Markets

OpenAI FINALLY Releases the ChatGPT Agent The Internet Has Been Asking For

Hello AI Enthusiasts!

Welcome to the Twenty-Eighth edition of “This Week in AI Engineering”!

This week, OpenAI launched the revolutionary ChatGPT Agent, Moonshot AI’s Kimi K2 beats Opus4 being 90% cheaper, Mistral released worlds #1 speech recognition models, Perplexity unveiled their smartest AI browser, and Cursor;s CEO had to apologise publicly .

As always, we’ll also explore some under-the-radar tools that can supercharge your development workflow.


ChatGPT Agent is FINALLY here

OpenAI has released ChatGPT Agent, a unified system that combines deep research capabilities with computer operation abilities. The agent can browse the web, use terminals, write code, analyze data, and create reports, spreadsheets, and presentations, all while achieving state-of-the-art performance across multiple benchmarks.

What’s New

  • Unified Computer Operation: The agent operates on its own virtual computer, intelligently switching between web browsers, terminals, and API access based on task requirements.
  • Collaborative Workflow: Users can interrupt, redirect, or take control at any point during execution, maintaining human oversight over complex workflows.
  • Real-Time Narration: Provides live updates of its activities and asks for permission before taking consequential actions.

Benchmark Domination

ChatGPT Agent is crushing industry benchmarks across the board:

  • Humanity’s Last Exam (Expert-Level Questions): 41.6% (new state-of-the-art, significantly outperforming Deep Research at 26.6% and OpenAI o3 at 24.9%)
  • FrontierMath (Expert Mathematics): 27.4% (beating OpenAI o4-mini at 19.3% and o3 at 10.3%)
  • DSBench Data Analysis: 89.9% (surpassing human performance at 64.1% and GPT-4o at 34.1%)
  • BrowseComp (Agentic Browsing): 68.9% (new state-of-the-art, ahead of Deep Research at 51.5%)
  • Investment Banking Modeling: 71.3% (dramatically outperforming OpenAI o3 at 41.0%)

Use Cases & Practical Applications

ChatGPT Agent excels in several key areas that demonstrate its real-world utility:

Research & Analysis

  • Conduct comprehensive market research by gathering data from multiple sources and synthesizing insights
  • Analyze financial documents and create investment reports with supporting charts and visualizations
  • Perform academic literature reviews across multiple databases and compile structured summaries

Business Operations

  • Manage your calendar, whip up a PowerPoint presentation and automate routine administrative tasks
  • Create detailed project reports by collecting data from various team tools and platforms
  • Build financial models and perform complex calculations in Excel with human-level accuracy

Content Creation & Documentation

  • Generate comprehensive technical documentation by analyzing codebases and system architectures
  • Create presentations with data-driven insights pulled from live web sources
  • Develop training materials by researching best practices and organizing information logically

What Makes It Superior to Other Agents

  • Multi-Modal Integration: Unlike specialized agents that focus on single tasks, ChatGPT Agent seamlessly combines web browsing, code execution, data analysis, and content creation in one unified workflow.
  • Human-in-the-Loop Design: Most autonomous agents run independently with limited oversight. ChatGPT Agent maintains collaborative control, allowing users to intervene, redirect, or approve actions at any point.
  • State-of-the-Art Performance: ChatGPT agent’s output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, significantly outperforming existing solutions like Claude or specialized research tools.
  • Real-Time Adaptability: While other agents follow rigid workflows, ChatGPT Agent dynamically switches between different tools and approaches based on task requirements, making it more flexible and efficient.

Availability & Safety

Rolling out now to Pro, Plus, and Team users, with Pro users getting 400 messages per month and other paid users receiving 40 messages monthly. OpenAI has implemented extensive safeguards including explicit user confirmation for consequential actions and enhanced biological and chemical safety controls.


Kimi K2 Beats Claude Opus 4 being 90% cheaper

Moonshot AI’s Kimi K2 has achieved the remarkable feat of becoming the #1 open model on the LMSys Chatbot Arena while delivering exceptional performance at a fraction of the cost of proprietary alternatives.

What’s New

  • Open Source Excellence: Available as both Kimi-K2-Base (foundation model) and Kimi-K2-Instruct (chat-ready model) with 32 billion activated parameters and 1 trillion total parameters.’
  • Blazing Speed: Achieves over 200 tokens/second on Groq hardware, making it one of the fastest inference models available.
  • Cost Revolution: Up to 90% cheaper than Claude Opus 4 while outperforming it on coding benchmarks.

Technical Innovation

  • MuonClip Optimizer: Revolutionary training technique that solved exploding attention logits, enabling stable pre-training on 15.5T tokens with zero training spikes.
  • Agentic Focus: Designed not just to answer but to act, can use tools and execute complex workflows through large-scale agentic data synthesis.

Benchmark Performance

  • Kimi K2 is setting new standards across coding and STEM tasks:
  • LiveCodeBench v6: 53.7% (beating Claude Sonnet 4 at 48.5% and Claude Opus 4 at 47.4%)
  • AIME 2024: 69.6% (significantly ahead of Claude Opus 4 at 48.2%)
  • MATH-500: 97.4% (outperforming Claude Opus 4 at 94.4%)
  • SWE-bench Verified: 65.8% single attempt, 71.6% multiple attempts

Real-World Applications

Data Science & Analytics

  • Salary Analysis Workflows: Performed comprehensive salary data analysis using 16 IPython calls, including data cleaning, statistical analysis, visualization creation, and trend identification across multiple demographics and job categories
  • Market Research Automation: Automated collection and analysis of market data from multiple sources, creating comprehensive reports with statistical insights and predictive modeling

Academic & Research Applications

  • Stanford NLP Genealogy Research: Executed complex genealogy research involving multiple tool interactions, database queries, cross-referencing academic papers, and generating family tree visualizations with supporting documentation
  • Literature Review Automation: Systematically searched academic databases, extracted key insights, categorized findings, and synthesized comprehensive literature reviews with proper citations

Software Development

  • Full-Stack Game Development: Developed a complete JavaScript Minecraft game through iterative debugging, including game engine setup, 3D rendering implementation, player controls, world generation algorithms, and performance optimization
  • Code Refactoring Projects: Analyzed legacy codebases, identified optimization opportunities, implemented improvements, and validated changes through automated testing

Business Intelligence

  • Financial Modeling: Created complex financial models with scenario planning, risk analysis, and automated reporting features
  • Process Optimization: Analyzed business workflows, identified bottlenecks, and implemented automated solutions to improve efficiency

Content & Documentation

  • Technical Documentation Generation: Automatically generated comprehensive API documentation, user guides, and system architecture diagrams from existing codebases
  • Multi-Language Content Creation: Produced technical content and educational materials across multiple languages with cultural adaptation

Mistral Releases World’s Best Open Speech Recognition Models

Mistral AI has unveiled Voxtral, claiming to deliver the world’s best open-source speech recognition models. Available in two sizes, Voxtral (24B) for production and Voxtral Mini (3B) for edge deployment, both are released under the Apache 2.0 license.

What’s New

  • State-of-the-Art Performance: Outperforms OpenAI Whisper large-v3, GPT-4o Mini Transcribe, and Gemini 2.5 Flash across all transcription tasks.
  • Multilingual Excellence: Beats Whisper in every language tested on FLEURS benchmark, including Arabic, with automatic detection and top-tier support.
  • Text-Native Capabilities: Retains full language model capabilities, addressing the major pain point where audioLMs often lose text abilities.

Enterprise-Ready Features

  • 32k Token Context: Handles up to 30 minutes of audio for transcription and 40 minutes for understanding.
  • Built-in Intelligence: Direct Q&A and summarization from speech without chaining separate models.
  • Function Calling: Trigger workflows directly from voice commands.
  • Affordable Access: API pricing starts at just $0.001/minute, making high-quality speech intelligence accessible at scale.

Availability

Available via API, Hugging Face downloads, and Le Chat voice interface, with enterprise options including private deployment and fine-tuning for specialized domains.


Perplexity’s Latest AI web browser

Perplexity has officially launched Comet, an AI-powered browser that moves beyond traditional search to create an intelligent, conversational web experience. Now in early access for Perplexity Max users, Comet transforms passive browsing into active thinking.

From Navigation to Cognition

  • Unified Intelligence: Organizes web activity into a single intelligent interface, eliminating tab overload and context-switching friction.
  • Conversational Browsing: Ask follow-up questions as you browse, compare content, and dig deeper, turning browsing into flow-state research.
  • Contextual Understanding: Maintains context over time, turning long sessions into seamless interactions.

From Answers to Action

  • Action Agent: Book meetings, send emails, shop, or organize your day, all in one continuous conversation.
  • Workflow Delegation: Brief you, make comparisons, or complete complex workflows through natural conversation.
  • Curiosity-Driven: Highlight text on any page for on-the-fly explanations, explore tangents without losing place, and request counterpoints or deeper questions.

Key Advantages Over Traditional Browsers

  • Contextual Memory: Unlike traditional browsers that treat each tab as isolated, Comet maintains conversational context across your entire browsing session, remembering previous queries and building upon them.
  • Real-Time Intelligence: I used Perplexity’s new Comet browser to book a restaurant while I wrote this article – demonstrating capabilities far beyond traditional browsers’ passive information consumption.
  • Reduced Tab Chaos: Eliminates the need for dozens of open tabs by intelligently synthesizing information and maintaining context within a single conversational flow.

How Comet Surpasses Chrome, Safari, and Arc

Chrome Comparison

  • Intelligence Integration: While Chrome requires switching between tabs and external AI tools, Comet is a web browser built for today’s internet with native AI integration that understands context across your entire browsing session
  • Reduced Cognitive Load: Eliminates the need to manually synthesize information from multiple sources – Comet automatically connects related information and provides insights
  • Task Automation: Features include real-time summarization, product comparisons, and task automation, all in a conversational interface, unlike Chrome’s static browsing experience

Safari Comparison

  • Cross-Platform Intelligence: Unlike Safari’s ecosystem lock-in, Comet works across platforms while maintaining intelligent context
  • Proactive Assistance: Instead of Safari’s reactive search, Comet anticipates information needs and provides contextual suggestions
  • Research Efficiency: Transforms Safari’s linear browsing into dynamic, interconnected knowledge discovery

Arc Comparison

  • AI-First Design: While Arc focuses on organization and aesthetics, Comet prioritizes intelligent interaction and automated reasoning
  • Conversational Interface: Arc’s sidebar organization pales compared to Comet’s natural language interaction model
  • Action Capabilities: Arc organizes content, but Comet can act on it – booking reservations, sending emails, and completing tasks directly

Tasks Made Significantly Easier

Research & Analysis

  • Comparative Shopping: Automatically compares products across multiple sites, synthesizing reviews, prices, and specifications without manual tab switching
  • Academic Research: Connects related papers, cross-references citations, and builds comprehensive understanding across multiple sources
  • Market Analysis: Aggregates data from various financial sources and creates real-time analytical insights

Daily Productivity

  • Travel Planning: Books flights, hotels, and restaurants while maintaining context about your preferences and constraints
  • Email Management: Drafts responses based on web research and sends them directly from the browser
  • Calendar Integration: Schedules meetings by automatically finding availability and sending invites

Content Creation

  • Fact-Checking: Verifies information in real-time as you write, providing sources and alternative perspectives
  • Research Synthesis: Combines information from multiple sources into coherent summaries and reports
  • Citation Management: Automatically tracks and formats sources for academic or professional writing

Trust and Accuracy

Built on Perplexity’s signature commitment to factual answers with trust, transparency, and truth, ideal for high-stakes decisions like comparing insurance plans or understanding investments.


Cursor Faces Backlash Over Pro Plan Pricing Shift

Cursor, the AI-powered coding platform by Anysphere, was under fire after an abrupt change to its $20/month Pro plan sparked user confusion, unexpected charges, and widespread frustration.

What Changed

  • Old Model: 500 fast responses per month using advanced models like Claude or GPT-4, plus unlimited slow responses after the cap.
  • New Model: $20 monthly credit for frontier model usage at real API rates, with unlimited usage only via “Auto mode” that dynamically selects cheaper or slower models.

User Frustration

  • Unexpected Charges: Many users hit the $20 usage cap after just a few prompts, especially when using models like Claude Opus 4.
  • Automatic Billing: Users were charged beyond their plan without realizing spend limits had to be manually configured.
  • Limited Premium Access: The only truly “unlimited” access was through Auto mode, which often doesn’t route to premium models.

Cursor’s Response

  • CEO Michael Truell issued an apology acknowledging poor communication: “These changes hurt the trust we work hard to build… We missed the mark.”
  • Full Refunds: Available for any unexpected charges from June 16 to July 4 by contacting [email protected].
  • Future Improvements: Better pre-change communication, clearer dashboard visibility, and enhanced UI features to alert users approaching usage limits.

The Rationale

Cursor cited growing API costs from model providers, explaining that request-based pricing couldn’t reflect the real cost of longer, token-heavy prompts, while API-based pricing provides more accurate cost structure for advanced usage.


Tools & Releases YOU Should Know About

Leap AI is a no-code workflow automation platform for building and deploying AI-powered workflows. Connect AI services and tools to create sophisticated automation pipelines that automate repetitive work and streamline your processes. Perfect for teams looking to integrate AI capabilities without complex development overhead.

Windframe.dev is a powerful drag-and-drop UI builder built on top of Tailwind CSS. Think of it like Figma for front-end developers, but with live Tailwind code generation and component-level control. Design interfaces visually and export clean, production-ready code instantly, making it ideal for rapid prototyping and professional development.

Replicate is a leading cloud platform enabling software developers to run, fine-tune, and deploy machine learning models effortlessly with a simple API. Removing the barriers of complex AI infrastructure, Replicate offers access to thousands of open-source models as well as the ability to host custom solutions, making AI deployment accessible to developers at any scale.


And that wraps up this issue of “This Week in AI Engineering.

Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and follow for more weekly updates.

Until next time, happy building!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button