OpenAI FINALLY Releases the ChatGPT Agent The Internet Has Been Asking For

mrarup824 hours ago

0 0 9 minutes read

Hello AI Enthusiasts!

Welcome to the Twenty-Eighth edition of “This Week in AI Engineering”!

This week, OpenAI launched the revolutionary ChatGPT Agent, Moonshot AI’s Kimi K2 beats Opus4 being 90% cheaper, Mistral released worlds #1 speech recognition models, Perplexity unveiled their smartest AI browser, and Cursor;s CEO had to apologise publicly .

As always, we’ll also explore some under-the-radar tools that can supercharge your development workflow.

ChatGPT Agent is FINALLY here

OpenAI has released ChatGPT Agent, a unified system that combines deep research capabilities with computer operation abilities. The agent can browse the web, use terminals, write code, analyze data, and create reports, spreadsheets, and presentations, all while achieving state-of-the-art performance across multiple benchmarks.

What’s New

Unified Computer Operation: The agent operates on its own virtual computer, intelligently switching between web browsers, terminals, and API access based on task requirements.
Collaborative Workflow: Users can interrupt, redirect, or take control at any point during execution, maintaining human oversight over complex workflows.
Real-Time Narration: Provides live updates of its activities and asks for permission before taking consequential actions.

Benchmark Domination

ChatGPT Agent is crushing industry benchmarks across the board:

Humanity’s Last Exam (Expert-Level Questions): 41.6% (new state-of-the-art, significantly outperforming Deep Research at 26.6% and OpenAI o3 at 24.9%)
FrontierMath (Expert Mathematics): 27.4% (beating OpenAI o4-mini at 19.3% and o3 at 10.3%)
DSBench Data Analysis: 89.9% (surpassing human performance at 64.1% and GPT-4o at 34.1%)
BrowseComp (Agentic Browsing): 68.9% (new state-of-the-art, ahead of Deep Research at 51.5%)
Investment Banking Modeling: 71.3% (dramatically outperforming OpenAI o3 at 41.0%)

Use Cases & Practical Applications

ChatGPT Agent excels in several key areas that demonstrate its real-world utility:

Research & Analysis

Conduct comprehensive market research by gathering data from multiple sources and synthesizing insights
Analyze financial documents and create investment reports with supporting charts and visualizations
Perform academic literature reviews across multiple databases and compile structured summaries

Business Operations

Manage your calendar, whip up a PowerPoint presentation and automate routine administrative tasks
Create detailed project reports by collecting data from various team tools and platforms
Build financial models and perform complex calculations in Excel with human-level accuracy

Content Creation & Documentation

Generate comprehensive technical documentation by analyzing codebases and system architectures
Create presentations with data-driven insights pulled from live web sources
Develop training materials by researching best practices and organizing information logically

What Makes It Superior to Other Agents

Multi-Modal Integration: Unlike specialized agents that focus on single tasks, ChatGPT Agent seamlessly combines web browsing, code execution, data analysis, and content creation in one unified workflow.
Human-in-the-Loop Design: Most autonomous agents run independently with limited oversight. ChatGPT Agent maintains collaborative control, allowing users to intervene, redirect, or approve actions at any point.
State-of-the-Art Performance: ChatGPT agent’s output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, significantly outperforming existing solutions like Claude or specialized research tools.
Real-Time Adaptability: While other agents follow rigid workflows, ChatGPT Agent dynamically switches between different tools and approaches based on task requirements, making it more flexible and efficient.

Availability & Safety

Rolling out now to Pro, Plus, and Team users, with Pro users getting 400 messages per month and other paid users receiving 40 messages monthly. OpenAI has implemented extensive safeguards including explicit user confirmation for consequential actions and enhanced biological and chemical safety controls.

Kimi K2 Beats Claude Opus 4 being 90% cheaper

Moonshot AI’s Kimi K2 has achieved the remarkable feat of becoming the #1 open model on the LMSys Chatbot Arena while delivering exceptional performance at a fraction of the cost of proprietary alternatives.

What’s New

Open Source Excellence: Available as both Kimi-K2-Base (foundation model) and Kimi-K2-Instruct (chat-ready model) with 32 billion activated parameters and 1 trillion total parameters.’
Blazing Speed: Achieves over 200 tokens/second on Groq hardware, making it one of the fastest inference models available.
Cost Revolution: Up to 90% cheaper than Claude Opus 4 while outperforming it on coding benchmarks.

Technical Innovation

MuonClip Optimizer: Revolutionary training technique that solved exploding attention logits, enabling stable pre-training on 15.5T tokens with zero training spikes.
Agentic Focus: Designed not just to answer but to act, can use tools and execute complex workflows through large-scale agentic data synthesis.

Benchmark Performance

Kimi K2 is setting new standards across coding and STEM tasks:
LiveCodeBench v6: 53.7% (beating Claude Sonnet 4 at 48.5% and Claude Opus 4 at 47.4%)
AIME 2024: 69.6% (significantly ahead of Claude Opus 4 at 48.2%)
MATH-500: 97.4% (outperforming Claude Opus 4 at 94.4%)
SWE-bench Verified: 65.8% single attempt, 71.6% multiple attempts

Real-World Applications

Data Science & Analytics

Salary Analysis Workflows: Performed comprehensive salary data analysis using 16 IPython calls, including data cleaning, statistical analysis, visualization creation, and trend identification across multiple demographics and job categories
Market Research Automation: Automated collection and analysis of market data from multiple sources, creating comprehensive reports with statistical insights and predictive modeling

Academic & Research Applications

Stanford NLP Genealogy Research: Executed complex genealogy research involving multiple tool interactions, database queries, cross-referencing academic papers, and generating family tree visualizations with supporting documentation
Literature Review Automation: Systematically searched academic databases, extracted key insights, categorized findings, and synthesized comprehensive literature reviews with proper citations

Software Development

Full-Stack Game Development: Developed a complete JavaScript Minecraft game through iterative debugging, including game engine setup, 3D rendering implementation, player controls, world generation algorithms, and performance optimization
Code Refactoring Projects: Analyzed legacy codebases, identified optimization opportunities, implemented improvements, and validated changes through automated testing

Business Intelligence

Financial Modeling: Created complex financial models with scenario planning, risk analysis, and automated reporting features
Process Optimization: Analyzed business workflows, identified bottlenecks, and implemented automated solutions to improve efficiency

Content & Documentation

Technical Documentation Generation: Automatically generated comprehensive API documentation, user guides, and system architecture diagrams from existing codebases
Multi-Language Content Creation: Produced technical content and educational materials across multiple languages with cultural adaptation

Mistral Releases World’s Best Open Speech Recognition Models

Mistral AI has unveiled Voxtral, claiming to deliver the world’s best open-source speech recognition models. Available in two sizes, Voxtral (24B) for production and Voxtral Mini (3B) for edge deployment, both are released under the Apache 2.0 license.

What’s New

State-of-the-Art Performance: Outperforms OpenAI Whisper large-v3, GPT-4o Mini Transcribe, and Gemini 2.5 Flash across all transcription tasks.
Multilingual Excellence: Beats Whisper in every language tested on FLEURS benchmark, including Arabic, with automatic detection and top-tier support.
Text-Native Capabilities: Retains full language model capabilities, addressing the major pain point where audioLMs often lose text abilities.

Enterprise-Ready Features

32k Token Context: Handles up to 30 minutes of audio for transcription and 40 minutes for understanding.
Built-in Intelligence: Direct Q&A and summarization from speech without chaining separate models.
Function Calling: Trigger workflows directly from voice commands.
Affordable Access: API pricing starts at just $0.001/minute, making high-quality speech intelligence accessible at scale.

Availability

Available via API, Hugging Face downloads, and Le Chat voice interface, with enterprise options including private deployment and fine-tuning for specialized domains.

Perplexity’s Latest AI web browser

Perplexity has officially launched Comet, an AI-powered browser that moves beyond traditional search to create an intelligent, conversational web experience. Now in early access for Perplexity Max users, Comet transforms passive browsing into active thinking.

Unified Intelligence: Organizes web activity into a single intelligent interface, eliminating tab overload and context-switching friction.
Conversational Browsing: Ask follow-up questions as you browse, compare content, and dig deeper, turning browsing into flow-state research.
Contextual Understanding: Maintains context over time, turning long sessions into seamless interactions.

From Answers to Action

Action Agent: Book meetings, send emails, shop, or organize your day, all in one continuous conversation.
Workflow Delegation: Brief you, make comparisons, or complete complex workflows through natural conversation.
Curiosity-Driven: Highlight text on any page for on-the-fly explanations, explore tangents without losing place, and request counterpoints or deeper questions.

Key Advantages Over Traditional Browsers

Contextual Memory: Unlike traditional browsers that treat each tab as isolated, Comet maintains conversational context across your entire browsing session, remembering previous queries and building upon them.
Real-Time Intelligence: I used Perplexity’s new Comet browser to book a restaurant while I wrote this article – demonstrating capabilities far beyond traditional browsers’ passive information consumption.
Reduced Tab Chaos: Eliminates the need for dozens of open tabs by intelligently synthesizing information and maintaining context within a single conversational flow.

How Comet Surpasses Chrome, Safari, and Arc

Chrome Comparison

Intelligence Integration: While Chrome requires switching between tabs and external AI tools, Comet is a web browser built for today’s internet with native AI integration that understands context across your entire browsing session
Reduced Cognitive Load: Eliminates the need to manually synthesize information from multiple sources – Comet automatically connects related information and provides insights
Task Automation: Features include real-time summarization, product comparisons, and task automation, all in a conversational interface, unlike Chrome’s static browsing experience

Safari Comparison

Cross-Platform Intelligence: Unlike Safari’s ecosystem lock-in, Comet works across platforms while maintaining intelligent context
Proactive Assistance: Instead of Safari’s reactive search, Comet anticipates information needs and provides contextual suggestions
Research Efficiency: Transforms Safari’s linear browsing into dynamic, interconnected knowledge discovery

Arc Comparison

AI-First Design: While Arc focuses on organization and aesthetics, Comet prioritizes intelligent interaction and automated reasoning
Conversational Interface: Arc’s sidebar organization pales compared to Comet’s natural language interaction model
Action Capabilities: Arc organizes content, but Comet can act on it – booking reservations, sending emails, and completing tasks directly

Tasks Made Significantly Easier

Research & Analysis

Comparative Shopping: Automatically compares products across multiple sites, synthesizing reviews, prices, and specifications without manual tab switching
Academic Research: Connects related papers, cross-references citations, and builds comprehensive understanding across multiple sources
Market Analysis: Aggregates data from various financial sources and creates real-time analytical insights

Daily Productivity

Travel Planning: Books flights, hotels, and restaurants while maintaining context about your preferences and constraints
Email Management: Drafts responses based on web research and sends them directly from the browser
Calendar Integration: Schedules meetings by automatically finding availability and sending invites

Content Creation

Fact-Checking: Verifies information in real-time as you write, providing sources and alternative perspectives
Research Synthesis: Combines information from multiple sources into coherent summaries and reports
Citation Management: Automatically tracks and formats sources for academic or professional writing

Trust and Accuracy

Built on Perplexity’s signature commitment to factual answers with trust, transparency, and truth, ideal for high-stakes decisions like comparing insurance plans or understanding investments.

Cursor Faces Backlash Over Pro Plan Pricing Shift

Cursor, the AI-powered coding platform by Anysphere, was under fire after an abrupt change to its $20/month Pro plan sparked user confusion, unexpected charges, and widespread frustration.

What Changed

Old Model: 500 fast responses per month using advanced models like Claude or GPT-4, plus unlimited slow responses after the cap.
New Model: $20 monthly credit for frontier model usage at real API rates, with unlimited usage only via “Auto mode” that dynamically selects cheaper or slower models.

User Frustration

Unexpected Charges: Many users hit the $20 usage cap after just a few prompts, especially when using models like Claude Opus 4.
Automatic Billing: Users were charged beyond their plan without realizing spend limits had to be manually configured.
Limited Premium Access: The only truly “unlimited” access was through Auto mode, which often doesn’t route to premium models.

Cursor’s Response

CEO Michael Truell issued an apology acknowledging poor communication: “These changes hurt the trust we work hard to build… We missed the mark.”
Full Refunds: Available for any unexpected charges from June 16 to July 4 by contacting [email protected].
Future Improvements: Better pre-change communication, clearer dashboard visibility, and enhanced UI features to alert users approaching usage limits.

The Rationale

Cursor cited growing API costs from model providers, explaining that request-based pricing couldn’t reflect the real cost of longer, token-heavy prompts, while API-based pricing provides more accurate cost structure for advanced usage.

Tools & Releases YOU Should Know About

Leap AI is a no-code workflow automation platform for building and deploying AI-powered workflows. Connect AI services and tools to create sophisticated automation pipelines that automate repetitive work and streamline your processes. Perfect for teams looking to integrate AI capabilities without complex development overhead.

Windframe.dev is a powerful drag-and-drop UI builder built on top of Tailwind CSS. Think of it like Figma for front-end developers, but with live Tailwind code generation and component-level control. Design interfaces visually and export clean, production-ready code instantly, making it ideal for rapid prototyping and professional development.

Replicate is a leading cloud platform enabling software developers to run, fine-tune, and deploy machine learning models effortlessly with a simple API. Removing the barriers of complex AI infrastructure, Replicate offers access to thousands of open-source models as well as the ability to host custom solutions, making AI deployment accessible to developers at any scale.

And that wraps up this issue of “This Week in AI Engineering.“

Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and follow for more weekly updates.

Until next time, happy building!

mrarup824 hours ago

0 0 9 minutes read

OpenAI FINALLY Releases the ChatGPT Agent The Internet Has Been Asking For

ChatGPT Agent is FINALLY here

What’s New

Benchmark Domination

Use Cases & Practical Applications

What Makes It Superior to Other Agents

Availability & Safety

Kimi K2 Beats Claude Opus 4 being 90% cheaper

What’s New

Technical Innovation

Benchmark Performance

Real-World Applications

Mistral Releases World’s Best Open Speech Recognition Models

What’s New

Enterprise-Ready Features

Availability

Perplexity’s Latest AI web browser

From Navigation to Cognition

From Answers to Action

Key Advantages Over Traditional Browsers

How Comet Surpasses Chrome, Safari, and Arc

Tasks Made Significantly Easier

Trust and Accuracy

Cursor Faces Backlash Over Pro Plan Pricing Shift

What Changed

User Frustration

Cursor’s Response

The Rationale

Tools & Releases YOU Should Know About

mrarup82

Leave a Reply Cancel reply

ChatGPT Agent is FINALLY here

What’s New

Benchmark Domination

Use Cases & Practical Applications

What Makes It Superior to Other Agents

Availability & Safety

Kimi K2 Beats Claude Opus 4 being 90% cheaper

What’s New

Technical Innovation

Benchmark Performance

Real-World Applications

Mistral Releases World’s Best Open Speech Recognition Models

What’s New

Enterprise-Ready Features

Availability

Perplexity’s Latest AI web browser

From Navigation to Cognition

From Answers to Action

Key Advantages Over Traditional Browsers

How Comet Surpasses Chrome, Safari, and Arc

Tasks Made Significantly Easier

Trust and Accuracy

Cursor Faces Backlash Over Pro Plan Pricing Shift

What Changed

User Frustration

Cursor’s Response

The Rationale

Tools & Releases YOU Should Know About

mrarup82

Related Articles

7 Altcoins With Massive Growth Potential To Watch In 2025 – CryptoMode

US Treasury Secretary Scott Bessent strikes a trade deal with China

Xapo Bank revolutioniert die Finanzwelt: Bitcoin-gesicherte Kredite jetzt verfügbar

Shiba Inu Price Prediction – How High Can SHIB Go in the Next Bull Run?

Leave a Reply Cancel reply