Markets

xAI’s Grok 3: All the GPUs, None of the Breakthroughs

At the end of February, Elon rolled out his latest model. Of course, it was “the best in the world.”

Is it really a Smartest AI on Earth?

As usual, Musk brought the hype train. But there wasn’t much objective data at launch. xAI’s short blog post mentioned that it was still in beta and the models were actively training.

They flashed some benchmarks showing Grok 3 ahead. However, they did not give access to the API. Which is important because independent benchmarks use it for evaluation.

So, Elon claims Grok 3 is “scarily smart” and beats everything else. But the only ways to check were chatting with it yourself or looking at their benchmarks.

And those benchmarks? Take a look:

See that lighter area on the right? That’s the boost Grok got by having way more compute power (test-time compute) to get more consistent answers. It’s not exactly a fair fight.

You probably know AI models often give slightly different answers each time—sometimes better, sometimes worse. Most benchmarks ignore this variability, evaluating only the first response (pass@1). It’s simpler and matches how we actually use AI—we expect a good answer on the first try.

But Grok results were all shown using cons@64. Meaning, it got 64 tries for each question and picked the most common answer. Then, xAI compared that boosted score against the pass@1 scores of competitors.

So on one hand, they claim it’s a next-gen model. On the other, they’re using pretty cheap tricks.

To be fair, in such a competitive field, all labs bend the rules. They cherry-pick benchmarks or exclude stronger models from comparisons—but rarely as blatantly.

Okay, benchmarks aside. What are experienced users saying after actually using it? The general consensus is:

The model is huge but hasn’t brought breakthroughs. It still hallucinates and tends toward overly long responses.

Performance-wise, Grok 3 lands somewhere near the top OpenAI models, maybe a bit better than DeepSeek and Google’s stuff at the time of release.

However, two months later, Gemini 2.5, Claude 3.7, and the new GPT-4o arrived. We also finally got partial API access for Grok 3 and its mini version. Unfortunately, only the mini version received the thinking mode in API.

So today we know it’s expensive and definitely not the absolute best.

But hold on, there’s still more to the story.

The model is interesting and worth looking at. And you have to hand it to them, Elon and xAI jumped into the market quickly, becoming a key player in record time.

1 – The Hardware

The big story here?

In 2024, xAI built a massive compute cluster. We’re talking 100,000 Nvidia H100 GPUs up and running in just 4 months. Then they doubled that to 200,000 cards in another 3 months.

Nvidia’s CEO, Jensen Huang, mentioned this usually takes about 4 years.

This was a massive engineering feat. And this time, no funny business—it’s the largest data center in the world. Nobody else has managed to link up that many GPUs in one spot.

Typically, such clusters are multiple regular data centers linked by costly Infiniband cables. During training, these centers need to swap tons of data constantly. If the connection is slow, those pricey GPUs sit idle, which is bad news.

A typical data center might have 10,000-20,000 GPUs, sucking down 20-30 megawatts of power. For example, Microsoft (for OpenAI) operates 100k GPUs network in Arizona, and Meta runs 128k.

See the two H-shaped buildings? That’s two standard Meta data centers next to each other.

Power needs for top-tier clusters have exploded up to 10x since 2022. We’re now talking around 150 MW per cluster. That’s like powering a small city. This creates a huge load on regional power grids. In some places, it’s actually cheaper to generate the power than to deliver it because there aren’t enough power lines.

So, Elon enters this market way behind. And… does the “Elon thing.” Hate his tweets all you want, the man knows how to build factories like nobody else.

He bought an old Electrolux factory in Memphis and decided to build one giant data center instead of a network like everyone else.

Predictably, power became an issue.

The factory only had 7 MW from the local grid—enough for only 4,000 GPUs. The local utility, Tennessee Valley Authority, promised another 50 MW, but not until August. And xAI’s own 150 MW substation was still being built, not ready until year-end.

But waiting isn’t Musk’s style.

Dylan Patel (from Semianalysis) spotted via satellite images that Elon just brought in 14 massive mobile diesel generators from VoltaGrid. Hooked them up to 4 mobile substations and powered the data center. Literally trucked in the electricity.

Patel mentioned they might have bought up 30% of the entire US market for these generators (though I couldn’t find anything on that).

Impressively, the data center also uses liquid cooling. Only Google has really done this at scale before. This is a big deal because the next generation of Nvidia chips, the Blackwell B200s, require liquid cooling. Everyone else will have to retrofit their existing data centers.

You can check out the first few minutes of this video to see what it looks like inside. I got a chuckle out of how hyped the guy is about gray boxes and cables:

It’s seriously cool engineering—just look at the cable management.

No one has done such massive work in so little time.

2 – Even More Hardware!

Elon says by summer 2025, they’ll have a 300k GPU cluster with Blackwell B200 chips. Given Musk’s habit of exaggeration, let’s say it’s realistically somewhere between 200-400k new chips by the end of 2025. B200 is roughly 2.2 times better than H100 for model training (based on Nov 2024 estimates).

Musk even plans to build a dedicated 2.2 GW power plant. That’s more power than a medium-sized city consumes.

And he’s not alone—all the big players are doing something similar:

  • Meta is building two gas plants in Louisiana.
  • OpenAI/Microsoft is setting up something similar in Texas.
  • Amazon and Google are also building gigawatt-scale data centers.

Why not nuclear? It’s got the power, but building a nuclear plant takes way too long. You can’t just pop one up next to your data center in a year. Wind and solar farms plus batteries are promising, but they also take too long to deploy at the needed scale.

As a result, both Microsoft and Meta have already had to backtrack on their green renewable energy promises. They broke their backs lifting Moloch to Heaven!

3 – Grok 3 is Huge

So, Elon built this massive, expensive box. Now what?

Estimates suggest Grok 2 trained on ~20k H100s, while Grok 3 used over 100k. For context, GPT-4 trained for about 90-100 days on ~25k older A100 chips, with H100 roughly 2.25x faster.

Doing the math, Grok 2 got about twice the computing power thrown at it compared to GPT-4. And Grok 3 got five times more than Grok 2. Google’s Gemini 2.0 probably used a similar amount of hardware (100k of their own TPUv6 chips), but the model itself is likely smaller.

Basically, the total compute cost for Grok 3 is an order of magnitude (10 times!) higher than its closest competitor. Sadly, we don’t have public data for GPT-4.5 or Gemini 2.5.

So they poured insane amounts of resources into building this mega-cluster, and the resulting model is… just on par with the incumbents. Definitely not leagues better.

It seems xAI’s expertise in training still lags behind OpenAI, Google, or Anthropic. They essentially brute-forced their way into the top tier. No magic tricks shown, just: “If brute force isn’t solving your problem, you aren’t using enough of it.”

But there’s a catch with that approach. Epoch AI estimates that over the last decade, algorithmic improvements accounted for about a third of the progress in model capabilities. The other two-thirds came from just throwing more hardware and data at bigger models.

Brute force worked for Grok 3 this time, but costs will grow exponentially while delivering less and less improvement. And xAI need to catch up on the algorithm side. The good news is that now they’re seen as pushing the frontier, so it will likely be much easier to attract top talent.

4 – What’s Good About Grok?

  1. It’s completely free (probably until the full release).

And without Anthropic’s tight limits, DeepSeek’s outages, or OpenAI’s paid tiers.

Even with all the new models dropped in the last couple of months, Grok is still holding its own near the top of the Chatbot Arena leaderboard.

We now also have an independent benchmarking by EpochAI:

And by LiveBench:

  1. Reasoning & Deep Research Mode

Back in February, free Deep Research feature was mostly Perplexity exclusive. Now, Google and OpenAI offer some in a basic tier—maybe Grok pushed them?

This mode automatically analyzes 30-100 links (Google might do more) in minutes and spits out a detailed (and bloated) summary that you just need to skim and fact-check. It’s way easier than researching anything from scratch. I’ve found Grok’s version works faster than the others, so I’ve started using it when I need to research something. Like, when buying a new headphones.

  1. Integration with X

This could be its killer feature: semantic search not just for keywords, but for what you meant. You can also ask it to summarize posts on a topic to track trends. Or to find recent posts from a specific user.

Twitter is the closest to a real-time information platform, so thats great. But so far Grok often lags, pulling data from the last couple of days instead.

  1. The Unfiltered Stuff

And for the grand finale, the 18+ mode. Grok is notoriously easy to jailbreak without much effort. You can get it to do… well, whatever you might want, from flirty voices to questionable recipes. The voice mode examples are particularly wild.

Listen to the end, it’s hilarious!

Ironically, Grok itself doesn’t seem to hold Musk (or Trump) in high regard. When this came out, xAI tried a fix—literally hardcoding a rule that Grok couldn’t criticize Elon. When that blew up, they blamed a former OpenAI employee for “not fitting the company culture.” Super cringe.

The real issue is that Grok’s opinions are just a reflection of its training data (i.e., the internet), not some intentional bias. Trying to patch these views without messing up the whole model is hard.

5 – Should You Bother Trying It?

Definitely try it, but as your second pilot.

TLDR:

  • Cost way more to train than competitors’ models.

  • Despite that, performance is almost on par with the best.

  • But it’s super fast and free (for now).

  • The Deep Research mode is genuinely useful—try it if you haven’t.

  • More prone to hallucinations and jumping to conclusions too fast.

  • Responses are usually well-structured but often feels bloated.

  • Unique access to Twitter data.

xAI proved capable of building world-class infrastructure at unprecedented speed. But in actual AI capabilities, they’re basically buying their way to the top with sheer compute power.

This adds another strong player pressuring OpenAI, Google, and Anthropic, pushing the AI industry toward commoditization. Competition is heating up and the exclusivity of top-tier models is fading.

Enjoyed this? Give an upvote or subscribe to my newsletter. I’d appreciate it!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button