Last
week, Nvidia announced that 8 Blackwell GPUs in a DGX B200 could
demonstrate 1,000 tokens per second (TPS) per user on Meta's Llama 4
Maverick. Today, the same independent benchmark firm Artificial Analysis
measured Cerebras at more than 2,500 TPS/user, more than doubling the
performance of Nvidia's flagship solution.
"Cerebras
has beaten the Llama 4 Maverick inference speed record set by NVIDIA
last week," said Micah Hill-Smith, Co-Founder and CEO of Artificial
Analysis. "Artificial Analysis has benchmarked Cerebras' Llama 4
Maverick endpoint at 2,522 tokens per second, compared to NVIDIA
Blackwell's 1,038 tokens per second for the same model. We've tested
dozens of vendors, and Cerebras is the only inference solution that
outperforms Blackwell for Meta's flagship model."
With
today's results, Cerebras has set a world record for LLM inference
speed on the 400B parameter Llama 4 Maverick model, the largest and most
powerful in the Llama 4 family. Artificial Analysis tested multiple
other vendors, and the results were as follows: SambaNova 794 t/s,
Amazon 290 t/s, Groq 549 t/s, Google 125 t/s, and Microsoft Azure 54
t/s.
Andrew
Feldman, CEO of Cerebras Systems, said, "The most important AI
applications being deployed in enterprise today-agents, code generation,
and complex reasoning-are bottlenecked by inference latency. These use
cases often involve multi-step chains of thought or large-scale
retrieval and planning, with generation speeds as low as 100 tokens per
second on GPUs, causing wait times of minutes and making production
deployment impractical. Cerebras has led the charge in redefining
inference performance across models like Llama, DeepSeek, and Qwen,
regularly delivering over 2,500 TPS/user."
With
its world record performance, Cerebras is the optimal solution for
Llama 4 in any deployment scenario. Not only is Cerebras Inference the
first and only API to break the 2,500 TPS/user milestone on this model,
but unlike the Nvidia Blackwell used in the Artificial Analysis
benchmark, the Cerebras hardware and API are available now. Nvidia used
custom software optimizations that are not available to most users.
Interestingly, none of the Nvidia's inference providers offer a service
at Nvidia's published performance. This suggests that in order to
achieve 1000 TPS/user, Nvidia was forced to reduce throughput by going
to batch size 1 or 2, leaving the GPUs at less than 1% utilization.
Cerebras, on the other hand, achieved this record-breaking performance
without any special kernel optimizations, and it will be available to
everyone through Meta's API service coming soon.
For
cutting-edge AI applications such as reasoning, voice, and agentic
workflows, speed is paramount. These AI applications gain intelligence
by processing more tokens during the inference process. This can also
make them slow and force customers to wait. And when customers are
forced to wait, they leave and go to competitors who provide answers
faster-a finding Google showed with search more than a decade ago.
With
record-breaking performance, Cerebras hardware and resulting API
service is the best choice for developers and enterprise AI users around
the world.