Get AI summaries of any video or article — Sign up free
Groq API - 500+ Tokens/s - First Impression and Tests - WOW! thumbnail

Groq API - 500+ Tokens/s - First Impression and Tests - WOW!

All About AI·
4 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Groq’s LPUs are positioned as inference-focused hardware designed to reduce bottlenecks in sequential LLM decoding.

Briefing

Grok’s API is delivering striking inference speeds—especially with Mixtral 8x7B—hitting roughly 417 tokens per second in a like-for-like text generation test. The practical takeaway is that Groq’s Language Processing Units (LPUs) are built for fast, compute-heavy decoding, and the results suggest a meaningful performance gap versus both a hosted baseline (ChatGPT 3.5 Turbo) and local inference on a PC.

The testing starts with a quick primer on Groq hardware: LPUs are designed to reduce the bottlenecks that slow large language model (LLM) inference, particularly compute density and memory bandwidth. Groq positions LPUs as outperforming GPUs and CPUs for inference workloads, while explicitly not competing in the model-training market. The hardware pitch includes on-die SRAM (230 on-die SRAM per chip) and very high memory bandwidth (up to 8 terab per second), all aimed at accelerating sequential LLM generation.

To stress the system in a real-time setting, the workflow is expanded into speech-to-speech. Microphone audio is transcribed using Faster Whisper, then fed into Groq-backed text generation, and finally converted back to speech using a local text-to-speech model from OpenVoice. The conversation includes a short “pirate” persona instruction (no emojis, very short and conversational), and the interaction runs with only minor lag—enough to suggest the API can support low-latency, conversational loops rather than just offline batch generation.

Speed comparisons then get more quantitative. Using the same prompt, ChatGPT 3.5 Turbo clocks in at about 83.6 tokens per second (760 tokens in 9 seconds). A local run in LM Studio using an Open Hermes Mistral 7B model lands around 34 tokens per second. Switching to a smaller local model (a 3B parameter model) improves throughput to about 77 tokens per second, still below the hosted baseline. The standout result comes when Groq is used with Mixtral 8x7B, producing about 417 tokens per second—described as “crazy” speed—and noted as potentially close in quality to ChatGPT 3.5, given Mixtral 8x7B’s reputation.

Finally, a chain-prompt experiment tests iterative prompting speed. A “simplify” function repeatedly compresses text: the output from one loop becomes the input to the next, with the process running for 10 iterations. The loop completes in under a second per iteration on average, totaling roughly 8 seconds for the full loop, ending with a heavily shortened version (down to a couple of sentences). The overall message is that Groq’s inference speed remains strong even when the workload involves multiple sequential generations, not just a single completion.

Cornell Notes

Grok’s API performance is benchmarked primarily through token generation speed, with the biggest win appearing when using Mixtral 8x7B on Groq’s infrastructure. In a prompt-matching test, ChatGPT 3.5 Turbo reaches about 83.6 tokens per second, while local LM Studio runs land around 34 tokens per second (7B) and about 77 tokens per second (3B). Groq with Mixtral 8x7B reaches roughly 417 tokens per second, and the speed holds up in a multi-step “simplify” loop where text is repeatedly compressed for 10 iterations. The practical implication: Groq’s LPUs are optimized for fast inference, including sequential decoding patterns common in real-time and iterative workflows.

What hardware idea does Groq use to speed up LLM inference?

Groq’s Language Processing Units (LPUs) are designed for rapid inference in compute-heavy, sequential workloads like LLM decoding. The pitch centers on reducing inference bottlenecks tied to compute density and memory bandwidth. The transcript notes 230 on-die SRAM per chip and up to 8 terab per second on memory bandwidth, and it emphasizes that LPUs target inference rather than training (so they aren’t positioned as a direct training competitor).

How was real-time speech-to-speech tested, and what did it show?

The setup uses Faster Whisper to transcribe microphone input, then sends the text to Groq via an API call, and converts the response back to audio using a local text-to-speech model from OpenVoice. A short “pirate” system instruction is added (Ali the pirate, no emojis, very short conversational replies). The interaction shows only small lag, suggesting the API can support near-real-time conversational loops rather than only offline generation.

How did Groq’s token throughput compare to ChatGPT 3.5 Turbo and local models?

With the same prompt, ChatGPT 3.5 Turbo is measured at about 83.6 tokens per second (760 tokens in 9 seconds). Local inference in LM Studio using Open Hermes Mistral 7B runs at about 34 tokens per second. A smaller local 3B model improves to roughly 77 tokens per second. When switching to Groq with Mixtral 8x7B, throughput jumps to about 417 tokens per second.

Why does Mixtral 8x7B matter in the results?

The transcript treats Mixtral 8x7B as the key configuration for the speed win. It reports extremely high throughput (about 417 tokens per second) and adds a quality expectation: Mixtral 8x7B is described as “pretty close” to ChatGPT 3.5 in capability, implying the speed gain isn’t coming from using a trivial model.

What does the chain-prompt test measure beyond raw single-shot speed?

It measures performance under repeated sequential generations. A loop runs a “simplify” function 10 times: each iteration takes the previous output and asks for further simplification, aiming to compress long text down to about one sentence. The transcript reports roughly under 1 second per loop on average, about 8 seconds total for all 10 iterations, ending with a much shorter output (down to two sentences).

Review Questions

  1. In the token-per-second comparison, what were the approximate throughputs for ChatGPT 3.5 Turbo, local LM Studio (7B), local LM Studio (3B), and Groq with Mixtral 8x7B?
  2. How does the speech-to-speech pipeline connect Faster Whisper, Groq API calls, and OpenVoice text-to-speech?
  3. What does the 10-iteration simplify loop demonstrate about Groq’s inference speed in multi-step workflows?

Key Points

  1. 1

    Groq’s LPUs are positioned as inference-focused hardware designed to reduce bottlenecks in sequential LLM decoding.

  2. 2

    A speech-to-speech test used Faster Whisper for transcription and OpenVoice for local text-to-speech, with Groq driving the text generation.

  3. 3

    ChatGPT 3.5 Turbo produced about 83.6 tokens per second in the same prompt test (760 tokens over 9 seconds).

  4. 4

    Local LM Studio inference ran at about 34 tokens per second on Open Hermes Mistral 7B and about 77 tokens per second on a 3B model.

  5. 5

    Groq with Mixtral 8x7B reached roughly 417 tokens per second, far above both the hosted baseline and local runs.

  6. 6

    A 10-iteration chain-prompt “simplify” loop compressed text down to two sentences in about 8 seconds total, averaging under 1 second per loop.

Highlights

Mixtral 8x7B on Groq hit about 417 tokens per second—an order-of-magnitude-feeling jump versus the local 7B and a clear lead over ChatGPT 3.5 Turbo.
The speech-to-speech demo combined Faster Whisper transcription, Groq text generation, and OpenVoice speech output with only minor lag.
Iterative prompting stayed fast: 10 rounds of simplification finished in roughly 8 seconds total.

Topics

Mentioned