DeepSeek R1 0528 - Better Coding & Tool Calling

TL;DR

DeepSeek R1 0528 adds JSON output and function calling, making it more practical for tool-using coding agents and structured workflows.

Briefing Cornell Notes

Briefing

DeepSeek R1 0528’s update centers on making the model more usable for real-world coding agents by adding support for JSON output and function calling—capabilities that typically power tool use, structured responses, and agent workflows. The release also claims enhanced front-end behavior, fewer hallucinations, and improved benchmark performance, with the model available through DeepSeek’s UI and API after weights were posted to Hugging Face.

For coding specifically, the update is framed as a meaningful step up. Early community chatter points to substantial gains on coding tasks, and the creator highlights that DeepSeek R1-style “thinking” models can be awkward in practice: they spend time generating reasoning tokens, which slows down inference compared with faster models where “thinking” can be disabled. That tradeoff matters for developers building interactive coding tools, where latency and responsiveness often decide whether an agent feels usable.

One benchmark singled out is LiveCodeBench, described as holistic and contamination-free, with a focus on large language model coding performance. In that leaderboard snapshot, OpenAI models like o4-mini and o3 (and o4-mini-medium) appear near the top, while other commercial models such as Sonnet 4 and Opus 4 land lower. Against that backdrop, the updated DeepSeek R1 is reported to score extremely well—potentially even ahead of Gemini 2.5 Pro—though the transcript also flags skepticism about trusting any single benchmark with 100% confidence.

Under the hood, DeepSeek R1 0528 is presented as a “minor version upgrade” that nonetheless delivers a large practical change. The update reportedly improves the depth of reasoning and inference by using more computational resources (more GPUs) and algorithmic optimizations during post-training. The exact optimizations aren’t detailed, but the release notes mention system prompting support and remove the need to force a thinking pattern with a required “think” tag at the start of output.

A key detail for developers is the available distilled variant: the only dist version mentioned uses an 8B-parameter base model. It is created by distilling chain-of-thought from a larger 38B model into that 8B base, with the claim that this pushes performance into the state-of-the-art range for the 8B tier.

In hands-on testing, the model is fed a large, detailed specification (over 4,000–5,000 tokens) for a to-do application and then asked to produce a single-file HTML/CSS/JavaScript landing page. The model spends roughly 23 seconds “thinking” to confirm and summarize the specification, then takes about 19 seconds for the landing-page generation plan and another couple of minutes to output the full code in the free inference setup. The resulting page includes animations, hover effects, pricing sections, social links, and even an image—overall described as professional and “picture perfect” for the task, though the tester notes that the model’s verbosity (summarizing what was provided) can be a mixed blessing for interactive workflows.

Overall, DeepSeek R1 0528 is positioned as a stronger open coding model with agent-ready structured output, but with practical latency considerations tied to its reasoning-heavy behavior—especially compared with faster models that can reduce or skip “thinking” tokens.

Cornell Notes

DeepSeek R1 0528 adds two agent-critical features: JSON output support and function calling. Those changes are meant to make the model easier to plug into tool-using coding agents and structured workflow systems. The update also claims better reasoning depth and inference performance through more compute and post-training optimizations, including system prompting support that removes the need for a forced “think” tag. Community and benchmark signals (notably LiveCodeBench) suggest strong coding gains, though benchmark trust is treated cautiously. In practical tests, the model handles large prompts and can generate a polished single-file HTML/CSS/JavaScript landing page, but it is slower due to visible “thinking” and reasoning token generation.

What new capabilities in DeepSeek R1 0528 matter most for building coding agents?

The update highlights support for JSON output and function calling. Together, these features enable structured responses and tool invocation—core requirements for agentic systems that need reliable formatting (JSON) and the ability to call external functions (tool calling) rather than only producing free-form text.

Why might a “thinking” model feel slower for coding tasks in real products?

Reasoning-heavy models generate additional tokens for their internal “thinking,” which increases latency. The transcript contrasts this with coding-focused models where “thinking” can be disabled, leading to faster inference and a more responsive coding agent experience.

Which benchmark is used to gauge coding performance, and what’s the caveat?

LiveCodeBench is singled out as holistic and contamination-free, with a focus on coding performance. The transcript notes that while the updated DeepSeek R1 scores extremely high—potentially even above Gemini 2.5 Pro—the creator wouldn’t treat any single benchmark as fully definitive.

What changes are described for reasoning behavior and prompting?

The release claims improved depth of reasoning and inference via increased GPU resources and algorithmic optimizations during post-training. It also mentions system prompting support and says a “think” tag at the beginning of output is no longer required to trigger a thinking pattern.

How is the distilled 8B model constructed, and why does it matter?

The available distilled variant uses an 8B-parameter base model. It is produced by distilling chain-of-thought from a 38B model into that 8B base, with the claim that this yields state-of-the-art performance within the 8B range—useful for developers who want strong coding ability without the cost of very large models.

What did the hands-on test reveal about context handling and output quality?

The tester used a large specification (over 4,000–5,000 tokens) and asked for a single-file HTML/CSS/JavaScript landing page. The model produced a detailed plan and then code, taking roughly a couple of minutes in the free inference setup. The output was described as professional, with animations, pricing components, social links, and even an image, suggesting good instruction following and code generation quality despite slower response times.

Review Questions

How do JSON output and function calling change what an LLM can do inside an agentic coding workflow?
What tradeoff does the transcript describe between reasoning-heavy models and faster coding models in terms of latency?
Why might system prompting support and removal of a required “think” tag affect how developers integrate DeepSeek R1 into their pipelines?

Key Points

1
DeepSeek R1 0528 adds JSON output and function calling, making it more practical for tool-using coding agents and structured workflows.
2
The update claims improved reasoning depth and inference performance through more GPU compute and post-training algorithmic optimizations.
3
System prompting support is included, and forcing a thinking pattern with a required “think” tag at the start of output is no longer necessary.
4
Coding performance signals are strong on LiveCodeBench, but the transcript treats any single leaderboard as not fully trustworthy on its own.
5
“Thinking” models can feel slower for coding because they generate extra reasoning tokens; faster models may deliver better interactive latency.
6
A distilled 8B variant is available, built by distilling chain-of-thought from a 38B model into an 8B base to target strong performance at lower size.
7
In a large-prompt test, the model generated a polished single-file landing page, but the free inference run took minutes, reflecting reasoning overhead.

Highlights

JSON output and function calling are the headline upgrades, directly targeting agentic tool use and structured responses.

System prompting support removes the need for a forced “think” tag, simplifying integration into prompting pipelines.

LiveCodeBench is used as a coding-performance yardstick, with DeepSeek R1 0528 reported to rank extremely high despite skepticism about benchmark certainty.

Hands-on generation produced a professional single-file HTML/CSS/JavaScript landing page, but visible “thinking” drove multi-minute latency in the free setup.

Topics

DeepSeek R1 0528
Tool Calling
JSON Output
Coding Benchmarks
Distilled 8B Model

Mentioned

Venelin Valkov
API
GPU
JSON
HTML
CSS
JavaScript
MVP
UI
o4-mini
o3
Sonnet 4
Opus 4
Gemini 2.5 Pro
LiveCodeBench

DeepSeek R1 0528 - Better Coding & Tool Calling | Is It Faster Now?