Only 40 lines of code

TL;DR

OpenJDK’s thread user-time retrieval was dramatically slowed by a `/proc`-based workflow that combined file I/O with string parsing and numeric conversions.

Briefing Cornell Notes

Briefing

A small change in OpenJDK—switching how thread “user time” is retrieved—wiped out a long-standing 400x performance gap, cutting the cost of the operation from roughly 11 microseconds down to about 279 nanoseconds. The fix traces back to an older JDK issue (JDK8210452) noting that getting current thread user time via one path was dramatically slower than using the CPU-time route. What made the improvement notable wasn’t just the result; it was the scale of hidden work that the slower approach performed.

The discarded implementation relied on Linux-style `/proc` data. It opened a `/proc` file for the thread/process, read a large chunk into a 2K buffer, then performed string manipulation to locate the relevant section of the text. After finding the last parenthesis, it skipped whitespace, read a fixed set of numeric fields (13 numbers), and then converted those stringified values back into numbers in user space to extract the specific timing fields needed for “user time.” The transcript emphasizes that `/proc` entries are generated on demand, so this path effectively turns a timing query into file I/O plus parsing overhead—exactly the kind of work that can balloon under measurement.

The replacement approach avoids `/proc` entirely. Instead of opening and parsing text, it adjusts internal clock identifiers (via bit-level manipulation of a clock ID) and then directly fetches the thread time through the appropriate clock mechanism. The code may still look dense—low-level systems code often does—but the key difference is eliminating the “file land” detour and the repeated string/scanf parsing cycle.

Flame graphs make the performance story concrete. In the before state, the largest portions of time cluster around file operations: opening and closing files dominate, while the actual numeric parsing work (including `scanf`) appears as a small fraction of total time. The graph suggests most time is spent shuffling through `/proc`-related activity and synchronization primitives (including fast user-space mutexes and futex-related behavior), not on the arithmetic needed to compute the timing value. After the change, the flame graph collapses: only a thin trace remains, with no broad swaths of file-handling frames.

Benchmarks reported alongside the change quantify the win: the operation drops from about 11 microseconds to 279 nanoseconds. The takeaway is less about micro-optimizing a single call and more about how measurement can expose “obvious” bottlenecks that were never obvious—like the cost of turning a timing query into text parsing. For engineers, it’s a reminder that performance problems can hide behind convenience layers, and that careful profiling can justify surprisingly small diffs with outsized impact.

Cornell Notes

OpenJDK improved thread “user time” retrieval by replacing a slow `/proc`-based parsing path with a direct clock-based mechanism. The old method opened a `/proc` file, read a 2K buffer, used string scanning to find the right section, then parsed and converted numeric fields back into values—turning a timing query into file I/O and heavy text processing. Flame graphs showed the time was dominated by file open/close activity, while actual parsing work (e.g., `scanf`) was a small slice. Benchmarks reported the change cut latency from about 11 microseconds to about 279 nanoseconds, eliminating a long-standing ~400x gap tied to JDK8210452. The result highlights how profiling can reveal hidden costs behind seemingly straightforward system calls.

Why was retrieving thread user time so slow in the older approach?

The older implementation pulled timing data from `/proc`, which is generated on demand. It opened a `/proc` file, read roughly 2K bytes into a buffer, searched within the text (including locating the last parenthesis), skipped whitespace, then read a fixed set of numeric fields (13 numbers). Those numbers were initially stringified by the kernel’s `/proc` output and then had to be parsed and converted back into numeric values in user space to extract the needed timing fields.

What did the flame graph reveal about where time was actually going?

The flame graph concentrated the largest blocks around file handling: opening and closing the `/proc` file consumed the majority of the time (with open/close frames dominating the top-level share). By contrast, the numeric parsing work—such as `scanf`—occupied only a small percentage (the transcript cites about 3.9% for the `scanf` portion). The rest of the time appeared in synchronization and low-level runtime activity associated with the file-based path, including futex-related behavior and fast user-space mutexes.

How did the new implementation avoid the bottlenecks?

Instead of reading and parsing `/proc` text, the new code adjusts internal clock identifiers (including flipping bits in a clock ID) and then directly fetches the thread time using the clock mechanism. That removes the file I/O and the string-to-number parsing loop, replacing it with a direct timing call.

What magnitude of improvement was measured, and how was it expressed?

Benchmark results reported the operation dropping from about 11 microseconds to about 279 nanoseconds. That corresponds to eliminating a long-standing ~400x performance gap associated with JDK8210452, where getting current thread user time was far slower than getting current thread CPU time.

What does the change set size suggest about the nature of the fix?

The diff described as “40 lines of code” included 96 insertions and 54 deletions, plus a 55-line JM benchmark addition. The transcript notes that production code was reduced overall, implying the fix wasn’t just additive—it streamlined the implementation while adding measurement coverage to validate the performance impact.

Review Questions

In the older `/proc`-based approach, which steps (file operations, string scanning, numeric parsing) contribute most to total time according to the flame graph?
What specific design change—data source and retrieval method—distinguishes the new implementation from the old one?
How do the benchmark numbers (11 microseconds vs 279 nanoseconds) relate to the previously reported ~400x gap in JDK8210452?

Key Points

1
OpenJDK’s thread user-time retrieval was dramatically slowed by a `/proc`-based workflow that combined file I/O with string parsing and numeric conversions.
2
Flame graphs showed the dominant cost came from `/proc` file open/close activity, not from the arithmetic or even the parsing itself.
3
The fix removed `/proc` parsing by switching to a direct clock-based mechanism, using clock ID adjustments to fetch thread time efficiently.
4
Benchmarking reported a drop from ~11 microseconds to ~279 nanoseconds, effectively erasing a long-standing ~400x gap tied to JDK8210452.
5
A relatively small code change can yield outsized performance gains when profiling exposes hidden work behind convenience layers.
6
Adding a focused benchmark (JM) helps confirm that the optimization is real, measurable, and not just a theoretical improvement.

Highlights

A 400x thread user-time slowdown traced back to `/proc` text parsing—open, read, scan, parse, and convert—turned a timing query into heavy work.

Flame graphs made the culprit obvious: file open/close dominated, while `scanf` parsing was only a tiny fraction of the total time.

The replacement approach fetched thread time directly via clock ID manipulation, collapsing the flame graph into a thin trace.

Measured results landed at ~11 microseconds down to ~279 nanoseconds, validating the optimization with concrete numbers.

Topics

OpenJDK Performance
Thread Timing
Flame Graphs
/proc Parsing
Benchmarking