Only 40 lines of code
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenJDK’s thread user-time retrieval was dramatically slowed by a `/proc`-based workflow that combined file I/O with string parsing and numeric conversions.
Briefing
A small change in OpenJDK—switching how thread “user time” is retrieved—wiped out a long-standing 400x performance gap, cutting the cost of the operation from roughly 11 microseconds down to about 279 nanoseconds. The fix traces back to an older JDK issue (JDK8210452) noting that getting current thread user time via one path was dramatically slower than using the CPU-time route. What made the improvement notable wasn’t just the result; it was the scale of hidden work that the slower approach performed.
The discarded implementation relied on Linux-style `/proc` data. It opened a `/proc` file for the thread/process, read a large chunk into a 2K buffer, then performed string manipulation to locate the relevant section of the text. After finding the last parenthesis, it skipped whitespace, read a fixed set of numeric fields (13 numbers), and then converted those stringified values back into numbers in user space to extract the specific timing fields needed for “user time.” The transcript emphasizes that `/proc` entries are generated on demand, so this path effectively turns a timing query into file I/O plus parsing overhead—exactly the kind of work that can balloon under measurement.
The replacement approach avoids `/proc` entirely. Instead of opening and parsing text, it adjusts internal clock identifiers (via bit-level manipulation of a clock ID) and then directly fetches the thread time through the appropriate clock mechanism. The code may still look dense—low-level systems code often does—but the key difference is eliminating the “file land” detour and the repeated string/scanf parsing cycle.
Flame graphs make the performance story concrete. In the before state, the largest portions of time cluster around file operations: opening and closing files dominate, while the actual numeric parsing work (including `scanf`) appears as a small fraction of total time. The graph suggests most time is spent shuffling through `/proc`-related activity and synchronization primitives (including fast user-space mutexes and futex-related behavior), not on the arithmetic needed to compute the timing value. After the change, the flame graph collapses: only a thin trace remains, with no broad swaths of file-handling frames.
Benchmarks reported alongside the change quantify the win: the operation drops from about 11 microseconds to 279 nanoseconds. The takeaway is less about micro-optimizing a single call and more about how measurement can expose “obvious” bottlenecks that were never obvious—like the cost of turning a timing query into text parsing. For engineers, it’s a reminder that performance problems can hide behind convenience layers, and that careful profiling can justify surprisingly small diffs with outsized impact.
Cornell Notes
OpenJDK improved thread “user time” retrieval by replacing a slow `/proc`-based parsing path with a direct clock-based mechanism. The old method opened a `/proc` file, read a 2K buffer, used string scanning to find the right section, then parsed and converted numeric fields back into values—turning a timing query into file I/O and heavy text processing. Flame graphs showed the time was dominated by file open/close activity, while actual parsing work (e.g., `scanf`) was a small slice. Benchmarks reported the change cut latency from about 11 microseconds to about 279 nanoseconds, eliminating a long-standing ~400x gap tied to JDK8210452. The result highlights how profiling can reveal hidden costs behind seemingly straightforward system calls.
Why was retrieving thread user time so slow in the older approach?
What did the flame graph reveal about where time was actually going?
How did the new implementation avoid the bottlenecks?
What magnitude of improvement was measured, and how was it expressed?
What does the change set size suggest about the nature of the fix?
Review Questions
- In the older `/proc`-based approach, which steps (file operations, string scanning, numeric parsing) contribute most to total time according to the flame graph?
- What specific design change—data source and retrieval method—distinguishes the new implementation from the old one?
- How do the benchmark numbers (11 microseconds vs 279 nanoseconds) relate to the previously reported ~400x gap in JDK8210452?
Key Points
- 1
OpenJDK’s thread user-time retrieval was dramatically slowed by a `/proc`-based workflow that combined file I/O with string parsing and numeric conversions.
- 2
Flame graphs showed the dominant cost came from `/proc` file open/close activity, not from the arithmetic or even the parsing itself.
- 3
The fix removed `/proc` parsing by switching to a direct clock-based mechanism, using clock ID adjustments to fetch thread time efficiently.
- 4
Benchmarking reported a drop from ~11 microseconds to ~279 nanoseconds, effectively erasing a long-standing ~400x gap tied to JDK8210452.
- 5
A relatively small code change can yield outsized performance gains when profiling exposes hidden work behind convenience layers.
- 6
Adding a focused benchmark (JM) helps confirm that the optimization is real, measurable, and not just a theoretical improvement.