Get AI summaries of any video or article — Sign up free
Building Empathy-Driven Developer Documentation - Kat King - Write the Docs Portland 2018 thumbnail

Building Empathy-Driven Developer Documentation - Kat King - Write the Docs Portland 2018

Write the Docs·
6 min read

Based on Write the Docs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Treat documentation quality as a developer outcome, not a traffic outcome, by collecting direct satisfaction signals from users.

Briefing

Twilio’s developer education team built “empathy-driven documentation” by treating developer frustration as a measurable, research-backed signal—not an afterthought. The core shift: documentation quality couldn’t be judged by traffic or signups alone, because those metrics don’t reveal whether developers feel unstuck, confident, and able to complete real tasks. The team’s answer was to combine lightweight satisfaction scoring with deeper user research and continuous iteration, then wire the results directly into the documentation workflow.

The journey began with a reality check from real-time public feedback. Team members monitored Twitter for comments about the docs—compliments were nice, but complaints about duplicate pages, outdated content, and missing information were a “canary in a coal mine.” That prompted a bigger internal question: what does “good documentation” actually mean, and who exactly is it serving? The team realized it didn’t fully understand its audience across personas (from CTOs to junior developers) or how different content types were being used.

To measure quality beyond vanity metrics, the team launched a pop-up widget across documentation pages asking a single question: did this content serve your needs (yes/no). Over a couple months, the widget produced quantitative “percent happy” style data and helped pinpoint weak areas by content type—voice docs lagged, while error code documentation was “abysmal.” When users answered “no,” the team offered short follow-up calls, turning negative ratings into qualitative insight. Those conversations surfaced recurring pain points and also forced the team to confront an emotional truth: documentation work can feel personal when developers describe it as confusing or incomplete.

That empathy work deepened through one-on-one user research in September, where a researcher observed nine developers from different personas completing tasks or working through real scenarios. Watching people navigate the docs revealed a major information architecture mismatch. The docs were organized around content types (API references, tutorials, walkthroughs), but developers behaved task-first. With more than 20 products and thousands of URLs and code samples, the “tutorial vs. reference” framing didn’t help people find what they needed—developers kept asking practical questions like how to send a text message or make a phone call.

In October, the team upgraded its measurement system from yes/no to a custom ratings widget that allowed star ratings per page. Pages rated below four stars triggered prompts for more detail, and the feedback flowed into a dashboard and a Slack integration so contributors could act quickly. The team initially worried the tooling would create noise, but it instead helped surface trends like “can’t find what I’m looking for” and “not enough examples,” while also catching bugs and outdated guidance.

Iteration then became the operating model: investigate, test, measure, and only then build. A/B tests helped kill ineffective ideas—like adding a prominent “sign up now” button to QuickStart pages, which didn’t increase registrations and annoyed users. For cases that couldn’t be tested cheaply, the team rewrote content based on research. One QuickStart revision in Python rose from 3.67 to 4.75 stars after being rebuilt with clearer steps and up-to-date code samples. Alongside the quality metric, business indicators improved too: signups rose, bounce rate fell, and sessions increased.

The team’s concluding lessons were blunt: accurate code samples are the most important feature, content-type labels matter less than task success, and docs must support “information foraging” by giving developers clear signposts once they land from Google. The overall thesis was practical rather than sentimental: build mechanisms for feedback and research, iterate in small steps, and let empathy guide what gets fixed first—because that’s the only path to documentation that reliably helps developers succeed.

Cornell Notes

Twilio’s documentation team shifted from measuring success with traffic and signups to measuring whether developers feel able to complete tasks. They started with a yes/no satisfaction widget, then followed up with short calls when users reported pages didn’t meet their needs. User research with nine developers showed that organizing docs primarily by content type (tutorial vs. API reference) didn’t match how people actually behave; developers are task-oriented and want the fastest path to “send a text,” “make a call,” or “find the right error code.” The team then built a star-rating system with dashboards and a Slack integration for pages rated below four stars, enabling continuous fixes. Iterative testing and targeted rewrites improved developer ratings and also moved business metrics in the right direction.

Why did traffic and signup metrics fail to capture documentation quality?

Those metrics can show whether people arrive and register, but they don’t reveal whether developers feel confident or stuck once they reach a page. A new developer or a CTO exploring Twilio for a decision might generate signups and page views even if the docs don’t help them complete real tasks. The team needed a way to measure “delighted vs. frustrated” directly, not just engagement at the top of the funnel.

How did the team turn developer feedback into actionable signals?

They launched a lightweight pop-up widget asking one question: did this content serve your needs (yes/no). For “no” responses, they offered a 15-minute follow-up call to learn what specifically went wrong. Later, they replaced the binary approach with a custom star-rating widget; pages rated below four stars triggered prompts for more detail and automatically fed into a dashboard and a Slack integration so contributors could address issues quickly.

What did user testing reveal about the docs’ information architecture?

The docs were organized around content types—API references, tutorials, walkthroughs—based on the assumption that developers choose entry points by label. Watching nine developers complete tasks showed a different reality: people didn’t care what something was called. They wanted to accomplish specific outcomes (e.g., sending an SMS, placing a voice call, obtaining a phone number, using Twilio features), so the content-type distinctions weren’t guiding them to success.

How did the team use experimentation to avoid building unwanted changes?

They applied iterative development and A/B testing when possible. One example: a hypothesis that a big red “sign up now” button on QuickStart pages would increase registrations. The team tested it by routing traffic to versions with and without the button; registrations didn’t improve, and users found the button annoying. That result prevented a broader rollout of a change that didn’t serve developers.

What evidence showed that rewriting docs based on research improved outcomes?

A Python QuickStart page was rebuilt from scratch after user research identified multiple issues (too many options, unclear prerequisites, outdated or confusing structure). Before the revision, the page averaged 3.67 stars. After publishing the revised QuickStart mid-month, the average rating rose to 4.75 out of 5. The team also reported improvements in Google Analytics metrics—bounce rate down, sessions up, and signups up—suggesting the developer-quality focus aligned with business results.

What three operational lessons guided ongoing doc improvements?

First, accurate code samples are the most important feature, even though maintaining thousands of examples is hard. Second, content-type labels matter less than task success because developers arrive with specific goals. Third, docs must support “information foraging”: developers often start with Google and then scan quickly to find the right next step, so signposts and navigational clarity determine whether they can succeed once they land.

Review Questions

  1. What measurement approach did the team use to assess documentation quality beyond traffic and signups, and why was it necessary?
  2. How did user research change the team’s view of content organization (content types vs. task-oriented navigation)?
  3. Describe one example of experimentation (A/B test or rewrite) and the metric(s) used to judge success.

Key Points

  1. 1

    Treat documentation quality as a developer outcome, not a traffic outcome, by collecting direct satisfaction signals from users.

  2. 2

    Use lightweight feedback (yes/no) to start, then evolve to richer metrics (star ratings) that can trigger follow-up and prioritization.

  3. 3

    Watch how developers actually work through tasks; don’t rely on assumptions about how they choose between tutorials, references, and walkthroughs.

  4. 4

    Wire feedback into the workflow—dashboards and Slack alerts for low-rated pages—so contributors can act quickly and consistently.

  5. 5

    Apply iterative development: test hypotheses with A/B tests when feasible, and rewrite content when research indicates deeper structural problems.

  6. 6

    Focus on accurate, up-to-date code samples as the highest-impact documentation feature, even under maintenance constraints.

  7. 7

    Design docs for “information foraging” by providing clear signposts that help developers move from Google landing to task completion.

Highlights

A yes/no “did this page serve your needs?” widget exposed which doc categories were failing—voice docs lagged and error code documentation was especially poor.
Nine observed developers showed that content-type labels didn’t drive behavior; developers were task-first and wanted direct paths to outcomes like sending messages or resolving errors.
A custom star-rating system fed low-rated pages into dashboards and Slack, turning frustration into a prioritized backlog rather than anonymous complaints.
An A/B test on a prominent “sign up now” button found no registration lift and added annoyance, preventing a costly rollout.
Rewriting one Python QuickStart based on user research lifted its average rating from 3.67 to 4.75 stars and coincided with improved analytics.

Topics

Mentioned

  • Kat King