Lab 08: Monitoring (FSDL 2022)

TL;DR

Add user-facing flagging for model outputs because infrastructure monitoring won’t detect incorrect or harmful predictions that don’t crash services.

Briefing Cornell Notes

Briefing

Model monitoring for a production text recognizer has to go beyond infrastructure health checks and into “behavioral” signals—whether the system’s outputs are actually correct and whether they could upset users. The lab builds that feedback loop by adding Gradio-based flagging controls (“incorrect,” “offensive,” “other”), then routing those user judgments—along with the input image and model output—into Gantry for centralized logging and analysis. Cloud metrics like EC2 and Lambda health in CloudWatch can catch outages, but they won’t alert when a model confidently returns garbage text; user feedback and model-specific metrics are what fill that gap.

The workflow starts with a small UI change: enabling Gradio’s flagging mechanism so users can label problematic predictions. When a user flags an example, the system initially stores a local CSV containing references to the uploaded image plus metadata about the prediction and the chosen flag. Because binary data is stored separately, the CSV keeps pointers to local files, and analysis requires reloading those images when needed. The lab then argues that local logging doesn’t scale well for production, so it replaces the local approach with a Gantry logger callback. That callback follows a two-step pattern: upload the image to S3, then send Gantry the S3 URL plus structured metadata (model output text, user-selected flag, and other context). From there, the notebook can pull logged records back into pandas for exploratory analysis, while Gantry’s UI supports the same investigation interactively.

In Gantry, the monitoring dashboard highlights how many feedback events arrive over time and which categories dominate. In the lab’s production window, most flags fall under “incorrect,” which is treated as a healthier sign than “offensive” or “other.” To catch harmful content automatically, the lab adds “detoxify” projections—scores for obscenity and related risks—computed on logged outputs. When an obscenity score bumps upward, the key question becomes whether it reflects real model behavior or just user-provided inputs containing swear words. With enough data, the lab emphasizes distribution comparisons: production values are compared against a trusted baseline from the model’s validation/testing sets. In this case, obscenity values are higher in testing than in production, reducing concern that the model is generating worse content than expected.

The monitoring system also supports debugging, not just alerting. By projecting output-text entropy, the lab detects a concerning shift: production shows more low-entropy outputs, consistent with repetition or degenerate generation. Filtering to those low-entropy cases reveals gibberish repetition, prompting a root-cause check. The lab then compares production inputs to the original test data and finds a preprocessing mismatch: training data was inverted (to stabilize grayscale training), but production preprocessing didn’t apply the same inversion consistently. The lab concludes that fixing such issues often requires changing the model and training data pipeline to handle real-world input variability—like different background colors and contrast—rather than relying on brittle preprocessing assumptions.

Finally, the lab turns the logged failures into a product roadmap. Common user-driven failure modes include inability to recognize printed text (despite handwritten focus), difficulty with complex spatial layouts (like diagrams rather than paragraphs), missing characters outside the ASCII character set, and poor performance on heterogeneous backgrounds. The overarching lesson is that model quality in real products is determined less by infrastructure tweaks and more by data: richer logging enables discovering unknown unknowns, and those discoveries translate into new data collection, synthesis, and retraining strategies.

Cornell Notes

The lab builds a production monitoring loop for a text recognizer by collecting user feedback on model outputs and analyzing it with model-specific metrics. Gradio flagging (“incorrect,” “offensive,” “other”) captures behavioral failures that infrastructure monitoring (like CloudWatch) would miss. A Gantry logger sends images to S3 and then logs the S3 URL plus prediction text and user flags to Gantry, enabling dashboards, projections, and distribution comparisons against validation/testing baselines. Projections such as detoxify obscenity scores and output-text entropy help detect both harmful content risks and degenerate generation patterns. Debugging low-entropy, repetitive outputs led to a preprocessing mismatch (image inversion) and highlighted the need to train for real-world input variability.

Why can’t standard system monitoring (health checks, latency, instance metrics) replace model monitoring for a text recognizer?

Because a model can fail “silently.” If the recognizer outputs garbage text, nothing necessarily crashes—so EC2/Lambda health and CloudWatch metrics may look normal. The lab treats this as a product risk: users need to know whether predictions are correct and whether outputs could be offensive or otherwise harmful. That’s why the monitoring adds user-facing feedback flags and model-specific projections rather than relying only on infrastructure signals.

How does the lab collect and store user feedback on predictions, and why does it separate binary images from metadata?

Gradio adds a flagging UI via a keyword argument (flagging enabled). When users flag an output, the system logs the input image reference plus the model output and the chosen flag. Initially, local logging writes a CSV containing pointers to locally stored image files (binary data isn’t stored directly in the table). In production, local logs are replaced with a Gantry callback that uploads the image to S3 first, then logs the S3 URL along with structured metadata to Gantry—keeping binary payloads separate from queryable fields.

What does Gantry add beyond basic logging, and how do projections help?

Gantry provides a UI and query layer over logged records, plus “projections” that compute additional metrics from stored inputs/outputs. Projections can be run later, after the system is already in production, and even outside the inference environment. The lab uses projections like detoxify obscenity scoring and output-text entropy, plus simpler metrics (text length, entropy) and image-related measures (pixel intensities, sizes). This turns monitoring into both detection and retrospective analysis.

How does the lab decide whether an obscenity-score bump is a real model problem?

It compares distributions. After computing an obscenity metric over logged production outputs, the lab checks whether production values are worse than a trusted baseline. Gantry’s distribution view contrasts production (e.g., maroon) against validation/testing (e.g., orange) using the same projections. In the lab’s example, obscenity values were higher in testing than in production, suggesting the bump likely isn’t evidence of the model generating more obscene text than expected.

What debugging signal led to a deeper investigation in the lab, and what did it reveal?

Output-text entropy. Low entropy in production indicates overly predictable outputs, consistent with repetition or degenerate generation. Filtering to low-entropy examples showed gibberish repetition. The lab then compared production inputs to test data and found a preprocessing mismatch: training used inverted images for stability, but production preprocessing didn’t apply inversion consistently. That mismatch helps explain why the model’s input distribution differed from what it was trained on.

What product-level failure modes did users expose, and how does the lab propose addressing them?

Users reported issues like inability to recognize printed text, trouble with complex spatial layouts (e.g., architecture diagrams rather than paragraphs), outputs containing symbols outside the ASCII character set, and poor handling of heterogeneous backgrounds. The lab’s resolution theme is data-driven: synthesize or collect new training data (printed text rendering, diagram-like layouts, broader character sets, varied backgrounds) and retrain or adjust the model pipeline rather than relying on infrastructure changes.

Review Questions

Which specific monitoring signals in the lab are designed to catch “silent” model failures that wouldn’t trigger infrastructure alerts?
How does distribution comparison against validation/testing data reduce false alarms when using metrics like detoxify obscenity scores?
Why can a preprocessing mismatch (such as image inversion) produce low-entropy, repetitive outputs, and what kinds of training changes would address it?

Key Points

1
Add user-facing flagging for model outputs because infrastructure monitoring won’t detect incorrect or harmful predictions that don’t crash services.
2
Route feedback plus input/output data into a centralized system (Gantry) instead of relying on local CSV files that don’t scale.
3
Use a two-step logging pattern: upload binary images to S3, then log S3 URLs with structured metadata (prediction text and user flags).
4
Apply model-specific projections (e.g., detoxify obscenity scores, output entropy) to detect both safety risks and degenerate generation patterns.
5
Compare production metric distributions to trusted validation/testing baselines to interpret bumps and avoid overreacting to noise.
6
When debugging, treat low-entropy/repetition signals as prompts to check input distribution shifts and preprocessing consistency, not just model weights.
7
Translate recurring user-reported failure modes into targeted data collection or synthesis (printed text, complex layouts, expanded character sets, varied backgrounds).

Highlights

Gradio flagging turns subjective user judgments into structured monitoring signals that infrastructure metrics can’t provide.

Gantry projections let teams compute new safety and quality metrics after the fact, using logged inputs and outputs.

Distribution comparisons against validation/testing data help decide whether a metric spike reflects real model harm or just user-provided content.

Low output-text entropy can reveal repetition and degenerate behavior, guiding investigation toward preprocessing and data pipeline mismatches.

The lab’s core fix strategy is data: real-world variability (backgrounds, contrast, layout) must be reflected in training rather than assumed away by brittle preprocessing.

Topics

Model Monitoring
User Feedback
Gantry Logging
Projections
Data Preprocessing

Mentioned

EC2
Lambda
AWS
S3
CSV
UI
ML
P0