Lab 8: Testing and Continuous Integration (Full Stack Deep Learning

TL;DR

Linting is implemented as a multi-tool static-analysis pipeline (Safety, PyLint, PyCodeStyle, PyDocStyle, MyPy, Bandit, ShellCheck) plus Black formatting to enforce consistent code quality.

Briefing Cornell Notes

Briefing

Lab 8 focuses on making a full-stack handwriting OCR project safer to change by adding automated linting, targeted tests, and continuous integration. The core idea is straightforward: every code push should trigger a repeatable quality check so style issues, common bugs, and broken functionality get caught immediately—before they reach production.

Linting is the first gate. A top-level lint task runs a chain of static-analysis tools: Safety checks Python dependencies for known security vulnerabilities, PyLint performs static code analysis for bugs and style-related problems, PyCodeStyle enforces style conventions, PyDocStyle verifies docstrings, MyPy validates type hints by flagging mismatches (for example, passing an int where a float is expected), Bandit looks for common Python security pitfalls like unsafe use of eval, and ShellCheck covers bash script issues. Configuration lives in files such as .pylintrc and setup.cfg, where teams can set rules like maximum line length and which messages to disable. To keep formatting consistent across developers, the workflow also recommends Black, an automated formatter that normalizes quoting, indentation, and line wrapping in a way that stays compatible with the linting rules.

Next come functionality tests for the handwriting-to-text pipeline. A module-level test targets the paragraph text recognizer added in Lab 7. The test disables CUDA to avoid GPU-related variability and uses support assets: images from a dataset plus a JSON file containing ground-truth text and expected character error rate. The test runs the recognizer on each sample, checks that predicted text matches expectations via the character error rate threshold, and records runtime—explicitly aiming to finish within about a minute so it can run frequently.

A separate evaluation test measures model quality more strictly. It loads trained weights and configuration for the paragraph text recognizer and runs on the evaluation dataset while using the GPU, since evaluation is expected to be heavier and CircleCI lacks GPU access. The test asserts both accuracy and speed: the character error rate must stay below a target (around 17% in the current setup), and the runtime must also remain under a defined limit.

Finally, an infrastructure test verifies that the training system still runs end-to-end. It executes the training command against a synthetic “fake image data” module, trains a small model (a ConfNet) for a limited number of epochs, and checks that training completes successfully. This isn’t meant to prove accuracy; it’s designed to catch regressions that would prevent training from running at all.

Continuous integration ties it together. Using CircleCI, the repository is configured with a top-level config.yml that runs linting and tests on every push. The pipeline uses a Python 3.6 Docker image, installs git lfs, ShellCheck, and the exact pinned requirements, then executes lint and tests in sequence. Even if lint fails, tests still run to provide additional signal. The evaluation test is intentionally skipped in CI due to missing GPU resources and runtime constraints. The assignment is to fork the repo, set up CircleCI, and confirm the build turns green.

Cornell Notes

Lab 8 adds a quality-control pipeline for a handwriting OCR system by combining linting, module tests, evaluation tests, and a training “smoke test,” then wiring them into continuous integration. Linting runs multiple static-analysis tools (Safety, PyLint, PyCodeStyle, PyDocStyle, MyPy, Bandit, ShellCheck) plus Black formatting to enforce consistent code style and catch common bugs early. Functionality tests validate the paragraph text recognizer using support images and a JSON file of ground truth and expected character error rate, with CUDA disabled to keep results stable and fast. An evaluation test uses GPU to load trained weights and assert both character error rate and runtime thresholds. A training system test runs training on synthetic “fake image data” to ensure the training command completes successfully. CircleCI runs linting and tests on every push, but skips the GPU-heavy evaluation test.

Why does the linting step run several different tools instead of just one?

Linting is split across complementary checks. Safety scans dependency packages for known security vulnerabilities. PyLint performs static analysis for bugs and some style issues. PyCodeStyle enforces style rules not covered by PyLint. PyDocStyle checks docstrings for functions, classes, and modules. MyPy uses static type hints to catch type mismatches (e.g., passing an int where a float is expected). Bandit targets common Python security vulnerabilities such as unsafe eval usage. ShellCheck covers bash scripts, which helps prevent shell-specific errors that static Python tools won’t catch.

How does the paragraph text recognizer functionality test decide whether predictions are correct?

It runs the recognizer on a small set of support images and compares outputs using character error rate. The support folder provides images expected in production-like use, while a JSON file supplies ground-truth text and the character error rate the model should achieve. The test disables CUDA so the run stays stable and avoids GPU-related interference, and it also tracks runtime with the goal that the test finishes within about a minute or a few.

What makes the evaluation test different from the functionality test?

The evaluation test is stricter and heavier. It loads trained weights and configuration for the paragraph text recognizer and runs on the evaluation dataset, using the GPU. At the end it reports character error rate and time taken, then asserts both accuracy (character error rate must be below a target, roughly 17% in the current setup) and performance (runtime must be under an expected limit).

What does the training system infrastructure test verify, and what does it not verify?

It verifies that the training command can run end-to-end without crashing. The test uses a synthetic “fake image data” data module so it doesn’t download real datasets or waste time. It trains a ConfNet for a limited number of epochs with specified parameters and checks that training completes successfully. It does not primarily validate correctness or high accuracy; it’s a regression guard against training-system breakage.

Why is the evaluation test skipped in CircleCI?

CircleCI in this setup doesn’t provide GPU access, and the evaluation test is designed to use the GPU for loading weights and running the evaluation dataset within reasonable time. Without a GPU, the evaluation would either fail due to missing hardware or take too long, so CI focuses on linting and non-GPU tests.

What does the CircleCI pipeline run on each push?

The top-level config.yml uses a Python 3.6 Docker image, installs git lfs, installs ShellCheck, and installs the exact pinned requirements. It then runs linting and tests. Both steps are executed so that even if linting fails, tests still run to provide additional debugging signal. A green check indicates success; a red cross indicates CI failure, and the result appears in commit history and pull requests.

Review Questions

Which linting tools would catch type mismatches, and which would catch dependency vulnerabilities?
How do the functionality and evaluation tests differ in hardware requirements and assertion criteria?
What is the purpose of using synthetic “fake image data” in the training infrastructure test?

Key Points

1
Linting is implemented as a multi-tool static-analysis pipeline (Safety, PyLint, PyCodeStyle, PyDocStyle, MyPy, Bandit, ShellCheck) plus Black formatting to enforce consistent code quality.
2
The paragraph text recognizer functionality test uses support images and a JSON ground-truth file to validate predictions via character error rate, with CUDA disabled for stability and speed.
3
The evaluation test loads trained weights and runs on the evaluation dataset using GPU, asserting both character error rate (target around 17%) and runtime limits.
4
The training infrastructure test is a smoke test: it trains on synthetic “fake image data” to confirm the training command completes successfully, not to prove high accuracy.
5
CircleCI is configured to run linting and tests on every push using a Python 3.6 Docker image and pinned requirements.
6
GPU-heavy evaluation is intentionally excluded from CI because CircleCI lacks GPU access and the evaluation would be too slow or fail without it.
7
The workflow goal is fast feedback: tests are designed to finish quickly enough to run frequently, while CI provides immediate pass/fail signals on commits and pull requests.

Highlights

Linting chains security, style, documentation, type checking, and shell-script checks into one gate, so issues get caught before runtime.

The paragraph text recognizer test validates outputs using character error rate against JSON ground truth while keeping CUDA off for repeatability.

The evaluation test enforces both accuracy and speed thresholds and relies on GPU, which is why it’s excluded from CircleCI.

The training test uses synthetic “fake image data” to ensure the training pipeline runs end-to-end without needing real datasets.

CircleCI runs linting and tests on every push and reports results directly in commit history and pull requests via green checks or red crosses.

Topics

Testing Strategy
Linting Pipeline
Character Error Rate
Continuous Integration
Training Smoke Tests

Lab 8: Testing and Continuous Integration (Full Stack Deep Learning - Spring 2021)