Storage (4) - Data Management - Full Stack Deep Learning

TL;DR

Use file systems for traditional file-based workflows, but expect limited parallelism when data sits on a single physical disk.

Briefing Cornell Notes

Briefing

Storage choices determine how data moves, how fast it can be read, and how safely it can be reused across training and production. The core takeaway is a practical division of labor: file systems handle foundational “files,” object storage wraps files as addressable objects with API-style operations and built-in versioning/redundancy, databases store structured records for fast querying, and data lakes aggregate semi-structured or log-like data for later transformation.

At the base layer, file systems treat the fundamental unit as a file—text or binary—typically not versioned at the file-system level and easy to overwrite or delete. They work well when access patterns are simple, such as reading a whole dataset from a single disk (which can be fast but limits parallelism because the data must pass through one physical location). Network file systems extend this by letting multiple machines share the same file namespace, and distributed file systems (including Hadoop-style setups) allow many machines to access files without needing to know where the bytes physically live.

Object storage changes the performance and management model by presenting an API over storage rather than a traditional filesystem. Instead of thinking in terms of “where the file is,” the system treats the fundamental unit as an object—often binary, but sometimes text—and can add features like versioning (new writes create additional versions rather than overwriting) and redundancy (replicating across multiple disks). That redundancy enables parallel reads: many “get” requests can be served from different disks at once. The tradeoff is that the extra abstraction layer can make object storage slower than raw local disk access, and it must handle concurrent readers. Amazon S3 is the canonical example, with similar patterns available via other cloud providers.

Databases sit on top of this with a different mental model: data is treated as if it lives in RAM for speed, while the database engine ensures durability by persisting changes to disk so nothing is lost on shutdown or power failure. The fundamental unit is a row with a unique identifier, columns for values, and references to other rows (including across tables). Databases are meant for structured, query-heavy data—not binary blobs. A common pattern is to store images or other binaries in object storage and keep metadata in the database: labels, dimensions, uploader identity, and the object path/ID. Logs are treated differently: they’re often stored for later investigation or metric computation, not as the primary operational dataset.

For the database layer, PostgreSQL is presented as a pragmatic default because it supports both SQL and JSON documents, letting teams handle “schema on read” needs without abandoning structured querying. SQL is framed as the right interface for structured data, and avoiding it risks reinventing query logic poorly.

Data lakes aggregate data from multiple sources—often logs and event streams—using “schema on read.” Raw data lands in the lake, then downstream processes impose structure later, transforming it into databases, packaged training files, or other artifacts. When training time arrives, minimizing distance between data and GPUs matters: the needed subset is copied to local or network storage near the compute so training doesn’t stall on slow transfers.

Finally, “what goes where” is summarized as: binaries to object storage (for versioning and parallel access), metadata/labels to databases, and raw event/log data to data lakes, with feature extraction or aggregation steps producing the training-ready datasets. For deeper study, the discussion points to Designing Data-Intensive Applications as a first-principles guide to databases, logs, and related tradeoffs.

Cornell Notes

Storage architecture works best when each system type owns a different job. File systems manage traditional files and shared mounts, but parallelism is constrained by physical placement. Object storage wraps binaries as addressable objects with API operations, enabling versioning and redundancy and often supporting parallel reads, at the cost of some latency. Databases store structured records (rows) for fast querying and durability, while binary data typically lives in object storage with only metadata/labels stored in the database. Data lakes aggregate multi-source data like logs using schema-on-read, then transform and package only what training or analytics needs—often copying the final training subset close to the GPUs to reduce data transfer distance.

Why do access patterns matter when choosing between file systems and object storage?

File systems can be fast when a single disk holds the dataset and reads stream through that disk, but parallelism is limited because the data must be read from one physical location. Object storage can support parallel reads because objects are stored redundantly across multiple disks; many concurrent get requests can be served from different disks at once. The tradeoff is that the object-storage abstraction layer adds latency and must handle concurrent access.

What’s the practical rule for storing binaries versus metadata?

Binary content (images, audio, other large blobs) fits object storage because it can be versioned and accessed in parallel. The database should store metadata and labels—such as who uploaded an image, image dimensions, and the object path/ID—so training and retrieval can query structured attributes without dragging binary blobs through database queries.

How does schema-on-read in a data lake differ from schema in a database?

A data lake typically ingests raw data without enforcing a strict schema up front, then applies transformations later when the data is actually needed—this is schema-on-read. In contrast, databases enforce schema at write time (or at least strongly structure how rows and columns are stored), enabling efficient joins and queries over structured fields.

Why is SQL emphasized even for teams that think in JSON or NoSQL terms?

PostgreSQL is highlighted as a middle ground: it supports SQL for structured querying and also supports JSON documents so teams can store flexible documents without giving up relational querying. The warning is that avoiding SQL can lead to reinventing query capabilities in ad hoc ways, which tends to be slower and less complete than using a mature query interface.

When should training data be moved near GPUs?

When training starts, the needed subset of data should be copied to local or network storage accessible to the training machines. The goal is to minimize “distance” between data and GPUs so training doesn’t stall on slow reads from far-away storage systems.

How do feature stores relate to data lakes and raw data pipelines?

A feature store can be populated as features are computed during service operation—transforming raw events immediately and storing the resulting features for future training. The counterpoint is that recomputing features is sometimes necessary when feature definitions change; starting from raw data each training run can be mentally cleaner because it avoids stale or inconsistent cached features, though at very large scale teams may adopt different tradeoffs.

Review Questions

Which storage layer best fits binary blobs, and what should be stored alongside them for efficient querying?
Explain how schema-on-read in a data lake changes when and where structure is imposed compared with a database.
What performance tradeoffs arise from file systems versus object storage when many machines read data concurrently?

Key Points

1
Use file systems for traditional file-based workflows, but expect limited parallelism when data sits on a single physical disk.
2
Store binaries in object storage to gain versioning and redundancy, and to enable parallel reads across disks.
3
Keep structured metadata and labels in a database as rows so joins and queries remain fast and durable.
4
Use data lakes for multi-source raw data such as logs, relying on schema-on-read and later transformation into training-ready datasets.
5
Prefer PostgreSQL as a default because it supports both SQL and JSON documents, covering structured and semi-structured needs.
6
When training begins, copy only the required data subset close to the GPUs to reduce transfer latency and keep compute fed.
7
Treat feature stores as a pipeline decision: either compute features early and reuse them, or recompute from raw data to avoid feature-definition drift.

Highlights

Object storage treats data as addressable objects with API-style operations, making versioning and redundancy practical and enabling parallel reads.

Databases are optimized for structured rows and durability; binary blobs belong in object storage while labels/metadata belong in the database.

Data lakes ingest raw multi-source data with schema-on-read, then transform later—often right before training when the needed subset is packaged.

PostgreSQL is positioned as a flexible default: SQL for structured querying plus JSON support for schema-less document storage.

Training performance depends on physical proximity: copying the training subset near GPUs can matter as much as the storage system itself.

Topics

File Systems
Object Storage
Databases
Data Lakes
Feature Stores

Mentioned

SQL
NoSQL
RAM
GPU
S3