6 GraphDB Fundamentals Loading Data
Based on Ontotext's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GraphDB Workbench supports RDF ingestion from local files, remote URLs, and pasted RDF text snippets, with import settings that include base IRI and target graphs.
Briefing
Loading RDF data into GraphDB can be done through the Workbench interface for quick, interactive imports—or through command-line tools designed for large-scale, high-performance ingestion. The core takeaway is that GraphDB offers multiple loading paths depending on whether the goal is convenience (uploading files, importing from URLs, pasting snippets) or throughput and reliability (offline loaders that serialize data directly into internal indexes).
In GraphDB Workbench, local RDF files are imported by uploading them, selecting which files to include, and running an import that asks for a base IRI and target graphs. Import settings are saved for repeat runs, and an “advanced settings” area lets users adjust additional parameters before starting the import. Workbench also supports remote ingestion: users can choose an RDF import flow that pulls data from a provided URL. For smaller or ad-hoc datasets, RDF text snippets can be pasted directly into the Workbench import dialog.
Workbench further supports server-side file loading for arbitrary-size datasets. Files are read from a directory on the GraphDB server, with a default location of $user.home/graphdb import/. The import directory can be changed via the system property graphdb.workbench.importdirectorysystemprop, and the Workbench can list available files in that server directory so users can import them selectively.
For very large datasets, GraphDB relies on offline, batch-oriented tools. The load rdf tool is built for offline loading of RDF datasets; it cannot target a running server. Its performance rationale is straightforward: it serializes RDF directly into GraphDB’s internal indexes and produces a ready-to-use repository. It can create a new repository using a standard Turtle configuration template (configs/templates) or initialize from an existing repository, in which case repository data is overwritten. Execution happens from the command line using load rdf with parameters including repository name, serial/parallel mode, and the RDF data file.
When datasets are so large that initial indexing becomes a multi-hour process, preload becomes the critical step. Preload converts RDF files into GraphDB indexes at a low level and is positioned for initial loads of datasets larger than several billion RDF statements. It supports transactional behavior such as stop requests, resume, and consistent output even after failures. If a run ends abnormally—due to disk space, out-of-memory, or other interruptions—preload can resume from intermediate restore points rather than restarting from scratch, using collected data sufficient to reinitialize internal components and continue.
Finally, onto refine provides a transformation layer for mapping structured data into RDF. It can convert tabular and semi-structured formats—TSV, CSV, TSV, XLS/XLSX, JSON, XML, RDF/XML, and Google Sheets—into RDF using a locally stored RDF schema. The workflow runs inside the GraphDB Workbench visual interface: start Workbench mode, create a project, upload data (from computer, URLs, clipboard, or existing projects), then define RDF mappings. Users map table headers to RDF triple construction, configure value mappings (value source, value type, optional transformation), and apply transformations using language-specific expressions including GREL (Google Refine expression language). The mapping editor supports prefixes from common RDF vocabularies (e.g., FOAF, Geo, RDF, RDFS, SKOS, XSD), enabling users to build IRIs and literals without importing every schema into the repository.
Cornell Notes
GraphDB supports RDF ingestion through Workbench for interactive imports and through offline command-line tools for high-volume loading. Workbench can upload local RDF files, import RDF from URLs, paste RDF text snippets, or load files from a server directory. For very large datasets, load rdf creates or overwrites repositories by serializing RDF directly into internal indexes, using serial or parallel modes. Preload is a low-level offline indexing tool designed for datasets with billions of statements; it supports stop/resume and can restart from restore points after failures. For structured data, onto refine maps tabular formats (CSV/TSV/XLS/JSON/XML/Google Sheets) into RDF using an RDF schema and configurable value mappings with transformations via GREL.
What are the main ways to load RDF data in GraphDB Workbench, and what extra settings appear during import?
How does GraphDB handle server-side file loading, and where do files come from by default?
Why use the offline load rdf tool instead of loading against a running server?
What makes preload different from load rdf, and how does it recover from failures?
How does onto refine turn structured data into RDF, and what are the key mapping concepts?
Review Questions
- When importing RDF files through Workbench, which two parameters are requested during the import step, and how can users reuse prior import settings?
- What operational difference distinguishes load rdf from preload, and which one is intended for datasets with billions of RDF statements?
- In onto refine, how do value source, value type, and optional transformation work together to produce RDF values from tabular cells?
Key Points
- 1
GraphDB Workbench supports RDF ingestion from local files, remote URLs, and pasted RDF text snippets, with import settings that include base IRI and target graphs.
- 2
Workbench can also load RDF from a server directory, controlled by graphdb.workbench.importdirectorysystemprop and listed under “import rdf server files.”
- 3
load rdf is an offline loader that serializes RDF directly into GraphDB internal indexes and creates or overwrites repositories; it cannot run against a running server.
- 4
preload is a low-level offline indexing tool for extremely large datasets (billions of statements) and supports resume from intermediate restore points after failures.
- 5
onto refine converts structured formats (CSV/TSV/XLS/JSON/XML/Google Sheets) into RDF by mapping tabular data to an RDF schema using configurable value mappings and transformations (including GREL).
- 6
In onto refine mappings, each table row generates RDF triples, while each cell-to-value conversion is controlled by value source, value type, and optional transformation rules.
- 7
Prefixes from common RDF vocabularies (like FOAF, RDFS, SKOS, and XSD) can be used in mapping configurations without importing those schemas into the repository.