6 GraphDB Fundamentals Loading Data

TL;DR

GraphDB Workbench supports RDF ingestion from local files, remote URLs, and pasted RDF text snippets, with import settings that include base IRI and target graphs.

Briefing Cornell Notes

Briefing

Loading RDF data into GraphDB can be done through the Workbench interface for quick, interactive imports—or through command-line tools designed for large-scale, high-performance ingestion. The core takeaway is that GraphDB offers multiple loading paths depending on whether the goal is convenience (uploading files, importing from URLs, pasting snippets) or throughput and reliability (offline loaders that serialize data directly into internal indexes).

In GraphDB Workbench, local RDF files are imported by uploading them, selecting which files to include, and running an import that asks for a base IRI and target graphs. Import settings are saved for repeat runs, and an “advanced settings” area lets users adjust additional parameters before starting the import. Workbench also supports remote ingestion: users can choose an RDF import flow that pulls data from a provided URL. For smaller or ad-hoc datasets, RDF text snippets can be pasted directly into the Workbench import dialog.

Workbench further supports server-side file loading for arbitrary-size datasets. Files are read from a directory on the GraphDB server, with a default location of $user.home/graphdb import/. The import directory can be changed via the system property graphdb.workbench.importdirectorysystemprop, and the Workbench can list available files in that server directory so users can import them selectively.

For very large datasets, GraphDB relies on offline, batch-oriented tools. The load rdf tool is built for offline loading of RDF datasets; it cannot target a running server. Its performance rationale is straightforward: it serializes RDF directly into GraphDB’s internal indexes and produces a ready-to-use repository. It can create a new repository using a standard Turtle configuration template (configs/templates) or initialize from an existing repository, in which case repository data is overwritten. Execution happens from the command line using load rdf with parameters including repository name, serial/parallel mode, and the RDF data file.

When datasets are so large that initial indexing becomes a multi-hour process, preload becomes the critical step. Preload converts RDF files into GraphDB indexes at a low level and is positioned for initial loads of datasets larger than several billion RDF statements. It supports transactional behavior such as stop requests, resume, and consistent output even after failures. If a run ends abnormally—due to disk space, out-of-memory, or other interruptions—preload can resume from intermediate restore points rather than restarting from scratch, using collected data sufficient to reinitialize internal components and continue.

Finally, onto refine provides a transformation layer for mapping structured data into RDF. It can convert tabular and semi-structured formats—TSV, CSV, TSV, XLS/XLSX, JSON, XML, RDF/XML, and Google Sheets—into RDF using a locally stored RDF schema. The workflow runs inside the GraphDB Workbench visual interface: start Workbench mode, create a project, upload data (from computer, URLs, clipboard, or existing projects), then define RDF mappings. Users map table headers to RDF triple construction, configure value mappings (value source, value type, optional transformation), and apply transformations using language-specific expressions including GREL (Google Refine expression language). The mapping editor supports prefixes from common RDF vocabularies (e.g., FOAF, Geo, RDF, RDFS, SKOS, XSD), enabling users to build IRIs and literals without importing every schema into the repository.

Cornell Notes

GraphDB supports RDF ingestion through Workbench for interactive imports and through offline command-line tools for high-volume loading. Workbench can upload local RDF files, import RDF from URLs, paste RDF text snippets, or load files from a server directory. For very large datasets, load rdf creates or overwrites repositories by serializing RDF directly into internal indexes, using serial or parallel modes. Preload is a low-level offline indexing tool designed for datasets with billions of statements; it supports stop/resume and can restart from restore points after failures. For structured data, onto refine maps tabular formats (CSV/TSV/XLS/JSON/XML/Google Sheets) into RDF using an RDF schema and configurable value mappings with transformations via GREL.

What are the main ways to load RDF data in GraphDB Workbench, and what extra settings appear during import?

Workbench supports uploading local RDF files, importing RDF from a remote URL, and importing RDF from a pasted text snippet. For local files, users upload files, select which ones to import (either “select all” or per-file selection), then run an import that requires a base IRI and target graphs. Advanced settings can be expanded to adjust additional parameters, and import settings are saved for repeat imports.

How does GraphDB handle server-side file loading, and where do files come from by default?

Workbench can load RDF files that already exist on the GraphDB server where Workbench runs. By default, the server directory is $user.home/graphdb import/. Users can change the directory using the system property graphdb.workbench.importdirectorysystemprop, then browse the “import rdf server files” list to select files from that directory for import.

Why use the offline load rdf tool instead of loading against a running server?

load rdf is designed for offline loading and cannot be used against a running server. The performance rationale is that it serializes RDF directly into GraphDB’s internal indexes and produces a ready-to-use repository, avoiding the overhead of online ingestion. It can initialize a new repository using standard configuration templates (configs/templates) or overwrite an existing repository.

What makes preload different from load rdf, and how does it recover from failures?

preload converts RDF files into GraphDB indexes at a very low level, targeting initial loads of datasets larger than several billion RDF statements. It supports stop requests, resume, and consistent output even after failure. If a run terminates abnormally (e.g., insufficient disk space or out of memory), preload restarts from intermediate restore points rather than beginning again, using restore-point data to reinitialize internal components and continue.

How does onto refine turn structured data into RDF, and what are the key mapping concepts?

onto refine maps structured inputs (TSV, CSV, XLS/XLSX, JSON, XML, RDF/XML, and Google Sheets) into RDF using a locally stored RDF schema. In the mapping editor, each table row becomes RDF triples. Value mappings define how a single tabular cell becomes an RDF value, specifying a value source (column name, constant, record id, or row index), a value type (e.g., IRI, literal, language literal, datatype literal, blank node variants), and an optional transformation. Transformations use a chosen language and expression, including GREL for complex transformations, and the editor provides prefixes from common RDF vocabularies (e.g., FOAF, Geo, RDF, RDFS, SKOS, XSD).

Review Questions

When importing RDF files through Workbench, which two parameters are requested during the import step, and how can users reuse prior import settings?
What operational difference distinguishes load rdf from preload, and which one is intended for datasets with billions of RDF statements?
In onto refine, how do value source, value type, and optional transformation work together to produce RDF values from tabular cells?

Key Points

1
GraphDB Workbench supports RDF ingestion from local files, remote URLs, and pasted RDF text snippets, with import settings that include base IRI and target graphs.
2
Workbench can also load RDF from a server directory, controlled by graphdb.workbench.importdirectorysystemprop and listed under “import rdf server files.”
3
load rdf is an offline loader that serializes RDF directly into GraphDB internal indexes and creates or overwrites repositories; it cannot run against a running server.
4
preload is a low-level offline indexing tool for extremely large datasets (billions of statements) and supports resume from intermediate restore points after failures.
5
onto refine converts structured formats (CSV/TSV/XLS/JSON/XML/Google Sheets) into RDF by mapping tabular data to an RDF schema using configurable value mappings and transformations (including GREL).
6
In onto refine mappings, each table row generates RDF triples, while each cell-to-value conversion is controlled by value source, value type, and optional transformation rules.
7
Prefixes from common RDF vocabularies (like FOAF, RDFS, SKOS, and XSD) can be used in mapping configurations without importing those schemas into the repository.

Highlights

Workbench import dialogs require a base IRI and target graphs, and advanced settings can be saved for repeat imports.

load rdf is built for offline performance by serializing RDF straight into GraphDB’s internal indexes, producing a ready-to-use repository.

preload targets initial loads of datasets larger than several billion RDF statements and can resume from restore points after abnormal termination.

onto refine turns tabular rows into RDF triples by letting users drag column headers into an RDF mapping editor and define value mappings with optional GREL transformations.

Topics

RDF Loading
GraphDB Workbench
Offline Loaders
Preload Indexing
Onto Refine Mapping