Public UBS Knowledge Graph – Building a Connected Data Catalog

TL;DR

UBS’s core goal is a curated data layer that enables employees to find, understand, trust, and legally reuse data across roughly 5,000 business applications.

Briefing Cornell Notes

Briefing

UBS is building a connected data catalog at massive scale by shifting from tool-centric data management to a capability-driven model: catalog data assets, map them to a shared enterprise data model, and then federate the work through small, reusable “data services” that translate each application’s data into a common semantic format. The payoff is a curated data layer where tens of thousands of employees can find, understand, trust, and legally reuse data—without forcing every application to change its underlying systems.

The scale is the central problem. UBS runs roughly 5,000 business applications across a global workforce of about 70,000 employees. Each application typically comes with its own budget, team, database, and data structure—privacy compartmentalization included, which matters under Swiss banking rules. Integration becomes manageable when connecting a few systems, but it turns into a “major headache” when the number of applications balloons. UBS’s goal is to provide a consistent, well-described data layer for all employees: discover the right datasets, interpret them despite thousands of structures and languages, verify quality and timeliness, and ensure licensing and usage rights.

The approach relies on four steps that have been stable for about a decade—catalog assets, build a conceptual enterprise data model, and map application data to that conceptual layer. What changed after years of trying “one big catalog” is the recognition that multiple catalogs and schemas already exist, often scattered across spreadsheets and local descriptions. Instead of centralizing governance and cataloging into a single monolithic team, UBS federates the work.

At the technical core is a “data service,” described as the smallest unit of work: a Kubernetes-based virtual machine that reads from a source system and transforms the output into JSON-LD (linked data). That transformation is driven by configuration mappings, letting developers expose data without rewriting their systems. UBS then pipes the resulting linked data through Kafka, with roughly 80 source systems feeding the pipeline. The semantic layer is anchored in an internal schema aligned with schema.org-style structured data.

UBS connects not just application data but also the metadata needed to understand it end-to-end. Application registries describe business applications and their environments; infrastructure systems describe servers, networks, ports, and connectivity; scanners identify data assets such as buckets and relational databases; and asset-structure metadata is integrated from other systems. At the top sits business glossaries, including separate glossaries per line of business. All of this is carried through Kafka as JSON-LD and written into both a graph database and a data lake. The graph database supports connected, pre-linked exploration, while the data lake supports time-based analysis over large file sets.

The model is built around capabilities—find, understand, trust, use, and reuse—supported by open standards. UBS cites DCAT for dataset description and DQV for data quality and DQV for lineage. Lessons learned emphasize business value first (including separating data registration from data governance), using open standards to reduce adoption friction, and aligning team topology with the federated architecture. The rollout is described as a long game: about three years to reach the current state, with additional years expected to mature and expand scanning and onboarding. In follow-up Q&A, UBS frames “small wins” as cost savings (like identifying and archiving archive tables) to keep funding moving while the broader revenue potential is deferred to later stages.

Cornell Notes

UBS is tackling data integration and data management at extreme scale by building a connected data catalog that employees can trust and reuse. With about 5,000 business applications and ~70,000 employees, the challenge is that each application has its own database, structure, and language—making integration and governance difficult. UBS keeps the core cataloging approach (catalog assets, create an enterprise data model, map to it) but changes execution: instead of one central team, it federates work using small Kubernetes-based “data services” that transform source data into JSON-LD and publish it via Kafka. Metadata and semantics flow into both a graph database (for connected discovery) and a data lake (for time-based analysis). Open standards like DCAT and DQV underpin dataset description, quality, and lineage, while business value and team structure drive adoption.

Why does UBS treat “data management at scale” as a different problem than integrating a few systems?

Integration is manageable when only a handful of systems need to connect, but UBS faces a combinatorial explosion with roughly 5,000 business applications. Each application typically has its own team, budget, database, and data structure, and privacy compartmentalization is intentionally built in for Swiss banking compliance. The result is a fragmented landscape where employees must still find the right data, understand it across many structures and languages, trust its quality and timeliness, and confirm licensing/usage rights—tasks that become far harder as the number of applications grows.

What are the four steps UBS uses to build its connected data layer, and what changed after a decade of attempts?

UBS describes a stable four-step method: (1) catalog data assets from the applications, (2) build a conceptual layer called the Enterprise data model, (3) map application data to that conceptual layer, and (4) use the mapped layer to enable consistent access and reuse. The change came in execution after years of trying to rely on a single central catalog/tooling approach. UBS observed that multiple catalogs and schemas already existed across the firm, often described in spreadsheets, so it shifted to a federated model rather than centralizing everything.

How does the “data service” enable federated data exposure without forcing every application to change?

A data service is the smallest unit of work: a Kubernetes virtual machine that reads from a source system and translates the data into JSON-LD. Developers supply a configuration mapping that defines how source fields map into UBS’s internal schema (aligned with schema.org-style structured data). This lets applications expose data “for the greater good” while keeping their underlying systems intact. UBS then streams the JSON-LD output through Kafka so downstream consumers can retrieve it.

What metadata layers does UBS connect to make data usable, not just searchable?

UBS connects multiple layers: application registries (business applications and environments), infrastructure metadata (servers, network connectivity, ports, and related systems), scanners that identify data assets (buckets, relational databases, and other assets), and asset structure metadata. On top, business glossaries provide meaning, including separate glossaries per line of business. This layered metadata is carried through Kafka as JSON-LD and written into a graph database and a data lake.

Why write into both a graph database and a data lake?

The graph database stores JSON-LD in a connected form so relationships are pre-linked, supporting discovery and connected exploration. The data lake supports analysis over time, where users typically work with thousands of files and can “cobble together” datasets for longitudinal or analytical workloads. Kafka is positioned as the near-real-time backbone to keep updates flowing.

Which open standards anchor UBS’s approach to dataset description, quality, and lineage?

UBS cites DCAT (an open W3C standard) for describing data sets. For data quality and lineage, it references DQV. The broader theme is that open standards reduce adoption friction because they are easier to sell than homegrown formats, and they provide consistent semantics across many teams and applications.

Review Questions

How does UBS’s federated model differ from a centralized data catalog approach, and what problem does that shift solve?
Describe the role of JSON-LD and Kafka in UBS’s pipeline from application data to graph database and data lake.
What business and organizational lessons did UBS learn about adoption, including the distinction between data registration and data governance?

Key Points

1
UBS’s core goal is a curated data layer that enables employees to find, understand, trust, and legally reuse data across roughly 5,000 business applications.
2
The approach keeps a four-step catalog-to-enterprise-model workflow but changes execution by federating work instead of building one central catalog team.
3
A Kubernetes-based “data service” transforms source data into JSON-LD using configuration mappings, allowing applications to expose data without rewriting their systems.
4
Kafka acts as the streaming backbone, enabling near-real-time propagation of JSON-LD metadata and data into downstream stores.
5
UBS builds a connected metadata graph by linking application registries, infrastructure metadata, scanned data assets, and business glossaries (including line-of-business glossaries).
6
Data is written into both a graph database (for connected discovery) and a data lake (for time-based analysis over large file sets).
7
Adoption is supported by open standards like DCAT and DQV, plus “small wins” such as cost savings from identifying archive tables while the long-term program matures.

Highlights

UBS’s “data service” is positioned as the smallest unit of work: a Kubernetes virtual machine that converts source data into JSON-LD via configuration mappings, enabling federated exposure.

The system connects not only datasets but also the surrounding context—application environments, infrastructure, scanned assets, and business glossaries—so data becomes interpretable and governable.

Writing JSON-LD into both a graph database and a data lake balances connected discovery with large-scale, time-based analytics.

UBS credits open standards (DCAT and DQV) and team topology changes as key levers for scaling adoption across decentralized teams.

Funding momentum comes from incremental “small wins,” especially cost savings from discovering and archiving unused or archive tables.

Topics

Connected Data Catalog
Federated Data Services
JSON-LD
Kafka Metadata Pipeline
Graph Database
Data Governance

Mentioned

Greg WBY
UBS
DCAT
DQV
JSON-LD
Kafka
LLM
ROI
CFA