Public UBS Knowledge Graph – Building a Connected Data Catalog
Based on The Knowledge Graph Conference 's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
UBS’s core goal is a curated data layer that enables employees to find, understand, trust, and legally reuse data across roughly 5,000 business applications.
Briefing
UBS is building a connected data catalog at massive scale by shifting from tool-centric data management to a capability-driven model: catalog data assets, map them to a shared enterprise data model, and then federate the work through small, reusable “data services” that translate each application’s data into a common semantic format. The payoff is a curated data layer where tens of thousands of employees can find, understand, trust, and legally reuse data—without forcing every application to change its underlying systems.
The scale is the central problem. UBS runs roughly 5,000 business applications across a global workforce of about 70,000 employees. Each application typically comes with its own budget, team, database, and data structure—privacy compartmentalization included, which matters under Swiss banking rules. Integration becomes manageable when connecting a few systems, but it turns into a “major headache” when the number of applications balloons. UBS’s goal is to provide a consistent, well-described data layer for all employees: discover the right datasets, interpret them despite thousands of structures and languages, verify quality and timeliness, and ensure licensing and usage rights.
The approach relies on four steps that have been stable for about a decade—catalog assets, build a conceptual enterprise data model, and map application data to that conceptual layer. What changed after years of trying “one big catalog” is the recognition that multiple catalogs and schemas already exist, often scattered across spreadsheets and local descriptions. Instead of centralizing governance and cataloging into a single monolithic team, UBS federates the work.
At the technical core is a “data service,” described as the smallest unit of work: a Kubernetes-based virtual machine that reads from a source system and transforms the output into JSON-LD (linked data). That transformation is driven by configuration mappings, letting developers expose data without rewriting their systems. UBS then pipes the resulting linked data through Kafka, with roughly 80 source systems feeding the pipeline. The semantic layer is anchored in an internal schema aligned with schema.org-style structured data.
UBS connects not just application data but also the metadata needed to understand it end-to-end. Application registries describe business applications and their environments; infrastructure systems describe servers, networks, ports, and connectivity; scanners identify data assets such as buckets and relational databases; and asset-structure metadata is integrated from other systems. At the top sits business glossaries, including separate glossaries per line of business. All of this is carried through Kafka as JSON-LD and written into both a graph database and a data lake. The graph database supports connected, pre-linked exploration, while the data lake supports time-based analysis over large file sets.
The model is built around capabilities—find, understand, trust, use, and reuse—supported by open standards. UBS cites DCAT for dataset description and DQV for data quality and DQV for lineage. Lessons learned emphasize business value first (including separating data registration from data governance), using open standards to reduce adoption friction, and aligning team topology with the federated architecture. The rollout is described as a long game: about three years to reach the current state, with additional years expected to mature and expand scanning and onboarding. In follow-up Q&A, UBS frames “small wins” as cost savings (like identifying and archiving archive tables) to keep funding moving while the broader revenue potential is deferred to later stages.
Cornell Notes
UBS is tackling data integration and data management at extreme scale by building a connected data catalog that employees can trust and reuse. With about 5,000 business applications and ~70,000 employees, the challenge is that each application has its own database, structure, and language—making integration and governance difficult. UBS keeps the core cataloging approach (catalog assets, create an enterprise data model, map to it) but changes execution: instead of one central team, it federates work using small Kubernetes-based “data services” that transform source data into JSON-LD and publish it via Kafka. Metadata and semantics flow into both a graph database (for connected discovery) and a data lake (for time-based analysis). Open standards like DCAT and DQV underpin dataset description, quality, and lineage, while business value and team structure drive adoption.
Why does UBS treat “data management at scale” as a different problem than integrating a few systems?
What are the four steps UBS uses to build its connected data layer, and what changed after a decade of attempts?
How does the “data service” enable federated data exposure without forcing every application to change?
What metadata layers does UBS connect to make data usable, not just searchable?
Why write into both a graph database and a data lake?
Which open standards anchor UBS’s approach to dataset description, quality, and lineage?
Review Questions
- How does UBS’s federated model differ from a centralized data catalog approach, and what problem does that shift solve?
- Describe the role of JSON-LD and Kafka in UBS’s pipeline from application data to graph database and data lake.
- What business and organizational lessons did UBS learn about adoption, including the distinction between data registration and data governance?
Key Points
- 1
UBS’s core goal is a curated data layer that enables employees to find, understand, trust, and legally reuse data across roughly 5,000 business applications.
- 2
The approach keeps a four-step catalog-to-enterprise-model workflow but changes execution by federating work instead of building one central catalog team.
- 3
A Kubernetes-based “data service” transforms source data into JSON-LD using configuration mappings, allowing applications to expose data without rewriting their systems.
- 4
Kafka acts as the streaming backbone, enabling near-real-time propagation of JSON-LD metadata and data into downstream stores.
- 5
UBS builds a connected metadata graph by linking application registries, infrastructure metadata, scanned data assets, and business glossaries (including line-of-business glossaries).
- 6
Data is written into both a graph database (for connected discovery) and a data lake (for time-based analysis over large file sets).
- 7
Adoption is supported by open standards like DCAT and DQV, plus “small wins” such as cost savings from identifying archive tables while the long-term program matures.