Introducing: Microsoft Fabric with WhereScape

Discover how-to unlock exciting new...

Automating Data Vault in Databricks | WhereScape Recap

| April 11, 2025

Building a Data Vault in Databricks with Streaming Tables

Automating Data Vault in Databricks can reduce time-to-value by up to 70%—and that’s why we hosted a recent WhereScape webinar to show exactly how.

At WhereScape, modern data teams shouldn’t have to choose between agility and governance. That’s why we hosted a live webinar titled “Building a Data Vault in Databricks with Streaming Tables”—designed to demonstrate how automation with WhereScape RED and 3D unlocks scalable, real-time Data Vault architectures within Databricks.

This session, led by Endika Pascual, provided data professionals—including data architects, engineers, BI analysts, DevOps teams, and data warehouse managers—with a firsthand look at metadata-driven modeling, automation, and ingestion pipelines. The webinar featured a live technical walkthrough and a robust Q&A segment with real-world questions from the audience.

Introduction: Streaming + Automation for Modern Data Vaults

In this session, we showcased how to create and manage a Data Vault using WhereScape RED and 3D within the Databricks environment. From metadata modeling to deployment and streaming data ingestion, the webinar demonstrated a fully automated workflow.

70% Faster Time-to-Value: How WhereScape RED and 3D Streamline Your Data Vault Architecture

Endika introduced the foundation of our automation ecosystem: WhereScape RED and WhereScape 3D.

WhereScape RED is our low-code automation platform for building, deploying, and orchestrating data warehouses and vaults. It generates all DDL and DML code automatically, manages version control, integrates scheduling, and supports multiple environments (Dev, UAT, Prod). It uses metadata templates to dynamically generate optimized code—such as Spark SQL or Python—for your target platform, like Databricks, Snowflake, Synapse, Microsoft Fabric, or Redshift.
WhereScape 3D enables rapid discovery, profiling, and modeling of data across any JDBC/ODBC source, including flat files and APIs. It captures and maps metadata through discovery queries, applies user-defined model conversion rules, and lets data engineers visually design architectures such as Data Vault 2.0, 3NF, and star schemas. 3D is technology-agnostic—no database is required until deployment—and generates a metadata-rich XML for seamless handoff to RED.

Both tools store metadata in PostgreSQL repositories, ensuring complete transparency. All transformations, logic, and queries are fully exposed and editable—there are no black-box operations. Users remain in full control of how their pipelines are built and evolve.

To further accelerate and simplify Data Vault 2.0 adoption, teams can also leverage WhereScape Data Vault Express (DVE). DVE builds on RED and 3D to streamline model creation with wizard-driven templates, automate the generation of hubs, links, and satellites, and unify batch and real-time processing.

As a purpose-built tool for Data Vaults, DVE dramatically reduces complexity, supports metadata consistency across layers, and enables agile, insight-ready analytics with minimal rework.

Building Connections and Profiling Sources

Endika demonstrated how to establish connections using JDBC/ODBC, APIs, or flat files. WhereScape then executes SQL discovery methods to pull metadata. The profiling tools flag data quality issues, such as sparsely populated attributes. As Endika explained, “You’ll see your region is not very well populated, and we can use that to flag attributes for developers.”

Modeling and Automating Your Data Vault in 3D

With WhereScape 3D, users tag attributes by volatility level to generate hubs, links, and satellites:

Hubs store business keys.
Links represent relationships.
Satellites store historical attributes.

Users can export the model to RED with one click for immediate deployment.

Code Generation for Automating Data Vault in Databricks

WhereScape RED transforms the 3D model into executable Databricks notebooks. It generates code to:

Create streaming tables
Upload and execute notebooks
Manage ingestion pipelines

During the live demo, we showed how RED builds load, stage, and hub tables automatically, complete with hashing logic and change tracking.

Streaming Data Vault Pipelines in Databricks

Streaming tables are essential to automating Data Vault in Databricks, enabling near-instant ingestion from ADLS with zero manual scripting.

With Databricks, RED creates streaming pipelines that ingest files from Azure Data Lake (ADLS) as soon as they land. “The pipeline pushes the information through the whole lineage—as long as the whole lineage is included,” Endika noted. Pipelines can run continuously or on a schedule.

Scheduling, JSON Parsing, and Multi-Env Deployment

The session also covered:

Scheduling options like Azkaban (included), Airflow, Data Factory, and Databricks Workflows
JSON Parsing using automated flattening via templates
API Support with full control over authentication, including OAuth
Multi-environment promotion for Dev, UAT, and Production

Documentation & Visibility

WhereScape RED automatically generates full documentation—lineage, rules, scripts, and annotations—allowing teams to share metadata and logic with business users and developers alike.

Q&A Highlights about Automating Data Vault in Databricks

Q: How do you handle performance with large tables, such as 350 million records and 260 columns?

Endika shared, “Performance depends on your cluster configuration. You can set compute limits to meet processing requirements.”

Q: What happens if multiple business keys exist for the same entity?

“We hash business keys—regardless of the source—and store them in a single hub. The hash ensures uniqueness.”

Q: How does the system handle a value flipping back and forth?

“The satellite logs both versions with timestamps. The latest version becomes current.”

Q: Is this compatible with Microsoft Fabric?

“Fabric doesn’t support streaming tables,” Endika noted, “but WhereScape integrates through Data Factory for similar functionality.”

Q: Can you build fact tables from the Data Vault?

“Yes. We recommend building fact views using satellites and links. That’s in line with Data Vault 2.0 principles.”

Q: How do you parse JSON and handle API-based ingestion?

“We use a JSON parser to flatten nested structures with commands like LATERAL VIEW EXPLODE. APIs can be configured for GET/POST with full auth control.”

Q: Does the pipeline detect changes automatically?

“Yes. Streaming pipelines support real-time and batch ingestion and only process new or changed records.”

Conclusion & Takeaways: Automating Data Vault in Databricks

This webinar showed how WhereScape empowers teams to automate their Data Vault development using Databricks. From metadata-driven modeling to real-time ingestion and cloud deployment, WhereScape RED and 3D simplify complex data workflows.

Key Takeaways:

Automate Data Vaults with confidence
Stream data in real time with Databricks
Deliver transparent, editable metadata and documentation
Parse JSON and integrate APIs easily

Get Started with WhereScape: Watch, Join, and Explore

Whether you’re starting from scratch or modernizing legacy systems, automating Data Vault in Databricks with WhereScape ensures speed, consistency, and agility. If you missed the live session or want to revisit the insights, the full webinar recording is available to watch on demand now.

Don’t miss our upcoming session, “From Data Discovery to Deployment: Automating Star Schema Modeling in Microsoft Fabric,” on May 1st. Register Here!

Ready to see how WhereScape can streamline your data automation journey? Book a personalized demo with our team today and discover how RED and 3D can accelerate your time-to-value.

About the Presenter

Endika Pascual is a Principal Solutions Architect at WhereScape with deep expertise in Big Data, Business Intelligence, and Data Analytics. He specializes in designing modern data infrastructures and cloud-native solutions. Endika thrives in agile environments and is passionate about helping clients maximize the value of their data.

ETL vs ELT: What are the Differences?

Apr 17, 2025

In working with hundreds of data teams through WhereScape’s automation platform, we’ve seen this debate evolve as businesses modernize their infrastructure. Each method, ETL vs ELT, offers a unique pathway for transferring raw data into a warehouse, where it can be...

Dimensional Modeling for Machine Learning

Apr 16, 2025

Kimball’s dimensional modeling continues to play a critical role in machine learning and data science outcomes, as outlined in the Kimball Group’s 10 Essential Rules of Dimensional Modeling, a framework still widely applied in modern data workflows. In a recent...

WhereScape Recap: Highlights From Big Data & AI World London 2025

Mar 28, 2025

Big Data & AI World London 2025 brought together thousands of data and AI professionals at ExCeL London—and WhereScape was right in the middle of the action. With automation taking center stage across the industry, it was no surprise that our booth and sessions...

Why WhereScape is the Leading Solution for Healthcare Data Automation

Mar 20, 2025

Optimizing Healthcare Data Management with Automation Healthcare organizations manage vast amounts of medical data across EHR systems, billing platforms, clinical research, and operational analytics. However, healthcare data integration remains a challenge due to...

WhereScape Q&A: Your Top Questions Answered on Data Vault and Databricks

Mar 17, 2025

During our latest WhereScape webinar, attendees had fantastic questions about Data Vault 2.0, Databricks, and metadata automation. We’ve compiled the best questions and answers to help you understand how WhereScape streamlines data modeling, automation, and...

What is Data Fabric? A Smarter Way for Data Management

Feb 28, 2025

As of 2023, the global data fabric market was valued at $2.29 billion and is projected to grow to $12.91 billion by 2032, reflecting the critical role and rapid adoption of data fabric solutions in modern data management. The integration of data fabric solutions...

Want Better AI Data Management? Data Automation is the Answer

Feb 14, 2025

Understanding the AI Landscape Imagine losing 6% of your annual revenue—simply due to poor data quality. A recent survey found that underperforming AI models, built using low-quality or inaccurate data, cost companies an average of $406 million annually. Artificial...

RED 10: The ‘Git Friendly’ Revolution for CI/CD in Data Warehousing

Feb 14, 2025

For years, WhereScape RED has been the engine that powers rapidly built and high performance data warehouses. And while RED 10 has quietly empowered organizations since its launch in 2023, our latest 10.4 release is a game changer. We have dubbed this landmark update...

The Assembly Line for Your Data: How Automation Transforms Data Projects

Feb 10, 2025

Imagine an old-fashioned assembly line. Workers pass components down the line, each adding their own piece. It’s repetitive, prone to errors, and can grind to a halt if one person falls behind. Now, picture the modern version—robots assembling products with speed,...

The Role of Clean Data in AI Success: Avoiding “Garbage In, Garbage Out”

Feb 5, 2025

Co-authored by infoVia and WhereScape Artificial Intelligence (AI) is transforming industries across the globe, enabling organizations to uncover insights, automate processes, and make smarter decisions. However, one universal truth remains: the effectiveness of any...