Dimensional Modeling for Machine Learning

| April 16, 2025
dimensional modeling for machine learning

Kimball’s dimensional modeling continues to play a critical role in machine learning and data science outcomes, as outlined in the Kimball Group’s 10 Essential Rules of Dimensional Modeling, a framework still widely applied in modern data workflows. 

In a recent WhereScape-hosted webinar, Dave Langer, founder of Dave on Data and former head of BI at Microsoft, and Patrick O’Halloran, Senior Solutions Architect at WhereScape, unpacked how this decades-old methodology remains one of the fastest ways to build clean, actionable pipelines for today’s AI and ML projects. With over 125 data professionals in attendance, the session focused on practical strategies and real use cases that connect dimensional modeling with modern machine learning.

This recap highlights the most valuable insights from the webinar—paired with technical guidance and applied examples to help you rethink your approach to data modeling in the context of ML.

Why Dimensional Modeling Still Matters for Machine Learning

dimensional modeling for machine learning

Dimensional modeling organizes data in ways that are intuitive, reusable, and aligned with business logic. It remains essential to modern analytics because it supports consistency, fast querying, and schema reuse across domains. According to a TDWI whitepaper, dimensional models continue to serve as the foundation for scalable, governed architectures in the age of cloud and machine learning.

 Though initially designed for OLAP and dashboards, this approach adapts well to the demands of machine learning.

“The Kimball approach was all about making data easy to use and understand—not just for IT teams, but for the business. That still matters today.” – Dave Langer

Dimensional models retain transaction-level data, giving machine learning algorithms access to the granular inputs they require.

“When I worked on the Xbox supply chain at Microsoft, we had a Kimball-style warehouse. Once I started learning machine learning in grad school, it just clicked—I realized we could use the same data to predict outcomes, not just report on them.” – Dave Langer

Technical Benefits of Dimensional Modeling for Machine Learning

Dimensional models create a solid foundation for machine learning by offering:

  • Preserved Grain: Maintain data at the transactional level for deeper behavioral insights.
  • Feature-Rich Dimensions: Use dimensions like dim_date to access pre-built features such as fiscal weeks and holiday indicators.
  • Accumulating Snapshots: Leverage tables that track each phase of a business process for use in time-series and outcome-driven models.
  • High Data Quality: Pull from curated, validated sources that reduce the need for manual data cleaning.
  • Schema Intuition: Speed up data exploration and collaboration through understandable, business-aligned schemas.

Use Cases: Dimensional Modeling in Machine Learning Projects

1. Fraud Detection

A U.S. state government trained a model on dimensional data to identify fraudulent unemployment claims. The model uncovered subtle patterns that were impossible to detect through manual review.

2. Student Retention

Universities built clustering models using existing student dimensions and fact tables to predict which students were most at risk of dropping out.

3. Supply Chain Forecasting

At Microsoft, the Xbox team converted their Kimball-based warehouse into a predictive engine to improve demand forecasting and mitigate inventory issues.

“Our data warehouse became our crystal ball.” – Dave Langer

Where Automation Fits In

automation for machine learning

WhereScape automates the entire modeling workflow, allowing teams to:

  • Profile source systems quickly
  • Build conceptual, logical, and physical models in minutes
  • Generate code for DDL, ETL, and orchestration automatically
  • Adapt models rapidly as ML needs evolve

“Whether you’re modifying a dimension, building prototypes, or adapting your model for new ML flows, WhereScape lets you do it without slowing down.” – Patrick O’Halloran

Automation not only accelerates delivery but ensures consistency, governance, and traceability across machine learning pipelines.

Best Practices for Building ML-Ready Dimensional Models

Here’s what Dave and Patrick recommend for aligning your dimensional models with machine learning goals:

  1. Start with Business KPIs
    Look at what matters to leadership—bonus-tied metrics are a great place to start—and model around them.
  2. Use What You Already Have
    If your warehouse already has things like dim_product, fact_orders, or dim_customer, use them. They’ve likely been battle-tested for quality and are easier to access than messy data lakes.
  3. Flatten External Data
    When using JSON or semi-structured data, flatten it into tabular form before feeding it into ML models. WhereScape can automate this transformation as part of your pipeline.
  4. Think Features First
    Build dimensions and facts with future machine learning use cases in mind. What behaviors or timelines will you want to predict?
  1. Avoid Rework Between Teams
    When possible, reuse transformation logic between your data warehouse ETL and ML inference pipelines to ensure consistency and reduce duplication.

Q&A Highlights: Dimensional Modeling for Machine Learning

Attendees had great questions throughout the webinar—here are a few highlights:

Q: Why use a Kimball model instead of data lakes or raw JSON for ML?
A: Convenience and speed. A Kimball warehouse often includes features like dim_date, which can contain 100+ columns that would otherwise have to be manually built. It’s also clean, curated, and documented.

Q: How do I move data from my data warehouse into a machine learning model?
A: Query the warehouse using SQL, pull data into Python or R as a DataFrame, and feed that table into your ML algorithm. “Most ML models work best with tabular data,” Dave noted.

Q: How can I validate machine learning results when I don’t have a ‘golden dataset’?
A: Unlike traditional BI validation, ML models are validated statistically—through train/test splits, cross-validation, or ROC/AUC scoring. There’s often no direct tie-out, but model accuracy metrics guide you.

Q: What if I’m stuck with JSON or semi-structured data?
A: You’ll need to flatten that data into tabular form before modeling. If you already transform JSON into relational tables for your warehouse, reuse that logic for your ML pipeline.

Q: Is Kimball too rigid compared to Data Vault?
A: Not necessarily. Many Data Vault implementations actually create dimensional views on top of raw vault structures. Dave’s take: “If a clean star schema exists—it’s your best starting point.”

Final Takeaway: It’s Not Either/Or—It’s Both

Dimensional modeling may be old-school, but when paired with automation and machine learning, it becomes a powerful tool for modern analytics. As Dave put it:

“ML and AI are just predictive models. Dimensional modeling makes your data easier to use, and that’s never going out of style.”

Turn Insight Into Action

The conversation between Dave Langer and Patrick O’Halloran made one thing clear: dimensional modeling is far from outdated—it’s foundational to successful machine learning initiatives. If you’re relying solely on data lakes or scattered sources, you’re likely spending more time wrangling data than building models that drive value.

In the session, attendees walked away with a clear understanding of how dimensional models support feature engineering, improve data quality, and accelerate machine learning outcomes. They also saw how WhereScape’s automation platform reduces complexity across the entire data lifecycle—from ingestion to deployment.

If you missed the live event, now’s your chance to catch up.

Watch the full webinar on demand to hear firsthand how dimensional modeling and automation come together to support faster, more reliable data science. Whether you’re just getting started or scaling existing efforts, the insights are immediately actionable.

Want to see WhereScape in action?

Schedule a personalized demo to explore how our platform can simplify dimensional modeling, streamline ML data prep, and drive greater business impact—no matter your data architecture.

About the Contributors

Dave Langer is the founder of Dave on Data, a data science education and consulting firm. With a background spanning software engineering, enterprise architecture, and hands-on analytics, Dave has helped thousands of professionals build real-world machine learning capabilities. He previously led BI and analytics teams at Microsoft, including the Xbox supply chain division. Learn more at daveondata.com

Patrick O’Halloran is a Senior Solutions Architect at WhereScape, where he works with organizations across industries to implement automated data infrastructure. With decades of experience in data warehousing and analytics, Patrick specializes in translating business needs into technical solutions using automation.

FAQ: Dimensional Modeling for Machine Learning

Q: Can I use a dimensional model if my data is still messy or semi-structured?

A: Yes, but you’ll first need to normalize and flatten the data. JSON, XML, and other semi-structured formats need to be transformed into tabular structures. This preprocessing ensures your model can benefit from the schema clarity of a dimensional approach.

Q: What’s the difference between a feature and a dimension in machine learning?

A: A feature is an individual measurable input to a machine learning model—usually a column in your dataset. A dimension is a structured way to organize related features, often representing business entities like time, products, or customers. You can derive multiple features from a single dimension (e.g., fiscal week, holiday flag from dim_date).

Q: Do I need to rebuild my warehouse to support ML?

A: Not necessarily. If your current warehouse follows dimensional modeling principles and includes validated, curated data, you can often repurpose it for ML by extracting tabular datasets directly. You may only need to adjust for additional feature engineering.

Q: Is dimensional modeling only useful for structured data?

A: Primarily, yes. Dimensional models are most effective with structured, relational data. However, semi-structured data can still be included once it’s flattened and transformed appropriately.

Q: How do I ensure the same data logic is used between my reports and ML models?

A: By reusing transformation logic—such as the SQL or ETL code used to build your dimensions and facts—you reduce discrepancies between reporting outputs and ML inputs. Automation tools like WhereScape can help ensure this consistency.

ETL vs ELT: What are the Differences?

In working with hundreds of data teams through WhereScape’s automation platform, we’ve seen this debate evolve as businesses modernize their infrastructure. Each method, ETL vs ELT, offers a unique pathway for transferring raw data into a warehouse, where it can be...

Automating Data Vault in Databricks | WhereScape Recap

Automating Data Vault in Databricks can reduce time-to-value by up to 70%—and that’s why we hosted a recent WhereScape webinar to show exactly how. At WhereScape, modern data teams shouldn't have to choose between agility and governance. That's why we hosted a live...

WhereScape Recap: Highlights From Big Data & AI World London 2025

Big Data & AI World London 2025 brought together thousands of data and AI professionals at ExCeL London—and WhereScape was right in the middle of the action. With automation taking center stage across the industry, it was no surprise that our booth and sessions...

Why WhereScape is the Leading Solution for Healthcare Data Automation

Optimizing Healthcare Data Management with Automation Healthcare organizations manage vast amounts of medical data across EHR systems, billing platforms, clinical research, and operational analytics. However, healthcare data integration remains a challenge due to...

What is Data Fabric? A Smarter Way for Data Management

As of 2023, the global data fabric market was valued at $2.29 billion and is projected to grow to $12.91 billion by 2032, reflecting the critical role and rapid adoption of data fabric solutions in modern data management.  The integration of data fabric solutions...

Want Better AI Data Management? Data Automation is the Answer

Understanding the AI Landscape Imagine losing 6% of your annual revenue—simply due to poor data quality. A recent survey found that underperforming AI models, built using low-quality or inaccurate data, cost companies an average of $406 million annually. Artificial...

RED 10: The ‘Git Friendly’ Revolution for CI/CD in Data Warehousing

For years, WhereScape RED has been the engine that powers rapidly built and high performance data warehouses. And while RED 10 has quietly empowered organizations since its launch in 2023, our latest 10.4 release is a game changer. We have dubbed this landmark update...

Related Content

ETL vs ELT: What are the Differences?

ETL vs ELT: What are the Differences?

In working with hundreds of data teams through WhereScape’s automation platform, we’ve seen this debate evolve as businesses modernize their infrastructure. Each method, ETL vs ELT, offers a unique pathway for transferring raw data into a warehouse, where it can be...

Automating Data Vault in Databricks | WhereScape Recap

Automating Data Vault in Databricks | WhereScape Recap

Automating Data Vault in Databricks can reduce time-to-value by up to 70%—and that’s why we hosted a recent WhereScape webinar to show exactly how. At WhereScape, modern data teams shouldn't have to choose between agility and governance. That's why we hosted a live...

ETL vs ELT: What are the Differences?

ETL vs ELT: What are the Differences?

In working with hundreds of data teams through WhereScape’s automation platform, we’ve seen this debate evolve as businesses modernize their infrastructure. Each method, ETL vs ELT, offers a unique pathway for transferring raw data into a warehouse, where it can be...

Automating Data Vault in Databricks | WhereScape Recap

Automating Data Vault in Databricks | WhereScape Recap

Automating Data Vault in Databricks can reduce time-to-value by up to 70%—and that’s why we hosted a recent WhereScape webinar to show exactly how. At WhereScape, modern data teams shouldn't have to choose between agility and governance. That's why we hosted a live...

WhereScape Recap: Highlights From Big Data & AI World London 2025

WhereScape Recap: Highlights From Big Data & AI World London 2025

Big Data & AI World London 2025 brought together thousands of data and AI professionals at ExCeL London—and WhereScape was right in the middle of the action. With automation taking center stage across the industry, it was no surprise that our booth and sessions...