On Demand | Something Old, Something New: Dimensional Modeling for Machine Learning

Is Kimball’s dimensional modeling outdated—or is it the missing link to better machine learning and data science outcomes?

In this on-demand session, Patrick O’Halloran of WhereScape and Dave Langer, founder of Dave on Data, explore how dimensional modeling remains a powerful and relevant approach for preparing data to train and input into machine learning models.

What You’ll Learn:

✅ Why dimensional modeling is still a practical choice for machine learning and data science-ready data
✅ How automation accelerates the design and delivery of dimensional models
✅ Real-world examples of using dimensional modeling to tackle data science challenges

See how WhereScape’s automation platform simplifies dimensional modeling—enabling faster delivery, consistent precision, and built-in data governance.

Whether you’re new to machine learning or optimizing existing workflows, discover how automation and proven modeling techniques can simplify complexity and drive better results.

🎥 Watch Now On-Demand!

Transcript

hey everybody my name is Patrick O'Halloran I am with WhereScape it's the top of the hour thank you all for joining I'm gonna give us about two minutes um to let a few more people trickle in and then we'll get started I've been been looking forward to this let me go ahead and share my screen. there we go for those of you just popped in um again my name is Patrick O'Halloran I'm going to give us about 60 90 seconds to get started uh just to allow a few more people to trick in this is being recorded So if you have to leave during the middle or if you're not here and you obviously don't hear me if you're not here but uh if you're not here we'll send it a recording as well there is a Q&A button at the bottom I've got that window open uh if you have any questions feel free to type those in uh we will either answer those if it fits in with our conversation uh but at the end there will also be a Q&A uh session as well got a little popup that said pent has requested the captions be turned on I'm still amazed by uh voice to text that's that's still just magic to me but that that that I I used to I was talking with Dave about this a while ago I used to understand technology I used to you know had classes and microprocessor design and took Assembly Language classes and knew knew what was happening and now it's just all all magic don't say that so I'll give us about 30 more seconds um you know what it's two minutes after the hour if someone walks in a little late they'll just have to catch up all right hey thank you for joining us uh I'm here with Dave Langer founder of uh the data Science Guy Dave on data uh he has taught hundreds in person thousands remotely on all aspects of uh data and Technology we're going to talk today about dimensional modeling and machine learning and if I can advance the slide we're not doing too many there we're done with the first bullet already we're we're ahead of schedule uh I'm going to spend a minute or two talking about we escape Who We Are what we do uh since we're sponsoring this I get that I get to be able to do that uh why dimensional modeling um why why why is it why does it exist I'm assuming a lot of you already know about it but you know it never hurts to go back to the fundamentals and talk about uh the aspects of what dimensional modeling offers you uh talk about how it helps with machine learning we'll have some real world examples uh and then uh suppose you decide that yeah I do I do want to use dimensional modeling to help with my machine learning because I either have existing data or I can build something fairly easy on top of my existing data Foundation how do I do that and then we will have Q&A at the end I'm seeing Pennsylvania I'm seeing Jamaica Central Florida Nebraska cool people from all over the place so WhereScape is a software company we started 25 years ago in New Zealand as a consulting company building data warehouses realized we are doing a lot of repetitive tasks and those tasks uh facilitated that we built a tool to automate those tasks those tools we built became very productive and very powerful and people started asking if they could buy those tools so um so where Escape stopped being a consultant company and began as a software company we've got customers around the world we've got about 1,200 customers or more in pretty much every industry you can think of doing dimensional modeling uh Inman Data Vault data Lakes data lake houses using pretty much every Target you can imagine and this pretty much describes what we do uh we have a design tool that helps you profile Source systems build a conceptual logical model tie your Source system that logical model and then push it out to our Builder tool where you can build a physical model a lot of ddl a lot of DML a lot of python or Powershell scripts um that that generate and run your what I call data infrastructure uh just a real quick example of what it is we build we've got data sources here you've got OBC connections Json files paret files of some sort you've got data analysts over here uh which I now would include AI engines machine learning um it used to be that the limiting factor in data warehousing was how much data you could consume then it seem to be number of data sources you could consume now it seems to be who's using the data and the struggle with data warehousing now building a data Foundation is getting data in the right format for your users who now are not only dashboards and reports uh but machines as well all what we call a nashcar slide lots of customers lots of very happy customers uh and then a slide for the very end with Q&A with that I will stop my screen sharing maybe and uh Dave I just did a very quick introduction for you uh you're based in Boise Boseman Montana Boseman Boseman another B toown yeah right Rocky M same same general area um we've talked before you said you worked at Microsoft and you've have daon data now um how would you describe what it is what it is you're currently doing and and what you're offering to people yeah so I've been in technology a long time uh been a software engineer been an Enterprise architect and I've spent about half my career working in Hands-On analytics starting with traditional bi and data warehousing reporting dashboards that sort of thing but these days I'm mostly doing data sciencey type work I'm going to air quote that because it means different things to different people we can certainly talk about what I think it means if you if you think folks would be interested in that so I'm a data science consultant and educator these days so I spend most of my time training people on how to use machine learning other Advanced analytics techniques as well as also helping clients um build and use machine learning and things like that more effectively yeah you we um we talked a couple times remotely and then we actually met in person in Las Vegas about two weeks ago at the tdwi which used to stand for the data whereare Institute but now stands for transforming data with intellig yeah you were teaching you were teaching several classes there um so we're talking about machine learning we're talking about the way this whole topic came up is um I've been told that dimensional modeling is p a and that there are much much better ways of doing things so in in your experience how how did again we've got a lot of data modelers and and analysts on already but at a very basic level why dimensional modeling what did what was what was Kimble trying to do 20 25 years ago uh that what was the problem he was trying to solve yeah so from my perspective because I've been a Kimble guy from way back uh I will I will admit like most technologists who've been around a long time I have certain biases and opinions about things strongly held beliefs and convictions so I'm a Kimble guy proud proud to say it from way back and in the early days Kimble was trying to achieve arguably two main goals and you mentioned these one was handling the physical limitations of the technology back in the day right so way back in the day some folks might not know about Technologies like this but like uh olap cubes right olap cubes were big back then and olap cubes were essentially a way to cach data in an aggregated form to increase performance now the only downside was that of that was you're using aggregated data you're not getting at the lowest grain of the business process you're not getting at the transactional stuff so Kimo was really trying to say look is there a way given the limitations of Technology back then could we get away from caching data in an aggregated form as the default way we served up data and can we actually capture it in such a way that we can access the transactional data and then surface it in such a way and that's the second aspect and you mentioned this as well that's easy for folks to understand and consume and that's where dimensional modeling really came into into play now as time has moved on as you correctly indicated the technology aspect has become less and less important databases are far more powerful we got more powerful Hardware blah blah blah blah all that you know Cloud all the things so is it's more shifted to this idea of like ready to use data and dimensional modeling is a great way to structure your data in such a way that the maximum number of users in your organization could potentially use it because it'll tend to be more when done right the Kimo model tends to be more accessible to a lot more folks not necessarily technologists not it people but but also quote unquote business folks as well yeah one of the one of the phrases I hear a lot is uh uh data mesh where individual departments individual domains are creating their own data products and I think Kimble lens itself to that well because now you have a standard way of accessing the data and approaching the data so if I'm in HR and I need to go look at purchasing data I don't have to retach myself a whole new architecture and a whole way of approaching the data yeah in theory assuming that you you know you you've built all the dims and facts that the HR people need right that's always sometimes that's true sometimes that's not but in general yeah assuming that you've got your bus Matrix filled out using you know the kiml methodology and you've filled in all the conform dims that you need and in the facts that you need yeah that's pretty true you you said in theory they're the same there's a quote I heard a long long time ago which is the difference between Theory and reality is that in theory they're the same uh for me it's always been time and budget yeah the in reality that's actually usually the Gap so you mentioned uh grain and how um because of limitations of processing because limitations in throughput and data transfers and all that uh olap cubes and things tend to deal with Aggregates and averages and large sums of data as opposed to dimensional modeling where you can get down to the finest grain you can individual transactions individual actions uh I'm assuming that really helps with machine learning it absolutely does it absolutely does um so the story I like to tell is that I first learned about machine learning while I was getting my master's degree and at the time I was going to school at night I was working at Microsoft during the day and I worked in the Xbox division uh supporting manufacturing and supply chain for Xbox and we had a Kimble style data warehouse and when I was taking this machine learning course it just clicked I was like wait a second we can use our data warehouses far more than just as for reporting a dashboards we can actually use it like a crystal ball it will allow us to actually take that historical data feed it into a machine learning model and if we do it right we should be able to accurately predict future events right that's the whole promise of machine learning predictive models and what struck me was that all of that clean and tasty Enterprise data was in that raw format that you need for machine learning you can't really work with Aggregates you have to work with the low-level data at least at the start you might aggregate it as part of building your machine learning model but typically you don't it's usually closer to the transactional grain than it is to the aggregated grain so for example um using cognos or ssas which are olap Cube Technologies right usually wouldn't give you the data that you actually need to build a good machine learning model but a Kimble style data warehouse with all the low grain data could be used for that quite well yeah you mentioned using excuse me um using uh your your Kimble architecture for training for machine learning and doing predictive analysis on I don't know if you mentioned today or yesterday as yep supply chain things like that um I I have been to a lot of AI events and and a lot of the attendees I'm going to at these events are still trying to figure out what what exactly what AI what exactly what machine learning are and um I've learned to describe it on a on a spectrum uh it's not like it's a switch you have it or you don't have it uh can you give some examples of someone who's just getting into it looking at capabilities what kinds of things they what kinds of things they might be using for machine learning you mentioned Predictive Analytics uh St keeping yeah so let me just St for just a quick second sure so that we can level set everybody so I I have an educational background as well as professional background in computer science AI is an umbrella term from the computer scientist perspective artificial intelligence and there's many aspects areas of study technologies that have been developed that all fall under that umbrella of AI one of which is machine learning now technically speaking what people think of as AI Chad gbt llms large language models that sort of thing is just a specialized form of machine learning called supervised learning which is a predictive model so AI is in in the current nomenclature is a predict Ive model now granted it's a very large very sophisticated predictive model right we all know about how much processing power you know open AI needs to build Chad GPT right it's insane but in the end it's still a predictive model it's predicting the words that it sends back in response to your prompts so from that perspective what we really need to do is start thinking about not necessarily AI anymore we just need to think about machine learning along the Spectrum like you're talking about which is the biggest most power ful predictive models that we know of right now are what we call AI llms chat GPT Claud that sort of thing okay that being said it's a predictive model like I said so if you think about machine learning along that Spectrum you can create all kinds of predictive models ranging from J GPT to something like is this claim fraudulent should we approve this insurance application should can we predict whether or not a student is going to drop out of our University all those sorts of things are possible as well as forecasting you can also use machine learning predictive models for forecasting and in fact some of the state-of-the-art techniques for actually forecasting sales and demand and things like that actually use machine learning techniques the kind of stuff that is typically taught in a computer science curriculum which is bit of a shift from statistical type techniques to more machine learning based techniques so predictive modeling obvious is the obvious one that's the one that gets all the play but that's not all in a lot of places you you have you have lots of data but your data doesn't have what's known in machine learning as a label which is an outcome of interest and the easiest way to think about that is like some sort of yes no indicator is this claim fraudulent yes or no is this student likely to drop out of University yes or no a column in your data for in your data set that has like yes no indicators that's what's known as a label it's an outcome of Interest most of the world's data is not labeled so another form of machine learning that's actually really really super useful in practice is what's known as unsupervised learning and that's where your data doesn't have these labels right cluster analysis is the quintessential example so that's another form of machine learning and by extension luster analysis is AI so somebody just getting into it really needs to say look along this spectrum of aess if you will where where should we actually start working first where are we actually going to get the potential biggest bang the potential Roi to the least amount of effort and typically it's not trying to work with LM because those are the largest and most complicated types of machine learning so it's usually it's usually a good idea in the early days to scale it back and try something a little bit easier a little bit more understandable to your business something with more of a direct line of sight to business Roi and start there and if you're under pressure from your Executives you can say with that with a straight face boss this is AI I like that yeah um it's it's uh it's one of those things that uh you know everybody wants but nobody quite knows exactly what it is MHM um you mentioned at uh Microsoft you had been in charge of Xbox supply chain uh and you had analytics in a Kimble style that you'd been gathering for that how were you using that Prem machine learning what was that data being used for yeah and just just so we're clear I was not in charge of the supply chain okay I just want to make sure that was clear I was at the time I was the architect for the bi and analytics platforms for um Microsoft's Xbox chain I eventually became the senior director of bi and analytics and actually ran the team yes but um I don't want anyone to be like Dave you were not in charge the whole thing because I wasn't um okay so before before my Epiphany uh in my machine learning course that I took as part of my Master's Degree we used our data warehouse for exactly what you would expect for reporting and dashboards so I I worked at Microsoft so not surprisingly we were on the SQL Server stack so we had all the things we had SSA SQL Server analysis Services goap cubes we had SSRS SQL Server reporting Services uh I was at Microsoft when powerbi first rolled out in the early days so of course we had powerbi dashboards all over the place not surprisingly so all the traditional stuff that uh under the Gartner maturity model would be classified as descriptive analytics reporting on what happened last week last month last quarter that sort of thing and that was predominantly what we did the analyses that that we did generally speaking were quite light as most companies do and what typically ended up happening was if any of our quote unquote business partners were interested in doing analyses more often than not they exported the data out and analyzed it in Excel or they asked for a data dump or data poll and then they analyzed it in Excel so right right in the whole descriptive analytics very low level of maturity kind of spectrum and I assume that was all backward looking uh all the reports and dashboards were here's what had happened right right and of course there was a certain amount of diagnostic analytics that people tried to do as well right for example if uh we had a bad bad batch of components or cons consoles or something like that trying to root cause analys that sort of thing right classic manufacturing supply chain kind of thing but typically that was all done using you know either Excel or tools outside of the official itbi stack so I do have to say I was very disappointed when the connect broke on my Xbox so I I should probably have talked to you about that because it was actually really really handy oh yeah the connect loved it yeah it was great um I use it as an example in some of my machine learning courses because there was a machine learning model that made the connect actually work recognized your body movements and stuff inside the device um so you've got you already had a Kimble style architecture for doing this uh uh look back and Diagnostic and other kinds of business analytics that you were using but but suppose I suppose I don't have that suppose I've got a data Lake I've got a Data Vault I've got some unstructured semi-structured data how do I go about what should I look for first what are some easy wins that I can say well if I can model certain areas certain types of data uh that that would give me as you said before the most bang for the buck right so it's a great question and the advice that I give is the same advice that I've followed myself so I'm preaching what I do if you will and that is look at the kpis look at the kpis that your managers are being held accountable to or your manager's managers being held accountable to or even farther up the chain and in general that's a great place to start and in and then once you look at the kpis then look at the stuff that got them worried I mean I'm gonna be completely Frank look at the stuff that actually determines whether or not they get a bonus and if you can find that right this is where they get a bonus and things aren't going well that's a great place to say h is there is there a way that I could apply Advanced analytics in this space somehow to positively drive that kpi for example uh you might be able to perform a cluster analysis um around things related to that kpi of Interest let's say for example um you're let's say you're a university and you're worried about um students dropping out for whatever reason so you might create you might do a cluster analysis for example along those lines of the students right where your rows of data and your data set are the entities of Interest which are students and then the columns are their behaviors and attributes and characteristics and you cluster them and then what you can do is take a look at those clusters and then start applying you know this idea of like well are is this cluster likely to for some reason drop out of the University are these people not so on and so forth and that becomes insights that you can then use and present as part of some sort of data story or presentation or something like that to the executive and because you're you're addressing something that they the leadership cares about they are far more interested usually in the output of what you have to produce than for example creating yet another dashboard been there done that by the way built many dashboards myself a lot of which never got used very much or got yeah um I'm I'm used to seeing sobody monitors up on the wall with kpis and dashboards and things and going green yellow red that the the idea of being PR Ive about it is is is enticing um I'm primarily a technical creature you're primarily a technical creature uh but in talking with you you've definitely got a business sense I I think one of the things we talk about when we talk about building data warehouses um is it's a it's typically a technical task that involves a lot of technical tools including warscape I can talk about that in a little bit uh but it should primarily be a business function that that we need to get the the idea of building a data warehouse an analytic system Ai and stuff uh to answer business questions not answer tactical questions so we kind of make a push to uh encourage people to if they're in a data engineer if they're a data architect or things like that expand their views I think a lot of the audience today is going to be data analysts who are already closer to the business um but you mentioned you know Finding finding the key kpis the things that are near and dear being bonuses to executives executive Hearts Executives Hearts um do you do you see a problem with uh AI being treated specifically as an IT function or uh as a a technical tool rather than as a business tool yes and I'm G to make an assumption Patrick that you're using AI as in the umbrella term as I described earlier yes Lear yes yes yes I I have to retrain myself learning yeah because because if we don't then we have to start talking about data science right and then that's another can of worms we have to open up okay and generally speaking the answer is is that I come from an IT background and I had to learn the hard way especially when I moved into Data that technology is great in so far as sometimes Executives and Business Leaders are looking for certain check boxes that they can check off yes we're doing this and yes we're doing that which is totally fine and you know just the reality of the situation but that in general they also don't care about too much about the how so one of the one of the earliest mistakes that I made it was TR was telling people about the algorithm the machine learning model that I used and to as a presentation to for the insights that I found and nobody cares nobody in that room cares about the how all they care about is the what what is why am I here Dave what what's the Insight what's the so what factor so moving this idea that yes these are technical skills for sure but they actually serve there are technical skills that serve quite readily business concerns is super important right which is different than soft hardcore software engineering right which there isn't necessarily a direct correlation and when I was a software engineer that used to bum me out big time right that nobody cared about my beautiful object-oriented design all my design patterns nobody cared in the business so I that was a big shift for me was to be like okay you know what all of my technical skills are just an enabler for me to get to the so what and that allowed me to get into the room and have conversations so that's what folks really need to think about especially if they come from more of a traditional technical background especially if they're on a traditional BW team where there's this kind of culture or operating model where often times people are just taking in requests I need your dashboard I need you to change this I need you to change that which is all well and good you have to have descriptive analytics you have to but a lot of these folks want to move Beyond it and the way they move Beyond It Is by saying look how do we actually get these technical skills and then actually evangelize them as addressing business concerns how do we evangelize these as potential business skills not just technical skills but also business skills as well you know along the same vein is Pivot tables I mean pivot tables is arguably in Excel is arguably a technical skill however the perception from business people is like tables are the best thing ever so we need to get like machine learning and data sciency and analytical kind of technical skills into that same realm as pivot tables in Excel yeah I I come from a very academic very uh technical background um I did before my dad taught physics and I was raised up in a university environment that's where I grew up in college town uh and and probably one of the best skills I developed was being able to not overe explain things I always thought that you know if if someone asked me a question I need to go into all the minutia and you talked about that as soon as you start introducing Greek letters you've probably gone too far yeah if you have to if you have to use Greek letters in your explanation for most people that's way too far we've got a couple questions here I'm going to go ahead and uh submit them uh how familiar are you with uh machine learning I'll go ahead and submit that one and I will submit uh um how interested are you um are you using machine learning now I'll go and submit that one oh it looks like I can only launch one at a time and I see that uh someone has their hand raised in the attendees I don't know if that's an accident or on purpose but if you do have questions there is a Q&A button so uh if you do have a question feel free to to type anything you need in there um excuse me you mentioned uh Microsoft supply chain um you've worked with a lot of people in a lot of situations can you describe some other examples of real world examples of where you've helped move someone from doing this uh ex after the fact analysis to being more predictive in what they're doing yeah there's quite a few examples uh for example one project that I worked on was for one of the state governments here in the US which will Ren name nameless and they were interested in trying to create a machine learning model that could accurately predict whether or not a unemployment insurance claim was potentially fraudulent because they had been doing that specifically by hand for example they had a fraud team and they would come through the historical data and they would try and look for patterns by hand and they would try hardcode these kinds of rules like if and this is no joke by the way if somebody used this particular middle name in their CL flame application they had noticed that that was disproportionally associated with fraud this one particular middle name which you know is well and good in theory a machine learning model will do exactly the same thing excuse me but what a machine learning model will do is do it at scale far far far more encompassing than what a human could do this team what the machine learning model learned from the data would take this team many many years to figure out if at all so machine learning was great for that so that was a that was a great example of them moving away from you know handcom through the historical data looking for things to actually moving to more of a predictive model and one of the things they found most interesting by the way was not just the fact that the model accurately predicted fraud based on the historical data but also they were super interested in what the model found to be important for predicting that fraud which was stuff that they had never thought of before yeah that's one example and I got more if you want more I got more I imagine things with that middle name is as a human being looking at that I'm sure there's confirmation bias where if I see that middle name again I go oh oh that's another one that's another one and I'll you know possibly right conv for the 99 that that didn't have that middle name you see see that's my rule that's the rule I figured out see look at that that's my rule that's my rule that's my rule I'm good look at me look at me look at me which is human which is human nature right I mean just the way it is yeah sometime in my 20s I decided I would adopt the attitude of egoist programming and I it's like Zen I never quite got there there was always some ego there in what I was doing about that's mine I came up with that solution guy there's some sort of meme potential for the eist software engineer I'm just seeing it right I got I gotta think about it more but because I used to be a software engineer and I certainly was not ego less as for sure okay so we do have a question that fits in well right now by Anonymous attendee um I'm still struggling to understand the value of why a Kimble architecture would help with machine learning my experience with machine learning is very basic I've always thought people use data for machine Lang machine learning that exists in data lakes or the likes so what advantage does a Kimble architecture have for machine learning over a data Lake that's a great question and the answer is convenience quite frankly so as a as a Consulting machine learning person data scientist whatever you want to call it when I find out that my client has a Kimble data warehouse that makes me happy because it means that I can get more done with less effort and let me give you a specific example dim date so a date dimension in the Kimble style typically if it's done well has many many many many columns for various aspects of a date is it a weekday what day of the week is it is it a holiday d right all these kinds of features those features end up being quite useful in many many real world World machine learning scenarios if they're already built into dim date I don't have to build them myself in my python code for example so that's one example of the convenience factor here's another con here's another convenience factor a accumulating snapshot fact table yeah this is one of the more relatively Advanced design patterns from the Kimble style and basically think of it as a table where there's a column and each column represents essentially a step in the business process process and usually way it gets inserted in there is some sort of key reference to dim dat right the step happened on this date that sort of thing accumulating snapshot fact tables are awesome once again because typically if they're modeled well you've already gone through the rigamaroo of trying to understand the business process step by step and you've been you've already figured out which steps are the most important and they're embodied in the table so as a data scientist I can just pull that data directly and start using it in my machine learning models to explore if any of those things are important now all that aside there's also one underlying Foundation element generally speaking the data quality that you're going to pull out of a Kimble style Enterprise data warehouse is going to be very very high because all the data checks have put in place many years ago oftentimes that sort of thing as well and that's also very nice because I'm spending less time cleaning the data because it's already been cleaned and normalized at the Enterprise level so it's mostly about convenience and it's also about also tripping in your brain as well as the machine learning practitioner what sorts of things are potentially important to the business process because they're typically embodied somewhere in the collection of dims and facts in the data warehouse yeah one of the things in wcape is we autogenerate dim dates and there's 90 100 110 columns attributes on that um things like is it a holiday is it a weekend what fiscal week is it what fiscal year is it you know what day what what's the Julian date and the nice thing about that is I can create different dim dates for different locations maybe I've got holidays in one country that I don't have in another um and and that stuff is gets gets it out of my head and makes it reproducible and consistent so that everyone in the UK knows that December 26 is boxing day and it's a holiday or every other Friday is a bank holiday whatever they do in the UK uh that's that's the way it seems to be is every other Friday seems to be a a nation holiday yeah that's a great example as well right because it also it's a if you're new to machine learning and let's say that you haven't been on the DAT to warehousing team for the past two decades right you can start going in and looking at these schemas start looking at the tables themselves and you it helps you understand if this if this was in the data warehouse in the first place it must be important for some reason why would they go through the trouble of modeling it and populating it and keeping the data clean if it didn't have some sort of value so it also helps you in in the process of brainstorming your features The Columns that you feed into your machine learning model now that being said doesn't necessarily mean that the the your data warehouse is the be all end all where you pull your data for a machine learning model you might certainly pull data from additional sources as well however typically in my experience what I start with is I start with the data in the data warehouse assuming that it exists and then I reach out to other types of data if needed for example from the US federal government or from maybe a data Lake where maybe they're storing um log records right in you know in a semi-structured format in the lake or whatever it might be as well but typically what ends up happening is when I have access to a good Kimble style data warehouse that becomes essentially my center of gravity It's usually the center of where I pull most of the features that I use and then I reach out from there to other data sources yeah one of the questions came in uh is Kimble a bit too rigid comparing to data Vault and and I can I can take that one out I've got a little bit of experience with Data Vault uh Data Vault is a specific methodology that composes several pieces they'll describe it not as an architecture but as a methodology that involves peoples and processes and things like that I I have to say that part but when people think Data Vault they think Hub satellites and links and specific architecture uh what Dan lindig will tell you is that you've got the raw Vault um which is essentially a on toone copy of what you're you're importing from The Source systems and you almost never ever ever use that data directly uh you're going to create what he calls a business Vault on top of that or even um he Advocates this just creating faction dimensions on top of that as views uh so you've got your raw Vault that has the semic curated data you've got your business Vault that has views on that has access to that data and uh I forget exactly what he calls it but all the business rules and things are taking data from the raw Vault and implementing that as star schemas um so actually he he he recommends that that you put views uh if it's appropriate a fa interventions on top of your your uh Data Vault excuse me I said to you earlier I've got this little bit of a cough that never seems to quite go away um dat equality uh it's become so ubiquitous in talking about machine learning and other things uh that I have been forbidden at certain conferences from saying garbage in garbage out I'm not allowed to say that anymore why not it's just as true now as it was when it was first coined yeah yeah yeah but you know it's it's so so overused that I'm working on another phrase I haven't quite figured one out yet but um fair enough does Kimble style methodology with its breaking things into facts and inventions does that I'm assuming that lends itself better to data quality checks and uh data governance than just uh a data Lake where you're just pumping in data in general yeah so in and to be fair I mean that idea is more of an Enterprise data warehousing methodology truism than Kimble specific right if You' got like an old school Inman style snowflake data warehouse ideally you still got all the same kinds of data quality checks as you would have in in Kimble data warehouse so tomato tomato to a certain extent as far as that goes um but once again people might be asking well Dave doesn't don't data scientists you know machine learning people don't they just want to go access the RW data and typically the answer is yes mainly because what they're looking for is maximum flexibility once again what they want to do is they want to avoid any sort of aggregation or any sort of pre-processing of the data that doesn't give them the flexibility that they need to engineer the best features the best columns to feed into their machine learning models now that being said once again a well-designed kimal data warehouse from the get-go is looking at trying to store highquality Enterprise grade transactional data so often times that's all that's basically all you need and so you've got the sufficient data quality now here's the thing if you're a quote unquote data scientist type person you know how to write code you know how to write SQL you know how to write python or r or whatever it might be so you're probably okay going to any basically any kind of data format that you want right whether that's you know dims andax in your Enterprise data warehouse or a CSV stored in some sort of blob storage in the cloud or whatever data Lake whatever it might be because you're like look I can just write the code and work with it the question isn't so much necessarily can you work with the data the question really boils down to how easy is it to actually how quick can you get to a good result that the business needs that's what you're really worried about not so much the technical aspects and yes even though I can access any kind of data in any kind of format and any kind of Technology because I have the coding skills it still benefits me to actually get things done faster and that's where data warehousing and more formal data structures are really really super useful and again as I mentioned earlier and it Bears repeating sometimes I have to go beyond the DW no matter what its form sometimes I have to go past the Enterprise blessed um data governed data sources because I need more in different types of data but usually once again if I have access to a good data warehouse that usually forms the core of the data and then I go outside for other types of data yeah and uh again representing wcape what I have to say is that uh one of the things I talk about with a with machine learning and Ai and data quality is having a consistently generated way of of enforcing governance standards and quality standards uh lends itself very very well to having good clean data at the end I think we all know the potential problems of you know longer training times and and bad results and hallucinations if you start feeding poor data into an AI uh machine learning engine but we don't we don't call it we don't call those hallucinations that's more of an llm kind of thing those are those are just errors just oldfashioned errors uh but uh I'll throw in one more plug for warscape and just say like I said the idea of being able to generate large models very quickly prototypes being able to change the format of data being able to change you know maybe I create a dimension and then realize I need to add something to it or take it or break it into two Dimensions the idea of automation allows you to do that very very quickly uh machine learning and how to get data formatted and what to put it in there's no single right answer a lot of it's going to be a lot of self-training and learning on how to do that uh Automation and where scape will definitely help with that um I've got a couple other questions about uh maybe I mentioned this before how do I adapt my existing data we talked a little bit about that about you talked about find the kpis find the important things I get those in front of from my own for my own ignorance I've got a dimensional model I've got facts I've got Dimensions sitting somewhere off the cloud somewhere or on Prem somewhere how I'm just curious how do I get that data into a machine learning engine yeah I mean there's many ways to do this right um if you're on a particular cloud provider let's say you're on Google Cloud I'm not by the way I'm not endorsing any particular cloud provider I'm just using them as an example so GC CP Cloud Google Cloud platform it's going to vary based on what cloud platform you're on and what technologies you have access to but the classic the classic strategy at a super high level irrespective of these individual architectural decisions in terms of your technology stack is more often than not you're pulling data from some sort of data storage that supports SQL it speaks if you will SQL right it could be a formal rdbms it could be some sort of Big Data solution that has a SQL interface on top of it like spark for example or data bricks and what you're doing is you're often pulling data and usually more often than not you're usually pulling a lot of cleaning and transformation Logic for engineering your features and all that kind of stuff in SQL pulling that from the storage engine ideally joining them all all up in some sort of storage engine which is usually more powerful than your laptop and then pulling that into something like python or r or something like that and then what you get back essentially is a what's known as a data frame essentially a table of data and you feed that table data into your machine learning algorithm so one of the things that always cracks me up is I teach a course um on how to do data wrangling for machine learning using Python and people show up and I'm teaching them how to use Python to like Wrangle the data and work with entire tables and join them up and all these kinds of things and inevitably somebody raises their hand and says can I do this in SQL the answer is yes of course you can so um that's typically what ends up happening is in most in most of the client situations I'm in most of the data is in some sort of data storage technology that speaks SQL so you fire off your SQL do a bunch of stuff pull the data back and then that net result that data frame that table is then what gets fed to the machine learning algorithm what what is considered a large table how much data are we talking about that you're feeding into you being a consultant the answer is going to be it depends I used to be an Enterprise architect the answer is it depends yeah most people most most people would be surprised right we have a tendency to be overly fascinated with quote unquote big data sets especially in the technical community because they represent an interesting engineering problem and it's legit right however I will tell you I've been doing machine learning now for 13 years something like that 12 13 years and most data sets in the real world that I've worked on or worked on with my clients are a lot smaller than you'd think so you know oftentimes less than a million rows of data okay A lot of times quite frankly a lot of them times this is this is the term that I tend to use more than like big data and small data most of them are Excel sized I was going to say is n million Rose the limit on Excel now something that's the technical limit in Practical terms you know Excel will choke most of the time but yeah excuse me um so I've I've gone through my list of questions uh we've got a question that's sort of peripherally uh associated with this uh we develop features from our data warehouse to train uh an ml model once the model is trained we pointed at our realtime transactional Source where the data is formatted in Json the challenge we have is recreating the features in real time against the Json so if I have I'm using Kimble style or something to train a machine model um how do I handle data that's maybe Json data or csvs or yeah and that's and so that's um so in a production scenario right so mlops machine learning operations talks a lot about these various characteristics you have a couple of different pipelines of data right so for example you have your training pipeline which is what the person was referring to we built that from our kemell style data warehouse now if I understand the situation correctly the data that's in said Kimble data warehouse came from Json so there was some processing pipeline upst stream in the data engineering or what we used to call ETL in the old days right the ETL processing pipeline that takes takes the Json cracks it open applies some rules and Transformations and then feeds it into a relational model right tabular model you essentially need to D duplicate that in the Json if you're going if you're running right because that becomes now becomes your um what's known as your inference pipeline where I'm actually taking the raw data in and I got to transform it into a format that the machine learning model understands before it can make a prediction so essentially what you need to do ideally is you would look at that and hopefully there's some code reuse between your Upstream data engineering ETL pipeline used to feed your data warehouse and you could also use that potentially in your inference pipeline as well okay so if you can't get to the source system directly you're only getting things through an API or some other way of Json files we're essentially going to load it into our right Foundation as broken apart into the IND individual attributes and then use that yeah the most commonly used machine learning algorithms in practice require tabular data not hierarchical document style like Json or you know stuff that no SQL databases are pretty good at like mongodb things like document databases it only understands tabular so if you build a model using tabular data what you're saying is I'm going to put in production my inference pipeline has to transform any of the data sources into that tabular representation before it can used and that's just part of the reality of the situation so it needs to be you know as part of your project planning effort and as well as maybe your early POC or MVP efforts you need to verify okay like look ow if we're actually really dependent on Json what's the actual data quality of the Json coming in and I will speak purely from my own experience so if folks take that as a single data point typically what I've seen most of the time when folks rely a lot on Json is that the data dat quality can be potentially potentially spotty not always but generally speaking right especially if it's in a hierarchical document kind of Loosey Goosey kind of Json format you can run into all kinds of problems and that's just something you're going to have to take on as part of your inference pipeline work if that's the case yeah one of the things I do in addition to uh hosting a lot of webinars and going to shows is I do record videos for our marketing team and uh last week they asked me to do a five-minute video they gave me three choices one of which is how do I break aart of Json file and get it into a database very very quickly so I will uh work on that and hope we have that done in a in a week or so yeah I mean if you're if you're working with like an API endpoint typically the Json is relatively structured so that's usually not a problem but if it's truly Json representing arbitrary document structures then it can be quite difficult yeah what I thought was interesting was uh so we've got a file parser in warscape that automatically splits apart Json files XML files Avro paret all these different file types whether they're coming from a file or coming from uh an API I ran into an interesting use case uh where someone was getting Json through a table uh they were getting a table with a big long text field and ins the inside the big long text field was the Json file so we had to manipulate our application to instead of reading from a file or reading from an API to read from a select statement um which was kind of a nice way to do it plus it was uh uh uh uh encoded it was uh encrypted so they this is on uh Google bigquery so they had some way of decrypting it as they were pulling data out which is I thought a pretty cool way of doing that now one thing I will say and I've seen this as an anti-pattern before because I love this question by the way if you can't tell is that often will end up happening in practice especially in larger organizations is the data warehousing team the data warehousing organization has quote unquote figured out the logic for breaking apart for example these Json documents and then populating the warehouse and then what ends up happening is oftentimes is is a different team ends up building the production machine learning environment and maintaining it and oftentimes they don't reuse all the work that the data warehousing team already went through they reinvent the wheel right um I used to be a software engineer and I love to cut code from scratch as an Enterprise architect I hated that but so that's there's a constant balance so that's one thing I will mention is if you've already figured this stuff out as part of populating your data warehouse please please please reuse that stuff when you start moving into machine learning models as much as you can there I'm trying to launch a few more few more questions here uh this is not a question but it's a statement I'll read it I don't know if you can see the Q&A panel uh Dave here's the question why are you so awesome um thanks thanks for all the buzz strip best practice answers you always provide uh okay so why are you so awesome I well first of all I I don't know if my dad would say that about me anyway but or my brother anyway um I if if people think I'm awesome that's that's great it makes obviously I like to hear that I mean I'm sure I'm Human After All I am a little bit embarrassed and flustered if you can't tell but um here's the thing I I've just got burned a lot that's all right it's not like I came it's not like I came down from the mountain toop with truth and wisdom you know I had to learn the hard way got burned a lot that's all so I just speak from my experience you know and you're and everyone's mileage me vary a little bit yeah I I maybe I'm a little too emphatic when I tell my boss I make mistakes I make a lot of mistakes I just try really hard not to repeat mistakes yeah me too I unfortunately I'm not always successful but that that is my goal yeah I didn't say I don't repeat mistakes I said I try not to repeat mistakes right exactly right nice weasly kind of phraseology there uh and we got about uh I'd say about eight minutes left um so again if you have any questions feel free to type those in the Q&A panel uh another anonymous attendee we're going to get good questions from Anonymous how does validation work with machine learning in general typical when we build dims facts for reports there's already a data set out there that we use to tie out to ensure the data is good uh how does this work for architecting f fact dim tables for ingesting into a machine learning model how do you know if you got good data how do you know if you got good results okay so there's a couple couple things right so we have to forgive me but as an Enterprise AR architect I used to be called Captain pantic so validation in machine learning terminology means something different than like in an Enterprise data quality kind of context right so validation is literally how do you validate the predictive performance of a model so I'm going to put that on the side for a second and assume that we mean how do we know that the data that was coming into the machine learning data frame the table that we feed into the algorithm to train the model is good and the answer is is that typically most of the time there isn't any double check kind of validation right so in the old days many many years ago when I worked in bi and data warehousing ideally what we would do is we would hand a specification to for a dimmer a fact including like the mappings from The Source systems and the Transformations and all that we would hand that specification to the engineering team and the testing team and they would build two things separately right the engineering team would build the production pipeline and then the testing team would build automation to test the pipeline and that was basically a double check to validate that everything worked according to spec typically you don't have that in most machine learning scenarios just like for example you don't have that when a user hooks Excel up to your data warehouse and performs some sort of analysis or pivot table on it you have no double check so is it a good idea to have those two in place absolutely and if you make a production machine learning model obviously viously that could be something you could add into the mix now of course that increases the expense of the initial development as well as the maintenance but it's probably a good idea however I will say that that kind of double check thing isn't that common in practice and I'm going to pause there to make sure I did not totally miss the nature of the question in my response no no there were there were two questions there and then we used U validation for both validating the data in the in the star schema is one one thing validating the results of machine learning is completely separate completely separate yeah uh area it's something you have to do by the way something you have to do I think the very first time I heard about machine learning was uh this is years and years ago someone was trying to train a machine to feed it pictures of people and to tell if they were happy or sad and after feeding it hundreds of thousands of pictures and spending weeks doing that they had trained it to tell when someone is indoors versus when someone is outdoors right and of course there's also the unspoken can think of like what bias was was there based on the the distribution of the actual images as well but we won't that that's well beyond this particular talk today uh this is a good question um again from Mr Mr or M Anonymous uh where do you recommend we get real world instruction and examples of utilizing data warehouse data in machine learning besides daon data uh or or or including Devon data well yeah I'm not going to sing my own Praises somebody else has done that for me thank you um that's a really good question because quite frankly and I'm going to be brutally honest if you go out and you go into social media and you go out into the blogosphere OR YouTube or something like that what you're typically not seeing our tutorials or courses or videos or anything like that blog posts that show you specifically hey I'm pulling all these features from dim date because date features are awesome you don't actually really see that very often because there is an implicit assumption and this was alluded to in an earlier question that um these quote unquote data scientist type people are just going against the raw transactional data they're not actually using the Enterprise assets which I would argue is actually not necessarily true and certainly that's something I wouldn't advocate so unfortunately I don't have any good examples of that maybe I should start doing that quite frankly based on the question we we were at just say I'll just say a trade show um which was focused on training a couple weeks ago oh yes I don't want to put on the spot but are there any particular certifications any particular conferences any particular uh classes maybe remot classes at universities that that you thought were helpful in in getting a better handle on how to implement machine learning so there's there's lots of good resources for implementing machine learning in general um I teach courses at said conference that you mentioned uh multiple times a year uh but I think the question was specifically around this idea of like showing using like for example a Kimble schema specifically as the data source and what you would do for that and even my own classes actually don't do that um because they tend to be more generalized right because what you're trying to do usually when you're teaching people machine learning in general is you're trying to reach a broad audience so you don't necessarily get too specific on particular Technologies um but yeah there's plenty of plenty of um plenty of resources out there right you can go to YouTube and see tens of thousands of videos on machine learning of course right um I have a YouTube channel you can check out some of my videos if you're interested in that if you like me and my style um but I wouldn't necessarily that I say that I'm the be all end all in that regard okay well we're coming up on the top of the hour we just have about two minutes you know I always do these and 15 minutes into it I look at my watch and go we have to talk for another 45 minutes and the next time I look at my watch it's like what're we're done well I take that as a compliment thank you yeah I appreciate you joining us uh I got a lot out of this and and I'm sure our other attendees did as well uh we've got this recorded if you want to watch it again you can reach out to you at dond dat.com uh I didn't mention warscape too much so I fell fell uh fell down on my duties as a warscape employee uh but I will say that if you go to wherec cape.com you can check us out again the automation of building protot types very quickly building dimensional models or other architectures very quickly uh it is a fantastic tool for for for accelerating the design Development building of a data warehouse check us out at wcape tocom so Dave again thank you very very much I appreciate it thank you for the time hopefully folks enjoyed it and hopefully they find it useful okay and if you have any follow-up questions uh feel free to reach out to us either uh me um Patrick ohin atwar skip.com more info atp.com marketing atwar skip.com all those email addresses work just fine uh and as I said before you can go to rape.com and check it out there as well all right Dave again thank you uh everybody enjoy the rest of your day thank you for your time and attention thanks everyone

Learn more about our unique data productivity capabilities for these leading platforms

Deploy on Microsoft Azure and integrate with Microsoft applications.

Seamlessly work with Amazon Web Services (AWS).

Leverage a complete range of Google infrastructure and data solutions.

Ingest data from multiple sources and deliver more business insights.

DataBricks

Deliver a wider variety of real-time data for Al, ML and data science.

“It took the architects a day and a half to solve all four use cases. They built two Data Vaults on the host application data, linked the two applications together and documented the whole process. This was impressive by any standard. After that it was an easy process to get all the documents signed.”

Daniel Seymore, Head of BI, Investec South Africa

Read Case Study

"At seven months into the project we can say it really worked out. We have been able to really quickly develop an initial MVP for our first country and that was really good. The automation and the changes we needed to do were rapidly applied. We had to remodel a few things and that was done within a day with the automation in WhereScape."

Carsten Griefnow, Senior BI Manager

Read Case Study

"It’s like having five people with only really two people working on it."

Will Mealing, Head of Data & Analytics at L&G

Read Case Study

EBOOK

Achieving Carbon Neutrality: A WhereScape Case Study

Download eBook →
GARTNER REPORT

Gatepoint Research Pulse Report: Approaches to Data Warehouse Automation

Read Gartner Report →
ON-DEMAND WEBINAR

On Demand | IT-Logix | DACH | Tipps und Tricks wie Sie Ihre BI-spezifischen Anforderungen erfolgreich erheben

Watch Webinar →
VIDEO

On Demand | Leveraging AI in Healthcare: Transforming Patient Insights into Actionable Outcomes

Watch Video →