Operating a Data Warehouse

| July 24, 2017


Having designed and built your data warehouse, I imagine that you’d like to deliver it successfully to the business and run it smoothly on a daily basis. That’s the topic of today’s article.

As digitalization continues apace across all industries, the role and value of a data warehouse—together with its attendant data marts and associated data lake—becomes ever more central to business success. With such informational systems now becoming as important as traditional operational systems, and often more so, it should be self-evident that highly reliable and efficient operating practices must be adopted.

Historically, however, approaches to operating a data warehouse environment have been somewhat lax. Manual and semi-automated methods for populating and updating the contents of the warehouse have been widespread. Advances made in the data warehouse itself have been offset by the spread of ad hoc approaches in departmentally managed data marts. The data lake has seen the re-emergence of a cottage industry of hand-crafted scripts, often operated by individual data scientists.

The challenges of data lake operations and management have recently attracted widespread comment (centered on the phrase data swamp) and increasing focus by vendors. Nonetheless, it is the data warehouse—as the repository of truth about the legally-binding history of the business and the basis for reliable exploration and analysis of business challenges and opportunities—where most can be gained by the adoption of advanced management and automation practices. The combination of data warehouse automation (DWA) and Data Vault address these needs from two perspectives: deployment of function and ongoing day-to-day operations.

Deployment seldom gets the attention it deserves; it’s not exactly the sexiest part of any IT project! However, for a data warehouse, deployment needs to be treated as a long-term, monogamous relationship. A data warehouse is substantially more complex than most IT projects, given the variety and number of systems involved. It is also significantly more important to get right as we move toward data-driven business and more agile development approaches.

As a data warehouse moves from the development phase (design and build discussed previously) to test, quality assurance, and on to production, seemingly mundane—yet highly important—issues such as packaging and installation of the code built in the previous phase must be addressed. In the case of DWA, where all cleansing, transformation, (and loading, of course) occur in the target databases of the data warehouse, mart and lake, this code consists of both definitional SQL (DDL) that builds the database structures such as tables, indexes, etc. and the processing code (DML) that creates the data to populate them.

In the context of a warehouse designed with the Data Vault model and methods, WhereScape® Data Vault Express™ allows a set of objects, such as the related Hub, Satellite, and Link objects of a customer ensemble, to be bundled together, transported and installed with ease from the development environment to quality assurance and then subsequently into production. This bundle includes the selected object and related metadata—structure (DDL), processing code (ELT procedures, etc.), and jobs. When installed into an environment, the product determines what DDL changes need to be made if an object already exists (for example, add columns, new indexes, and so on) and constructs the appropriate DDL syntax statements. New ELT code such as stored-procedures are then installed and compiled in the database.

The clear aim—subject, of course, to internal policies—is to automate the deployment activities in order to speed deployment in agile development approaches and to reduce the chance of human error across the full life cycle.

Having deployed the system to production, the next—and ongoing—task is to schedule, execute, and monitor the continuing process of loading and transforming data into the data warehouse. In this phase, jobs defined by WhereScape consist of a sequence of interdependent tasks. For example, one sequence could be to drop certain indexes on a table, load new data to the table, and rebuild the indexes.  The administrator creating the job can specify the tasks and their interdependencies to ensure the objects are processed in the correct order.  During execution, the tasks run in parallel (up to a specified threshold, to ensure the system is not overloaded), subject to dependency constraints. This feature is particularly useful in a Data Vault 2.0 environment, where the design supports elevated levels of parallelization of load tasks.

To ensure that data consistency is maintained, if a task fails during execution, then all downstream dependent tasks are halted.  When the problem has been resolved, the job is restarted and will pick up from where it left off and continue through to completion. 

As mentioned earlier, a modern analytical environment consists of a combination of data warehouse, marts, and data lake. From an operational point of view, given potential interdependencies of data across these systems, it makes sense to manage this ensemble as a single, logical environment. WhereScape supports this aim, both in terms of its job definitions that span multiple systems and in its provision of a centralized monitoring and logging repository. Here, all job execution activities such as processing times (start and finish), rows loaded, errors, exceptions, and so on are logged.  This provides historical information that can be used to track load performance over time, allowing administrators to make any necessary adjustments as data volumes grow.

The smooth, ongoing daily operation of the entire data warehouse environment is a fundamental prerequisite to its acceptance by users and its overall value to the business. Nothing will destroy users’ confidence more quickly than arriving every morning and not knowing whether the warehouse is up and running with the latest data updates.

And yet, there’s more! How do you support the changes in requirements that occur almost continuously when a data warehouse has been successful? Data Vault and Data Warehouse Automation can help here too. That is the topic of the fourth and final post of this series.

You can find the other blog posts in this series here:


Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988. Barry is founder and principal of 9sight Consulting. A regular blogger, writer and commentator on information and its use, Barry is based in Cape Town, South Africa and operates worldwide. 

How to Hire and Retain Data Warehouse Developers

The projected data warehouse developer job growth rate is 21% from 2018-2028, with about 284,100 new jobs for data warehouse developers projected over the next decade, according to Zippia. This surge in demand for data warehouse talent is being felt across businesses...

8 Reasons to Make the Switch to ELT Automation

Extraction, loading, and transformation (ELT) processes have been in existence for almost 30 years. It has been a programming skill set mandatory for those responsible for the creation of analytical environments and their maintenance because ELT automation works....

What is a Data Model?

A data model depicts a company's data organization, standardizing the relationships among data elements and their correspondence to real-world entities' properties. It facilitates the organization of data for business processes and information systems, offering tools...

Webinar Recap: Navigating the Future of Data Analytics

In an era where data is the new gold, understanding its trajectory is crucial for any forward-thinking organization. Our recent webinar, "Capitalizing on Data Analytic Predictions by Focusing on Cross-Functional Value of Automation and Modernization," hosted in...

Introducing: Data Automation Levels

The concept of automation has seamlessly integrated into many aspects of our lives, from self-driving cars to sophisticated software systems. Recently, Mercedes-Benz announced their achievement in reaching Level 3 in automated driving technology, which got me thinking...

Agile Data Warehouse Design for Rapid Prototyping

Agile Prototyping: Revolutionizing Data Warehouse Design While most people know WhereScape for its automated code generator that eradicates repetitive hand-coding tasks, there is another major way in which the software can save huge amounts of time and resources....

Related Content

How to Hire and Retain Data Warehouse Developers

How to Hire and Retain Data Warehouse Developers

The projected data warehouse developer job growth rate is 21% from 2018-2028, with about 284,100 new jobs for data warehouse developers projected over the next decade, according to Zippia. This surge in demand for data warehouse talent is being felt across businesses...

8 Reasons to Make the Switch to ELT Automation

8 Reasons to Make the Switch to ELT Automation

Extraction, loading, and transformation (ELT) processes have been in existence for almost 30 years. It has been a programming skill set mandatory for those responsible for the creation of analytical environments and their maintenance because ELT automation works....

How to Hire and Retain Data Warehouse Developers

How to Hire and Retain Data Warehouse Developers

The projected data warehouse developer job growth rate is 21% from 2018-2028, with about 284,100 new jobs for data warehouse developers projected over the next decade, according to Zippia. This surge in demand for data warehouse talent is being felt across businesses...

8 Reasons to Make the Switch to ELT Automation

8 Reasons to Make the Switch to ELT Automation

Extraction, loading, and transformation (ELT) processes have been in existence for almost 30 years. It has been a programming skill set mandatory for those responsible for the creation of analytical environments and their maintenance because ELT automation works....