Data Warehousing Lessons Learned:
A Data Warehousing CMM: Garbage In, Quality Out
Capability maturity model (CMM) integration project models are now available for people, software acquisition, systems engineering and integrated product development. It is a scandal that no similar initiative focuses on improving data management capabilities in the enterprise. Indeed, the irony is that good software is often thwarted by bad data as the platitude "garbage in, garbage out" trumps the process-improvement movement. The benefits of applying such a process to data warehousing include attaining a repeatable, defined, managed, optimized process for transforming data into information and information into knowledge.
The CMM for software development exists because software is a complex artifact created by fallible human beings whose practices are susceptible to improvement. Software systems are so complex and unforgiving of small defects that their development benefits from a well-defined, repeatable, optimized process with stages and a high degree of structure to define and manage quality. It is worth noting that the maturity framework out of which the CMM emerged was inspired by Philip Crosby's book Quality Is Free (McGraw-Hill, 1979). Crosby's quality management maturity frame describes five stages in adopting quality practices. This maturity framework was adapted to software by Ron Radice and Watts Humphrey at IBM and was subsequently brought to the Software Engineering Institute in 1986. Thus, the CMM is not a methodology but a set of guidelines for improving the process of design and implementation as applied to software or (in this case) other abstract subject areas.
Ask any database administrator (DBA) or manager of DBAs what life is like, and the answer will frequently describe a continuous firefight with data management, including data warehousing, process exceptions and data integrity issues. Heroics are the order of the day, which is precisely the initial level of the CMM. Overcommitment is common. Exceptional experience, leadership and "tribal knowledge" are key success factors. The process of data management is a black box to the user community - data goes in and meaningful information about customers, products, markets, etc. potentially comes out. The point of the data management CMM is to move beyond heroics to a repeatable, defined, managed, optimized process (which, of course, are the respective stages).
At stage two, the planning and management of data is based on experience with similar projects. The cost, schedule and functionality of data management are tracked. Project discipline is sufficient to repeat earlier successes on data with similar applications. Data requirements and deliverables are baselined. Data modeling is a rigorous discipline, well defined, with a large body of research and experience. A good data model represents the subject matter with a high degree of (data) normalization (usually at least third normal form) to reduce redundancy and update anomalies. Designs such as the star schema are deployed where they can add value - usually in a business intelligence context. Exceptions based on particular scenarios are also well defined (e.g., the customer dimension of a voluminous customer base for a telecommunications firm with millions of customer lines). However, processes often differ between projects, reducing opportunities for collaboration between teams and reuse of data models. The users get visibility into the project at defined occasions such as data review and acceptance of major deliverables, allowing limited participation and control. An information-quality safe harbor is established by executive management, and fear is driven out as the staff is given permission to surface data quality issues without the risk of dysfunctional organizational behavior.
In stage three, data is managed as a corporate information asset. A standard data management process is established, documented and integrated into the information supply chain by which the business operates. Management has insight into the technology used to build and operate the corporate information supply chain. All projects use an approved, tailored version of the enterprise's standard data management process. Data is shared via a central repository or a small set of federated repositories. Leveraging the repository, meta data-driven design is implemented and data models; and the information they represent can be reused between projects, applications and systems. The user is able to get rapid and accurate status updates concerning data integrity and availability.
Stage four introduces metrics on an enterprise-wide basis to the quality-improvement process. (Note that in previous steps this may have been done on a case-by-case basis.) Data defects are tracked, collected and addressed on an enterprise-wide basis to improve the process of managing data as a corporate information asset. Information quality (IQ) is measured as a function of the objectivity, usability and trustworthiness of the data. Key metrics for information quality encompass three areas - representational, procedural and judgmental - and cover not only the form (syntax) of the data, but the content (semantics). All three of these dimensions require an ongoing assessment of capabilities for data management. Management establishes goals and measures progress toward them. The latter two can be especially labor-intensive, requiring peer reviews, professional development, supplementary staffing and outside professional expertise. The user community can understand the data management issues and risks before a project or operational initiative begins or is implemented.
At stage five, continuous process improvement is established by quantitative feedback from the data management process, including information quality metrics and piloting innovative ideas and technologies. Data management is improved by performing root-cause analysis on a systematic basis and a commitment to a data design for defect prevention (not mere inspection and removal). Collaboration between the user community and the data management function occurs to establish a strong win/win relationship.
For more information on related topics visit the following related portals...
DW Administration, Mgmt., Performance,
DW Design, Methodology and
Lou Agosta, Ph.D., is a business intelligence strategist with IBM WorldWide Business Intelligence Solutions focusing on competitive dynamics. He is a former industry analyst with Giga Information Group and has served many years in the trenches as a database administrator. His book The Essential Guide to Data him at LoAgosta@us.ibm.com
Provided by IndustryBrains
|Verify Data at the Point of Collection: Free Trial|
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.
|Design Databases with ER/Studio: Free Trial|
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.
|Free EII Buyer's Guide|
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.
|Data Mining: Levels I, II & III|
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.
|Click here to advertise in this space|