DM Review Published in DM Review in June 2000.
Printed from

Meta Data and Data Administration: XML's Uses In Data Warehousing: Getting Data In

by David Marco

Over the past 10 years, data warehousing has proven to be a highly valuable technology that the vast majority of corporations have leveraged to provide them with a competitive edge in the marketplace. As we enter the next decade, extensible markup language (XML) is poised to accomplish much the same. The one unanswered question is how will these two essential technologies function together.

Virtually all Web sites have been built with hypertext markup language (HTML), which describes how data will be formatted but does not provide information on this data. Consequently, this unstructured Web-site data is very difficult to bring into a data warehouse system. XML provides a remedy to this situation by assigning data tags to this Web-site information. To understand how these data tags function let's use XML to describe the information about a textbook:

Building and Managing the Meta Data Repository

David Marco

John Wiley & Sons
New York

By adding context to the content on a Web site, XML enables corporations to bring unstructured, Web-site data into their data warehouses. This is critical for many companies' analysts who need this information to make better decisions. Let's walk through an example using a healthcare company. Many doctors that research drugs will publish their results to their Web sites. Often the decision-makers in these healthcare organizations want to know about the latest developments with this drug research in order to make better patient- care decisions. To see how XML simplifies this challenge, we will examine Figure 1.

Figure 1: XML Bringing Data into the Data Warehouse

Figure 1 illustrates data being read from a physician's Web site and brought into a XML transformation process (see Figure 1, bullet 1). This transformation process (bullet 3) matches the Web-site data to the corresponding XML schema (data tag layout). Remember that one of the key challenges for XML is to standardize on the names and meaning of the data tags. As an industry, IT has had limited success in defining global standards, and I don't expect XML to change this trend. Therefore, we will have to juggle multiple XML schemas in our corporations. Next, the XML transformation process converts the tagged Web-site data into record format by removing the XML data tags which is important since these tags increase processing overhead. These records are sent to the extraction, transformation and load (ETL) process of the data warehouse (bullet 4). The ETL process will clean, integrate and load this data into the data warehouse and its corresponding data marts (bullet 5). Keep in mind that as several ETL tool vendors are looking to expand their current toolsets to include XML transformation functionality. This XML transformation process (bullet 3) could be completely merged into the ETL process.

Often times when we think of the Internet we think about business-to- customer (B2C) transactions; however, the potential for business-to-business (B2B) commerce on the Internet is far greater than that of B2C. Many companies are in the business of selling information. XML plays a major role in this effort as it allows B2B transactions to be brought directly into a data warehouse. Figure 1, bullet 2 shows how the B2B trading partner sends information into the XML transformation process. As before, not all B2B trading partners will use the standard XML schemas so multiple XML schemas will need to be maintained. This process (bullet 3) uses the XML schemas stored in the XML database and moves these converted transactions into the ETL process of the data warehouse (bullet 4). The ETL process then integrates this information into the data warehouse and its data marts (bullet 5).

As we can see, XML is critical technology and it is coming to a data warehouse near you!

David Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence and is the world's foremost authority on meta data. He is the author of Universal Meta Data Models (Wiley, 2004) and Building and Managing the Meta Data Repository: A Full Life-Cycle Guide (Wiley, 2000). Marco has taught at the University of Chicago and DePaul University, and in 2004 he was selected to the prestigious Crain's Chicago Business "Top 40 Under 40."  He is the founder and president of Enterprise Warehousing Solutions, Inc., a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing and meta data repository technologies. He may be reached at (866) EWS-1100 or via e-mail at

Copyright 2005, SourceMedia and DM Review.