-
Marketplace
-
Channel Resources
Articles from this Site
Composite Software Releases Data Integration Strategy Recommendation Tool
IBM Introduces New Informix Dynamic Server
Quantrix Announces a New Release of Quantrix Modeler
IBM Acquires InfoDyne to Create Platform for Market Data Delivery
Air and Space Expeditionary Force Enlists WebFOCUS
White Papers
Pragmatic Approach to Compliance Data Collation
Informatica - Handling Variable Length Files Using XML
Putting Metadata to Work to Achieve the Goals of Data Governance
Enterprise Information Management - Insights and Strategies into the Direction of EIM
Automated Analysis Technology
Web Seminars
Making the Business Case for Predictive Analytics: Innovative Strategies for Maximizing ROI
Master Data Management: Best Practices for Success
Modeling Unstructured Data
Books
Data Management: Databases and Organizations, 3rd Edition
Data Modeler's Workbench: Tools and Techniques for Analysis and Design
Effective Databases for Text & Document Management
Mobile Handheld Devices - Enabling Enterprise Communications and Data Management
Mobile Data Management (MDM 2002), 3rd International Conference
Dataless Hybrid Approach Solves Many Universal Data Integration, Sharing and Interoperability Problems
Mid-to-large companies and federal agencies are moving away from infrequent batch/incremental one-way updates towards global 24x7, real-time interactions.
Service-oriented architectures (SOAs) and, in particular Web services, trend away from centralized monolithic applications and centralized data storage toward distributed applications as a service and distributed data as a service. In many cases, moving data to a remote location raises data ownership and storage issues of responsibility, accountability, security, privacy and legal restrictions. In addition, Sarbanes-Oxley regulations include the legal requirement to store and access original data or copies of original data, not the modified data that typically resides in data warehouses.
A recent DM Review article discussed major database products that offer universal data integration, including structured and unstructured data, and binary large object files (BLOBs) such as images, audio and video.1 However, when the large database vendors refer to data integration, they are generally talking about moving and storing all data in a single database - a data warehouse. Other data integration vendors leave data at its source and deal with it there, as with federated database systems, which fit better with the increasingly favored SOA approach than data warehouses. In fact, data warehouses seem to be in conflict with global 24x7, real-time interactions and the SOA approach.
Data Integration, Sharing and Interoperability Options
Some of the reasons that data warehouses are chosen over federated database systems in many situations are to have:
- Clean and transformed (usable) data;
- Indexes that support a wide range of queries, including fuzzy matching;
- Name variation, nearest neighbor, etc.;
- Complex queries without shutting down data source systems;
- Aggregations and other calculations; and
- Support for business and other intelligence applications.
Federated database systems typically cannot support any of these operations, besides fetching data for applications and populating associated databases to perform these operations. Enterprise search is an alternative approach, but generally cannot cope with the more structured data requirements of business monitoring, analysis and reporting. Therefore, the data integration, sharing and interoperability options facing organizations are usually a difficult choice between a data warehouse or a federated database system, as enterprise search is generally considered a distant third choice.
This article discusses a fourth choice: a dataless hybrid of data warehouses, federated database systems and enterprise search.
Problems with Universal Access to Structured Data
Many of the problems in achieving universal access to structured data involve having to deal with multiple variations in:
- Data source schemas;
- Data types and lengths;
- Data content distribution (domain values);
- Nonstandard or unclean data, e.g., typos;
- Combined and/or separated data fields, e.g., full name to last name, first name, and middle name/initial and vice versa;
- Indexes and types;
- Query processing support;
- Connections to data sources;
- Support for query standards such as SQL;
- ID authentication methods;
- Access security methods; and
- Languages.
The seemingly simplest solution to overcoming these variations is to build a data warehouse where 100 percent control can be enforced on all of the above. Unfortunately, accessing, cleansing, transforming and moving data in bulk from multiple disparate data sources and storing it all in a one-size-fits-all data warehouse has its own problems. These include time and effort, and therefore cost, and the data ownership and storage issues mentioned previously.
Problems with Universal Access to Unstructured Data
Problems with universal access to unstructured data are that there is little to no sense-making required to index unstructured data, query success can be low due to the ambiguity of words and human review is generally required to make sense of results. Some search engines are now incorporating knowledge management tools to bring more sense and structure to searches on unstructured data.
Using Unstructured Text Search on Structured Data
Another option would be to use an unstructured text search engine on structured data, as offered by some search engines. However, this only works to a limited extent for simple cases, because the following problems typically remain:
- In the case of relational databases, only records in tables with search terms are returned; no JOINs between tables are made and no validation that tables are connected in any way.
- No range queries;
- No nearest neighbor matching;
- No allowance for data transforms;
- Only return results where an exact match is found, e.g., for name, unless some form of fuzzy match index is used; and
- Records found regardless of where a search term occurs, e.g., a street address name could be found instead of a person's name, unless some form of column-level constraints are predefined.
Advantages and Disadvantages of Data Warehouses Advantages
Data warehouses tend to have a high query success, because they have complete control over the main related areas of data management systems:
- Clean data,
- Indexes - multiple types, including text, and
- Query processing: multiple options.
Other advantages include:
- High performance,
- No index or query load on data source systems,
- Almost any data source,
- Preaggregated and precalculated fields,
- Archive,
- Denormalized views,
- Data mining/link analysis options,
- Minimum existing system interference,
- Data sources not aware of queries, and
- Security access managed in data warehouse (schema, row, column and data element).
Disadvantages
There are considerable disadvantages involved in moving data from multiple, often highly disparate data sources to a one-size-fits-all data warehouse. It translates into a long implementation time, high cost, lack of flexibility, dated information and limited capabilities.
- Major data schema transforms from each of the data sources to one schema in the data warehouse can represent more than 50 percent of the total data warehouse effort.
- Data owners lose control over their data, raising responsibility, accountability, security, privacy and legal issues.
- Long implementation time and associated high cost.
- Adding new data sources takes time, with associated cost.
- Limited flexibility of use and types of users - requires multiple, separate data marts for multiple, different uses and users.
- Typically, data is static and dated.
- Typically, there are no data drill-down capabilities.
- Typically, it cannot actively monitor changes in data sources.
- Additional cost and maintenance of the data warehouse infrastructure.
Advantages and Disadvantages of Federated Database Systems Advantages
Federated database systems deal with data sources as they are, and as such, they offer the following advantages:
- Data remains at source,
- Latest data available,
- Little or no additional storage,
- Drill-down capabilities,
- Assuming a denormalized front-end schema, no major schema transform,
- Good for standard applications and related data sources, e.g., Peoplesoft, SAP and Siebel.
Disadvantages
Federated database systems tend to have a lower query success than data warehouses, because they do not have complete control over the main related areas of data management systems. Other disadvantages include:
- Unclean data;
- Limited indexes, e.g., no text;
- Limited query processing;
- Imposes a heavy query load on data source systems and adapters;
- Low query performance - total system as fast as the slowest individual database;
- Queries tend to be "hard-wired" or preconfigured;
- No precalculated aggregation or calculated fields;
- Limited types of data sources;
- Limited to no schema denormalization;
- Limited to no data mining/link analysis;
- Limited to no row, column and data element security (unless data source supports it);
- Data sources aware of queries;
- Cannot actively monitor data sources; and
- No archive.
Ideal Universal Access to Data
There is an approach that ideally combines the advantages of both data warehouses and federated database systems through external indexing and query processing in a layer that resides above a data source, similar to an adapter in a conventional federated database system. Similar to data warehouses, data is read, transformed and indexed with two major exceptions: data is not stored separately, but remains at source, and usually, no schema transforms, as indexes are built as per data sources. Queries are resolved 100 percent in the external index and query layer, including complex JOINs, range queries, nearest neighbor matches and text search, resulting in pointers back to the source data. Then, using the result-set pointers, results data is read, transformed and merged with other data sources' results data, and passed back to the calling application in the desired format. Indexes are maintained in batch, incremental or near real-time through various methods.
This dataless hybrid approach offers the following advantages and disadvantages, where similarities to data warehouses and federated database systems are marked as DW and FED, respectively:
Advantages
- High query success (DW) due to data in indexes and results clean and usable; consistent and multiple indexes across disparate data sources, including text; and complete control over query processing.
- Data remains at source (FED);
- Latest data (FED)
- Almost no index or query load on data source systems (DW)
- No major ETL - indexes built as per data source schemas (FED)
- Almost any data source - structured, semistructured and unstructured (DW maybe);
- Actively monitor data sources - index updates serve as constant monitors (only active or real-time DW);
- Denormalized schema indexes (DW);
- Preaggregated and precalculated field indexes (DW);
- Link mapping and analysis/data mining; also for performance;
- Highly flexible - accommodate multiple uses and users;
- Row, column (and data element) security indexes (DW maybe);
- User-level access to data sources - no special connectors or adapters;
- Drill-down capabilities; and
- Data sources only aware of low-level results; not queries (DW 100 percent isolated).
Disadvantages
- Establishing index updates - anywhere from two hours to two weeks per data source
- Indexes require storage - 20 to 80 percent of actual source data (typically, 60 percent); not allocated space; not indexes; not specialized views; not logs; not user profiles; not anything else, e.g., a "10 TB data warehouse" is reduced to 500GB of actual data, of which 50 percent needed to be indexed and queried, leading to 150GB of actual index storage.
- No inherent archive - although mirrored copies of data sources can be indexed, queried and accessed like an archive; archives themselves can be indexed, queried and accessed; and results data can be stored and indexed for further analysis, communities of interest, audit, etc.
Until recently, there have been three conventional choices for data integration, sharing and interoperability. A fourth choice is available; a dataless hybrid that ideally combines the advantages of all three conventional choices and at the same time overcomes most of the disadvantages of all three conventional choices. Now, a nonstandard data source with a proprietary API and perhaps only a single primary key index can appear to an application/Web service as a standard driver and SQL data source, with all data indexed and available for advanced queries.
Reference:
1. Guy Creese. "Information as a Service." DM Review, July 20. 2006.
Gavin Robertson is chief technology officer of WhamTech Inc., a privately held VLDB technology company that develops breakthrough database-related products and technologies designed for enterprise-scale database and search systems. Robertson has almost 30 years of experience in IT and database application development and has developed an economics application, creating a global enterprise database application that now sells commercially. He can be reached at gavin.robertson@whamtech.com.
For more information on related topics, visit the following channels:


