Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Document Warehousing & Content Management:
ETL Meets Content Management

  Column published in DMReview.com
October 1, 2001
  By Dan Sullivan

Oracle has unleashed the marketing machine to trumpet the virtues of Oracle9i. While improvements such as multiple block sizes and more dynamic parameters may make DBAs' lives a little easier and performance improvements in partitions, function-based indexes and materialized views are welcomed by data warehouse designers, a significant improvement has been made in the content management arena. A new tool, Oracle Ultra Search, provides developers dealing with distributed content the equivalent of an extract, transform and load tool for unstructured text.

Ultra Search offers the ability to dynamically monitor several different types of unstructured content sources and catalog meta data about documents in a centralized repository. In its most basic form, it is akin to a Web search engine for the enterprise. Users specify content sources (either the Web, file systems, e-mail servers or databases), how often to check the source and a few other parameters to control the depth of search and the file types to examine. The Ultra Search crawler then compiles meta data about the document and stores that information in a database. (The content itself stays in its original repository, so this tool should not be considered a traditional document management system.) Of course, Oracle provides full text indexing on documents through Oracle Open Text.

So, what's the big deal? Crawlers have been around almost as long as the Web, the content management space is filled with vendors offering Web-based tools to manage intranets and search tools such as Autonomy are available for enterprise-scale operations. The significance of this tool, and others such as IBM's Enterprise Information Portal and InStranet's InStranet 2000, is that the incorporation of such an essential tool for distributed content means that unstructured content is recognized as an essential element of enterprise information assets that needs management as much as structured data. Relational databases and the repository model are as much fixtures of current development practices as portals and Java 2 Enterprise Edition (J2EE) architecture. To successfully control the information assets of an organization, we need to handle unstructured text in a structured manner - in a relational database with enterprise level tools.

There are a few broad levels for structuring text, or content in general. At the first level, content, such as a word processing document, is treated as a binary large object and stored along with simple identifying attributes such as a document ID. This level of management works best for vertical applications with simple storage and retrieval requirements.

At the second level of structuring, additional attributes about the type of content are gathered. Typically, these include file type, creation and modification dates, author and access control attributes. With this additional information, more flexible retrieval is possible. However, for the most part, we are still dealing with superficial features.

The third level of structuring is the most useful because it gets inside the document to answer the question, "What is this document about?" Third level structures include full-text indexes, thematic or topical indexes, and summaries. We've had tools to solve parts of the structuring problem, such as full-text indexing programs, thesauri for describing relationships between terms and linguistic tools for creating summaries and extracting key features.

Of course, we've had database management systems to manage the output of any of these as well. The problem has been lack of integration. Oracle Open Text and similar tools made significant inroads into integrating unstructured text into OLTP and decision support systems by providing both a storage/retrieval mechanism and content meta data extraction tools, such as theme identification and summary generation programs. That state is analogous to data warehousing five to seven years ago when we had the means to store and aggregate numeric data but few options other than custom programs for extracting, cleansing and loading the data. With tools such as Ultra Search, we are seeing the emergence of content management tools analogous to data warehousing extract, transform and load tools.

This emergence implies two things. First, vendors understand that organizations need to manage unstructured text that is distributed throughout the enterprise, not just what is intentionally published to the intranet in a dedicated content management system. Customer service representatives need access not just to sales records and return merchandise authorizations, but customer e-mails, policy memos and other documents about customers. Users will want functionally related information (e.g., sales figures, product descriptions and marketing material) accessible from a single point. This leads to the second point: Both structured and unstructured data need to be integrated along functional lines. When developing a sales proposal, we do not think linearly. First, we think about numeric measures such as past sales and moving averages, and then we think about unstructured information such as conditions, past contracts and competitor offerings. Decision support and content management systems require integration to support the dynamic way users think about problems.


For more information on related topics visit the following related portals...
Data Acquisition, Replication, Content Management and Unstructured Data.

Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at dsullivan@ballstongroup.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.