Warehousing Web Log Data, Part 4
Editor's Note: This article is the fourth and final installment in Ralph Hughes' series on warehousing Web log data for shopping portals. Part 1 was published in the September 15, 2000 issue of DM Direct; Part 2 was published in the September 22, 2000 issue of DM Direct; and Part 3 was published in the October 6, 2000 issue of DM Direct.
Slow Changes and High Volume
Though the base cubes presented in Part 3 will meet the portal company's warehouse need for day- to- day business intelligence, one naturally wonders which of the dimensions will be subjected to the challenges of slowly changing attributes and how the warehouse should aggregate the cubes presented in order to cope with the high data volumes these cubes would amass during the long term.
Following Kimball's writings on data marts, there are three ways to handle slowly changing dimensional data - update the attributes for the new values, record the last value in a "prior" value column or track history. The first two types are relatively easy to process, so we will focus on the track history of slowly changing dimensions. When they change, the transformation facility should generate of a new record altogether with the updated attribute values, and mark the prior dimension record with a "good through" date. Fortunately only four of the 12 dimensions presented so far should require such treatment: customer, inventory, category and affiliate referral.
Category is probably the most obvious dimension that should require the warehouse to track the history of its changes. Given that engineering will be laboring to improve the usability of the site, the improvements cannot be measured intelligibly if the warehouse does not maintain an image of the prior organizations of inventory items the company has tried.
Of the attributes in the customer dimension, the local fields require the warehouse to track history, the most obvious reason being that marketing - as it looks back to determine which branding campaigns enticed the most new merchants and advertisers to consider the site - must have a steady mark on where the potential customers were located when each advertising effort was made.
In the inventory dimension, the ads displayed by the site will probably not change once the advertiser initiates the campaign. But, for the offers, it will be very important to track changes to site, URLs and titles that identify them. As we discussed business requirements, we noted that the warehouse will aid accounts receivable (A/R) in investigating any challenges by customers to their invoices. To do so, A/R must be able to convincingly demonstrate to the customer that there were a specific number of clicks on a given offer during a given span of time. Should the merchants change the URLs of the linked offers in the middle of the month, the warehouse must reflect this change, otherwise customers could easily claim that a given URL did not exist for the entire month and, therefore, all of the A/R numbers must be suspect. Finally, the warehouse should track the inventory item's history each time it is modified, for if this attribute was simply updated, content management would lose sight of the trend in its staff's efficiency in maintaining the inventory of offers.
Concerning the affiliate dimension, a change in any of the attributes indicates a new affiliate, except for affiliate type. It is conceivable that an established site could change its format to where it actually evolves into another type of site - from a pure information source, for example, to an actual merchant by the simple addition of a catalog and shopping cart. Marketing would be quite interested in the impact of this sort of change on the number and quality of referrals the site made to the portal and, thus, the history of this dimension must be retained.
As one can easily imagine, with thousands of merchants, tens of thousand of offers and hundreds of thousands of shoppers generating millions of clicks per day, the base cubes described can get very large. Current hardware should be able to support the base cubes with, perhaps, six months of data in them. But, for BI reporting over the long term (typically three years), the portal company will need to aggregate these cubes to keep them workably small. Luckily, the attributes that cause the largest of attributes to balloon in size are of short- term interest.
Roughly speaking, the average dimension becomes too large when it approaches 50,000 records. At this point, the speed of the preliminary queries for the parameter values upon the dimension tables, when run on a mid-range processor, begin to tax a user's patience. Of the dimensions presented, customer, item, sales order (SO), shopper and the three clickstream dimensions (visits, searches and queries) will exceed this approximate limit with three years of data in the cubes.
Customers can be easily be generalized in order to permit cube aggregation by dropping those attributes that identify individual merchants and advertisers. Furthermore, ZIP codes can be generalized into standard metropolitan statistical areas (SMSAs). These abstractions are safe to make because the portal company's long-term interest in this dimension is to look back on past branding campaigns to see which generated the most interest among potential new customers. SMSAs are a sufficient level of resolution to evaluate the impact of this broadcast advertising. This generalization can bring the expected size of this dimension with three years of data down from 10 million to one or two thousand records.
The long-term interest in inventory item is to twofold. First, the company will want to document and trend the portal's indirect relationship with manufacturers and distributors as a basis for more effective negotiating on advertising contracts. Second, the company will want to keep an eye on the maintenance rates of offers as the months go by. Naturally, the item type (ads versus offers) needs to be retained in the aggregated cubes and, as suggested, the manufacturer and brand. The date fields can be generalized to their calendar months, and this step will make date created and date approved resolve, for the most, part to the same month, making only date approved (i.e., the date the offer or ad went online) of interest. During a three-year period, a portal site will typically have manufacturers and brands in the order of magnitude of 10,000 to 100,000, making this dimension large enough. Accordingly, the two remaining data fields (month approved and modified) should be moved to month dimensions of their own, saving the item dimension three orders of magnitude (36 x 36).
The SO dimension can be similarly reduced by generalizing away from individual SOs to groups of SOs as defined by the bill-to company and bill-to parent, attributes which will allow the portal company to report on trends with major customers. Start and end date attributes now belong in a month dimension.
The portal company's long-term interest in the shopper dimension can be summarized into three questions: What ad campaigns brought in the most new shoppers? What long-term shopping populations can we boast to merchants in a given locale? What are the trends in lost versus retained shoppers? Accordingly, we can drop the cookie data from this dimension when we aggregate because these questions have no need for the identity of individual shoppers. Locale attributes can be generalized to ZIP code but not to SMSA because smaller merchants can be interested in tightly defined locales. Like other dates, the last visit date can be abstracted to last visit month and, optionally, moved to its own dimension in order to save two orders of magnitude. Taken all together, these suggestions will reduce the shopper dimension from around 10 million to less than 10,000 records.
For long- term clickstream analysis, the visit and search dimensions no longer need to identify individual visits and searches, but can be generalized to the point where they identify classes of records by whether they were early instances (first, second or third visit, for example) versus "all others" and what was their peak event. In the three-year cubes, these steps will reduce visit and search from about 10 and 100 million records, respectively, down to less than one hundred each.
For the query dimension, there is still the need for some detail, mostly so the company can still track trends in the products, brands and merchants requested over the long term. The individual identity of the query can be dropped, similar to the summaries made on its parent dimensions, and its ordinal number can generalized in the same fashion. These two alterations alone will bring the number of query records in the three-year cube down from more than 1 billion to somewhere in the neighborhood of a 100,000.
This article has explored the challenge of business intelligence for the shopping portal company and introduced a series of eight dimensional models that address this need using data from the portal's Web server logs. The dimensions of customer, shopper (user), inventory, visit, search and query are basic to most Web enterprises, so it should be no surprise that the OLAP solution for the shopping portal will address the needs of many commercial sites of the Internet.
Although the models presented can only be a sketch of the detailed solution required for a Web company, they will provide a quick orientation for the BI sponsor and project manager as to how a solution might generally appear and what capabilities it should have. The models will also serve to kick-start the analysis of source data and serve as a good framework for dimensional design. Furthermore, it will allow the site engineers to get an early start on extending the standard Web logs to capture the data elements the warehouse will require, thus removing an obstacle that has often caused major delays in Web warehouse implementations.
For more information on related topics visit the following related portals...
Ralph Hughes, President of Ceregenics, Inc.
Provided by IndustryBrains
TechExcel CRM sets the standard for high-end CRM: powerful, configurable, affordable and easy to use.
|Intuit Track-It! Help Desk & CRM Software|
Intuit IT Solutions provides Track-It! - the leading help desk software solution for employee & customer self-help, call tracking, problem resolution, remote control, asset management, LAN/PC auditing, and electronic software distribution. Free demo
|Customer Relationship Management for IT|
Web-based CRM and more with Autotask: Great business management software optimizes resources and track billable project and service work. Get a demo, then try it free with sample data. Click here for your free trial!
|Online Data Mining Courses|
Predictive analytics (neural nets, CART, logistic reg., k-nearest neighbor...), clustering, homeland security applications, text mining. Interact with leading experts over 4 weeks. No set hours - go online at your convenience. 10+ hours per week.
|TopLine Leadership Sales & Sales Mgmt. Training|
Offering two customized workshops. One for sales managers to improve coaching and leadership skills; and one for sales people to improve consulting/solution selling skills.Visit our website to download free report.
|Click here to advertise in this space|