SignUp for Free Exclusive Services:  Portals    eNewsletters    Web Seminars    dataWarehouse.com    DM Review Magazine  
Covering Business Intelligence, Integration & Analytics  Advanced Search  


The Link is the Thing, Part 2
Part I of this series (August 2003 issue of DM Review) reviewed the work in network analysis of complex systems, particularly the recent research into the smallworld (SW) property, aristocraticegalitarian (AE) distinction and tipping points. Insights are popping up in strange places. Scientists from anthropology to zoology are making significant progress in understanding their particular complex systems by applying these concepts. These systems have a surprising efficiency interconnecting elements, along with unanticipated tipping points caused by minor stimuli. Understanding the underlying structure of interactions among numerous elements is the determining factor to explaining and even predicting system behavior. This installment concentrates on the implications to the business intelligence (BI) and data warehousing (DW) fields. In particular, how can the concepts of the SW property help us understand the data in our enterprise data warehouse? Additionally, how can we leverage those insights to solve practical problems? To answer these questions, this article suggests a methodology called Associative Link Analysis (ALA). Enterprise data richly describes many complex systems essential to our business. These systems are composed of a loosely coupled network of many interacting elements such as customers, suppliers and distributors. The intent is that, by applying ALA to our enterprise data, we can derive insights into a specific business problem and apply actionable information to alleviate that problem. The ALA methodology is a four stage process, as follows:
Let's examine a critical part of this methodology: translating the warehouse schema into an associative graph. ER Schema to Associative GraphThe challenge of the association specification step is to abstract, from the overwhelming complexity within the enterprise data warehouse, those aspects giving us the greatest insights into associations among customers, products, etc. In particular, we need to distill from entityrelationship (ER) schema those key associations into a graph structure consisting of nodes (representing the entity instances as the unit of analysis) connected by links (representing the association among entity instances).^{1 } Analysis of a graph structure requires an adjacency matrix.^{2} This matrix has an equal number of rows and columns corresponding to each of the nodes. For a directed graph with N nodes and M links, the adjacency matrix G is defined as: g_{ij} = 1 where i != j and i = 1,2...N(1) where a link from node i to node j exists; otherwise, g_{ij} is zero. The diagonal (where i = j) is usually undefined because it implies a node with a link back to itself. For undirected graphs (whose links have no direction), matrix G is symmetrical so that g_{ij} is always equal to g_{ji}. When the strength of a link is important, the adjacency matrix may also be valued, so that g_{ij} may be any nonzero number to indicate the link's strength. A valued adjacency matrix is usually normalized if there is a specific maximum for the strength. In this section, we will examine various schemas (recursive, manyto many and star) and illustrate the generation of associative graphs. Recursive SchemaGraphs are explicitly represented in relational databases through the recursive schema. A schema is recursive when it relates an entity to itself, such as supervisor to employees or billof materials assemblies to parts. The simplest example is the self referent schema shown in Figure 1. The table EMPLOYEE has the primary key of empID and a foreign key attribute of supervisorID, which points back to EMPLOYEE.
An instance table shows some typical data, which directly generates the corresponding graph and adjacency matrix, as shown in Figure 2. The three links in the graph are represented in the adjacency matrix with a "1" for n_{1}*n_{2}, n_{1}*n_{3}, and n_{3}*n_{4}.
Graphs generated from a selfreferent schema are often uninteresting because they are limited to hierarchical structures (i.e., every node will have a maximum of one link into it). A more interesting recursive schema is the classic billofmaterials (BOM) schema as used in manufacturing systems. A variation that could be used by a database supporting a customer referral program is shown in Figure 3. An existing customer submits another person's name for some promotion and receives a reward. The program tracks the productivity of this referral by the units purchased. The table CUSTOMER has the primary key custID and some attribute ADDRESS. An instance of table REFERRAL is created every time the first person refers the name of another person.
Note that two onetomany relationships tie the two tables together. Going counterclockwise (down REFERS and up REFERRED BY), a forward explosion of "who referred whom" would be listed. Going clockwise, a reverse explosion of "who was referred by whom" would be listed. From the instance table, the graph and adjacency matrix can be directly generated, as shown in Figure 4. The graph is directed because it is important to indicate the direction of the referral.
Although it may not be permitted in an actual referral program, customer #1 refers customer #2, and sometime later customer #2 refers customer #1. This generates a bidirectional link between the two customers. Also, note that customer #3 is referred by more than one person. The adjacency matrix in the center is not valued so that all links are of equal strength. However, the adjacency matrix on the bottom is valued based on the number of units purchased. Note that person #2 has the best performance because this person has referred the most people and the most units purchased. ManytoMany SchemaA more common pattern in warehouse schemas is the Vshaped manyto many relationship between two entities. The theme is "guilty by association." The instances of one entity type are associated together based how they occur in relationship with another entity. For example, the table CUSTOMER is related to table PRODUCT through intersection table ORDER, as shown in Figure 5.
The instance table generates a different type of graph than before, as shown in Figure 6. First, there are two types of nodes  one for customers (circles) and another for products (triangles). Second, the links are not directed because the direction between customers and products is not meaningful. This is a bimodal graph  a graph having two types of nodes.^{3} It often appears in social networks where people have specific affiliations with each other (such as friendships, club memberships and board of directors).
In the center, the matrix is an affiliation matrix, which is different than an adjacency matrix. Note that it is a 4x3 matrix whose rows represent the four customers and whose columns represent the three products. Like an adjacency matrix, there is a "1" when a link is present and "0" otherwise. Like a valued adjacency matrix, the affiliation matrix can also be valued to indicate the strength to a link. In this example, the sum of a specific product ordered by a specific customer is indicated as the strength of that link. Directly analyzing the affiliation matrix is difficult. The usual approach is to project the affiliation matrix into two adjacency matrices, one for each entity involved. The dualism with a bimodal graph results in two projections onto normal (unimodal) graphs.^{4}
Consider the association among products based on the orders by customer, as shown in Figure 7. From the instance table, note that Customer 1 purchased both Product a and Product c and thus associated the two products. That resulted in the link between the two triangles and the n_{a}*n_{c} and n_{c}*n_{a} items in the adjacency matrices. The valued adjacency matrix indicates "14" as the sum of "12+2." Likewise, Customer 2 purchased both Product b and Product c. Consider the association among customer based on the orders for products. It is like looking at the same coin but from a reversed perspective. Figure 8 shows the instance table ordered on product ID instead of customer ID.
Note that Product a was purchased by both Customer 1 and Customer 4, thus associating the two customers. That resulted in the link between the two circles and the n_{1}*n_{4} and n_{4}*n_{1} items in the adjacency matrices of Figure 9. The valued adjacency matrix indicates "11" as the sum of "2+9." In general, the adjacency matrix for the first entity is derived from the (nonvalued) affiliation matrix A by: g_{ij} = (sum) a_{ik}*a_{jk} for k over columns of the second entity; while the adjacency matrix for the second entity is: g_{ij} = (sum) a_{ki}*a_{kj} for k over rows of the first entity. This projection technique identifies the cliques (i.e., associations among nodes that are fully linked), first within products and then within customers. The important aspect is the overlap among cliques, identifying the similarities among products and among customers. The nodes that are common across cliques play a critical role.
Star SchemaThe final schema is the star schema, ever present in data warehousing. The focus is on a set of dimensions associated with a key event in a business process. Figure 10 lists some common examples of star schema patterns.^{5}
Figure 11 shows a generalized star schema, consisting of a fact table F and N dimension tables D_{1}, D_{2}, ... D_{n}. Note that each pair of dimensions is like the manyto many relationship discussed previously. Each pair can be considered a special association and converted to a bimodal graph with its affiliation matrix and dual adjacency matrices. There is a possibility of N(N 1)/2 pairs and hence N (N1) associative graphs for analysis. Considering the simple example of the retail star schema in Figure 10, its five dimensions (time, product, customer, store and promotion) yield 10 affiliation matrices and 20 adjacency matrices. How should we choose which association of a star schema to study? First, the problem definition should imply which association is likely to give us meaningful insights and effective intervention for solving the problem. Second, the two dimensions involved should have a high cardinality and rich distribution of intersection instances resulting in overlapping cliques. For example, a bad dimension would be "sex" which usually has only two instances and allows for no overlap. Part 3 of this series will cover: the metrics (e.g., distance, degree, clustering and betweenness) used to analyze graph structures; definition for the SW property; strategies and tactics to foster or deter the SW property within a system; the issues of blanacing size, scope and heterogeneity; the implementation considerations for the ALA methodology; and the similarity to data mining within data warehousing. References: Dr. Richard Hackathorn is president and founder of Bolder Technology, Inc., a twelveyearold consultancy in Boulder, Colorado. Richard has more than 30 years of experience in the IT industry and is a well known technology innovator and international educator, conducting professional seminars in 18 countries. He has written three textbooks entitled Enterprise Database Connectivity, Using the Data Warehouse (with W.H. Inmon) and Web Farming for the Data Warehouse. He earned his B.S. degree from the California Institute of Technology and his M.S. and Ph.D. degrees from the University of California, Irvine. To contact, send email to richardh@bolder.com. ............................................................................... For more information on related topics visit the following related portals... Enterprise Intelligence and Data Modeling. Dr. Richard Hackathorn is president and founder of Bolder Technology, Inc., an established consultancy in Boulder, Colorado. Hackathorn has more than 30 years of experience in the IT industry and is a wellknown technology innovator and international educator, conducting professional seminars in 18 countries. He has written three textbooks entitled Enterprise Database Connectivity, Using the Data Warehouse (with W.H. Inmon), and Web Farming for the Data Warehouse. He earned his B.S. degree from the California Institute of Technology and his M.S. and Ph.D. degrees from the University of California, Irvine. To contact, send email to richardh@bolder.com.




Site Map  Terms of Use  Privacy Policy  
