|Sign-Up for Free Exclusive Services:||Portals|||||eNewsletters|||||Web Seminars|||||dataWarehouse.com|||||DM Review Magazine|
|Information Is Your Business||Advanced Search|
Data Infrastructure Hygiene and the Imperative of Organized Growth
Corporations are not all that dissimilar from individuals. During a lifetime, we tend to accumulate things. At one time, some of these things were undoubtedly useful, but after a while, we forget why we obtained them in the first place. The result is clutter and disorganization, not to mention the difficulty of finding something amid all the "stuff."
I have a neighbor who has yet to park a single car in his two-car garage in the 10 years I have lived next door. He plans on building a shed, but I'd bet money he would only fill that up as well. Corporations are no different. They accumulate data, and lots of it. By some accounts, storage capacity requirements within the data center are growing between 45 and 125 percent compounded annually. There are a number of reasons for this huge growth. Among them are:
People and corporations typically deal with the inevitable growth of things to manage in two ways. Either you keep everything (like my neighbor) or you start throwing things out after you can't remember why you have it. I fall into the latter category, which is why my wife got upset at Christmas when she couldn't find her food processor. Oops! In this age of regulatory compliance - especially if your company is based in the U.S., the most litigious society in the world - you tend to become part of the former group by necessity.
Neither approach works all that well, however, so we need a third option. If we can accept the inevitable fact that data growth will happen, then companies can decide either to let the growth control it or to organize that growth so that the company is in control. The consequences for choosing uncontrolled growth are universally bad. If becoming adaptive or agile is at the heart of virtually every organization's IT strategy, then organized growth is the only answer.
Growth versus Cost
To address a problem, we must first identify that a problem exists. For a problem to exist and a solution to be found, there must be a correlation of factors. In other words, find two (or more) measurable facts that move together and are, therefore, highly correlated. For example, we all agree that when the outside temperature is greater than 90 degrees, humans will tend to sweat and become uncomfortable. If we can lower the temperature (i.e., air conditioning), we can improve the comfort level.
This leads us to the question: is there a correlation between the growth in data managed and the cost of actually maintaining that data? The answer is highly dependent on whom you ask. Certainly most PC users would likely see little correlation between maintenance cost and the amount of data under management. They have (for the most part) experienced the positive impact of larger storage capacity, coupled with improved processing speeds for a lower cost. Indeed, the average PC user now likely spends less time managing their storage than ever before. Hence, there is little correlation between the growth in the amount of data the average PC manages and the cost to maintain it.
To some degree, this mantra that storage is cheap has entered the conventional wisdom of most IT organizations. However, there is now evidence that a growing number of organizations have reached the point of diminishing returns, in which even sharply declining storage costs are no longer positively correlated with overall management costs. After all, while storage cost continues to decline 33 percent annually, the actual performance (as measured in disk revolutions per minute) has remained relatively unchanged for some time. We have overcome some of these factors through the use of larger caches and improvements in data partitioning and parallel processing (and yes, even more storage), but for larger IT shops with massive amounts of data to manage, we are now facing the unpleasant reality of unorganized growth.
The Birth of ILM
Storage vendors knew this day would come. They have seized this opportunity to expand their footprints into other areas in which they did not traditionally play — areas such as application, content and document management. Information lifecycle management (ILM) was marketed as a solution to any number of problems, including resource optimization, compliance, data protection and application performance. Organizations should accept the fact that not all data currently under management is equal, and the relative value of the data should match the relative cost of its underlying storage infrastructure. To accomplish this, experts urged organizations to create a process by which data could be assessed and classified, and then automate the movement of data to appropriate storage platforms.
I am oversimplifying the breadth of what ILM attempts to address, but suffice it to say, many IT organizations view ILM with a skeptical eye, seeing it as just another overhyped marketing ploy that would ultimately prove to be a $100 solution to a 10-cent problem.
At the very core of ILM is the notion that data is moved around the infrastructure as its relative value as "information" to an application or business user changes (i.e., as it diminishes). This process is often referred to as "archiving," which in my many discussions with end users would seem a most unfortunate term. I say this because archiving means different things to different people, even within the same organization. A storage manager might equate archiving with a backup to tape. An end user might think archiving means the data is no longer available or at least very difficult to access. This confusion, I believe, has had a negative impact on some very positive aspects of ILM.
For me, a light popped on during a discussion last year with Sai Gundavelli, CEO and founder of Solix Technologies, which is one of a handful of software companies that market software to manage the growth of application data, semi and unstructured data. Collectively, these companies have been called "archive vendors." In our discussion, Gundavelli stated, "Our goal is to do for enterprise data what Google has done for Web-based information. We want to help customers to better organize their data so that the infrastructure is easier to manage and all the data is ultimately easier to find." Sounds like more than just archiving to me.
Gundavelli's vision of organizing data immediately made sense to me in a way that none of the other vendors had been able to articulate. While I have always seen the inherent value in what Solix and others were doing, finding a way to effectively communicate the concept at a higher level had been difficult. This notion was coupled with a question I received while giving a presentation on trends in the database market a few years ago. One gentleman in the audience asked what other organizations were doing to clean up their IT environments as applications become obsolete or hardware technology and standards advanced. These two notions came together for me as the concept I call "organized growth and data infrastructure hygiene."
Organized Growth and Data Infrastructure Hygiene
If ILM is a concept so massive in its implications as to overwhelm most decision-makers, the notion of organized growth is for me a more accessible concept. Remember my neighbor's garage. The problem there is disorganization - chaos, really. Gardening tools mixed in with power equipment, building materials, bikes and children's toys. Through some reorganization, the space in the garage could have accommodated at least one car and finding (or accessing) the right tools would have been simpler. The analogy is also applicable to enterprise data.
Organized growth is not a storage-centric concept. At its core, it is a business performance concept, which ILM certainly encompasses but has clearly done a poor job of communicating. It is easy to see why, as many managers and C-level executives can understand the rapidly declining costs of overall storage capacity. A reasonable person would conclude that it is silly to spend money to save overall storage costs when the price is falling naturally.
What needs to be communicated is the ripple effect that growing volumes of data have on the overall performance of the business's data infrastructure. By constantly cleaning little-used data out of production environments, we can lower the costs associated with large production databases of any kind. The more data that a database must manage, the more difficult it becomes to maintain acceptable performance. This leads to higher personnel costs, missed opportunity costs, lower user productivity and lower levels of application availability.
Take a look at the example in Figure 1, which combines information about the ratio of database administrators (DBAs) to database instances managed for both production and nonproduction. This analysis assumes a fully loaded cost of a DBA to be $112,586 (base of $75,057) and a three percent annual growth in pay. It assumes that the average database instance today is a relatively small 100GB. We use this to calculate the management cost per database instance. We then assume a 45 percent compounded annual growth rate (CAGR) in data under management, considered by many to be a reasonable expectation. I also assume that our management processes, tools and databases themselves will improve over time, making management easier. In this case, I assumed a perhaps aggressive rate of 30 percent improvement annually. While our ability to manage more data (average instance size) increases, the actual ratio of instances per DBA declines over time, pushing our personnel costs up 74 percent in four years and 164 percent over a six-year period.
Figure 1: Impact of Unchecked Growth
This example doesn't begin to capture the true costs resulting from unchecked or disorganized growth. If we have software to apply business rules about the relative value of information in an application, we can move the information that is less likely to be needed by most users to an operational archive of the application database. Some have called this "active archiving" to illustrate that the data can still be transparently accessed by application users with proper privileges. Data is copied from production and purged, thereby lowering the growth rate of the production environment.
Additionally, most companies create an average of seven full copies of a production system. These are for QA, test, training and reporting environments. Another aspect of data infrastructure hygiene is delivering the right amount of data to the appropriate environment. For example, QA may need to be an exact copy of production, but a training or development environment may only need one to two GB of data. So, in addition to active archiving, we want to employ instance-subsetting tools. Perhaps you have a reporting instance for sales; a subset of production with only customers from a particular region could easily be built. These tools enable administrators to define a referentially intact subset of data to be copied from production to a nonproduction environment. The savings on storage would be substantial; but also think of the improved productivity of users, administrators and developers no longer forced to deal with long refresh cycles, performance issues and more. As an added bonus, data masking or encryption can be performed on sensitive data as part of the subsetting process and ensures that nonproduction environments also comply with data security and privacy rules, a situation often overlooked by many IT organizations.
Data infrastructure hygiene is a foundational management process that keeps production and nonproduction systems lean and performing optimally without sacrificing data access. Add the vision of organized growth as articulated by Sai Gundavelli for a new twist on traditional ILM thinking. It assumes that the value of information can and will change. Not only can information lose value, it can quickly gain value. For example, a lawsuit is brought against your company and subpoenas are issued for a number of potential emails or transactional records about the issue at hand. Finding that data is critical, and the value has increased exponentially overnight. Doing so may mean millions to your organization. Now that your enterprise data is organized, it can be searched in much the same way that information is located on the Web using Google-like search engines. Data organizing (staying away from the "archive" word) tools would have built and maintained the metadata regarding information location and its content. For this reason, such tools can truly be termed information lifecycle management tools. They are not concerned with data on storage; they are focused on managing information for improved business productivity.
Maintaining an organized and clean data infrastructure is imperative. Some may think so because of compliance concerns. Some because of management costs. Others believe this because they fear that uncontrolled growth in data hurts the company's ability to adapt quickly to changing demands. For all of these reasons and more, the implementation of a data infrastructure hygiene or organized growth process becomes an imperative. Look no further than your current list of IT initiatives and consider which ones would not be positively impacted by better-performing systems that free up resources (both human and hardware) to do higher value work. Such a process enables continued innovation, which every organization needs to survive.
For more information on related topics visit the following related portals...
Data Management, Data Quality, Enterprise Achitecture and Enterprise Information Management.
Charles Garry is a former vice president and director with META Group's Technology Research Services organization and has more than 18 years experience in the database market. He can be reached at firstname.lastname@example.org.