|Covering Business Intelligence, Integration & Analytics||Advanced Search|
Deciding to Buy or Build ETL for Your Data Warehouse
Buy versus build is a multidimensional decision. Which option is right for you?
When starting a new data warehouse project, you must make the decision to either buy or build your ETL (extract, transform and load). There are a few constraints that may prevent you from purchasing a product. After buying the hardware, database software and BI tools, you may not have enough money left in your capital budget for an ETL purchase.
If you are starting with a proof of concept, then it is unlikely that you will have the time or the budget to go through an ETL product selection and purchase cycle in the time allotted. Because there is no vendor selection, tool evaluation or training involved with custom coding, you can start the analysis and programming immediately. You also don't need to worry about sustainability with a proof of concept; therefore, you can do without elements you need in a production scenario such as integrated scheduling, dependency management and recoverability. After the proof of concept, you may consider evaluating a tool rather than doing the additional development, or you may continue with what you have.
If you already have a data warehouse with custom ETL, you've probably encountered the problem of growing ETL complexity due to enhancements, new data requests and maintenance. At some point in this cycle, you will probably consider purchasing an ETL product rather than continuing down the custom coding path.
When it's time to consider buying versus building, there are a few key items that can guide your decision. The consideration that arises first in most people's minds is cost.
Unlike many other technology products, ETL tool prices have not dropped much over the last few years. This is balanced by a steady increase in functionality, performance and usability. Products tend to be clustered at two ends of the cost spectrum: large ETL suites that carry high six-figure price tags and narrowly targeted, ETL-only products that cost considerably less.
Is a large ETL suite overkill? This should be the first question you ask yourself. High-priced full-featured products are great if you exploit all the features; however, they are overly complex and inefficient tools if you only use a portion of what they offer.
On the other hand, if you buy a product because it is less expensive, you run the risk that it won't meet all of your needs. If it doesn't, then you'll end up paying developers to write code to do what the product can't, buying supplemental products or, at worst, you might end up with shelfware.
It's important to evaluate the project realistically. Many projects go for the expensive products without considering whether a low-end tool or hand coding would be easier, faster or more cost-effective. A feature such as an all-encompassing meta data repository is great, provided you need to share meta data from many applications and you have the time and people to make proper use of it. Do you really need that SNMP console integration and automatic ticket generation for the help desk? Automatic parallelism and data pipelining in the high-end products offer easy performance and scalability, but are not required if you do not process large data volumes.
While examining your project, look at the IT organization's approach to staffing projects. If you are building your ETL, then you will be dependent on skilled programmers. The primary requirements for these people are familiarity with your data warehouse operating system and database, expertise in your language of choice, solid SQL skills and familiarity with the source formats you are dealing with - plus knowledge of data warehousing basics. This is a tall order. It means that you are looking for more senior developers, often the same people who are in high demand for other projects.
If your organization likes to share the in-demand resources part-time across projects or does not have many people who fit the bill, then you may be better off buying a tool. If you have developers available or can guarantee that the developers with the proper skills will be dedicated to your team, then building can be a reasonable option.
A tool provides some isolation from the technical skill factors, particularly the languages and platforms. It allows you to bring in someone who is a good programmer but doesn't have many of these skills. The easier-to-use tools can also allow non-development staff such as data analysts or DBAs to develop ETL jobs. Using a tool does not mean you can do away with decent developers; it still takes skill to decompose complex ETL tasks and build the proper solution.
When purchasing an ETL product, particularly the more complex ETL suites, it will still be necessary to train the project staff. The time for training and the additional learning curve add to the cost of buying ETL for a project and should be factored in if you are comparing the cost of buying versus the cost of building.
If you buy an ETL product but you're on a tight budget, then you probably can't afford expert tool consultants. Instead, you will be relying on novice developers. In this case, it may be faster to use existing in-house skills and develop your own custom ETL than it would be to go through the steep learning curve associated with some of the more complex ETL tool suites. The speed to learn and build with a tool should be compared if you plan to evaluate tools in more detail.
Development speed can be rapid with a tool when compared to hand coding extracts. It depends on the complexity of the extract and how well the tool can deal with that complexity. The development time for simple extracts is often equivalent -- complex extracts can sometimes favor hand coding; however, the broad area in between is often the place where tools fit best.
Many Disparate Source Systems
A factor that favors tools over hand coding is the number of disparate source applications. If you must source data from multiple databases, operating systems and nonrelational sources, then a tool can make the effort considerably easier. Many ETL products have the ability to connect to these sources without exposing the developer to the underlying technology.
If you hand code ETL in this environment, then the project will require developers with additional skills to deal with all of the source platforms. If those resources are not already assigned to the project, one of the maintainers of the source system will be required to help develop the extracts. Using source system developers creates dependencies outside the project team for the programming effort in addition to the source data analysis and mapping that you already have. These outside resources will likely be responsible for maintaining the extracts, adding operational dependencies outside the data warehouse staff and creating resource scheduling problems when the maintainer's primary responsibilities conflict with the data warehouse project's needs.
Software Package as Primary Source
When packaged software such as ERP systems are the data source, the right ETL product will greatly improve developer productivity. Most of the major ETL vendors have pre-built extraction interfaces for common software packages such as Oracle Applications, SAP or PeopleSoft.
The interfaces they offer usually include a connector to the source so you have minimal configuration work to do in order to get the data you want. Some products provide meta data all the way to commonly used BI metrics. Others limit themselves to the basic source meta data. A few follow the data-mart-in-a-box approach, providing pre-built extracts, target schemas and basic reports.
Unless you are an expert in the underlying implementation details of the package, mapping the extract for the data you want is difficult. You must learn the vendor's schema and processing rules via their documentation and may need to reverse engineer some components to understand how they work. You must also work out the proper change detection mechanisms and manage the ETL changes whenever the package is upgraded. This effort alone is often enough to justify the additional expense of an ETL tool that works with that package.
Frequent Source Changes
Dealing with source system changes isn't isolated to packaged software. Packaged software upgrades are usually planned well in advance. Changes to the organization's internal applications may not be as well coordinated or controlled. If you find yourself facing frequent source system changes, then an ETL product may be a better choice than building your own.
Many ETL tools provide some form of reporting or visual change impact specifically to alleviate this problem. They can do this because the extracts built in the tool are usually captured in the tool's meta data repository, making it easy to identify code where columns or fields are used, or where specific logic has been reused.
Without impact analysis tools, it is difficult to find every place where a changed element is used, identify the code that requires maintenance or identify the downstream effects of modifications to the ETL code.
Another area where impact analysis can help is in the design of source system changes. Many times, it is possible to fix a bug or add an enhancement in more than one way. By using impact analysis tools, the warehouse developer can work with developers on the source system to assess various design options at the source and minimize the impact on warehouse extracts.
Long-term ETL maintenance can be an issue if your organization has high staff turnover. If ETL code is developed in house, then it is more difficult for new people to step in and become productive. They must understand both the ETL subsystem they will be working on and the source systems and data. In contrast, new staff can be trained quickly on an ETL tool. This means the primary learning curve is for the source systems and data, without the additional problem of how those are treated in custom ETL programs.
Dealing with long-term maintenance is one of the areas that can be used to help justify buying over building, particularly when source systems are in a constant state of flux. In contrast, if the systems are relatively stable and there are no plans for major upgrades in the near future, then building ETL can be a cost-effective choice.
A large portion of the ETL development effort is spent on data mapping and analysis. This is usually much more than simply mapping one column to another and includes identifying the most reliable sources, deriving the rules used by the source application to create data, identifying bad data and dealing with data exceptions. All these tasks must be completed, whether using an ETL tool or not.
An ETL tool will not offer a quicker start over building when the source data is coming from custom applications. This is also true if the ETL tool does not have a pre-built extract for a software package you are using as a source. This generally puts the tool on a more level comparison with custom code, and you then must look at the relative merits of what you plan to build versus what the vendors offer.
There is one advantage that you might have when using a tool: the source and extract rule meta data are captured. Unless you take care to document your extraction logic or build meta data usage into your custom ETL, you will not benefit as much from the analysis. However, don't fall into the trap that because you have meta data, you won't need to document your extracts or analyze the source system again for a different extract. It's common to revisit the prior analysis, if only to refresh your knowledge.
A custom ETL effort must address many of the technology problems that are built-in features of ETL products. The vendors spend a lot of time and money to take a holistic approach to data integration. As new technologies or design techniques appear, the vendors build them into their products. This is more difficult for a data warehouse project to do; therefore, the custom ETL solution is usually locked in to its initial development choices, and reuse of the custom-built ETL software for other purposes is unlikely.
That drawback aside, more than half of all data warehouse implementations use custom built ETL. The right answer to the buy versus build question will depend on your budget, project timelines and the features that are important to your specific implementation.
Editor's note: If you do decide to buy, please read Madsen's follow-up to this article, which will be featured in the October 1 issue of DM Direct. To subscribe to this e-newsletter, please visit http://www.dmreview.com/editorial/newsletters.cfm.
Mark Madsen is an award-winning IT architect who has been working with technology for more years than he cares to divulge. He is a principal author of Clickstream Data Warehousing (Wiley, 2002) and currently teaches classes on ETL tool selection at The Data Warehouse Institute. For more information or to contact Madsen, visit http://clickstream.blogspot.com/.