Sid Adelmans Answer: You have a number of choices when you find data that cant be cleaned and, depending on the importance of the data and the need for completeness, you may decide to throw the row away completely, to put in a value indicating that it is invalid, insert a null value or, as you suggest, put in an exception/error table. Your decision will depend upon what you expect the users to need (do they need all the rows?), their ability to understand missing, flagged and null values. If you do put the unclean data in an exception table, what are you expecting the users to do with it? What purpose will this table serve? (It would be very difficult for the users to work with this exception table.)
David Marcos Answer: The answer to your question is maybe. Sometimes I load these values into an error table, other times they are loaded into the data warehouse. Which you do is up to the business users assessing the system. As far as the table layout, I recommend going to the Real-World Decision Support (RWDS) newsletter http://www.EWSolutions.com/Newsletter.asp and looking up Mike Jennings excellent two-part article on operational meta data. In this article he provides a good beginning layout. Look at the first two issues of RWDS in the archives.
Larissa Mosss Answer: Dirty data, regardless from what source, is a time bomb waiting to go off for most companies who are moving toward CRM, BI, B2B and B2C. The quality of information used and sold as a product can make or brake a company. My suggestions go beyond the question of how to dispense or suspense these data exceptions. Certainly, as a short-term solution, the ETL program/tool can write the "error report" into a table and populate such columns as: Load Date, Source file (table) name, source field (column) name, target table name, target column name, value of rejected data item. This table can then be used to assign each rejected data item to an analyst for resolution and the assignment date, assigned-to name, resolution date, resolution comment, and even reprocess date can be added to this table and updated manually. But for the long term, the company will need policies governing its data. This policy must include identification of its critical data, its important but not critical data, its less important data, and its not important data, as seen through the eyes of business executives who make strategic decisions based on this data, or as seen through the eyes of business operations units that sell this data as a product. This policy must also include guidelines for acceptability of data exceptions for each of those categories. In some cases 30 % "dirty data" may be acceptable, in other cases 5% is too high. This policy must also include procedures for dispensing or suspending data depending on its significance and its exception threshold. Writing exceptions out to a table may be acceptable for one situation but not another. This policy has to be followed with procedures to minimize the proliferation of bad data from its inception on the business operations side to the its use on the business intelligence side. This would include establishing a data quality management team whose responsibility it is to monitor the quality of strategically significant and important data of the company. It would also include changes to development methods, such as including certain testing guidelines for the various categories of data, asking down-stream users to participate in operational requirements definition sessions, setting data standards from naming to definitions to meta data, and so on. It would even include changes to current job-descriptions and incentives paid to employees, from data creation on the business side to data distribution on the IT side, to include data quality in the performance appraisal guidelines. In other words, writing data exceptions into a table is barely the beginning of raising a companys or maybe even only a business units consciousness for data quality. But it is not a long-term solution to a problem that could severely impact a companys competitive edge in the best case, and put it out of business in the worst case, as it did some automobile makers a few decades ago.
Ross Armstrongs Answer: The answer may not as straight forward as just putting the data in an error table. Your strategy should be to put as much data into the warehouse as possible without manual intervention. Using default values, minimum or maximum values, eye catchers, etc can accomplish this. As long as the data isnt dimensional keys or facts this can save a lot of effort correcting data that may be infrequently analyzed. I also design an exception table which contains the suspected data, the source it came from, the ETL process that detected it and the error reason code and a timestamp.