-
Marketplace
-
Channel Resources
Articles from this Site
Plymouth State University Gets the Picture with Rapid Insight
Egenera Simplifies Management of Virtual and Physical Resources
Loblaw Companies Limited Selects JDA Software
Baja Fresh Awards Five-Year Contract to Casdex, Inc.
City of Tallahassee Law Enforcement Implements Mydials
White Papers
Pragmatic Approach to Compliance Data Collation
Informatica - Handling Variable Length Files Using XML
Putting Metadata to Work to Achieve the Goals of Data Governance
Enterprise Information Management - Insights and Strategies into the Direction of EIM
Automated Analysis Technology
Web Seminars
Making the Business Case for Predictive Analytics: Innovative Strategies for Maximizing ROI
Master Data Management: Best Practices for Success
Modeling Unstructured Data
Creative Strategies for Achieving 24/7 Uptime
The Economy Catalyst: Four Pillars of Strategic Storage
Books
Data Management: Databases and Organizations, 3rd Edition
Data Modeler's Workbench: Tools and Techniques for Analysis and Design
Effective Databases for Text & Document Management
Mobile Handheld Devices - Enabling Enterprise Communications and Data Management
Mobile Data Management (MDM 2002), 3rd International Conference
How (not) to Fit a Terabyte into the Courtroom
Recent research shows that one in four companies, regardless of size, will be involved in legal proceedings each year. One in four. The explosion in electronic documents over the last decade, combined with increasingly aggressive legal discovery practices, is creating a mountain of data that needs to be exchanged between parties during this litigation. Obviously, this is a tremendous load on the legal department (and their budgets), but at some point in the process, that mountain of data falls squarely onto the shoulders of IT managers, as corporate legal counsel turns to them and asks, How do we get a terabyte of data into the courtroom?
At which point the IT manager gets to smugly answer: you dont.
Nowadays, of course, its possible to bring a terabyte (TB) of data anywhere, and rather easily. While you probably cant fit a TB in your pocket, chances are you can carry it under your arm or in a briefcase. But thats not the point. Its not the cost of the hardware that matters, but the cost of the data; or, more specifically, its the cost of the data review.
Many companies that are pulled into e-discovery for the first time have the impression that they need to search, save, gather and bring with them copies of literally everything they have. And where does that data come from? The sources that feed into e-discovery can potentially be any data, electronic or otherwise, thats owned or controlled by an entity, whether that entity is an individual or a corporation. (These entities are often referred to as custodians in legal circles, as they are in custody of the data.) And that data can be anywhere: in the data center, in off-site archives, on desktop hard drives, CDs, flash drives or even in file cabinets.
In todays email-centric world, the vast majority of that data resides in employee mailboxes. While corporations often limit the size of an individual mailbox to prevent packrat-itis on the server, employees are extremely adept at getting around those limitations by archiving data on the desktop, oftentimes for years. Even if a company has a good strong retention policy in place; those policies are rarely enforced in a comprehensive manner. Once a suit is pending, its way too late to go back and suddenly enforce the policies without serious legal hot water. As a result, if an employee has been at a company for a number of years, they most likely have email from each of those years in their possession outside the control of the IT group, but within their responsibility from a legal standpoint.
In most instances, the IT department will have much of this data stored somewhere in an archive. Perhaps its a sophisticated and searchable email archiving environment. Or, just as likely, the data is stored on a multitude of archived tapes that have been cut every month, quarter or year. Thus, a single email message from several years back could reside on the email server, on each of the last four daily tapes, on the last weekly tape, on each of the last quarterly tapes and on the preceding several annual tapes, as well as on the desktops of any employees that were involved in the mail chain.
Thats a lot of data. The paranoia that has been drilled into IT groups about the potential for a catastrophic loss of data actually works to their disadvantage because in a litigation setting each one of those copies needs to be identified, reviewed and categorized. This could add up to hundreds of gigabytes of data per employee; consider it the dark side to backup and archiving.
So back to the original question: how do you fit a TB into the courtroom? The true answer is that you dont need to.
The strategy instead is to effectively cull that information so that the legal decisions regarding what is relevant or not are focused on useful data. In other words, the goal is not just to limit the amount of data, but also to limit the number of decisions that have to be made regarding that data, thus minimizing the time, effort and legal expense around discovery.
In Search of the Duplicate
There are a number of tricks and tools an IT department or forensic consultant can use to identify and procure only the relevant information for review. The central concept in this process is the art of deduplication: finding the one instance of a file per custodian that is appropriate for review. (Note the per custodian part of that statement the importance of a file can vary greatly depending on who had it. Was it just the CFO, or an entire department, for instance?) In short, we want to be able to identify a single email that is in 15 different archives and on a dozen desktops, and review it just once, while ignoring all of the files that have nothing to do with the questions at hand.
The first step in this process is a technique called known file exclusion - using an authoritative source to define files that can safely be ignored. One of the standards in this instance is the U.S.Governments National Software Reference Library. This database contains the electronic signatures of millions of files that are known to be part of software applications or related sources, such as help files, documentation, executables, etc. In most cases, these files can all be ignored.
The next strategy is to take this same technique and apply it to the files that are specific to a corporation (internal applications, help files, etc.). The IT department can compile their own database of these files to be excluded.
So now weve eliminated the known industry- and company-specific files from consideration, slowly whittling down the vast mountain of data that will need to be presented to attorneys for review.
From this point forward, the IT group will need to consider specialized e-discovery tools to get the amount of data down even further. One of the most effective techniques these tools can apply is a process known as near-deduplication, basically the art of excluding email chains. In near-deduplication, the tool reviews an email exchange between two or more people, dissects each message instance, and determines if all of the preceding messages in the string are wholly contained in that one email. This process helps identify the last message in a given chain of, say, 20 emails, and if that final message contains the entire context of the conversation, exclude the previous 19 copies. When applied to long email threads that have a large number of recipients, this technique can be very effective in limiting the amount of data for review.
Once the data is as clean as the IT group can reasonably make it, the time is at hand for human legal review. Fortunately, technology can still help at this stage; the focus now turns to helping organize the data for review. The objective is to help the attorneys better understand or recognize the context regarding document contents so that they can make faster and more accurate decisions, thus reducing the total cost of the project.
Reading the Documents
The only way to effectively organize a large body of documents is to read them first. Thats where content analysis comes in. Content analytics: extracting information from a document to determine what the document is about. Of course, software tools dont know (or care) what any given file is about, but these tools can certainly look at the nouns, noun phrases, etc. and identify the main topics of the document. This allows the software to locate other documents about the same or similar topics and organize them so that the human reviewer making decisions about them is doing so in the most effective way possible.
Thus, the tools will analyze all of the documents in a case to identify how theyre related and organize them for efficient review. Instead of reviewing a sales forecast followed by a software bug report followed by a memo about the company picnic, the attorneys can now review all of the documents about sales or software development or the company picnic. This creates context and, thus, efficiency.
Think of it as creating the virtual equivalent of a stack of papers for review. Unlike paper, however, technology allows the reviewer to follow a chain of thought through electronic copies. The reviewer can then dynamically reorganize the documents into different or more detailed stacks.
Slowly but surely, these tools and techniques can whittle down the amount of data that needs to be taken into a courtroom often by a factor of 10 or more. Certainly this can all be done manually, and, in fact, was always done by hand in the past. With the exponential increase in the amount of email and other content in the world, technology is obviously the only way to address increasingly detailed and burdensome review requirements.
Talk to me, Goose. Talk to me.
You might think youre Maverick, but dont try to go it alone on these projects. Document everything and make sure your legal department understands how you are culling and filtering data (and get their approval in writing). You might not be court martialed, but a flame out on a project like this wont launch your career like Top Gun did for Tom Cruise.
So in a very real sense, the IT group might be able to use technology to help win the case or even prevent it from ever going to trial. And that is certainly the very best way to keep a TB of data out of the courtroom.
Greg Lawn is the professional services manager of Technical Services at Attenex, a provider of open software platforms and expertise that enables corporations and their law firms to standardize e-discovery processes to reduce the risk, complexity and cost of litigation, regulatory requests and internal investigations.
For more information on related topics, visit the following channels:


