March 2010
S M T W T F S
« Feb    
 123456
78910111213
14151617181920
21222324252627
28293031  

Legal Disclaimer

Your use of this Blog does not create an attorney-client relationship. Your e-mail or comments do not create an attorney-client relationship. We have no duty to keep confidential the information that is submitted to this blog. This blog is not a substitute for, nor does it constitute legal advice. Only an attorney who knows the details of your particular situation and is properly licensed in the applicable state (or states) is able to appropriately and properly address any legal issues you may have.

Blog Categories

Data Culling Strategies for Electronically Stored Information

Data Culling

Data Culling

An effective data culling strategy is essential to the e-discovery process.  Data culling refers to the process of reducing a large document population to a smaller set.  Data culling can save time and money by reducing the amount of documents requiring review.  More than 90 percent of collected electronic content is non-responsive and would be a significant waste of time and resources to review.  There are a variety of methodologies that can be used to reduce an electronic data collection to a manageable size.  Some data culling methodologies include deduplication, file extension filtering, NSRL signature filtering, keyword searching and data sampling.  The data culling process produces a dataset of potentially responsive documents that are then reviewed for responsiveness and applicability of privilege is evaluated by an attorney. 

 File Extension Filtering

File extensions can be used to streamline the process of locating and processing relevant data.  From a file extension, one can infer information about what sort of data might be stored in any given file.

A file extension consists of one or more characters following the proper filename, usually prefixed with a period.  Generally, file types should be identified at a binary level rather than relying on file extensions.  Different applications usually have a predetermined selection of file types.  One should verify the details of file type analysis using TrID, a utility designed to identify file types based on their binary signatures (available at http://mark0.net/soft-trid-e.html).
Some common applications (and their extensions) include text editors (.txt), word processors (.doc or .odt), web browsers (.htm or .html), PDF viewers/editors (.pdf), and spreadsheet programs (.xls or .ods).  File extensions can also indicate which files are executable programs and, by definition, do not contain relevant information.  For example, under Microsoft’s operating systems DOS and Windows, some irrelevant extensions include .exe, .com, .bat, and .cmd. (http://www.filext.com).

Common file types along with those applications most commonly associated with these file types include:

  • Word processor documents (Word, WordPerfect);
  • Spreadsheets (Excel, OmniPro);
  • Page description language (Acrobat, Fax Server, Crystal Reports);
  • Presentation (PowerPoint, Keynote);
  • Webpage (Internet Explorer, Netscape, FrontPage);
  • Archive and compressed (Zip, Exchange, Novell);
  • Databases (Access, Oracle);
  • Computer-aided design (Visio, AutoCAD);
  • Font file (binary data);
  • Graphics (Photoshop);
  • Object code, executable files, shared and dynamically-linked libraries (executable code);
  • Script (PHP, Python, Perl);
  • Sound and music (mp3, iTunes); and
  • Source code for compiled programs (.c).

Relevancy of a particular application can partially be determined by the application, the frequency of use of each application, the number of files found on the evidence drive for the given application, and the relative size of the data associated with the given application.  Other applications do not likely contain relevant information.  For example, an mp3 music file or compiled source code likely would not be relevant, whereas a Word or Excel document likely would contain relevant information.

Some files can often be excluded because of significant expense to process these file types (for example music, graphic, and video files along with other binary data).  Non-binary application or Internet data that can be opened by an industry standard application and/or viewed using Quick View Plus generally should be processed (available at http://www.avantstar.com/).
 

NSRL Database Filtering

A typical desktop computer contains between 10,000 and 100,000 files, each of which may need to be reviewed.  To eliminate as many known irrelevant files as possible from having to be reviewed, an automated filter program can screen files for specific profiles and signatures. If a specific file profile and signature match the database of known irrelevant files, then the file can be eliminated from review. Only those files that do not are subject to further investigation.

NIST maintains a repository of known software signatures at the National Software Reference Library (NSRL) (available at http://www.nsrl.nist.gov/).  An automated filter program can be used to screen files against that list of computer file signatures to separate those generated by a system and those generated by a user.  In the industry this is called De-NISTing. See THE SEDONA GLOSSARY: for E-Discovery and Digital Information Management 2nd Edition, pp 14, 36 (The Sedona Conference, 2007)(available at http://www.thesedonaconference.org/dltForm?did=TSCGlossary_12_07.pdf).

 

Deduplication

Deduplication is a technique that identifies and segregates files that are exact duplicates of one another, with the end goal of delivering a data set that includes one copy of each original document while maintaining the information associated with each instance of that document within the collection.  Electronic deduplication is accomplished by software that applies a mathematical algorithm to each file to create a “hash” value unique to that item.  The hash values of each document are compared to find duplicates.

A hash is a number generated by applying a mathematical formula to a document or sequence of text to generate a unique signature–the hash is unique to the original document.

Duplication may occur within the data of a single custodian or user (such as e-mail) or across an entire infrastructure across all users and data sets (such as data from file servers).  To accommodate this, there are two types of deduplication that may be used:

  Vertical deduplication locates duplicates within the records and data of a single custodian, and

  Horizontal deduplication applies globally across all custodians.

The drawback of vertical deduplication is that it will usually produce a higher number of duplicate files.  Each user copy of that memo will be produced rather than simply the original.  While horizontal deduplication is more efficient because only one copy of that memo would be submitted for review, vertical deduplication may be necessary if it becomes important to compare identical documents in different custodians’ collections.

E-Mail Deduplication

In some instances, e-mail can be deduplicated using the MD5 hash value of four common metadata fields opposed to the entire message.  Those fields typically used for deduplication include: Sender’s name (the “from” field), “Sent On” date and time, “Subject” line, and Attachment Count (just the count, since some e-mail servers will strip attachments completely and automatically).  More fields such as the “CC” and “BCC” may also be used.

Near Deduplication

In near deduplication, files that are not hash value duplicates but are materially similar may be culled or grouped together for review.  Near deduplication relies on linguistic pattern-matching.  Whereas hash values will only flag duplicates that are completely identical, linguistic pattern-matching technology attempts to recognize files that are extremely similar, such as the body of an e-mail copied into a Word document.

In addition to exact copies, this technique identifies near-duplicates with only a few words changed, or electronic files that are similar, but in different formats, such as a Word document that is also found in PDF.

Search Terms Filtering

Search term filtering is one of the most common and effective culling strategies.  Search terms may include the name of specific custodians coupled with litigation-specific keywords and determine whether the combination is contained within the text or metadata of a document.  This filtering method can help sift through vast quantities of records and can make the process more manageable.

Filtering can narrow a dataset by selecting responsive files based on file-level criteria such as metadata. The custodian list, file type and time frame associated with the matter are standard criteria.

There are many powerful search options available in modern search tools:

  • Fuzzy: find search terms even if they are misspelled;
  • Phonic: finds words that sound alike;
  • Synonym: finds words with the same meaning; and
  • Stemming: finds word with the same root.

Data Sampling

When there is a large volume of data to sift through, a statistical method called data sampling can be used to direct discovery processing efforts.  Data sampling is also used for purposes of quality control.

In data sampling, a statistically representative portion of the information is examined to determine if the set contains responsive data. Data sampling can help narrow the requirements of e-discovery plan.  Data sampling is also used for quality control.

The larger the sample, the more accurate the assessment, yet the added value of increasing the sample size diminishes as the sample gets larger. The optimal sample size relates to the confidence level.  Confidence level is a term that specifies how confident one can be that a given sample size is adequate; normally, a 95% confidence level is used.

Generally both sides of the dispute should agree to use a court-approved statistical sampling protocol. This can demonstrate whether evidence actually exists in the collection before the entire collection is ordered to be surrendered for review. See Zurich Am. Ins. Co. v. Ace Am. Reinsurance Co., 2006 WL 3771090 (S.D.N.Y. Dec. 22, 2006) (Finding parties shall submit either a stipulation or separate proposals for sampling database to obtain claims files in which the allocation of policy limits was at issue.)

Data sampling can also be used for quality assurance in e-Discovery reviews.  Elusion is a measure of quality of information retrieval.  Elusion is a technical term defined by the set of documents of all the material judged as not responsive that are in fact responsive (in other words the number of false-negatives the culling methodology produces). (See http://www.umiacs.umd.edu/~oard/desi-ws/papers/roitblat.doc.)

Improve the web with Nofollow Reciprocity.