The eDiscovery industry is a mature one. While we can debate when the industry officially began, it’s certain that we are at least a decade into it. Over the years, many details in the eDiscovery workflow have been scrutinized, thoroughly debated, and resolved. Despite this maturity, there remain a number of gray areas in eDiscovery data processing that have not risen to the level of scrutiny one would expect. In some cases, there is the possibility that data may disappear or not be properly presented.
Such gray areas fall into different categories, but of particular interest are those related to database data and obscure data types.
eDiscovery is largely about documents: Word documents, Excel spreadsheets, PowerPoint presentations, and emails dominate this landscape. The term of art for this document-based information is unstructured data. It can be created in a freeform nature by a user and has no definitive structure. Unstructured data is, somewhat ironically, easy to work with in the context of eDiscovery. A document is a single item with relatively fixed metadata that can be represented easily in printed form. But not all eDiscovery data is unstructured, and some data types, such as databases, carry complicated issues.
Databases represent structured data. Structured data is stored in a systematic way. One might believe that this is an advantage over unstructured data but, in fact, it often causes more problems than it solves. There are few rules about how structured data should be organized. Most databases, however, have three broad types of data represented:
Tables are where the data is stored and forms provide data entry fields that populate the data in those tables. Reports are different ways of looking at the data stored in tables once it has been queried, grouped, and sorted.
A common type of database is one created with Microsoft Access. Microsoft Access is nearly ubiquitous in corporations and it is fairly easy for a business user to create a new database on a whim. Most eDiscovery data processing engines, however, do not gracefully deal with the types of data embedded within Microsoft Access.
In eDiscovery, it is important to be able to extract text from the tables, forms, and reports from a Microsoft Access database so that the complete text of the database is extracted. Depending on the case, it is important to produce data from the tables, forms, and reports so that the opposing party has context of what is being produced. What text is extracted and what data is produced depends on the needs of each case.
The reality is that there is no single answer that addresses the needs for every Microsoft Access database, so each one needs to be handled with individual attention. There are few standards developed around these questions, so we often rely on the expertise of individual service providers. This sort of case-by-case, individualized handling of data allows experts and service providers to provide additional value, but for an industry that prides itself on forensic results, it is clear that the way we process Microsoft Access files raises questions once we dig into the details.
Another type of database is Lotus Notes. Many people think of Lotus Notes as an email system, but it is actually a sophisticated database system that can be used to create databases of nearly any shape or size. In the mid-to-late ’90s, email was of high interest to corporations, so Lotus Notes was often used as an email database, but email was certainly not its only use. Many eDiscovery tools treat Lotus Notes databases as email-only and, therefore, important data can be missed. All Lotus Notes databases should first be inspected to see if they contain data not related to email.
Even for those Lotus Notes databases that contain only email data, many eDiscovery processing tools do not understand how to properly open Lotus Notes databases. This is in part due to the fact that a Lotus Notes email database is not a collection of individual email files (as one would find in a Microsoft email database format, such as a PST file), but rather it is a database full of tables, columns, and rows that collectively represent emails—but the columns and rows need to be stitched together first.
Many eDiscovery tools cannot appropriately act on structured data, such as records in a database, and prefer to treat files as structured data or as a collection of unstructured data. In fact, it is common for eDiscovery software to simply convert Lotus Notes databases into a collection of Microsoft email files (bundled together in a personal storage file) and then process the email in Microsoft format. The act of converting Lotus Notes to Microsoft format, however, almost always causes a loss of Lotus Notes-specific metadata, such as departmental data, formatting of the body text, and other custom fields used by the organization. This destruction of data is often glossed over and clients and customers are rarely made aware of the data loss. There are some eDiscovery software programs that can handle Lotus Notes data natively, so it is important to ask your litigation support department or eDiscovery service provider whether Lotus Notes is natively supported.
Another category of data that often falls into a gray area is that of obscure data types. Two examples are Exif data and data with extended metadata properties.
Exif data stands for exchangeable image file format (oddly, it is not officially written EXIF, even though it is an acronym). Exif data is embedded in many media files, such as JPEGs, TIFFs, and WAV file formats. Exif data can contain information such as the time, location, and equipment settings of the media when it was captured. It is common in photographs, for example, to include the type of camera used to take the photograph, the location where the photograph was taken, whether a flash was used, etc. When processing media files for eDiscovery, it is not necessarily assumed that Exif data should be extracted and made available in metadata fields during the review. This may exclude a treasure trove of information that could be useful for a case. In order to include this metadata in your review, it is important to inquire whether your service provider is including this metadata in its processing workflow.
Another kind of obscure data accompanies data stored in the cloud. Much of the data stored in the cloud has extended metadata properties. For example, data stored in the popular Amazon S3 file storage service contains not only typical metadata about the file, but also metadata about when it was uploaded, whether older versions of the document exist, and any custom or user-defined metadata. This information can contain key data for a case, but it is often ignored. It is important to ensure that this metadata be preserved and extracted if you believe it could be important to your litigation matter.
eDiscovery is a mature industry with many high standards. It is important, however, to understand that there remain types of data, such as database (structured) data and obscure data, which require us to handle data on a case-by-case basis. Structured data requires an understanding of each database and how its data is stored and presented in order to be used effectively in eDiscovery. Obscure data types require a deep understanding of the method in which data is stored and an additional understanding of the types of extended metadata that a file may contain. While the tools for eDiscovery have come a long way over the years, eDiscovery still requires expert knowledge of data, how it is stored, and how to extract information within the data in the most effective way possible.
(Image Credit: ShutterStock)