There are also unknown unknowns. There are things we don’t know we don’t know. —Donald Rumsfield
Keyword searches embody this remark; it is easy enough to enter words that make sense to you as being associated with what you want to find, but the truth is keyword searches only find some data in e-discovery. If you don’t know exactly what you are looking for, and all the words that were used in communicating information, then a keyword search will likely leave relevant data hidden from your view.
How Keyword Searches Fail
A keyword search is a basic search technique, where the user enters words into a search engine and is provided with a list of all of the documents that contain those specific words. Think of the last time you used your favorite internet search engine to find crock pot recipes. Here, your keywords are crock, pot, and recipes, and you probably get a lot of websites that contain… you guessed it: recipes for crock pots! What keyword searches won’t find are those documents that are related to the words that you are searching for. That isn’t a fault of keyword searching, as those searches are performing exactly as designed; it’s the fault of the user expecting that search to do more—to find other items the user wants, for example slow cooker meals.
Unfortunately, attorneys may find documents through a keyword search that relate to their case, but fail to find the other documents that contain relevant information. Worse yet, the attorney assumes no other documents exist or, if they do, have no idea how to search for those additional documents. In our crock pot example above, the search will leave out potentially relevant terms such as slow cooker, meal, and beef stew. It is easy to fall into the trap of using keyword searches to guess what other documents may contain because—according to Nat’l Day Laborer Org. Network v. United States Immigration & Customs Enf’t Agency, 877 F. Supp. 2d 87, 108 (S.D.N.Y. 2012).—“…searching for an answer on Google (or Westlaw or Lexis) is very different from searching for all responsive documents in the FOIA or e-discovery context.”
It is easy to run into other issues when using keyword searches in more complex matters, like litigation. The data searched may include everyday e-mails and documents that contain jargon, slang, acronyms, punctuation, and misspellings that a keyword search simply will not find. Again, in our crock pot example, that keyword search may miss the obvious trademarked term crock-pot due to the hyphen.
If you’re asking yourself why anyone would ever use a keyword search in an e-discovery issue, you are not alone. Courts have weighed in on this issue, Judge Facciola opined:
“Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. . . Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.”
Judge Andrew Peck, the federal judge who first approved technology assisted review, also eloquently stated:
“In too many cases, however, the way lawyers choose keywords is the equivalent of the child’s game of ‘Go Fish.’ The requesting party guesses which keywords might produce evidence to support its case without having much, if any, knowledge of the responding party’s “cards” (i.e., the terminology used by the responding party’s custodians). Indeed, the responding party’s counsel often does not know what is in its own client’s ‘cards.'”
Alternatives to Keyword Searches
I can hear it already, “Ok you’ve convinced me to stop using keyword searches, but what do I look for in my next e-discovery solution?” The two most popular alternatives used in e-discovery are predictive coding and concept searching. While both processes are complicated, and would take an entire law journal article to explain just one, we provide a brief overview of each below.
Predictive coding has been the buzz word in e-discovery over the last few years. This method uses humans to review small portions of large datasets in an attempt to train the computer to find similar documents based on characteristics of those marked relevant by the reviewer. The reviewer marking the training set used by the computer will typically be a lawyer intimately familiar with the case, to ensure the intricacies of that case are identified. The predictive coding algorithm will identify different sets of data based on how the reviewer categorizes that data (i.e. relevant or non-relevant). Eventually the user will be presented with a smaller “relevant” subset of data and a larger “non-relevant” pile. How long the lawyer physically trains the system depends on the accuracy of the reviewer and the acceptable cost/benefit level the law firm and client have identified for that particular case.
In contrast to predictive coding, a concept search does not require the user to train a computer system. While there are several categories of concept searches, latent semantic indexing is a commonly used one. This technique is based on the assumption that words used in the same contexts will likely have similar meaning. Latent semantic indexing does what keyword searches cannot do; it finds groups of words that a human reader would likely relate to the keyword being searching for, even if the specific keyword is not located within the document at all. Continuing with the crock pot recipe example, documents might contain slow cooker, beef stew, and ingredients which a human reader could reasonably relate to the keywords crock pot recipe without having to actually contain those keywords. The methods a reviewer uses to accomplish this vary from algorithm to algorithm, but the basics remain the same.
Closing
At the 2012 LegalTech conference, Judge Peck stated, “Keyword searching is absolutely terrible, in terms of statistical responsiveness.” While keyword searches might be appropriate in certain cases, when you are faced with large amounts of data that you know nothing about, instead of playing Go Fish with your client’s data and money, ditch the keyword searches and invest in technology that will help you find those documents that you didn’t even know existed.