TAR and Non-English Documents: Does it Work?

In recent years, technology-assisted review (TAR) has become a staple of electronic discovery. TAR’s ability to cut the cost and time of document review in litigation is so well-established that a federal magistrate judge recently declared it “black letter law” that courts will authorize TAR whenever a producing party seeks to use it.

Even so, there remains a perception that TAR is effective only for English documents. Given the era of global commerce in which businesses now operate, this would be a serious limitation. Legal disputes these days can involve documents in a range of languages, from Asian to European to Middle Eastern and beyond.

The truth of the matter is that TAR can be as effective for other languages as it is for English. All that it requires is that the documents be properly processed in advance.

TAR Doesn’t ‘Speak’ Any Language

To explain how TAR can work with non-English documents, it is important to emphasize that TAR doesn’t understand English or the actual meaning of documents. Rather, it is simply an algorithm that analyzes words according to their frequency and proximity in relevant documents compared to irrelevant documents.

We train a TAR system by marking documents as relevant or irrelevant. When I mark a document relevant, the algorithm analyzes the words in that document and ranks them. When I mark a document irrelevant, the algorithm does the same, this time giving the words a negative score. At the end of the training, the computer sums up the analysis from the training documents and uses that information to build a search against a larger set of documents.

While different algorithms work differently, think of the TAR system as creating huge searches using the words developed during training. It might use 10,000 positive terms, with each ranked for importance. It might similarly use 10,000 negative terms, with each ranked in a similar way. The search results would come up ordered by importance, with the most likely relevant ones ranked first.

By Tokenizing Documents, TAR Will Work

In performing this ranking, the TAR system does not actually understand the words. Rather, it is programmed to recognize words by the spaces and punctuation that separate them. Because these groupings of letters and characters might not even be actual words, computer scientists instead call them “tokens.”

The difficulty with some non-English languages is that they do not use spaces and punctuation in the same way that English does. For example, Asian languages such as Chinese and Japanese do not use spaces between words. If a document contains no spaces, how is the computer to recognize words and compile the index?

The solution is to tokenize the document before feeding it to the TAR system. Tokenization software is programmed to recognize characters and words in specific languages. The software is able to segment the words within the document so that the computer is able to “see” and index them just as it would with English-language documents.

It is true that when TAR first came on the market, some of the early systems could not process many non-English langtuages. Today, however, some advanced TAR systems include a text tokenizer and are able to analyze documents effectively in virtually any language.

Case Study: TAR for Japanese Documents

A recent case we were involved in illustrates how a corporation was able to use TAR to reduce the time and cost of review in a case involving Japanese documents.

The case involved a U.S. multinational in an international intellectual property dispute. Its Japan-based legal team faced a four-week deadline to review more than 15 million Japanese documents. Using culling software, the team was able to reduce the collection considerably, but even then it was left with 3.5 million unique documents to review.

In the hope of speeding the review, the team turned to TAR. Japanese reviewers had already coded roughly 10,000 of the documents as relevant or not. These were used as seeds to train the TAR system. Then, the full collection was run through TAR.

The result was that the legal team was able to disregard 83 percent of the remaining documents, yet still achieve a confidence level of 97 percent. That meant that they had to review only 17 percent of the documents yet could be highly confident that they had found virtually all of the relevant ones.


Legal matters these days are increasingly likely to involve non-English documents. A belief persists that TAR cannot be effective in these cases. The truth, however, is that with the proper technology and expertise, TAR can be used with any language, even difficult Asian languages such as Chinese and Japanese.

About John Tredennick

John Tredennick
John Tredennick is the founder and CEO of the e-discovery company Catalyst. John is passionate about the role of search in e-discovery. Before founding Catalyst in 2000, he was a trial lawyer and litigation partner. John has been a frequent speaker on legal and technology issues for more than 30 years. He’s also written and edited five best-selling books and countless articles on litigation and technology issues.

Check Also


Electronic and Remote Notarization Legislative Updates

Electronic notarizations—and, more specifically, remote notarizations conducted online—are gaining popularity across the country.

  • America Mo

    Useful article – For what it’s worth , if your company are requiring to merge two images , my boss came across a service here http://bit.ly/1JfekJh .