Monday, June 9, 2008

A Primer on Foreign Language E-Discovery

While e-discovery may be Greek to many, it is those documents written in Chinese, Japanese, Korean and Russian that cause much of the trouble. These “multi-byte” languages have exponentially more characters than the 26 letters and few other punctuation marks that Latin languages like English, Spanish, French and German need. In fact, the number of Chinese characters included in the Kangxi dictionary is over 47,000 (though only 3-4,000 are reportedly necessary for full literacy). The impact on e-discovery is significant considering the increased sophistication necessary for case evaluation.

At the most basic level, computers think in ones and zeros, with a one or zero being a bit. Eight bits is a byte. There are 256 different combinations of numbers you can create using a byte (2 (bits) to the 8th power). For languages that are not based solely on letters, i.e., those where symbols represent a concept or a syllable, you need to add bytes (256 x 256, which equals 66,536). That is the essence of multi-byte vs. single-byte languages – single-byte languages have 256 possible combinations, while multi-byte languages have 66,536.

Confused? Then let’s address codings. An encoding is a programmatical translation of what you input to what you get on the screen. The problem is when you have multiple encodings. For example, when analyzing an Outlook 2000 e-mail file (PST format) under a Japanese operating system, which you then convert to an English-language machine for review, there will be problems because the native data in Japanese is corrupted due to linguistic differences.

Unicode was created to solve some of these problems and offer a universal solution; however, it is only available for files created on newer systems, making legacy data a continuing area of concern. “Each language family has its own unique set of problems and solutions,” says Thomas Barnett, Special Counsel for Sullivan & Cromwell, LLP.

In fact, “in some parts of the world, you are not allowed to take the data out of the country due to local data protection laws,” adds Brian Kim of PriceWaterhouseCoopers LLP. He highlights that certain countries also have native applications that are more popular than those commonly used in the United States, requiring additional evaluation of your program inventory.

Whether your data is in Unicode or not, proper preservation is the key. While Microsoft Windows NT, 2000, XP and subsequent versions support Unicode, many archiving or compression tools do not support it. This could result in missing files that may or may not be reported in the error logs. For that reason, you must test carefully, notes Kim. Also, to ensure correct extraction, properly align the regional settings.

Some languages overlap in terms of characters, e.g., Chinese and Japanese, and others do not use spacing, which makes search more complicated. And, many corporate documents will combine English with another language as well.

To avoid mistakes and enhance defensibility, consider organizing data for review beyond keyword searching given the difficulty in establishing such terms for foreign languages. Also bear in mind that the expense incurred for translation is substantial. While expert translators, ordinary native speakers and native machine translators are options the issue is often one of timing and the reliability of the end product.

Remember, e-discovery is only Greek to you if you don’t know the code.

This article was originally published in the December 2007 issue of the Legal Tech newsletter.

About the Author

Ari Kaplan is the founder of Ari Kaplan Advisors and author of The Opportunity Maker: Strategies for Inspiring Your Legal Career Through Creative Networking and Business Development. He teaches professionals how to promote their work-get free editorial calendars at http://www.AriKaplanAdvisors.com.

See Also:

No comments: