June 2009
S M T W T F S
« May   Jul »
 123456
78910111213
14151617181920
21222324252627
282930  

Legal Disclaimer

Your use of this Blog does not create an attorney-client relationship. Your e-mail or comments do not create an attorney-client relationship. We have no duty to keep confidential the information that is submitted to this blog. This blog is not a substitute for, nor does it constitute legal advice. Only an attorney who knows the details of your particular situation and is properly licensed in the applicable state (or states) is able to appropriately and properly address any legal issues you may have.

Blog Categories

Fingerprinting (Writeprinting) Text Using Stylistic Features Can Be Used To Accurately Identify the Authorship of Anonymous Emails, Blog Entries and IRC Chat Sessions

Going to Court to force an ISP to disclose the identity of anonymous blogger raises many legal road blocks including issues of First Amendment rights. For example,

On June 13, 2007, the New Jersey Township of Manalapan filed a malpractice suit against its former attorney Stuart Moskovitz, alleging misconduct regarding the Township’s purchase of polluted land in 2005. The decision to file suit was met by a lively debate in the regional press and among local bloggers. One blogger who was particularly critical of the Township, of this and other decisions, was Blogspot blogger “datruthsquad”

(http://www.eff.org/cases/manalapan-v-moskovitz).

Long story short the Township lost, a copy of Electronic Freedom Foundation’s (“EFF”) motion squash is available here motiontoquashmpa-signed; and the Court order squashing the subpoena is available here order-122107.  However, there may exist an alternative method for “unmasking” anonymous bloggers, cyber-stalkers, etc. using public information.  Everyone has a unique writeprint (basically a written fingerprint that can be used to identify him or her).  This technique s has traditionally been used to identify the true author of a text (e.g. a book) where authorship is disputed or unknown. Forensics linguistics has been used to provide evidence in trademark disputes cases, identifying the author of anonymous texts (such as threat or harassment letters), and identifying cases of plagiarism. The identification process relies on the analysis of an individual’s particular patterns of language use (vocabulary, collocations, pronunciation, spelling, grammar, etc.). The term “idiolect” is defined as the speech patterns of a specific person (a dialect, unique in pronunciation, grammar, and vocabulary to a single person). Stylistic features can be used to create a fingerprint of an individual’s writing style (a linguistic fingerprint is called a “writeprint”). A writeprint is composed of features that represent an author’s writing style, which are consistent across all of an individual’s writings. For a gentle introduction, see Digital fingerprints: tiny behavioral differences can reveal your identity, by Julie Rehmeyer in the January 13, 2007 issue of Science News (Westlaw cite 2007 WLNR 2239738).

Email identification is a unique subset of authorship identification. When identifying authorship of anonymous emails, the following considerations have been noted:

  • The identification of an author is usually attempted from a small set of known candidates; and

  • Other evidence in the form of e-mail headers, e-mail trace route, e-mail attachments, time stamps, or other independent evidence is often used in conjunction with linguistic analysis to establish the identity of the author.

Two studies (both funded by security related government agencies) have applied forensic linguistics to the identification of the authorship of anonymous emails. (See A. Anderson, M. Corney, O. de Vel, and G. Mohay; Identifying the Authors of Suspect E-mail, Communications of the ACM, 2001 (available at eprints.qut.edu.au/archive/00008039/01/8039.pdf); see also Jiexun Li, Rong Zheng, Hsinchun Chen; From Fingerprint to Writeprint, Communications of the ACM (April 2006)).

Characteristics of an email that are relevant in establishing authorship include:

  • Composition and writing, such as particular syntactic and structural layout traits;

  • Patterns of vocabulary usage;

  • Unusual language usage (e.g., converting the letter “f” to “ph”); and

  • The excessive use of digits or upper-case letters.

Id.

These studies have found that a dataset of available e-mail used to conduct an evaluation ideally should include about 50 emails per author where each author’s emails include in total approximately 12,000 words. Id. However, other studies have shown that a total of 20 documents for each author are adequate to achieve sufficient accuracy for purposes of authorship identification of an unknown email if additional independent corroborating features are also available. Id. One study, focusing on knowledge acquisition within an organization (for purposes of maintaining institutional knowledge which is lost when an employee leaves an organization) found that email text analysis was superior to a content matter based approached in identifying subject matter expertise within an organization. Campbell, Christopher S.; Maglio, Paul P; Cozzi, Alex; and Dom, Bryon, Expertise Identification using Email Communications, IBM Almaden Research Center (ACM © 2003). Moreover, this study finds a small number of emails sufficient to identify a subject matter expert within an organization. Id.

The literature has found the following stylistic features relevant in describing an individual’s dialect:

  • Number of blank lines/ total number of lines;

  • Average sentence length;

  • Average word length (number of characters);

  • Vocabulary richness: (distinct words (V) / total number of words (M));

  • Total number of function words (Conjunctions, prepositions, and articles) / total number of words;

  • Total number of words three letters or less: all, at, his;

  • Hapax legomenon / total number of words (hapax legomenon is a word which occurs only once in the text);

  • Hapax legomenon/ total number of unique words;

  • Total number of characters in words/ total number of characters in the body of the email (C);

  • Total number of alphabetic characters in words/ total number of characters in the body of the email (C);

  • Total number of upper case characters in words/ total number of characters in the body of the email (C);

  • Total number of digit characters in words/ total number of characters in the body of the email (C);

  • Total number of white space characters/ total number of characters in the body of the email (C);

  • Total number of space characters/ total number white space characters; and

  • Total number of tab spaces/ total number of characters in the body of the email (C).

To date there is only one application publicly available for performing authorship analysis of emails. This application is a python script called Unmask. The application was presented at a computer security conference in 2002 to demonstrate the ease with which stylistic patterns could be used to identify authorship and demographic information of an author using only the text of an email or IRC chat session log. Unmask has been used by forensic examiners for the last few years to identify the authorship of unknown emails with a high degree of accuracy (depending on the stylistic features used). Accuracy ranges between 97.85% and 99.01%. Unmask identifies the author of anonymous email text by analyzing select stylistic features and matching properties of the anonymous text with a known email text. Unmask does not use all the listed stylistic features. A summary of features recognized by various researchers has been compiled for reference purposes. The stylistic features detailed above can also be used to classify emails based on the geographical origin of the author, gender, age, occupation, and sexual orientation.

Unmask is available at http://www.immunitysec.com/downloads/unmask1.0.tar.gz. Unmask was developed by Dave Aitel, who currently is CTO of Immunity Security.1 Unmask was written soon after Dave Aitel’s departure from the National Security Agency where he worked for six years. Similar tools are known to be in use by the Federal Government for purposes of identifying terrorists and other criminals: these tools are not publically available. By compounding it he expands the differences between different people. The more you match, the more an individual score will increase, however, this is not a linear function. There are some really obvious words, like “a”, “the”, “I”, and “an” that a hypothetical email user will use, and thus common doubles. The frequency of triples is significantly less frequent. Punctuation

Relatively minor differences between the raw scores for two hypothetical test users may reflect significant differences in the likelihood of a match. For example Jane may have a raw score of 20 and John a raw score of 18 and John when identifying an unknown email compared against each users known sample emails. Jane compared against John shows that John’s score is ninety percent that of Jane. Numerous, normal, stylistic similarities between Jane and John will result in their scores hitting a local minimum value that reflects these “normal” stylistic similarities. Beyond this local minimum value unusual and unique stylistic features become a factor (the relative magnitude of these differences are significantly smaller as compared to normal stylistic similarities) accordingly these few matches reflect an exponentially difference in the quality of the match. Accordingly, a 10% relative difference in raw score may potentially equate to a 99% match for Jane and 10% (or less likelihood) of a match for John, even though Jane and John share styles are objectively very close to each-other.

Some unique features of the matching algorithm should be carefully considered when evaluating the quality of a given match:

  • Two hypothetical users, with a strong command of English that use a lot of articles, prepositions and conjunctions where there is little bias of either user toward a given combination of words, the more significant small variations become;

  • Individuals with a limited vocabulary will have their stylistic features padded by less common words, and generally by default will match less well, accordingly, the likelihood of error is significantly higher where comparing an anonymous email against a universe of potential email users some of which have a good command of English and other users who have a limited English vocabulary. However, users with a limited command of English will likely have stylistics variations that are indicate of their demographic group or nationality; and

  • Unique words have been to shown to be strongly correlated to a given user. However, the Unmask algorithm may not match long and/or odd word combinations especially where the sample size for a given library of emails for a given user test case becomes extremely large. Nevertheless the matching algorithm should not be significantly affected with emails because emails are relatively short (opposed to other types of written texts) and where the total sample size of 12,000 words among all emails for a given user is maintained.

Figure 1 – Functions Words (Prepositions, Articles, and Conjunctions Are Distinctive Features)

The few courts that have addressed the issue over the last century have generally found linguistic stylistic features to be admissible evidence:

  • In the Matter of the Estate of Violet Houssien, 3AN-98-59 P/R, Superior Court for the State of Alaska(1999)(available at http://www.touchngo.com/sp/html/sp-5496.htm), Court held that the disputed will was not authored by the decedent but by the Appellants [or at their direction].

  • In the Matter of the Appeal of Amarjit Saluja, 30082 and 94-16 (1994 California State Personnel Board)(available at http://www.spa.ca.gov/spblaw/pdsindex.htm), the Court found that employee authored anonymous letters that harmed other employees.

  • In United States v Larson, 596 F2d 759 (CA8 Minn. 1979), the court held that the jury in a criminal prosecution had been properly permitted to consider evidence showing that one ransom note contained three separate misspellings of “approach” as “approuch,” while a letter known to be written by the accused also contained the same misspelling.

  • In Josephs v Briant, 115 Ark 538, 172 SW 1002 (Ark. 1914), court allowed evidence of spelling peculiarities, as well as syntactical peculiarities, to establish authorship of a document.

  • In Bartholomew v Walsh, 191 Mich. 252, 157 NW 575 (Mich. 1916), evidence of punctuation characteristics and technical typing characteristics were found admissible.

  • In Re Cravens’ Estate, 206 Okla. 174, 242 P2d 135 (Okla. 1952), the court allowed evidence of distinctive punctuation technique along with other typing characteristics to show that a purported testator had not typed certain portions of a disputed will.

Over the last 25 years, with the evolution of more advanced statistical methods and algorithms to identify authorship of a document, this type of evidence has not been challenged. Statistical methods of evaluating the authorship of an article are distinct from traditional literary theory (which in at least one researcher’s opinion is not sufficient to satisfy a Daubert challenge). See C. Chaski., A Daubert-inspired assessment of current techniques for language-based author identification, Technical Report, US National Institute of Justice, 1998 (available at www.ncjrs.org). Writeprinting authors using stylistic features is a new method to combat cybercrime where law enforcement or victims of cybercrimes can use a criminal’s own anonymous emails, blog entries and IRC chat sessions as evidence of their illegal conduct.

1 Dave Aitel is a computer security professional who worked at the NSA as a research scientist for six years.

1

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us 

Related posts:

  1. Analysis of Former Employee’s Laptop Can Raise Privilege Issues In Stengart v. Loving Care Agency, Inc. et al. ,...

1 comment to Fingerprinting (Writeprinting) Text Using Stylistic Features Can Be Used To Accurately Identify the Authorship of Anonymous Emails, Blog Entries and IRC Chat Sessions

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Improve the web with Nofollow Reciprocity.