<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Law Blog 2.0 &#187; Statistical Methods Use Thereof</title>
	<atom:link href="http://law2point0.com/wordpress/topics/e-discovery/discovery-plan/statistical-methods-use-thereof/feed/" rel="self" type="application/rss+xml" />
	<link>http://law2point0.com/wordpress</link>
	<description>This blog covers privacy, security, health information technology and e-discovery related topics. The primary goal of this blog is to raise public awareness of legal issues pertaining to the use of law and technology.</description>
	<lastBuildDate>Sat, 12 Jun 2010 02:39:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Fingerprinting (Writeprinting) Text Using Stylistic Features Can Be Used To Accurately Identify the Authorship of Anonymous Emails, Blog Entries and IRC Chat Sessions</title>
		<link>http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/</link>
		<comments>http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/#comments</comments>
		<pubDate>Fri, 19 Jun 2009 17:54:29 +0000</pubDate>
		<dc:creator>Robert Hudock</dc:creator>
				<category><![CDATA[Forensic Linguistics]]></category>
		<category><![CDATA[Forensic Tools]]></category>
		<category><![CDATA[Law and Technology]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Privacy]]></category>
		<category><![CDATA[Privacy Law]]></category>
		<category><![CDATA[Statistical Methods Use Thereof]]></category>
		<category><![CDATA[1st Amendment]]></category>
		<category><![CDATA[Bloggers]]></category>
		<category><![CDATA[forensic linquistics]]></category>
		<category><![CDATA[Writeprint]]></category>

		<guid isPermaLink="false">http://law2point0.com/wordpress/?p=773</guid>
		<description><![CDATA[Going to Court to force an ISP to disclose the identity raises many issues including First Amendment issues. For example,

    On June 13, 2007, the New Jersey Township of Manalapan filed a malpractice suit against its former attorney Stuart Moskovitz, alleging misconduct regarding the Township's purchase of polluted land in 2005. The decision to file suit was met by a lively debate in the regional press and among localbloggers. One blogger who was particularly critical of the Township, of this and other decisions, was Blogspot blogger "datruthsquad" 

(http://www.eff.org/cases/manalapan-v-moskovitz).

Long story short the Township lost, a copy of EFF's motion squash is available here motiontoquashmpa-signed; and the Court order squashing the subpoena is available here order-122107.  However, there may exist an alternative method for "unmasking" anonymous bloggers, cyber-stalkers, etc. using public information.  Everyone has a unique writeprint (basically a written fingerprint that can be used to identify him or her).  This technique s has traditionally been used to identify the true author of a text (e.g. a book) where authorship is disputed or unknown. Forensics linguistics has been used to provide evidence in trademark disputes cases, identifying the author of anonymous texts (such as threat or harassment letters), and identifying cases of plagiarism. The identification process relies on the analysis of an individual’s particular patterns of language use (vocabulary, collocations, pronunciation, spelling, grammar, etc.). The term “idiolect” is defined as the speech patterns of a specific person (a dialect, unique in pronunciation, grammar, and vocabulary to a single person). Stylistic features can be used to create a fingerprint of an individual’s writing style (a linguistic fingerprint is called a “writeprint”). A writeprint is composed of features that represent an author’s writing style, which are consistent across all of an individual’s writings. For a gentle introduction, see Digital fingerprints: tiny behavioral differences can reveal your identity, by Julie Rehmeyer in the January 13, 2007 issue of Science News (Westlaw cite 2007 WLNR [...]]]></description>
			<content:encoded><![CDATA[<p align="justify">
<p>Going to Court to force an ISP to disclose the identity of anonymous blogger raises many legal road blocks including issues of First Amendment rights. For example,</p>
<blockquote>
<p align="justify"><em>On June 13, 2007, the New Jersey Township of Manalapan filed a malpractice suit against its former attorney Stuart Moskovitz, alleging misconduct regarding the Township&#8217;s purchase of polluted land in 2005. The decision to file suit was met by a lively debate in the regional press and among local bloggers. One blogger who was particularly critical of the Township, of this and other decisions, was Blogspot blogger &#8220;datruthsquad&#8221;</em></p>
</blockquote>
<p>(http://www.eff.org/cases/manalapan-v-moskovitz).</p>
<p align="justify">Long story short the Township lost, a copy of Electronic Freedom Foundation&#8217;s (&#8220;EFF&#8221;) motion squash is available here <a href="http://law2point0.com/wordpress/wp-content/uploads/2009/06/motiontoquashmpa-signed.pdf"  >motiontoquashmpa-signed</a>; and the Court order squashing the subpoena is available here <a href="http://law2point0.com/wordpress/wp-content/uploads/2009/06/order-122107.pdf"  >order-122107</a>.  However, there may exist an alternative method for &#8220;unmasking&#8221; anonymous bloggers, cyber-stalkers, etc. using public information.  Everyone has a unique writeprint (basically a written fingerprint that can be used to identify him or her).  This technique s has traditionally been used to identify the true author of a text (e.g. a book) where authorship is disputed or unknown.  Forensics linguistics has been used to provide evidence in trademark disputes cases, identifying the author of anonymous texts (such as threat or harassment letters), and identifying cases of plagiarism.  The identification process relies on the analysis of an individual’s particular patterns of language use (vocabulary, collocations, pronunciation, spelling, grammar, etc.).  The term “idiolect” is defined as the speech patterns of a specific person (a dialect, unique in pronunciation, grammar, and vocabulary to a single person).  Stylistic features can be used to create a fingerprint of an individual’s writing style (a linguistic fingerprint is called a “writeprint”).  A writeprint is composed of features that represent an author’s writing style, which are consistent across all of an individual’s writings. For a gentle introduction, see <span style="text-decoration: underline;">Digital fingerprints: tiny behavioral differences can reveal your identity</span>, by Julie Rehmeyer in the January 13, 2007 issue of Science News (Westlaw cite 2007 WLNR 2239738).</p>
<p align="justify">Email identification is a unique subset of authorship identification.  When identifying authorship of anonymous emails, the following considerations have been noted:</p>
<ul>
<li>
<p align="justify">The 	identification of an author is usually attempted from a small set of 	known candidates; and</p>
</li>
<li>
<p align="justify">Other evidence in 	the form of e-mail headers, e-mail trace route, e-mail attachments, 	time stamps, or other independent evidence is often used in 	conjunction with linguistic analysis to establish the identity of 	the author.</p>
</li>
</ul>
<p align="justify">Two studies (both funded by security related government agencies) have applied forensic linguistics to the identification of the authorship of anonymous emails. (<em>See </em>A. Anderson, M. Corney, O. de Vel, and G. Mohay; <span style="text-decoration: underline;">Identifying the Authors of Suspect E-mail</span>, Communications of the ACM, 2001 (available at eprints.qut.edu.au/archive/00008039/01/8039.pdf); see also Jiexun Li, Rong Zheng, Hsinchun Chen;  <span style="text-decoration: underline;">From Fingerprint to Writeprint,</span> Communications of the ACM (April 2006)).</p>
<p align="justify">Characteristics of an email that are relevant in establishing authorship include:</p>
<ul>
<li>
<p align="justify">Composition and 	writing, such as particular syntactic and structural layout traits;</p>
</li>
<li>
<p align="justify">Patterns of 	vocabulary usage;</p>
</li>
<li>
<p align="justify">Unusual language 	usage (e.g., converting the letter “f&#8221; to “ph&#8221;); 	and</p>
</li>
<li>
<p align="justify">The excessive use 	of digits or upper-case letters.</p>
</li>
</ul>
<p align="justify"><span style="text-decoration: underline;">Id.</span></p>
<p align="justify">These studies have found that a dataset of available e-mail used to conduct an evaluation ideally should include about 50 emails per author where each author’s emails include in total approximately 12,000 words. <span style="text-decoration: underline;">Id.</span> However, other studies have shown that a total of 20 documents for each author are adequate to achieve sufficient accuracy for purposes of authorship identification of an unknown email if additional independent corroborating features are also available. <span style="text-decoration: underline;">Id.</span> One study, focusing on knowledge acquisition within an organization (for purposes of maintaining institutional knowledge which is lost when an employee leaves an organization) found that email text analysis was superior to a content matter based approached in identifying subject matter expertise within an organization. Campbell, Christopher S.; Maglio, Paul P; Cozzi, Alex; and Dom, Bryon, <span style="text-decoration: underline;">Expertise Identification using Email Communications,</span> IBM Almaden Research Center (ACM © 2003).   Moreover, this study finds a small number of emails sufficient to identify a subject matter expert within an organization. <em>Id.</em></p>
<p align="justify">The literature has found the following stylistic features relevant in describing an individual’s dialect:</p>
<ul>
<li>
<p align="justify">Number of blank 	lines/ total number of lines;</p>
</li>
<li>
<p align="justify">Average sentence 	length;</p>
</li>
<li>
<p align="justify">Average word 	length (number of characters);</p>
</li>
<li>
<p align="justify">Vocabulary 	richness: (distinct words (V) / total number of words (M));</p>
</li>
<li>
<p align="justify">Total number of 	function words (Conjunctions, prepositions, and articles) / total 	number of words;</p>
</li>
<li>
<p align="justify">Total number of 	words three letters or less: all, at, his;</p>
</li>
<li>
<p align="justify">Hapax legomenon / 	total number of words (hapax legomenon is a word which occurs only 	once in the text);</p>
</li>
<li>
<p align="justify">Hapax legomenon/ 	total number of unique words;</p>
</li>
<li>
<p align="justify">Total number of 	characters in words/ total number of characters in the body of the 	email (C);</p>
</li>
<li>
<p align="justify">Total number of 	alphabetic characters in words/ total number of characters in the 	body of the email (C);</p>
</li>
<li>
<p align="justify">Total number of 	upper case characters in words/ total number of characters in the 	body of the email (C);</p>
</li>
<li>
<p align="justify">Total number of 	digit characters in words/ total number of characters in the body of 	the email (C);</p>
</li>
<li>
<p align="justify">Total number of 	white space characters/ total number of characters in the body of 	the email (C);</p>
</li>
<li>
<p align="justify">Total number of 	space characters/ total number white space characters; and</p>
</li>
<li>
<p align="justify">Total number of 	tab spaces/ total number of characters in the body of the email (C).</p>
</li>
</ul>
<p align="justify">To date there is only one application publicly available for performing authorship analysis of emails.  This application is a python script called Unmask.  The application was presented at a computer security conference in 2002 to demonstrate the ease with which stylistic patterns could be used to identify authorship and demographic information of an author using only the text of an email or IRC chat session log.  Unmask has been used by forensic examiners for the last few years to identify the authorship of unknown emails with a high degree of accuracy (depending on the stylistic features used).  Accuracy ranges between 97.85% and 99.01%.  Unmask identifies the author of anonymous email text by analyzing select stylistic features and matching properties of the anonymous text with a known email text.  Unmask does not use all the listed stylistic features.  A summary of features recognized by various researchers has been compiled for reference purposes.  The stylistic features detailed above can also be used to classify emails based on the geographical origin of the author, gender, age, occupation, and sexual orientation.</p>
<p align="justify">Unmask is available at <span style="color: #0000ff;"><span style="text-decoration: underline;"><a target="_blank" href="http://www.immunitysec.com/downloads/unmask1.0.tar.gz"  >http://www.immunitysec.com/downloads/unmask1.0.tar.gz</a></span></span>.  Unmask was developed by Dave Aitel, who currently is CTO of Immunity Security.<sup><a target="_blank" href="https://docs.google.com/a/securitydotmatrix.com/Doc?id=ddxnjtjz_467dk9rkwgt&amp;hl=en#sdfootnote1sym" rel="nofollow"  name="sdfootnote1anc" ><sup>1</sup></a></sup> Unmask was written soon after Dave Aitel’s departure from the National Security Agency where he worked for six years.  Similar tools are known to be in use by the Federal Government for purposes of identifying terrorists and other criminals: these tools are not publically available.  By compounding it he expands the differences between different people. The more you match, the more an individual score will increase, however, this is not a linear function.   There are some really obvious words, like &#8220;a&#8221;, &#8220;the&#8221;, &#8220;I&#8221;, and “an” that a hypothetical email user will use, and thus common doubles.  The frequency of triples is significantly less frequent.   Punctuation</p>
<p align="justify">Relatively minor differences between the raw scores for two hypothetical test users may reflect significant differences in the likelihood of a match.  For example Jane may have a raw score of 20 and John a raw score of 18 and John when identifying an unknown email compared against each users known sample emails.  Jane compared against John shows that John’s score is ninety percent that of Jane.  Numerous, normal, stylistic similarities between Jane and John will result in their scores hitting a local minimum value that reflects these “normal” stylistic similarities.  Beyond this local minimum value unusual and unique stylistic features become a factor (the relative magnitude of these differences are significantly smaller as compared to normal stylistic similarities) accordingly these few matches reflect an exponentially difference in the quality of the match.  Accordingly, a 10% relative difference in raw score may potentially equate to a 99% match for Jane and 10% (or less likelihood) of a match for John, even though Jane and John share styles are objectively very close to each-other.</p>
<p align="justify">Some unique features of the matching algorithm should be carefully considered when evaluating the quality of a given match:</p>
<ul>
<li>
<p align="justify">Two hypothetical 	users, with a strong command of English that use a lot of articles, 	prepositions and conjunctions where there is little bias of either 	user toward a given combination of words, the more significant small 	variations become;</p>
</li>
<li>
<p align="justify">Individuals with 	a limited vocabulary will have their stylistic features padded by 	less common words, and generally by default will match less well, 	accordingly, the likelihood of error is significantly higher where 	comparing an anonymous email against a universe of potential email 	users some of which have a good command of English and other users 	who have a limited English vocabulary.  However, users with a 	limited command of English will likely have stylistics variations 	that are indicate of their demographic group or nationality; and</p>
</li>
<li>
<p align="justify">Unique words have 	been to shown to be strongly correlated to a given user.  However, 	the Unmask algorithm may not match long and/or odd word combinations 	especially where the sample size for a given library of emails for a 	given user test case becomes extremely large.   Nevertheless the 	matching algorithm should not be significantly affected with emails 	because emails are relatively short (opposed to other types of 	written texts) and where the total sample size of 12,000 words among 	all emails for a given user is maintained.</p>
</li>
</ul>
<p align="justify"><img src="https://docs.google.com/a/securitydotmatrix.com/File?id=ddxnjtjz_469f5jqdhc7_b" border="0" alt="" width="609" height="357" align="bottom" /></p>
<p align="justify"><span style="color: #4f81bd;"><span style="font-size: x-small;"><strong>Figure 1 &#8211; Functions Words (Prepositions, Articles, and Conjunctions Are Distinctive Features)</strong></span></span></p>
<p align="justify">The few courts that have addressed the issue over the last century have generally found linguistic stylistic features to be admissible evidence:</p>
<ul>
<li>
<p align="justify"><span style="text-decoration: underline;">In the Matter 	of the Estate of Violet Houssien</span>, 3AN-98-59 P/R, Superior Court 	for the State of Alaska(1999)(available at 	<span style="color: #0000ff;"><span style="text-decoration: underline;"><a target="_blank" href="http://www.touchngo.com/sp/html/sp-5496.htm"  >http://www.touchngo.com/sp/html/sp-5496.htm</a></span></span>), 	Court held that the disputed will was not authored by the decedent 	but by the Appellants [or at their direction].</p>
</li>
<li>
<p align="justify"><span style="text-decoration: underline;">In the Matter 	of the Appeal of Amarjit Saluja</span>, 30082 and 94-16 (1994 	California State Personnel Board)(available at 	<span style="color: #0000ff;"><span style="text-decoration: underline;">http://www.spa.ca.gov/spblaw/pdsindex.htm</span></span>), 	the Court found that employee authored anonymous letters that harmed 	other employees.</p>
</li>
<li>
<p align="justify">In <span style="text-decoration: underline;">United 	States v Larson</span>, 596 F2d 759 (CA8 Minn. 1979), the court held 	that the jury in a criminal prosecution had been properly permitted 	to consider evidence showing that one ransom note contained three 	separate misspellings of &#8220;approach&#8221; as &#8220;approuch,&#8221; 	while a letter known to be written by the accused also contained the 	same misspelling.</p>
</li>
<li>
<p align="justify">In <span style="text-decoration: underline;">Josephs v 	Briant</span>, 115 Ark 538, 172 SW 1002 (Ark. 1914), court allowed 	evidence of spelling peculiarities, as well as syntactical 	peculiarities, to establish authorship of a document.</p>
</li>
<li>
<p align="justify">In <span style="text-decoration: underline;">Bartholomew 	v Walsh</span>, 191 Mich. 252, 157 NW 575 (Mich. 1916), evidence of 	punctuation characteristics and technical typing characteristics 	were found admissible.</p>
</li>
<li>
<p align="justify">In <span style="text-decoration: underline;">Re Cravens&#8217; 	Estate</span>, 206 Okla. 174, 242 P2d 135 (Okla. 1952), the court 	allowed evidence of distinctive punctuation technique along with 	other typing characteristics to show that a purported testator had 	not typed certain portions of a disputed will.</p>
</li>
</ul>
<p align="justify">Over the last 25 years, with the evolution of more advanced statistical methods and algorithms to identify authorship of a document, this type of evidence has not been challenged.  Statistical methods of evaluating the authorship of an article are distinct from traditional literary theory (which in at least one researcher’s opinion is not sufficient to satisfy a Daubert challenge). <em>See </em>C. Chaski., <span style="text-decoration: underline;">A Daubert-inspired assessment of current techniques for language-based author identification</span>, Technical Report, US National Institute of Justice, 1998 (available at <span style="color: #0000ff;"><span style="text-decoration: underline;">www.ncjrs.org)</span></span>.  Writeprinting authors using stylistic features is a new method to combat cybercrime where law enforcement or victims of cybercrimes can use a criminal’s own anonymous emails, blog entries and IRC chat sessions as evidence of their illegal conduct.</p>
<div id="sdfootnote1">
<p><a target="_blank" href="https://docs.google.com/a/securitydotmatrix.com/Doc?id=ddxnjtjz_467dk9rkwgt&amp;hl=en#sdfootnote1anc" rel="nofollow"  name="sdfootnote1sym" >1</a> Dave Aitel is a computer security professional who worked at the NSA 	as a research scientist for six years.</div>
<div>
<p align="right">1</p>
</div>
<div id="spreadx">&nbsp;<a target="_blank" href="http://digg.com/submit?phase=2&url=http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/"  target="_new"><img src="http://law2point0.com/wordpress/wp-content/plugins/spreadx/images/digg.gif" alt="Digg" border="0" /></a>&nbsp;&nbsp;<a target="_blank" href="http://www.facebook.com/share.php?u=http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/"  target="_new"><img src="http://law2point0.com/wordpress/wp-content/plugins/spreadx/images/facebook.gif" alt="Facebook" border="0" /></a>&nbsp;&nbsp;<a target="_blank" href="http://www.stumbleupon.com/submit?url=http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/&title=Fingerprinting+%28Writeprinting%29+Text+Using+Stylistic+Features+Can+Be+Used+To+Accurately+Identify+the+Authorship+of+Anonymous+Emails%2C+Blog+Entries+and+IRC+Chat+Sessions"  target="_new"><img src="http://law2point0.com/wordpress/wp-content/plugins/spreadx/images/stumble.gif" alt="StumbleUpon" border="0" /></a>&nbsp;&nbsp;<a target="_blank" href="http://technorati.com/faves?add=http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/"  target="_new"><img src="http://law2point0.com/wordpress/wp-content/plugins/spreadx/images/technorati.gif" alt="Technorati" border="0" /></a>&nbsp;&nbsp;<a target="_blank" href="http://del.icio.us/post?url=http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/&title=Fingerprinting+%28Writeprinting%29+Text+Using+Stylistic+Features+Can+Be+Used+To+Accurately+Identify+the+Authorship+of+Anonymous+Emails%2C+Blog+Entries+and+IRC+Chat+Sessions"  target="_new"><img src="http://law2point0.com/wordpress/wp-content/plugins/spreadx/images/delicious.gif" alt="Deli.cio.us" border="0" /></a>&nbsp;</div><p><a href="http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/" rel="bookmark">Fingerprinting (Writeprinting) Text Using Stylistic Features Can Be Used To Accurately Identify the Authorship of Anonymous Emails, Blog Entries and IRC Chat Sessions</a> originally appeared on <a href="http://law2point0.com/wordpress">Law Blog 2.0</a> on June 20, 2009.</p>
]]></content:encoded>
			<wfw:commentRss>http://law2point0.com/wordpress/2009/06/20/fingerprinting-writeprinting-text-using-stylistic-features-can-be-used-to-accurately-identify-the-authorship-of-anonymous-emails-blog-entries-and-irc-chat-sessions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
