September 2009
S M T W T F S
« Aug   Oct »
 12345
6789101112
13141516171819
20212223242526
27282930  

Legal Disclaimer

Your use of this Blog does not create an attorney-client relationship. Your e-mail or comments do not create an attorney-client relationship. We have no duty to keep confidential the information that is submitted to this blog. This blog is not a substitute for, nor does it constitute legal advice. Only an attorney who knows the details of your particular situation and is properly licensed in the applicable state (or states) is able to appropriately and properly address any legal issues you may have.

Blog Categories

Is Truly De-identified Data an Impossibility?

De-identification of Data

De-identification of Data

Social networking sites, efficient search tools (bing, dogpile, google, yahoo), blogs, cookies, mailing lists, message boards, active x controls/ embedded java script on websites and other databases make it easy to identify that new business prospect or easily cross-reference materials from multiple sources to yield unique insights into a matter of interest.  However, these online repositories of data are making it much more difficult to maintain the anonymity of those whose confidential information has been de-identified.  De-identified data has many useful purposes; the data can be used in its aggregate for tracking disease, flu outbreaks, tax purposes, etc..  There is a darker use of these many data sources, where those in our society that are ethically challenged use these data sources for socially unproductive purposes.  For example cyber-stalking and cyber-harassment are now serious problems for both companies and individuals – if you ever tried to stop such individuals you will note the absence of a well developed corpus of law in these areas.

De-identified Information is information that does not allow an individual to be identified because specified identifiers have been removed.  Scientists have demonstrated they can often “reidentify” or “de-anonymize” individuals hidden in anonymized data. See Ohm, Paul, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization (August 13, 2009). University of Colorado Law Legal Studies Research Paper No. 09-12. Available at SSRN: http://ssrn.com/abstract=1450006; see also Cassa, Christopher A; Wieland, Shannon C; and Mandl, Kenneth D. Re-identification of home addresses from spatial locations anonymized by Gaussian skew, International Journal of Health Geographics (August 2008) (available at http://www.ij-healthgeographics.com/content/pdf/1476-072X-7-45.pdf)( finding that multiple de-identified versions of the same data set, each anonymized using a method known as nondeterministic Gaussian skew, can be used to ascertain original geographic locations).

The fundamental flaw with anonymizing data methodologies relates to an adversary being able to find a unique data fingerprint (e.g. date of birth, zip code, and gender), and link that data to auxiliary information or outside information.  A potential adversary can use resources such as the web (Google), public records, blogs, social networks, Facebook, etc; the issue is particularly troublesome when multiple organizations independently release anonymized data about the same or similar populations.  The ultimate balance comes in trying to de-identify data sufficient to withstand inspection by a potential adversary, while also remaining useful for public health, or other similar needs.

De-identification of health information on the one hand is essential, but also can be used to embarrass, extort, or otherwise annoy someone whose information has been disclosed.  With respect to Protected Health Information (PHI), the HIPAA Privacy Rule permits covered entities to release data that have been de-identified without obtaining an authorization and without further restrictions upon use or disclosure because de-identified data is not PHI and, therefore, not subject to the Privacy Rule.  Generally a covered entity can de-identify PHI in one of two ways.  The first way, the “safe-harbor” method, is to remove all 18 identifiers enumerated at section 164.514(b)(2) of the regulations.  Data that are stripped of these 18 identifiers are regarded as de-identified, unless the covered entity has actual knowledge that it would be possible to use the remaining information alone or in combination with other information to identify the subject.  However copious amounts of auxiliary information that is publically available on the Internet may render HIPAA safe-harbor protection impossible.  On the other hand the “actual knowledge” requirement may allow for data that could be readily re-identified by a hacker (super user) (i.e. associating a person with the medical or other confidential data), while the covered entity “reasonably” believes the data are de-identified.

The 18 identifiers are:

a)                  Names;

b)                  Geographic subdivisions smaller than a state;

c)                   All elements of dates (except year) related to an individual (including dates of admission, discharge, birth, death and, for individuals over 89 years old, the year of birth must not be used);

d)                  Telephone numbers;

e)                  FAX numbers;

f)                   Electronic mail addresses;

g)                  Social Security numbers;

h)                  Medical record numbers;

i)                    Health plan beneficiary numbers;

j)                    Account numbers;

k)                  Certificate/license numbers;

l)                    Vehicle identifiers and serial numbers including license plates;

m)                Device identifiers and serial numbers;

n)                  Web URLs;

o)                  Internet protocol addresses (IP);

p)                  Biometric identifiers (including finger and voice prints);

q)                  Full face photos and comparable images; and

r)                   Any unique identifying number, characteristic

The second method to de-identify data is to have a qualified statistician determine, using generally accepted statistical and scientific principles and methods, that the risk is very small that the information could be used, alone or in combination with other reasonably available information, be used to identify the subject of the information.  The qualified statistician must document the methods and results of the analysis that justify such a determination. (See 67 Fed, Reg. 53233 (August 14, 2002.))

As is typically the case — if some method is built into the system to allow for re-identification, then the covered entity may not (1) use or disclose the code or other means of record identification for any purposes other than as a re-identification code for the de-identified data, and (2) disclose its method of re-identifying the information.  In essence the method and key (the code) almost become an encryption method, but like with encryption when the key is compromised the data are compromised.

One study using 1990 census data showed that 87% (216 million of 248 million) of the United States population reported characteristics that made them uniquely identifiable using only three pieces of data:  5-digit ZIP, gender, date of birth.  Fifty-three percent of the U.S. population could be uniquely identified using only gender, location (city, town, or municipality), and date of birth.  At the county level approximately 18% of the U.S. population could be uniquely identified.  L. Sweeney. Uniqueness of Simple Demographics in the U.S. Population, LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000 (available at http://privacy.cs.cmu.edu/dataprivacy/papers/LIDAP-WP4abstract.html)

Interesting the older the population the easier (the more likely) an individual can be uniquely identified.  Accordingly greater care must be taken with the medical data of elderly populations.  Philippe Golle, Revisiting the Uniqueness of Simple Demographics in the US Population (Palo Alto Research Center October 30, 2006)(available at http://www.truststc.org/wise/articles2009/articleM3.pdf).  Additional research has found that when multiple de-identified data sets are made from overlapping data sets re-identification of data becomes progressively easier.  Accordingly even where extremely large geographical areas are used to aggregate data for population studies this information may still be de-identified.

Unlike de-identified data, a limited data set is even easier to re-identify (albeit there are significant legal restrictions on the use of this information).  A limited data set is one that excludes the direct identifiers in 164.514(e)(2). Unlike a de-identified data set, a limited data set is PHI because it may include dates, city, state, and ZIP codes, and other unique identifying codes or characteristics not listed as direct identifiers.  A limited data set may be used or disclosed, without Authorization, for research, public health, or health care operations purposes, in accordance with section 164.512(e), only if the covered entity and limited data set recipient enter into a data use agreement. However, if the use or disclosure could be made under another provision of the Privacy Rule, such as for public health purposes in accordance with section 164.512(b), such agreement is not required.

“Value-added” de-identification that replaces personal health information with tags that retain temporal sequences and the georgraphic context simply may not work in a networked world.  Covered entities, business associates and others who aggregate and de-identify data sets may need to start limiting the downstream rights of licensees’ of de-identified data, and conduct some type of quality assurance proccess of their de-identification techniques.  What works today to de-identify data may not work in a year however your data will likely still be available somewhere on the Internet.  However, simply removing all personal health information may negate the value of the data.

Other Resources:

Federal Committee on Statistical Methodology, Office of Management and Budge, Statistical Policy Working Paper 22 (Revised 2005)- Report on Statistical Disclosure Limitation Methodology (available at http://www.fcsm.gov/working-papers/SPWP22_rev.pdf).

The New York Times reported in article entitled When 2+2 Equals a Privacy Question “Some healthcare concerns say they have been able to offer study data to researchers stripped of specific personal details like your name, phone number, and email address,” but “in some cases researchers may be able to re-identify you by correlating anonymous information with the digital trail that you’ve left on blogs, chat rooms and Twitter.” (see http://www.nytimes.com/2009/10/18/business/18stream.html)

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us 

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Improve the web with Nofollow Reciprocity.