What is Anonymisation?

Anonymisation (or de-identification, confidentialisation) is a process that removes all personal identifying information from data that represents an identifiable individual. One of the main purposes for anonymising personally related information (e.g. health or financial-related records) is to make this information accessible for secondary uses (such as publication or research) without infringing upon an individual’s privacy.

Three levels of data identifiability.

The NHMRC National Statement on Ethical Conduct in Human Research (2007 – Updated May 2015) states that with respect to data identifiability, data may be collected, stored or disclosed in three mutually exclusive forms:

  • individually identifiable data, where the identity of a specific individual can reasonably be ascertained;
  • re-identifiable data, from which identifiers have been removed and replaced by a code, but it remains possible to re-identify a specific individual by, for example, using the code or linking different data sets;
  • non-identifiable data, which have never been labelled with individual identifiers or from which identifiers have been permanently removed, and by means of which no specific individual can be identified. A subset of non-identifiable data are those that can be linked with other data so it can be known that they are about the same data subject, although the person’s identity remains unknown.

What makes information identifiable?

In Australia, “identifying information” and “identifiers” of an individual are defined in the Privacy Act (1988). i.e.:

Identification information about an individual means:

  1. the individual’s full name; or
  2. an alias or previous name of the individual; or
  3. the individual’s date of birth; or
  4. the individual’s sex; or
  5. the individual’s current or last known address, and 2 previous addresses (if any); or
  6. the name of the individual’s current or last known employer; or
  7. if the individual holds a driver’s licence—the individual’s driver’s licence number.

Identifier of an individual means a number, letter or symbol, or a combination of any or all of those things, that is used to identify the individual or to verify the identity of the individual, but does not include:

  1. the individual’s name; or
  2. the individual’s ABN (within the meaning of the A New Tax System (Australian Business Number) Act 1999); or
  3. anything else prescribed by the regulations.

Under what circumstances is Anonymisation required?

Anonymisation of information pertaining to an individual is required whenever the information is to be shared without infringing upon an individual’s privacy.

How is Anonymisation achieved?

Anonymisation or de-identification of data can be achieved using two broad methods:

  1. Removing explicit identifying information about an individual.
  2. Applying expert statistical knowledge to render information not individually identifiable and to ensure that the risk is very small that the information could be used, alone or in combination with other information to identify an individual.

What are the Australian Standards for Anonymisation?

The “Guidelines for the Disclosure of Secondary Use Health Information for Statistical Reporting, Research and Analysis 2015“, issued by the National Health Information Standards and Statistics Committee (NHISSC) provides general guidance to assist in the managing risks regarding the identification of individual patients/clients and health service providers.

It points to the Australian Bureau of Statistics for a comprehensive range of techniques, including alternative approaches for managing the risk of personal identification that could be applied to health information by data custodians.

Additionally, the NHISSC endorses some specific techniques which are briefly summarised below. Please refer to the original document for the full content and note that in all cases, professional judgement is required to assess the privacy implications, to utilise one or more of the techniques, and to assess whether an individual or an organisation’s commercial operations can be identified and have previously unknown information about them disclosed:

  1. Anonymising data:
    • removing and/or modifying personal identifiers such as a person’s name, address, date of birth and unit record number. For example, if it is an essential requirement of the data request to know that multiple episodes relate to the same person in the same hospital, then the unit record numbers provided should be encrypted;
    • not providing other specific dates unless absolutely necessary, otherwise provide month and year of admission/separation, etc.;
    • aggregating variables wherever possible: e.g. provide 5 year age groups rather than date of birth; a metropolitan/rural indicator or statistical level area 2 (SA2) rather than postcode and locality of residence, diagnosis related group instead of individual diagnosis and procedure codes, etc. The aggregating technique is based on the data minimisation principle and addresses the concern that if the population in the community is too small it can provide a risk of individual identification and information disclosure. Custodians should ensure that the pool of people who could potentially have contributed to unit record data or to a cell in aggregate data is as large as possible while still enabling the user to do their job. This approach could be assisted by a numerical test, i.e. unit record data would not be provided for sub-groups where their estimated population is less than a value set by the custodian. Custodians should require external users of unit record data to sign ‘conditions of release’ covering specific confidentiality requirements such as the purpose for which data may be used, requirements for the secure storage and retention of data, restrictions on the publication of data, the provision of data to a third party and any attempt to re-identify individuals. The Conditions should indicate the applicable laws covering release and the penalties that apply for a breach of the conditions.
  2. Small cell suppression in aggregate data. It is an easy test to apply and detects cells with potential identification problems possibly leading to the release of previously unknown information, e.g. age, diagnosis or procedure. Small cells (e.g. containing values between 1 and 4) may be avoided by aggregating variables, e.g. age group ranges 65-74, 75-84, 85+ are replaced with 65+, data from small areas or communities are aggregated over a number of years, etc. If this is not possible, then the small cells may be suppressed.
  3. Cells in aggregate data where the value of the cell is the same as a row/column total should be suppressed if it is considered that it could lead to disclosure of an additional attribute.
  4. The application of the small cell, and cell = row/column total techniques may require the suppression or amalgamation of several cells in a table, possibly including some with values of zero or greater than 4, in order that a cell not be derivable by subtraction. In these circumstances, it is advisable that the compiler of the table choose a method of confidentialisation that maintains the column and row totals and results in the loss of the least amount of useful information.

The document also contains a number of case studies to illustrate the various approaches.

An additional guide from the Australian Government’s National Statistical Service entitled “How to confidentialise data: the basic principles” provides further advice and tips on ensuring individuals or organisations are not identifiable within aggregated datasets.


Best practice for generating temporary codes for re-identification purposes

With respect to best practice guidance to consider when generating temporary codes for re-identification purposes, this is available from the US Department of Health and Human Services, who suggest:

(A) “The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and

(B) One “does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re-identification”.

Other useful resources