Data anonymization in R

Written on 2014-11-07

Use cases

  • Public reports.
  • Public data sharing, e.g. R packages download logs from CRAN's RStudio mirror - cran-logs.rstudio.com - mask ip addresses.
  • Reports or data sharing for external vendor.
  • Development works can operate on anonymized PRODUCTION data.
    Manually or semi-manually populated data can often brings some new issue after migration to PRODUCTION data.
    Such anonymized PRODUCTION data can be quite handy for the devs.

Dependencies

suppressPackageStartupMessages({
  library(data.table)
  library(digest)
  library(knitr) # used only for post creation
})

Sample of survey data

Anonymize sensitive information in survey data, data storage in a single table.

# pretty print
kable(head(SURV))
City Postal Code Address Name Sex Age Height Weight Score
London SW1H 0QW Silk Road 17 John Lennon M 48 176 94 3
Cardiff CF23 9AE Queen Road 19 Edward Snowden M 55 185 74 2
London SW1P 3BU Edinburgh Road 19 John Kennedy M 46 156 84 1
London SW1P 3BU Cardiff Road 21 Mahatma Gandhi M 56 186 54 5
Cardiff CF23 9AE King Road 10 Nelson Mandela M 61 181 84 2
London SW1P 2EE Cardiff Road 23 Vandana Shiva F 41 192 64 5

Anonymize function

Function will calculate hashes only for unique inputs and return vector of masked inputs.
My version will use digest(x, algo="crc32") because it fits better into html tables, algo crc32 is not really secure.
Read ?digest::digest for supported algo, also consider to salt your input vector, e.g. x=paste0("prefix",x,"suffix").
Performance improvement possible using Rcpp / C: digest #2.

anonymize <- function(x, algo="crc32"){
  unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
  unname(unq_hashes[x])
}

Anonymize survey data

We will keep city and sex fields unmasked.

# choose columns to mask
cols_to_mask <- c("name","address","postal_code")
# backup original data
SURV_ORG <- copy(SURV)
# anonymize
SURV[,cols_to_mask := lapply(.SD, anonymize),.SDcols=cols_to_mask,with=FALSE]
# pretty print
kable(head(SURV))
City Postal Code Address Name Sex Age Height Weight Score
London 913ad86c c26dc5a8 a6ccb226 M 48 176 94 3
Cardiff 921485db 58be1ead 14404453 M 55 185 74 2
London 4c0d9ac8 7996c8e1 66dc3ad0 M 46 156 84 1
London 4c0d9ac8 1a5ecf8b 44f84c46 M 56 186 54 5
Cardiff 921485db b4dce820 b3445a6d M 61 181 84 2
London 1f39765c f450aea7 56efd861 F 41 192 64 5

Why not just random data or integer sequence

When using the digest function to hide sensitive data you:

  • keep rows distribution:
    aggregates by masked columns will still match to aggregates on original columns, see simple grouping below:
SURV_ORG[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)
         ][,kable(.SD)]
City Postal Code N Mean Age Mean Score
London SW1H 0QW 1 48.00 3.00
Cardiff CF23 9AE 3 65.33 2.33
London SW1P 3BU 2 51.00 3.00
London SW1P 2EE 2 36.50 3.50
Glasgow G40 3AS 1 53.00 2.00
SURV[,.(.N,mean_age=mean(age),mean_score=mean(score)),by=.(city,postal_code)
     ][,kable(.SD)]
City Postal Code N Mean Age Mean Score
London 913ad86c 1 48.00 3.00
Cardiff 921485db 3 65.33 2.33
London 4c0d9ac8 2 51.00 3.00
London 1f39765c 2 36.50 3.50
Glasgow 90b79e54 1 53.00 2.00
  • keep relationships on equi joins:
    if t1.col1 == t2.col4 TRUE then also digest(t1.col1) == digest(t2.col4) TRUE
    Example in next section below.

Sample of sales data

Anonymize relational data in sales data, data normalized into SALES and CUSTOMER tables.

kable(head(SALES,4))
Customer Uid Product Name Transaction Date Quantity Value
CUST_3 rgr 2014-10-28 34 612
CUST_4 jfc 2014-10-13 42 588
CUST_6 hnm 2014-11-06 40 200
CUST_9 zgm 2014-11-04 40 760
kable(head(CUSTOMER,2))
Customer Uid City Postal Code Address Name Sex
CUST_1 London SW1H 0QW Silk Road 17 John Lennon M
CUST_2 Cardiff CF23 9AE Queen Road 19 Edward Snowden M
# join
kable(head(
  CUSTOMER[SALES]
))
Customer Uid City Postal Code Address Name Sex Product Name Transaction Date Quantity Value
CUST_3 London SW1P 3BU Edinburgh Road 19 John Kennedy M rgr 2014-10-28 34 612
CUST_4 London SW1P 3BU Cardiff Road 21 Mahatma Gandhi M jfc 2014-10-13 42 588
CUST_6 London SW1P 2EE Cardiff Road 23 Vandana Shiva F hnm 2014-11-06 40 200
CUST_9 Glasgow G40 3AS Simple Road 11 Bob Marley M zgm 2014-11-04 40 760
CUST_2 Cardiff CF23 9AE Queen Road 19 Edward Snowden M qej 2014-11-06 29 493
CUST_9 Glasgow G40 3AS Simple Road 11 Bob Marley M fnz 2014-10-30 59 649
# join and aggregate
kable(head(
  CUSTOMER[SALES][,.(quantity = sum(quantity),value = sum(value)),by=.(city,postal_code)]
))
City Postal Code Quantity Value
London SW1P 3BU 845 10783
London SW1P 2EE 729 9732
Glasgow G40 3AS 376 4887
Cardiff CF23 9AE 981 12983
London SW1H 0QW 329 4099

Anonymize sales data

SALES[, customer_uid := anonymize(customer_uid)]
cols_to_mask <- c("customer_uid","name","address","postal_code")
CUSTOMER[,cols_to_mask := lapply(.SD, anonymize),.SDcols=cols_to_mask,with=FALSE]
setkey(CUSTOMER,customer_uid)
# preview result
kable(head(CUSTOMER,2))
Customer Uid City Postal Code Address Name Sex
4a7d777 Cardiff 921485db a759d95 b51a2e5c F
73a0e7e1 Glasgow 90b79e54 a8708751 7c739cd6 M
kable(head(SALES,2))
Customer Uid Product Name Transaction Date Quantity Value
93750eff rgr 2014-10-28 34 612
d119b5c jfc 2014-10-13 42 588
# datasets will still join correctly even on masked columns
kable(head(
  CUSTOMER[SALES]
))
Customer Uid City Postal Code Address Name Sex Product Name Transaction Date Quantity Value
93750eff London 4c0d9ac8 7996c8e1 66dc3ad0 M rgr 2014-10-28 34 612
d119b5c London 4c0d9ac8 1a5ecf8b 44f84c46 M jfc 2014-10-13 42 588
e31ffa70 London 1f39765c f450aea7 56efd861 F hnm 2014-11-06 40 200
73a0e7e1 Glasgow 90b79e54 a8708751 7c739cd6 M zgm 2014-11-04 40 760
e4723e69 Cardiff 921485db 58be1ead 14404453 M qej 2014-11-06 29 493
73a0e7e1 Glasgow 90b79e54 a8708751 7c739cd6 M fnz 2014-10-30 59 649
# also the aggregates on masked columns will match to the origin
kable(head(
    CUSTOMER[SALES][,.(quantity = sum(quantity),value = sum(value)),by=.(city,postal_code)]
))
City Postal Code Quantity Value
London 4c0d9ac8 845 10783
London 1f39765c 729 9732
Glasgow 90b79e54 376 4887
Cardiff 921485db 981 12983
London 913ad86c 329 4099

Reproduce from Rmd

Script used to produce this post is available in the github repo (link in the page footer) as Rmd file and can be easily reproduced locally in R (required knitr or rmarkdown) to any format (md, html, pdf, docx).

# html output
rmarkdown::render("2014-11-07-Data-Anonymization-in-R.Rmd", html_document())
# markdown file used as current post
knitr::knit("2014-11-07-Data-Anonymization-in-R.Rmd")

Minimal script

Minimal script example on survey data as SURV_ORG data.table:

anonymize <- function(x, algo="crc32"){
  unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
  unname(unq_hashes[x])
}
cols_to_mask <- c("name","address","postal_code")
SURV_ORG[, cols_to_mask := lapply(.SD, anonymize), .SDcols=cols_to_mask, with=FALSE][]