Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles.
PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.
Because it’s a fun name and it’s a play on the word “sanitize” which is what we are doing to the data.
The closest R package in functionality is anonymizer which is a package for finding and removing PII from text. The package is not designed to work with data frames directly and we believe that our package will be more user-friendly and intuitive as it accepts data frames directly. In addition, sanityzeR gives the ability for users to define new type of spotters to redact new types of PII.
You can install the development version of sanityzeR from GitHub with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/sanityzeR")
This is a basic example which shows you how to solve a common problem:
library(sanityzeR)
df <- data.frame()
spotters <- list()
spotter_1 <- list(redact_email,TRUE,0)
spotters <- append(spotters,spotter_1)
df_cleaned <- clean_data_frame(df, spotters)
Conceptually, sanityzeR
is a package that provides a way to remove PII
from Pandas data frames. The package provides a number of default
spotters, which can be used to identify PII in the data and redact them.
The library comes with two default redaction functions
redact_creditcardnumber
and redact_email
and which simply takes a
character vector and redacts the corresponding PII using either a
constant string replacement or a hash of the redaction.
redact_creditcardnumber()
: a function that takes a character vector (string) and redacts credit card numbers contained within that string, replacing them with either:-
A constant string that the user can specify
-
A hash of the redaction (using MD5)
-
redact_email
: a function that takes a character vector (string) and redacts email addresses contained within that string, replacing them with either:-
A constant string that the user can specify
-
A hash of the redaction (using MD5)
-
clean_data_frame
: a function that takes as input the following list of arguments below and returns a deep copy of the cleaned data.frame:-
An input data.frame
df
to clean -
A list of spotter information arguments. Each item in the list is a list of 3 elements:
-
The redact_* function to use (e.g.
redact_creditcardnumber
). -
The second argument of the redact_* function:
hash_spotted
(TRUE or FALSE) or 0 to use the default argument. -
The third argument of the redact_* function:
replace_with
(a redaction string) or 0 to use the default argument.
-
-
Below is a simple quick start example:
library(sanityzeR)
df <- data.frame()
spotters <- list()
spotter_1 <- list(redact_email,TRUE,0)
spotters <- append(spotters,spotter_1)
df_cleaned <- clean_data_frame(df, spotters)
To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found here.
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
sanityzeR
was created by Caesar Wong, Jonah Hamilton and Tony Zoght.
It is licensed under the terms of the MIT license.
sanityzeR
was created using devtools and usethis R packages.