diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..86c3800 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,75 @@ +# Contributing + +Contributions are welcome, and they are greatly appreciated! Every little bit +helps, and credit will always be given. + +## Types of Contributions + +### Report Bugs + +If you are reporting a bug, please include: + +* Your operating system name and version. +* Any details about your local setup that might be helpful in troubleshooting. +* Detailed steps to reproduce the bug. + +### Fix Bugs + +Look through the GitHub issues for bugs and Project. Anything tagged with "bug" and "help +wanted" is open to whoever wants to implement it. + +### Implement Features + +Look through the GitHub issues for features. Anything tagged with "enhancement" +and "help wanted" is open to whoever wants to implement it. + +### Write Documentation + +You can never have enough documentation! Please feel free to contribute to any +part of the documentation, such as the official docs, docstrings, or even +on the web in blog posts, articles, and such. + +### Submit Feedback + +If you are proposing a feature: + +* Explain in detail how it would work. +* Keep the scope as narrow as possible, to make it easier to implement. +* Remember that this is a volunteer-driven project, and that contributions + are welcome + +## Get Started! + +Ready to contribute? Here's how to set up `sanityzeR` for local development. + +1. Fork and Clone a copy of `sanityzeR` locally. +2. Install locally in R studio + + ```console + library(devtools) + library(usethis) + load_all() + ``` + +3. Use `git` (or similar) to create a branch for local development and make your changes: + + ```console + $ git checkout -b name-of-your-bugfix-or-feature + ``` + +4. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. + +5. Commit your changes and open a pull request. + +## Pull Request Guidelines + +Before you submit a pull request, check that it meets these guidelines: + +1. The pull request should include additional tests if appropriate. +2. If the pull request adds functionality, the docs should be updated. +3. The pull request should work for all currently supported operating systems and versions of R. + +## Code of Conduct + +Please note that the `sanityzeR` project is released with a +Code of Conduct. By contributing to this project you agree to abide by its terms. diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md new file mode 100644 index 0000000..5cb3dad --- /dev/null +++ b/CONTRIBUTORS.md @@ -0,0 +1,11 @@ +# Contributors + +## Special thanks for all the people who had helped this project so far: + +- [Tony Zoght](https://github.com/tzoght) +- [Caesar Wong](https://github.com/caesarw0) +- [Jonah Hamilton](https://github.com/xXJohamXx) + +## I would like to join this list. How can I help the project? + +For more information, please refer to our [CONTRIBUTING](CONTRIBUTING.md) guide. diff --git a/DESCRIPTION b/DESCRIPTION index d51d687..6e12c94 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -5,13 +5,14 @@ Authors@R: c(person(given = "Jonah", family = "Hamilton", email = "jonah.hamilton@alumni.ubc.ca", - role = c("aut", "cre")), + role = c("aut")), person(given = "Caesar", family = "Wong", - role = "ctb"), + role = c("ctb")), person(given = "Tony", family = "Zoght", - role = "ctb") + email = "tony@zoght.com", + role = c("cre")) ) Description: Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities diff --git a/NAMESPACE b/NAMESPACE index 6ae9268..45d77bf 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,2 +1,5 @@ # Generated by roxygen2: do not edit by hand +export(clean_data_frame) +export(redact_creditcardnumber) +export(redact_email) diff --git a/R/clean_data_frame.R b/R/clean_data_frame.R new file mode 100644 index 0000000..93c4624 --- /dev/null +++ b/R/clean_data_frame.R @@ -0,0 +1,23 @@ +#' Cleans a data.frame by redacting PII information from character vector columns +#' +#' @param df A data.frame to clean +#' @param spotters_list A list containing lists of 3 elements each: +#' 1. the redact function +#' 2. hash_spotted value to pass or 0 to keep the default +#' 3. the replace_with value or 0 to keep the default +#' +#' +#' @return A deep copy of the cleaned data.frame. +#' @export +#' +#' @examples +#' df <- data.frame() +#' spotters <- list() +#' spotter_1 <- list(redact_email,TRUE,0) +#' spotters <- append(spotters,spotter_1) +#' df_cleaned <- clean_data_frame(df, spotters) +clean_data_frame <- function(df, spotters_list) { + # to be implemented in the next milestone + print(df) + print(spotters_list) +} diff --git a/R/redact_creditcardnumber.R b/R/redact_creditcardnumber.R new file mode 100644 index 0000000..debc618 --- /dev/null +++ b/R/redact_creditcardnumber.R @@ -0,0 +1,19 @@ +#' Redacts credit card numbers from a given string +#' +#' @param string A character vector with, at most, one element. The input string to redact credit card numbers from +#' @param hash_spotted When TRUE, the redaction of the credit cards will be a hash of the redacted (Default False) +#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted credit card numbers. +#' +#' +#' @return A character vector. +#' @export +#' +#' @examples +#' x <- "You can use my 5567554868135971 here" +#' redact_creditcardnumber(x) +redact_creditcardnumber <- function(string, hash_spotted=FALSE, replace_with="CREDITCARD") { + # to be implemented in the next milestone + print(string) + print(hash_spotted) + print(replace_with) +} diff --git a/R/redact_email.R b/R/redact_email.R new file mode 100644 index 0000000..bf6c72e --- /dev/null +++ b/R/redact_email.R @@ -0,0 +1,19 @@ +#' Redacts an email addresses from a given string +#' +#' @param string A character vector with, at most, one element. The input string to redact email addresses from +#' @param hash_spotted When TRUE, the redaction of the email addresses will be a hash of the redacted (Default False) +#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted email addresses. +#' +#' +#' @return A character vector. +#' @export +#' +#' @examples +#' x <- "my email address is foo@gaga.com" +#' redact_email(x) +redact_email <- function(string, hash_spotted=FALSE, replace_with="EMAILADDRS") { + # to be implemented in the next milestone + print(string) + print(hash_spotted) + print(replace_with) +} diff --git a/README.Rmd b/README.Rmd index a07ce08..706794c 100644 --- a/README.Rmd +++ b/README.Rmd @@ -2,7 +2,6 @@ output: github_document --- - ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, @@ -12,24 +11,27 @@ knitr::opts_chunk$set( ) ``` -** note this is a initial version of the README file which will be updated for the future milestones** - # sanityzeR + +![](logo.png) + The goal of sanityzeR: Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles. PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed -## Why `sanityzeR` ? +## Why `sanityzeR` ? + Because it's a fun name and it's a play on the word "sanitize" which is what we are doing to the data -## Similar R packages - **add to this** +## Similar R packages + +The closest R package in functionality is [**anonymizer**](https://www.rdocumentation.org/packages/anonymizer/versions/0.2.0)which is a package for finding and removing PII from text. The package is not designed to work with data frames directly and we believe that our package will be more user-friendly and intuitive as it accepts data frames directly. In addition, sanityzeR gives the ability for users to define new type of spotters to redact new types of PII. ## Installation @@ -42,59 +44,60 @@ devtools::install_github("UBC-MDS/sanityzeR") ## Example -This is a basic example which shows you how to solve a common problem: **edit this** +This is a basic example which shows you how to solve a common problem: ```{r example} library(sanityzeR) -## basic example code +df <- data.frame() +spotters <- list() +spotter_1 <- list(redact_email,TRUE,0) +spotters <- append(spotters,spotter_1) + +df_cleaned <- clean_data_frame(df, spotters) ``` ## Features and Usage -Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them. -The main entry point to the package is the `Cleanser` class. The `Cleanser` class is used to add `Spotter`s to the cleanser, which will be used to identify PII in the data. The cleanser can then be used to cleanse the data, and redact the PII from the given data frame (all future data structures that will be suppportd by the package, in the future). **edit this as needed** +Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them. -The package comes with a number of default spotters, as subclassess of `Spotter`: -1. `CreditCardSpotter` - identifies credit card numbers -2. `EmailSpotter` - identifies email addresses +The library comes with two default redaction functions `redact_creditcardnumber` and `redact_email` and which simply takes a character vector and redacts the corresponding PII using either a constant string replacement or a hash of the redaction. -Spotters can be added to it using the `add_spotter()` method. The cleanser can then be used to cleanse data using the `cleanse()` method which takes a Pandas data frame and returns a Pandas data frame with PII redacted. +## Functions -The redaction options provided by `sanityze`` are: -1. Redact using a fixed string - The string in this case is the ID of the spotter. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{CREDITCARD}}`, or `{{EMAILADDRS}}` for an instance of `EmailSpotter`. -2. Redact using a hash of the input - The hash is computed using the `hashlib` package, and the hash function is `md5`. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{6a8b8c6c8c62bc939a11f36089ac75dd}}`, if the input is contains a PII `1234-5678-9012-3456`. +1. `redact_creditcardnumber()`: a function that takes a character vector (string) and redacts credit card numbers contained within that string, replacing them with either: + 1. A constant string that the user can specify + 2. A hash of the redaction (using MD5) +2. `redact_email`: a function that takes a character vector (string) and redacts email addresses contained within that string, replacing them with either: + 1. A constant string that the user can specify -## Classes and Functions -1. `Cleanser`: the main class of the package. It is used to add spotters to it, and then cleanse data using the spotters. - 1. `add_spotter()`: adds a spotter to the cleanser - 2. `remove_spotter()`: removes a spotter from the cleanser - 3. `clean()`: cleanses the data in the given data frame, and returns a new data frame with PII redacted -2. `EmailSpotter`: a spotter that identifies email addresses - 1. `getUID()`: returns the unique ID of the spotter - 2. `process()`: performs the PII matching and redaction -3. `CreditCardSpotter`: a spotter that identifies credit card numbers - 1. `getUID()`: returns the unique ID of the spotter - 2. `process()`: performs the PII matching and redaction + 2. A hash of the redaction (using MD5) +3. `clean_data_frame`: a function that takes as input the following list of arguments below and returns a deep copy of the cleaned data.frame: + 1. An input data.frame `df` to clean -> You can checkout detailed API Documentations [here](https://ubc-mds.github.io/sanityze/). + 2. A list of spotter information arguments. Each item in the list is a list of 3 elements: -Below is a simple quick start example: + 1. The redact\_\* function to use (e.g. `redact_creditcardnumber` ). -```python -import pandas as pd -from sanityze import Cleanser, EmailSpotter + 2. The second argument of the redact\_\* function: `hash_spotted` (TRUE or FALSE) or 0 to use the default argument. -# Create a cleanser, and don't add the default spotters -cleanser = Cleanser(include_default_spotters=False) -cleaner.add_spotter(from sanityze import Cleanser, EmailSpotter()) -cleaned_df = cleanser.clean(df) -``` + 3. The third argument of the redact\_\* function: `replace_with` (a redaction string) or 0 to use the default argument. + +Below is a simple quick start example: +``` r +library(sanityzeR) +df <- data.frame() +spotters <- list() +spotter_1 <- list(redact_email,TRUE,0) +spotters <- append(spotters,spotter_1) +df_cleaned <- clean_data_frame(df, spotters) +``` ## High-level Design -To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](HighLevelDesign.md). + +To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md). ## Contributing @@ -106,13 +109,13 @@ Interested in contributing? Check out the [contributing guidelines](CONTRIBUTING ## Credits -`sanityzeR` was created with [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the `py-pkgs-cookiecutter` [template](https://github.com/py-pkgs/py-pkgs-cookiecutter). +`sanityzeR` was created using **devtools** and **usethis** R packages. ## Quick Links - * [Documentation](https://ubc-mds.github.io/sanityze/) - * [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15) - * [Issues](https://github.com/UBC-MDS/sanityze/issues) - * [High Level Design](HighLevelDesign.md) - * [Contributing Guidelines](CONTRIBUTING.md) - * [Code of Conduct](CODE_OF_CONDUCT.md) - * [License](LICENSE) + +- [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15) +- [Issues](https://github.com/UBC-MDS/sanityzeR/issues) +- [High Level Design](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md) +- [Contributing Guidelines](CONTRIBUTING.md) +- [Code of Conduct](CODE_OF_CONDUCT.md) +- [License](LICENSE.md) diff --git a/README.md b/README.md index 741db3a..fe0d63b 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,11 @@ -\*\* note this is a initial version of the README file which will be -updated for the future milestones\*\* - # sanityzeR +![](logo.png) + The goal of sanityzeR: Data scientists often need to remove or redact Personal Identifiable @@ -27,7 +26,13 @@ what we are doing to the data ## Similar R packages -**add to this** +The closest R package in functionality is +[**anonymizer**](https://www.rdocumentation.org/packages/anonymizer/versions/0.2.0)which +is a package for finding and removing PII from text. The package is not +designed to work with data frames directly and we believe that our +package will be more user-friendly and intuitive as it accepts data +frames directly. In addition, sanityzeR gives the ability for users to +define new type of spotters to redact new types of PII. ## Installation @@ -42,11 +47,31 @@ devtools::install_github("UBC-MDS/sanityzeR") ## Example This is a basic example which shows you how to solve a common problem: -**edit this** ``` r library(sanityzeR) -## basic example code +df <- data.frame() +spotters <- list() +spotter_1 <- list(redact_email,TRUE,0) +spotters <- append(spotters,spotter_1) + +df_cleaned <- clean_data_frame(df, spotters) +#> data frame with 0 columns and 0 rows +#> [[1]] +#> function (string, hash_spotted = FALSE, replace_with = "EMAILADDRS") +#> { +#> print(string) +#> print(hash_spotted) +#> print(replace_with) +#> } +#> +#> +#> +#> [[2]] +#> [1] TRUE +#> +#> [[3]] +#> [1] 0 ``` ## Features and Usage @@ -55,60 +80,62 @@ Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them. -The main entry point to the package is the `Cleanser` class. The -`Cleanser` class is used to add `Spotter`s to the cleanser, which will -be used to identify PII in the data. The cleanser can then be used to -cleanse the data, and redact the PII from the given data frame (all -future data structures that will be suppportd by the package, in the -future). **edit this as needed** - -The package comes with a number of default spotters, as subclassess of -`Spotter`: 1. `CreditCardSpotter` - identifies credit card numbers 2. -`EmailSpotter` - identifies email addresses - -Spotters can be added to it using the `add_spotter()` method. The -cleanser can then be used to cleanse data using the `cleanse()` method -which takes a Pandas data frame and returns a Pandas data frame with PII -redacted. - -The redaction options provided by -``` sanityze`` are: 1. Redact using a fixed string - The string in this case is the ID of the spotter. For example, if the spotter is an instance of ```CreditCardSpotter`, the string will be`{{CREDITCARD}}`, or`{{EMAILADDRS}}`for an instance of`EmailSpotter`. 2. Redact using a hash of the input - The hash is computed using the`hashlib`package, and the hash function is`md5`. For example, if the spotter is an instance of`CreditCardSpotter`, the string will be`{{6a8b8c6c8c62bc939a11f36089ac75dd}}`, if the input is contains a PII`1234-5678-9012-3456\`. - -## Classes and Functions - -1. `Cleanser`: the main class of the package. It is used to add - spotters to it, and then cleanse data using the spotters. - 1. `add_spotter()`: adds a spotter to the cleanser - 2. `remove_spotter()`: removes a spotter from the cleanser - 3. `clean()`: cleanses the data in the given data frame, and - returns a new data frame with PII redacted -2. `EmailSpotter`: a spotter that identifies email addresses - 1. `getUID()`: returns the unique ID of the spotter - 2. `process()`: performs the PII matching and redaction -3. `CreditCardSpotter`: a spotter that identifies credit card numbers - 1. `getUID()`: returns the unique ID of the spotter - 2. `process()`: performs the PII matching and redaction - -> You can checkout detailed API Documentations -> [here](https://ubc-mds.github.io/sanityze/). +The library comes with two default redaction functions +`redact_creditcardnumber` and `redact_email` and which simply takes a +character vector and redacts the corresponding PII using either a +constant string replacement or a hash of the redaction. + +## Functions + +1. `redact_creditcardnumber()`: a function that takes a character + vector (string) and redacts credit card numbers contained within + that string, replacing them with either: + 1. A constant string that the user can specify + + 2. A hash of the redaction (using MD5) +2. `redact_email`: a function that takes a character vector (string) + and redacts email addresses contained within that string, replacing + them with either: + 1. A constant string that the user can specify + + 2. A hash of the redaction (using MD5) +3. `clean_data_frame`: a function that takes as input the following + list of arguments below and returns a deep copy of the cleaned + data.frame: + 1. An input data.frame `df` to clean + + 2. A list of spotter information arguments. Each item in the list + is a list of 3 elements: + + 1. The redact\_\* function to use + (e.g. `redact_creditcardnumber` ). + + 2. The second argument of the redact\_\* function: + `hash_spotted` (TRUE or FALSE) or 0 to use the default + argument. + + 3. The third argument of the redact\_\* function: + `replace_with` (a redaction string) or 0 to use the default + argument. Below is a simple quick start example: -``` python -import pandas as pd -from sanityze import Cleanser, EmailSpotter +``` r +library(sanityzeR) +df <- data.frame() +spotters <- list() +spotter_1 <- list(redact_email,TRUE,0) +spotters <- append(spotters,spotter_1) -# Create a cleanser, and don't add the default spotters -cleanser = Cleanser(include_default_spotters=False) -cleaner.add_spotter(from sanityze import Cleanser, EmailSpotter()) -cleaned_df = cleanser.clean(df) +df_cleaned <- clean_data_frame(df, spotters) ``` ## High-level Design To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package -evolves. The document can be found [here](HighLevelDesign.md). +evolves. The document can be found +[here](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md). ## Contributing @@ -124,17 +151,14 @@ It is licensed under the terms of the [MIT license](LICENSE). ## Credits -`sanityzeR` was created with -[`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the -`py-pkgs-cookiecutter` -[template](https://github.com/py-pkgs/py-pkgs-cookiecutter). +`sanityzeR` was created using **devtools** and **usethis** R packages. ## Quick Links -- [Documentation](https://ubc-mds.github.io/sanityze/) - [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15) -- [Issues](https://github.com/UBC-MDS/sanityze/issues) -- [High Level Design](HighLevelDesign.md) +- [Issues](https://github.com/UBC-MDS/sanityzeR/issues) +- [High Level + Design](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md) - [Contributing Guidelines](CONTRIBUTING.md) - [Code of Conduct](CODE_OF_CONDUCT.md) -- [License](LICENSE) +- [License](LICENSE.md) diff --git a/logo.png b/logo.png new file mode 100644 index 0000000..ffa934a Binary files /dev/null and b/logo.png differ diff --git a/man/clean_data_frame.Rd b/man/clean_data_frame.Rd new file mode 100644 index 0000000..4ea0c90 --- /dev/null +++ b/man/clean_data_frame.Rd @@ -0,0 +1,31 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/clean_data_frame.R +\name{clean_data_frame} +\alias{clean_data_frame} +\title{Cleans a data.frame by redacting PII information from character vector columns} +\usage{ +clean_data_frame(df, spotters_list) +} +\arguments{ +\item{df}{A data.frame to clean} + +\item{spotters_list}{A list containing lists of 3 elements each: +\enumerate{ +\item the redact function +\item hash_spotted value to pass or 0 to keep the default +\item the replace_with value or 0 to keep the default +}} +} +\value{ +A deep copy of the cleaned data.frame. +} +\description{ +Cleans a data.frame by redacting PII information from character vector columns +} +\examples{ +df <- data.frame() +spotters <- list() +spotter_1 <- list(redact_email,TRUE,0) +spotters <- append(spotters,spotter_1) +df_cleaned <- clean_data_frame(df, spotters) +} diff --git a/man/redact_creditcardnumber.Rd b/man/redact_creditcardnumber.Rd new file mode 100644 index 0000000..c7d34e3 --- /dev/null +++ b/man/redact_creditcardnumber.Rd @@ -0,0 +1,29 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/redact_creditcardnumber.R +\name{redact_creditcardnumber} +\alias{redact_creditcardnumber} +\title{Redacts credit card numbers from a given string} +\usage{ +redact_creditcardnumber( + string, + hash_spotted = FALSE, + replace_with = "CREDITCARD" +) +} +\arguments{ +\item{string}{A character vector with, at most, one element. The input string to redact credit card numbers from} + +\item{hash_spotted}{When TRUE, the redaction of the credit cards will be a hash of the redacted (Default False)} + +\item{replace_with}{A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted credit card numbers.} +} +\value{ +A character vector. +} +\description{ +Redacts credit card numbers from a given string +} +\examples{ +x <- "You can use my 5567554868135971 here" +redact_creditcardnumber(x) +} diff --git a/man/redact_email.Rd b/man/redact_email.Rd new file mode 100644 index 0000000..f7d88c7 --- /dev/null +++ b/man/redact_email.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/redact_email.R +\name{redact_email} +\alias{redact_email} +\title{Redacts an email addresses from a given string} +\usage{ +redact_email(string, hash_spotted = FALSE, replace_with = "EMAILADDRS") +} +\arguments{ +\item{string}{A character vector with, at most, one element. The input string to redact email addresses from} + +\item{hash_spotted}{When TRUE, the redaction of the email addresses will be a hash of the redacted (Default False)} + +\item{replace_with}{A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted email addresses.} +} +\value{ +A character vector. +} +\description{ +Redacts an email addresses from a given string +} +\examples{ +x <- "my email address is foo@gaga.com" +redact_email(x) +}