Skip to content

Commit

Permalink
Merge pull request #3 from UBC-MDS/tzoght-milestone2_1
Browse files Browse the repository at this point in the history
ready for review
  • Loading branch information
caesarw0 authored Jan 21, 2023
2 parents 349ccd9 + 27b7ccd commit efa01e7
Show file tree
Hide file tree
Showing 13 changed files with 371 additions and 108 deletions.
75 changes: 75 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit
helps, and credit will always be given.

## Types of Contributions

### Report Bugs

If you are reporting a bug, please include:

* Your operating system name and version.
* Any details about your local setup that might be helpful in troubleshooting.
* Detailed steps to reproduce the bug.

### Fix Bugs

Look through the GitHub issues for bugs and Project. Anything tagged with "bug" and "help
wanted" is open to whoever wants to implement it.

### Implement Features

Look through the GitHub issues for features. Anything tagged with "enhancement"
and "help wanted" is open to whoever wants to implement it.

### Write Documentation

You can never have enough documentation! Please feel free to contribute to any
part of the documentation, such as the official docs, docstrings, or even
on the web in blog posts, articles, and such.

### Submit Feedback

If you are proposing a feature:

* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.
* Remember that this is a volunteer-driven project, and that contributions
are welcome

## Get Started!

Ready to contribute? Here's how to set up `sanityzeR` for local development.

1. Fork and Clone a copy of `sanityzeR` locally.
2. Install locally in R studio

```console
library(devtools)
library(usethis)
load_all()
```

3. Use `git` (or similar) to create a branch for local development and make your changes:

```console
$ git checkout -b name-of-your-bugfix-or-feature
```

4. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests.

5. Commit your changes and open a pull request.

## Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

1. The pull request should include additional tests if appropriate.
2. If the pull request adds functionality, the docs should be updated.
3. The pull request should work for all currently supported operating systems and versions of R.

## Code of Conduct

Please note that the `sanityzeR` project is released with a
Code of Conduct. By contributing to this project you agree to abide by its terms.
11 changes: 11 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Contributors

## Special thanks for all the people who had helped this project so far:

- [Tony Zoght](https://github.com/tzoght)
- [Caesar Wong](https://github.com/caesarw0)
- [Jonah Hamilton](https://github.com/xXJohamXx)

## I would like to join this list. How can I help the project?

For more information, please refer to our [CONTRIBUTING](CONTRIBUTING.md) guide.
7 changes: 4 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ Authors@R:
c(person(given = "Jonah",
family = "Hamilton",
email = "jonah.hamilton@alumni.ubc.ca",
role = c("aut", "cre")),
role = c("aut")),
person(given = "Caesar",
family = "Wong",
role = "ctb"),
role = c("ctb")),
person(given = "Tony",
family = "Zoght",
role = "ctb")
email = "tony@zoght.com",
role = c("cre"))
)
Description: Data scientists often need to remove or redact Personal Identifiable
Information (PII) from their data. This package provides utilities
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
# Generated by roxygen2: do not edit by hand

export(clean_data_frame)
export(redact_creditcardnumber)
export(redact_email)
23 changes: 23 additions & 0 deletions R/clean_data_frame.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#' Cleans a data.frame by redacting PII information from character vector columns
#'
#' @param df A data.frame to clean
#' @param spotters_list A list containing lists of 3 elements each:
#' 1. the redact function
#' 2. hash_spotted value to pass or 0 to keep the default
#' 3. the replace_with value or 0 to keep the default
#'
#'
#' @return A deep copy of the cleaned data.frame.
#' @export
#'
#' @examples
#' df <- data.frame()
#' spotters <- list()
#' spotter_1 <- list(redact_email,TRUE,0)
#' spotters <- append(spotters,spotter_1)
#' df_cleaned <- clean_data_frame(df, spotters)
clean_data_frame <- function(df, spotters_list) {
# to be implemented in the next milestone
print(df)
print(spotters_list)
}
19 changes: 19 additions & 0 deletions R/redact_creditcardnumber.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#' Redacts credit card numbers from a given string
#'
#' @param string A character vector with, at most, one element. The input string to redact credit card numbers from
#' @param hash_spotted When TRUE, the redaction of the credit cards will be a hash of the redacted (Default False)
#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted credit card numbers.
#'
#'
#' @return A character vector.
#' @export
#'
#' @examples
#' x <- "You can use my 5567554868135971 here"
#' redact_creditcardnumber(x)
redact_creditcardnumber <- function(string, hash_spotted=FALSE, replace_with="CREDITCARD") {
# to be implemented in the next milestone
print(string)
print(hash_spotted)
print(replace_with)
}
19 changes: 19 additions & 0 deletions R/redact_email.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#' Redacts an email addresses from a given string
#'
#' @param string A character vector with, at most, one element. The input string to redact email addresses from
#' @param hash_spotted When TRUE, the redaction of the email addresses will be a hash of the redacted (Default False)
#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted email addresses.
#'
#'
#' @return A character vector.
#' @export
#'
#' @examples
#' x <- "my email address is foo@gaga.com"
#' redact_email(x)
redact_email <- function(string, hash_spotted=FALSE, replace_with="EMAILADDRS") {
# to be implemented in the next milestone
print(string)
print(hash_spotted)
print(replace_with)
}
97 changes: 50 additions & 47 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
output: github_document
---


```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
Expand All @@ -12,24 +11,27 @@ knitr::opts_chunk$set(
)
```

** note this is a initial version of the README file which will be updated for the future milestones**

# sanityzeR

<!-- badges: start -->

<!-- badges: end -->

![](logo.png)

The goal of sanityzeR:

Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles.

PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed

## Why `sanityzeR` ?
## Why `sanityzeR` ?

Because it's a fun name and it's a play on the word "sanitize" which is what we are doing to the data

## Similar R packages
**add to this**
## Similar R packages

The closest R package in functionality is [**anonymizer**](https://www.rdocumentation.org/packages/anonymizer/versions/0.2.0)which is a package for finding and removing PII from text. The package is not designed to work with data frames directly and we believe that our package will be more user-friendly and intuitive as it accepts data frames directly. In addition, sanityzeR gives the ability for users to define new type of spotters to redact new types of PII.

## Installation

Expand All @@ -42,59 +44,60 @@ devtools::install_github("UBC-MDS/sanityzeR")

## Example

This is a basic example which shows you how to solve a common problem: **edit this**
This is a basic example which shows you how to solve a common problem:

```{r example}
library(sanityzeR)
## basic example code
df <- data.frame()
spotters <- list()
spotter_1 <- list(redact_email,TRUE,0)
spotters <- append(spotters,spotter_1)
df_cleaned <- clean_data_frame(df, spotters)
```

## Features and Usage
Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them.

The main entry point to the package is the `Cleanser` class. The `Cleanser` class is used to add `Spotter`s to the cleanser, which will be used to identify PII in the data. The cleanser can then be used to cleanse the data, and redact the PII from the given data frame (all future data structures that will be suppportd by the package, in the future). **edit this as needed**
Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them.

The package comes with a number of default spotters, as subclassess of `Spotter`:
1. `CreditCardSpotter` - identifies credit card numbers
2. `EmailSpotter` - identifies email addresses
The library comes with two default redaction functions `redact_creditcardnumber` and `redact_email` and which simply takes a character vector and redacts the corresponding PII using either a constant string replacement or a hash of the redaction.

Spotters can be added to it using the `add_spotter()` method. The cleanser can then be used to cleanse data using the `cleanse()` method which takes a Pandas data frame and returns a Pandas data frame with PII redacted.
## Functions

The redaction options provided by `sanityze`` are:
1. Redact using a fixed string - The string in this case is the ID of the spotter. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{CREDITCARD}}`, or `{{EMAILADDRS}}` for an instance of `EmailSpotter`.
2. Redact using a hash of the input - The hash is computed using the `hashlib` package, and the hash function is `md5`. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{6a8b8c6c8c62bc939a11f36089ac75dd}}`, if the input is contains a PII `1234-5678-9012-3456`.
1. `redact_creditcardnumber()`: a function that takes a character vector (string) and redacts credit card numbers contained within that string, replacing them with either:
1. A constant string that the user can specify

2. A hash of the redaction (using MD5)
2. `redact_email`: a function that takes a character vector (string) and redacts email addresses contained within that string, replacing them with either:
1. A constant string that the user can specify

## Classes and Functions
1. `Cleanser`: the main class of the package. It is used to add spotters to it, and then cleanse data using the spotters.
1. `add_spotter()`: adds a spotter to the cleanser
2. `remove_spotter()`: removes a spotter from the cleanser
3. `clean()`: cleanses the data in the given data frame, and returns a new data frame with PII redacted
2. `EmailSpotter`: a spotter that identifies email addresses
1. `getUID()`: returns the unique ID of the spotter
2. `process()`: performs the PII matching and redaction
3. `CreditCardSpotter`: a spotter that identifies credit card numbers
1. `getUID()`: returns the unique ID of the spotter
2. `process()`: performs the PII matching and redaction
2. A hash of the redaction (using MD5)
3. `clean_data_frame`: a function that takes as input the following list of arguments below and returns a deep copy of the cleaned data.frame:
1. An input data.frame `df` to clean

> You can checkout detailed API Documentations [here](https://ubc-mds.github.io/sanityze/).
2. A list of spotter information arguments. Each item in the list is a list of 3 elements:

Below is a simple quick start example:
1. The redact\_\* function to use (e.g. `redact_creditcardnumber` ).

```python
import pandas as pd
from sanityze import Cleanser, EmailSpotter
2. The second argument of the redact\_\* function: `hash_spotted` (TRUE or FALSE) or 0 to use the default argument.

# Create a cleanser, and don't add the default spotters
cleanser = Cleanser(include_default_spotters=False)
cleaner.add_spotter(from sanityze import Cleanser, EmailSpotter())
cleaned_df = cleanser.clean(df)
```
3. The third argument of the redact\_\* function: `replace_with` (a redaction string) or 0 to use the default argument.

Below is a simple quick start example:

``` r
library(sanityzeR)
df <- data.frame()
spotters <- list()
spotter_1 <- list(redact_email,TRUE,0)
spotters <- append(spotters,spotter_1)

df_cleaned <- clean_data_frame(df, spotters)
```

## High-level Design
To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](HighLevelDesign.md).

To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md).

## Contributing

Expand All @@ -106,13 +109,13 @@ Interested in contributing? Check out the [contributing guidelines](CONTRIBUTING

## Credits

`sanityzeR` was created with [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the `py-pkgs-cookiecutter` [template](https://github.com/py-pkgs/py-pkgs-cookiecutter).
`sanityzeR` was created using **devtools** and **usethis** R packages.

## Quick Links
* [Documentation](https://ubc-mds.github.io/sanityze/)
* [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15)
* [Issues](https://github.com/UBC-MDS/sanityze/issues)
* [High Level Design](HighLevelDesign.md)
* [Contributing Guidelines](CONTRIBUTING.md)
* [Code of Conduct](CODE_OF_CONDUCT.md)
* [License](LICENSE)

- [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15)
- [Issues](https://github.com/UBC-MDS/sanityzeR/issues)
- [High Level Design](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md)
- [Contributing Guidelines](CONTRIBUTING.md)
- [Code of Conduct](CODE_OF_CONDUCT.md)
- [License](LICENSE.md)
Loading

0 comments on commit efa01e7

Please sign in to comment.