Merge pull request #3 from UBC-MDS/tzoght-milestone2_1

ready for review
UBC-MDS · Jan 21, 2023 · efa01e7 · efa01e7
2 parents 349ccd9 + 27b7ccd
commit efa01e7
Show file tree

Hide file tree

Showing 13 changed files with 371 additions and 108 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,75 @@
+# Contributing
+
+Contributions are welcome, and they are greatly appreciated! Every little bit
+helps, and credit will always be given.
+
+## Types of Contributions
+
+### Report Bugs
+
+If you are reporting a bug, please include:
+
+* Your operating system name and version.
+* Any details about your local setup that might be helpful in troubleshooting.
+* Detailed steps to reproduce the bug.
+
+### Fix Bugs
+
+Look through the GitHub issues for bugs and Project. Anything tagged with "bug" and "help
+wanted" is open to whoever wants to implement it.
+
+### Implement Features
+
+Look through the GitHub issues for features. Anything tagged with "enhancement"
+and "help wanted" is open to whoever wants to implement it.
+
+### Write Documentation
+
+You can never have enough documentation! Please feel free to contribute to any
+part of the documentation, such as the official docs, docstrings, or even
+on the web in blog posts, articles, and such.
+
+### Submit Feedback
+
+If you are proposing a feature:
+
+* Explain in detail how it would work.
+* Keep the scope as narrow as possible, to make it easier to implement.
+* Remember that this is a volunteer-driven project, and that contributions
+  are welcome
+
+## Get Started!
+
+Ready to contribute? Here's how to set up `sanityzeR` for local development. 
+
+1. Fork and Clone a copy of `sanityzeR` locally.
+2. Install locally in R studio
+
+    ```console
+    library(devtools)
+    library(usethis)
+    load_all()
+    ```
+
+3. Use `git` (or similar) to create a branch for local development and make your changes:
+
+    ```console
+    $ git checkout -b name-of-your-bugfix-or-feature
+    ```
+
+4. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests.
+
+5. Commit your changes and open a pull request.
+
+## Pull Request Guidelines
+
+Before you submit a pull request, check that it meets these guidelines:
+
+1. The pull request should include additional tests if appropriate.
+2. If the pull request adds functionality, the docs should be updated.
+3. The pull request should work for all currently supported operating systems and versions of R.
+
+## Code of Conduct
+
+Please note that the `sanityzeR` project is released with a
+Code of Conduct. By contributing to this project you agree to abide by its terms.
diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -0,0 +1,11 @@
+# Contributors
+
+## Special thanks for all the people who had helped this project so far:
+
+-   [Tony Zoght](https://github.com/tzoght)
+-   [Caesar Wong](https://github.com/caesarw0)
+-   [Jonah Hamilton](https://github.com/xXJohamXx)
+
+## I would like to join this list. How can I help the project?
+
+For more information, please refer to our [CONTRIBUTING](CONTRIBUTING.md) guide.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -5,13 +5,14 @@ Authors@R:
     c(person(given = "Jonah", 
             family =  "Hamilton", 
             email = "jonah.hamilton@alumni.ubc.ca", 
-            role = c("aut", "cre")),
+            role = c("aut")),
       person(given = "Caesar",
              family = "Wong", 
-             role = "ctb"),
+             role = c("ctb")),
       person(given = "Tony",
              family = "Zoght",
-             role = "ctb")
+             email = "tony@zoght.com",
+             role = c("cre"))
            )
 Description: Data scientists often need to remove or redact Personal Identifiable 
              Information (PII) from their data. This package provides utilities 

diff --git a/NAMESPACE b/NAMESPACE
@@ -1,2 +1,5 @@
 # Generated by roxygen2: do not edit by hand
 
+export(clean_data_frame)
+export(redact_creditcardnumber)
+export(redact_email)
diff --git a/R/clean_data_frame.R b/R/clean_data_frame.R
@@ -0,0 +1,23 @@
+#' Cleans a data.frame by redacting PII information from character vector columns
+#'
+#' @param df A data.frame to clean
+#' @param spotters_list A list containing lists of 3 elements each:
+#' 1. the redact function
+#' 2. hash_spotted value to pass or 0 to keep the default
+#' 3. the replace_with value or 0 to keep the default
+#'
+#'
+#' @return A deep copy of the cleaned data.frame.
+#' @export
+#'
+#' @examples
+#' df <- data.frame()
+#' spotters <- list()
+#' spotter_1 <- list(redact_email,TRUE,0)
+#' spotters <- append(spotters,spotter_1)
+#' df_cleaned <- clean_data_frame(df, spotters)
+clean_data_frame <- function(df, spotters_list) {
+  # to be implemented in the next milestone
+  print(df)
+  print(spotters_list)
+}
diff --git a/R/redact_creditcardnumber.R b/R/redact_creditcardnumber.R
@@ -0,0 +1,19 @@
+#' Redacts credit card numbers from a given string
+#'
+#' @param string A character vector with, at most, one element. The input string to redact credit card numbers from
+#' @param hash_spotted When TRUE, the redaction of the credit cards will be a hash of the redacted (Default False)
+#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted credit card numbers.
+#'
+#'
+#' @return A character vector.
+#' @export
+#'
+#' @examples
+#' x <- "You can use my 5567554868135971 here"
+#' redact_creditcardnumber(x)
+redact_creditcardnumber <- function(string, hash_spotted=FALSE, replace_with="CREDITCARD") {
+  # to be implemented in the next milestone
+  print(string)
+  print(hash_spotted)
+  print(replace_with)
+}
diff --git a/R/redact_email.R b/R/redact_email.R
@@ -0,0 +1,19 @@
+#' Redacts an email addresses from a given string
+#'
+#' @param string A character vector with, at most, one element. The input string to redact email addresses from
+#' @param hash_spotted When TRUE, the redaction of the email addresses will be a hash of the redacted (Default False)
+#' @param replace_with A character vector with, at most, one element. When hash_spotted is FALSE, this character vector will be the replacement redacted email addresses.
+#'
+#'
+#' @return A character vector.
+#' @export
+#'
+#' @examples
+#' x <- "my email address is foo@gaga.com"
+#' redact_email(x)
+redact_email <- function(string, hash_spotted=FALSE, replace_with="EMAILADDRS") {
+  # to be implemented in the next milestone
+  print(string)
+  print(hash_spotted)
+  print(replace_with)
+}
diff --git a/README.Rmd b/README.Rmd
@@ -2,7 +2,6 @@
 output: github_document
 ---
 
-
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
@@ -12,24 +11,27 @@ knitr::opts_chunk$set(
 )
 ```
 
-** note this is a initial version of the README file which will be updated for the future milestones**
-
 # sanityzeR
 
 <!-- badges: start -->
+
 <!-- badges: end -->
 
+![](logo.png)
+
 The goal of sanityzeR:
 
 Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles.
 
 PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed
 
-## Why `sanityzeR` ? 
+## Why `sanityzeR` ?
+
 Because it's a fun name and it's a play on the word "sanitize" which is what we are doing to the data
 
-## Similar R packages 
- **add to this**
+## Similar R packages
+
+The closest R package in functionality is [**anonymizer**](https://www.rdocumentation.org/packages/anonymizer/versions/0.2.0)which is a package for finding and removing PII from text. The package is not designed to work with data frames directly and we believe that our package will be more user-friendly and intuitive as it accepts data frames directly. In addition, sanityzeR gives the ability for users to define new type of spotters to redact new types of PII.
 
 ## Installation
 
@@ -42,59 +44,60 @@ devtools::install_github("UBC-MDS/sanityzeR")
 
 ## Example
 
-This is a basic example which shows you how to solve a common problem: **edit this**
+This is a basic example which shows you how to solve a common problem:
 
 ```{r example}
 library(sanityzeR)
-## basic example code
+df <- data.frame()
+spotters <- list()
+spotter_1 <- list(redact_email,TRUE,0)
+spotters <- append(spotters,spotter_1)
+
+df_cleaned <- clean_data_frame(df, spotters)
 ```
 
 ## Features and Usage
-Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them. 
 
-The main entry point to the package is the `Cleanser` class. The `Cleanser` class is used to add `Spotter`s to the cleanser, which will be used to identify PII in the data. The cleanser can then be used to cleanse the data, and redact the PII from the given data frame (all future data structures that will be suppportd by the package, in the future). **edit this as needed**
+Conceptually, `sanityzeR` is a package that provides a way to remove PII from Pandas data frames. The package provides a number of default spotters, which can be used to identify PII in the data and redact them.
 
-The package comes with a number of default spotters, as subclassess of `Spotter`:
-1. `CreditCardSpotter` - identifies credit card numbers
-2. `EmailSpotter` - identifies email addresses
+The library comes with two default redaction functions `redact_creditcardnumber` and `redact_email` and which simply takes a character vector and redacts the corresponding PII using either a constant string replacement or a hash of the redaction.
 
-Spotters can be added to it using the `add_spotter()` method. The cleanser can then be used to cleanse data using the `cleanse()` method which takes a Pandas data frame and returns a Pandas data frame with PII redacted.
+## Functions
 
-The redaction options provided by `sanityze`` are:
-1. Redact using a fixed string - The string in this case is the ID of the spotter. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{CREDITCARD}}`, or `{{EMAILADDRS}}` for an instance of `EmailSpotter`.
-2. Redact using a hash of the input - The hash is computed using the `hashlib` package, and the hash function is `md5`. For example, if the spotter is an instance of `CreditCardSpotter`, the string will be `{{6a8b8c6c8c62bc939a11f36089ac75dd}}`, if the input is contains a PII `1234-5678-9012-3456`.
+1.  `redact_creditcardnumber()`: a function that takes a character vector (string) and redacts credit card numbers contained within that string, replacing them with either:
+    1.  A constant string that the user can specify
 
+    2.  A hash of the redaction (using MD5)
+2.  `redact_email`: a function that takes a character vector (string) and redacts email addresses contained within that string, replacing them with either:
+    1.  A constant string that the user can specify
 
-## Classes and Functions
-1. `Cleanser`: the main class of the package. It is used to add spotters to it, and then cleanse data using the spotters.
-   1. `add_spotter()`: adds a spotter to the cleanser
-   2. `remove_spotter()`: removes a spotter from the cleanser
-   3. `clean()`: cleanses the data in the given data frame, and returns a new data frame with PII redacted
-2. `EmailSpotter`: a spotter that identifies email addresses
-   1. `getUID()`: returns the unique ID of the spotter
-   2. `process()`: performs the PII matching and redaction
-3. `CreditCardSpotter`: a spotter that identifies credit card numbers
-   1. `getUID()`: returns the unique ID of the spotter
-   2. `process()`: performs the PII matching and redaction
+    2.  A hash of the redaction (using MD5)
+3.  `clean_data_frame`: a function that takes as input the following list of arguments below and returns a deep copy of the cleaned data.frame:
+    1.  An input data.frame `df` to clean
 
-> You can checkout detailed API Documentations [here](https://ubc-mds.github.io/sanityze/).
+    2.  A list of spotter information arguments. Each item in the list is a list of 3 elements:
 
-Below is a simple quick start example:
+        1.  The redact\_\* function to use (e.g. `redact_creditcardnumber` ).
 
-```python
-import pandas as pd
-from sanityze import Cleanser, EmailSpotter
+        2.  The second argument of the redact\_\* function: `hash_spotted` (TRUE or FALSE) or 0 to use the default argument.
 
-# Create a cleanser, and don't add the default spotters
-cleanser = Cleanser(include_default_spotters=False)
-cleaner.add_spotter(from sanityze import Cleanser, EmailSpotter())
-cleaned_df = cleanser.clean(df)
-```
+        3.  The third argument of the redact\_\* function: `replace_with` (a redaction string) or 0 to use the default argument.
+
+Below is a simple quick start example:
 
+``` r
+library(sanityzeR)
+df <- data.frame()
+spotters <- list()
+spotter_1 <- list(redact_email,TRUE,0)
+spotters <- append(spotters,spotter_1)
 
+df_cleaned <- clean_data_frame(df, spotters)
+```
 
 ## High-level Design
-To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](HighLevelDesign.md).
+
+To better understand the design of the package, we have provided a high-level design document, which will be kept up to date as the package evolves. The document can be found [here](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md).
 
 ## Contributing
 
@@ -106,13 +109,13 @@ Interested in contributing? Check out the [contributing guidelines](CONTRIBUTING
 
 ## Credits
 
-`sanityzeR` was created with [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the `py-pkgs-cookiecutter` [template](https://github.com/py-pkgs/py-pkgs-cookiecutter).
+`sanityzeR` was created using **devtools** and **usethis** R packages.
 
 ## Quick Links
-  * [Documentation](https://ubc-mds.github.io/sanityze/)
-  * [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15)
-  * [Issues](https://github.com/UBC-MDS/sanityze/issues)
-  * [High Level Design](HighLevelDesign.md) 
-  * [Contributing Guidelines](CONTRIBUTING.md)
-  * [Code of Conduct](CODE_OF_CONDUCT.md)
-  * [License](LICENSE)
+
+-   [Kanban Board](https://github.com/orgs/UBC-MDS/projects/15)
+-   [Issues](https://github.com/UBC-MDS/sanityzeR/issues)
+-   [High Level Design](https://github.com/UBC-MDS/sanityze/blob/main/HighLevelDesign.md)
+-   [Contributing Guidelines](CONTRIBUTING.md)
+-   [Code of Conduct](CODE_OF_CONDUCT.md)
+-   [License](LICENSE.md)