This project is based on the Princeton-Leuven Privacy Policy dataset, a dataset of over a million privacy policies spanning 20+ years.
An efficient search engine would help practitioners and policymakers quickly answer questions about privacy policies like: "What kind of information did Facebook collect in 2010?", "When did Google include GDPR compliance?"
This repo demonstrates semantic search, a step towards RAG (Retrieval Augment Generation) to support answering natural language questions from the privacy policy dataset.
The code and deployment setup is heavily inspired by this article.
The code can be installed with poetry:
git clone git@github.com:privacy-policy-search/semantic-search-poc.git
cd semantic-search-poc
poetry install --all-extras
For demonstration purposes, this repo uses HuggingFace's SentenceTransformers.
It also includes a small sample of the privacy policies. Request access to the full dataset to unlock the full functionality.
Run locally with:
poetry run python save_model.py
poetry run python handler.py
All configuration files are provided for deployment on AWS Lambda and ECR with the serverless framework.
Create ECR Repo
You must have an AWS account, the AWS CLI installed (e.g., brew install awscli
), and be
authenticated.
aws ecr create-repository --repository-name privacypolicyfinder-repo
Make note of the repository URI for the following steps.
You can use the justfile_example
provided to authenticate into the newly created ECR repository. Just fill in the
missing fields, rename it to justfile
and run:
just auth
Deployment
You must follow these steps at least once before running the deploy
recipe:
- Run
poetry run python save_model.py
if you haven't yet. - Build the docker image and then push it
docker build -t privacypolicyfinder . --platform=linux/x86_64
docker tag privacypolicyfinder <repoUri>
docker push <repoUri>
- Make note of the digest for the next step.
- Rename
serverless_example.yml
toserverless.yml
and fill in image in this format: @ - Run
serverless deploy
Query
You can send in a new event following the AWS Lambda example format. You can use and edit the provided event.json
and run:
just query