A LoRa-tuned deciphering runtime safety guardrail for LLM-powered software applications.
🚀 DecipherGuard is Available in Huggingface Model Hub 🚀
AutoTokenizer.from_pretrained("MickyMike/DecipherGuard")
AutoModelForCausalLM.from_pretrained("MickyMike/DecipherGuard")
Table of Contents
First of all, clone this repository to your local machine and access the main dir via the following command:
git clone https://github.com/awsm-research/DecipherGuard.git
cd DecipherGuard
Then, install the python dependencies via the following command:
pip install -r requirements.txt
This repo uses the following datasets:
The datasets have been compiled, transformed by jailbreak attack functions, split into 80% testing and 1-10% training, and stored at /data/split_attack_prompts
To replicate the experiment results, the following models are used:
- DecipherGuard
- Llama-Guard-3-8B
- OpenAI Moderation (omni-moderation-latest)
- PerspectiveAPI (v1alpha1)
- Perplexity (GPT2)
The models used can either be accessed from their huggingface pages, or as public, free APIs.
To replicate the empirical results of the experiment, please use the run the following commands to get the prediction of each model:
cd DecipherGuard
python -m evaluate.evaluation_decipherguard
python -m evaluation.evaluation_llamaguard
python -m evaluation.evaluation_openai_moderation
python -m evaluation.evaluation_perspectiveAPI
python -m evaluation.evaluation_perplexity
We recommend to use GPU with 16 GB up memory for inferencing since LlamaGuard is quite computational intensive.
To reproduce the RQ1 result, run the following commands (Inference only):
cd DecipherGuard
python -m evaluation.evaluation_llamaguard
python -m evaluation.evaluation_openai_moderation
python -m evaluation.evaluation_perspectiveAPI
python -m evaluation.evaluation_perplexity
To reproduce the RQ2&3 result, run the following commands (Inference only):
cd DecipherGuard
python -m evaluation.evaluation_decipherguard
To retrain the DecipherGuard model, run the following commands (Training + Inference):
cd DecipherGuard/train
python lora_decipher_main.py \
--training_proportion=ENTER YOUR VALUE HERE (e.g., 1, 3, 5, 7, 10) \
--do_train \
--batch_size=1 \
--data_dir=data \
--model_name_or_path=meta-llama/Llama-Guard-3-8B \
--saved_model_name=decipherguard \
--learning_rate=1e-4 \
--epochs=1 \
--max_grad_norm=1.0 \
--lora_r=8 \
--lora_alpha=32 \
--lora_dropout=0.1 \
--max_train_input_length=2048 \
--max_new_tokens=100 \
--seed 123456 2>&1 | tee decipher_lora.log
To reproduce the RQ4 result, run the following commands (Inference only):
cd DecipherGuard
python -m evaluation.evaluation_decipher_only
cd DecipherGuard
python -m lora.lora_testing_loop
This will produce the LoRa model results in in discussion section, specifically for the 6 different % of the training data used (1%,3%,5%,7%,10%,20%)
Model | Defence Success Rate (DSR) w/o jailbreak | Defence Success Rate (DSR) w/ jailbreak |
---|---|---|
LlamaGuard | 0.81 | 0.57 |
OpenAI Moderation | 0.76 | 0.39 |
PerspectiveAPI | 0.03 | 0.15 |
Perplexity | 0.15 | 0.28 |
Model | Defence Success Rate (DSR) w/ jailbreak |
---|---|
DecipherGuard | 0.94 |
LlamaGuard | 0.57 |
OpenAI Moderation | 0.39 |
Perplexity | 0.28 |
Model | Overall Guardrail Performance (OGP) w/ jailbreak |
---|---|
DecipherGuard | 0.96 |
LlamaGuard | 0.75 |
OpenAI Moderation | 0.62 |
Perplexity | 0.45 |
Model | Overall Guardrail Performance (OGP) w/ jailbreak | Defence Success Rate (DSR) w/ jailbreak |
---|---|---|
DecipherGuard | 0.96 | 0.94 |
LoRa + LLamaGuard | 0.95 | 0.92 |
Decipher + LlamaGuard | 0.67 | 0.76 |
LlamaGuard | 0.75 | 0.57 |
We would like to express our gratitude to the author of LlamaGuard for their foundational work and inspiration, as well as the creators of the datasets used in this repository: CategoricalHarmfulQA, do-not-answer, AdvBench, forbidden_question, and alpaca. Their efforts in curating and maintaining these resources were invaluable to this research.
Under Review at IEEE TSE