This repository contains the source code and synthetic datasets used in our research on scam detection using deep learning models trained on data generated by Large Language Models (LLMs). Our work demonstrates the effectiveness of synthetic data in training scam detection models and offers publicly available datasets and models for further research and development in fraud prevention.
This research presents a novel approach to training scam detection models using synthetic data generated by Large Language Models (LLMs). We propose single-agent and multi-agent methods for data generation and train six deep learning architectures—LSTM, BiLSTM, GRU, BiGRU, CNN, and BERT—to classify conversations as scam or non-scam. Our experiments demonstrate that models trained on synthetic data achieve high accuracy on both generated test sets and real-world scam conversations. The models perform well even with limited conversation turns and when analyzing only the suspect's messages, indicating potential for early scam detection and privacy-preserving applications. Our findings highlight the efficacy of synthetic data in overcoming real-world dataset limitations for scam detection.
- Synthetic datasets were generated using single-agent and multi-agent conversation.
- Models were trained to classify conversations as scam or non-scam, achieving high accuracy on synthetic and real-world data.
- Models were evaluated for early scam detection and privacy-preserving methods by analyzing suspect-only messages.
Two datasets were generated using LLMs:
- Single-Agent Dataset: A single LLM was used to simulate both the scammer’s and victim’s conversations. This dataset includes 1,600 conversations split evenly between scam and non-scam categories.
- Multi-Agent Dataset: Two LLM instances were configured, one representing the scammer/non-scammer and the other acting as the victim with various personality traits (e.g., skeptical, trusting, aggressive). This dataset also includes 1,600 conversations.
Scam Categories:
- Social Security Scams
- Refund Scams
- Technical Support Scams
- Reward Scams
Non-Scam Categories:
- Delivery Confirmations
- Insurance Sales
- Appointment Confirmations
- Wrong number
The dataset and models for this project are available on [Hugging Face]. You can download them directly using the links below:
- Dataset:
- Trained Model:
Below are the plots illustrating the performance of various models:
Single-Agent Dataset with only Suspect:
Multi-Agent Dataset with only Suspect:
YouTube Video Dataset (Trained on Single-Agent Dataset):
YouTube Video Dataset (Trained on Multi-Agent Dataset):
YouTube Video Dataset with only Suspect (Trained on Single-Agent Dataset):
YouTube Video Dataset with only Suspect (Trained on Multi-Agent Dataset):
This research demonstrates that deep learning models trained entirely on synthetic data can effectively detect scam conversations, even from limited information such as only the suspect's messages. This privacy-preserving approach is highly applicable in real-world scenarios, allowing telecom companies and messaging platforms to detect potential fraud without processing full conversations. Future research should focus on expanding the diversity of synthetic datasets and further improving model architectures to support real-time scam detection systems.
git clone https://github.com/yourusername/scam-detection-using-synthetic-data.git
cd Synthetic-Data-for-Scam-Detection-Leveraging-LLMs-to-Train-Deep-Learning-Models
pip install -r requirements.txt
If you use this repo, please cite:
@inproceedings{gumphusiri2024,
title={Synthetic Data for Scam Detection: Leveraging LLMs to Train Deep Learning Models},
author={Gumphusiri, Pitipat and Triyason, Tuul},
year={2024},
booktitle={IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (under review)},
}