Open Datagen is a Data Preparation Tool designed to build Controllable AI Systems
It offers improvements for:
RAG: Generate large Q&A datasets to improve your Retrieval strategies.
Evals: Create unique, “unseen” datasets to robustly test your models and avoid overfitting.
Fine-Tuning: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.
Guardrails: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.
-
Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)
-
Data anonymization
-
Open-source model support + local inference
-
Decontamination
-
Tree of thought
-
Multimodality (Text, Audio and Image)
-
No-code dataset generation with the Open DataGen UI
⤵️ ⤵️ ⤵️
conda create -n opendataenv python=3.9.6
pip install --upgrade opendatagen
export OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)
export MISTRAL_API_KEY='your_mistral_api_key'
export TOGETHER_API_KEY='your_together_api_key'
export ANYSCALE_API_KEY='your_anyscale_api_key'
export ELEVENLABS_API_KEY='your_elevenlabs_api_key'
export SERPLY_API_KEY='your_serply_api_key' #Google Search API
Example: Generate a low-biased FAQ dataset based on Wikipedia content
from opendatagen.template import TemplateManager
from opendatagen.data_generator import DataGenerator
output_path = "opendatagen.csv"
template_name = "opendatagen"
manager = TemplateManager(template_file_path="faq_wikipedia.json")
template = manager.get_template(template_name=template_name)
if template:
generator = DataGenerator(template=template)
data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)
where faq_wikipedia.json is here
We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.
We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:
-
Textbooks are all you need (Read the paper)
-
Evol-Instruct Paper (Read the paper) by WizardLM_AI
-
Textbook Generation by VikParuchuri
If you need help for your Generative AI strategy, implementation, and infrastructure, reach us on