This repository contains scripts and notebooks for creating, processing, and uploading the ElectricalNER dataset, a NER dataset tailored for the electrical engineering domain. The pipeline is divided into three stages, each handled by a specific script or notebook.
- Purpose: Generate annotated NER data by using a large language model (LLM).
- Functionality:
- Sends structured prompts to an LLM to generate sentences and their corresponding NER annotations.
- Saves the generated data in batches to CSV files.
- Key Features:
- Asynchronous API calls for efficient batch processing.
- Handles large-scale dataset generation with options for saving intermediate results.
- Output:
- Raw CSV files containing sentence-level and token-level NER annotations.
- Purpose: Process the CSV files into a Hugging Face-compatible dataset.
- Functionality:
- Reads the raw CSV files generated in the previous step.
- Structures the data into
DatasetDict
format with splits for training, validation, and testing. - Saves the dataset in Hugging Face's binary format for efficient loading.
- Output:
- A Hugging Face Dataset (
.arrow
) ready for use with Hugging Face models and libraries.
- A Hugging Face Dataset (
- Purpose: Upload the processed dataset to the Hugging Face Hub for public sharing.
- Functionality:
- Configures the Hugging Face
datasets
library. - Uses the Hugging Face API to create a dataset repository and upload the dataset files.
- Includes metadata such as dataset card and license.
- Configures the Hugging Face
- Output:
- The ElectricalNER dataset hosted on the Hugging Face Hub.
git clone ner-electrical-engineering
cd ner-electrical-engineering
conda create -n ner_ee python=3.12
conda activate ner_ee
Install the required Python libraries:
pip install -r requirements.txt
Set up environment variables for OpenAI API Key and HuggingFace Access Token. Create a .env
file in the root directory:
HF_TOKEN=<huggingface_access_token>
OPENAI_API_KEY=<your_openai_api_key>
Please check README file under dataset_creation_pipeline
folder for detailed steps - Usage Instructions.
- The dataset is generated using GPT-4o-mini and may contain inaccuracies.
- Intended for research and educational purposes; not recommended for critical applications without validation.
- Contributions for refinement and expansion are welcome.
This project is licensed under the MIT License. See the LICENSE file for details.
- Report issues or suggest improvements via GitHub.
- Contributions to expand or refine the dataset are highly encouraged.
This project utilizes GPT-4o-mini for dataset generation and Hugging Face libraries for dataset processing and hosting.