Electrical Engineering Named Entity Recognition (NER) Dataset Creation Pipeline

This repository contains scripts and notebooks for creating, processing, and uploading the ElectricalNER dataset, a NER dataset tailored for the electrical engineering domain. The pipeline is divided into three stages, each handled by a specific script or notebook.

Pipeline Overview

1. Dataset Creation (`01_dataset_creation.py`)

Purpose: Generate annotated NER data by using a large language model (LLM).
Functionality:
- Sends structured prompts to an LLM to generate sentences and their corresponding NER annotations.
- Saves the generated data in batches to CSV files.
Key Features:
- Asynchronous API calls for efficient batch processing.
- Handles large-scale dataset generation with options for saving intermediate results.
Output:
- Raw CSV files containing sentence-level and token-level NER annotations.

2. Convert CSVs to Hugging Face Dataset (`02_csvs_to_hf_dataset.ipynb`)

Purpose: Process the CSV files into a Hugging Face-compatible dataset.
Functionality:
- Reads the raw CSV files generated in the previous step.
- Structures the data into DatasetDict format with splits for training, validation, and testing.
- Saves the dataset in Hugging Face's binary format for efficient loading.
Output:
- A Hugging Face Dataset (.arrow) ready for use with Hugging Face models and libraries.

3. Upload to Hugging Face Hub (`03_upload_to_hf_hub.ipynb`)

Purpose: Upload the processed dataset to the Hugging Face Hub for public sharing.
Functionality:
- Configures the Hugging Face datasets library.
- Uses the Hugging Face API to create a dataset repository and upload the dataset files.
- Includes metadata such as dataset card and license.
Output:
- The ElectricalNER dataset hosted on the Hugging Face Hub.

Environment Setup

1. Clone the Repository

git clone ner-electrical-engineering
cd ner-electrical-engineering

2. Create and Activate a Virtual Environment

conda create -n ner_ee python=3.12
conda activate ner_ee

3. Install Dependencies

Install the required Python libraries:

pip install -r requirements.txt

4. Configure API Keys

Set up environment variables for OpenAI API Key and HuggingFace Access Token. Create a .env file in the root directory:

HF_TOKEN=<huggingface_access_token>
OPENAI_API_KEY=<your_openai_api_key>

Creating Dataset

Please check README file under dataset_creation_pipeline folder for detailed steps - Usage Instructions.

Limitations

The dataset is generated using GPT-4o-mini and may contain inaccuracies.
Intended for research and educational purposes; not recommended for critical applications without validation.
Contributions for refinement and expansion are welcome.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Report issues or suggest improvements via GitHub.
Contributions to expand or refine the dataset are highly encouraged.

Acknowledgments

This project utilizes GPT-4o-mini for dataset generation and Hugging Face libraries for dataset processing and hosting.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset_creation_pipeline		dataset_creation_pipeline
electrical_engineering_ner_dataset		electrical_engineering_ner_dataset
ner_dataset		ner_dataset
prompt		prompt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Electrical Engineering Named Entity Recognition (NER) Dataset Creation Pipeline

Pipeline Overview

1. Dataset Creation (`01_dataset_creation.py`)

2. Convert CSVs to Hugging Face Dataset (`02_csvs_to_hf_dataset.ipynb`)

3. Upload to Hugging Face Hub (`03_upload_to_hf_hub.ipynb`)

Environment Setup

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install Dependencies

4. Configure API Keys

Creating Dataset

Limitations

License

Contributing

Acknowledgments

About

Releases

Packages

Languages

License

di37/ner-electrical-engineering-dataset

Folders and files

Latest commit

History

Repository files navigation

Electrical Engineering Named Entity Recognition (NER) Dataset Creation Pipeline

Pipeline Overview

1. Dataset Creation (01_dataset_creation.py)

2. Convert CSVs to Hugging Face Dataset (02_csvs_to_hf_dataset.ipynb)

3. Upload to Hugging Face Hub (03_upload_to_hf_hub.ipynb)

Environment Setup

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install Dependencies

4. Configure API Keys

Creating Dataset

Limitations

License

Contributing

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Dataset Creation (`01_dataset_creation.py`)

2. Convert CSVs to Hugging Face Dataset (`02_csvs_to_hf_dataset.ipynb`)

3. Upload to Hugging Face Hub (`03_upload_to_hf_hub.ipynb`)

Packages