This repository contains a series of scripts that I developed to classify text data with the OpenAI API. Examples of usage are provided below.
If you are new to using the OpenAI API I recommend you to start with its documentation, you might also consider the following information flow diagram that I elaborated.
Divide the data to be sorted into batches in .jsonl format according to the API usage limits. Insert your prompt, maximum token desired and LLM chosen directly in the code.
create_jsonl
(
df,
batch_size=30000,
output_base_name="batch",
output_dir="./input_batchs/"
)
Upload batches to the OpenAI platform
upload_jsonl_files
(
"./input_batchs/"
)
Start batch jobs and limit it to the maximum processing available (according to number of calls and weight of each file)
create_batch_jobs
(
"uploaded_file_ids.csv",
start_file=1,
end_file=15
)
Update the ID of each batch uploaded to the platform for downloading.
update_batch_output_file_id
(
"batch_ids.csv"
)
Download all successfully completed batches
csv_file = "batch_ids.csv" download_folder = "./resulting_batches/"
download_and_process_multiple_files
(
csv_file,
download_folder
)
Retrieve relevant information from downloaded .jsonl files
output_df = load_jsonl_to_dataframe
(
"./resulting_batches/",
id_column="post_id",
content_column="reasoning"
)