Text Classification with Large Language Models - Llama3, Llama3:70B, Mixtral:8x7B, and GPT-3.5-Turbo
This project evaluates the effectiveness of large language models (LLMs) in performing multi-class text classification using few-shot prompting. Text classification is crucial for organizing information, enhancing search engines, and improving user interaction. Our study uses the BBC News Dataset categorizing news articles into Business, Technology, Sports, Politics, and Entertainment.
Key Objectives:
- To benchmark the performance of various LLMs.
- To demonstrate the application of few-shot learning in real-world datasets.
We have conducted tests using a 500-sample dataset to evaluate the performance of each model. Below is a summary of our findings presented through various metrics:
Tabular Representation
Model | Accuracy | Precision | Recall | F1 Score | Time Taken (s) |
---|---|---|---|---|---|
Llama3:8b | 0.886 | 0.796789 | 0.894913 | 0.800293 | 360 |
Llama3:70b | 0.976 | 0.981302 | 0.977032 | 0.978721 | 720 |
Mixtral:8x7b | 0.914 | 0.827455 | 0.821479 | 0.843914 | 540 |
GPT-3.5-Turbo | 0.962 | 0.805873 | 0.802054 | 0.803609 | 900 |
Graphical Representation
Based on the provided results for the news classification task using different Language Models (LLMs) on a 500-sample unseen test dataset, here are the interpretations:
-
Accuracy:
- The llama3:70b model achieves the highest accuracy of 0.976, indicating that it correctly classifies 97.6% of the test samples.
- The gpt-3.5-turbo model follows closely with an accuracy of 0.962, correctly classifying 96.2% of the samples.
- The llama3:8b and mixtral:8x7b models have lower accuracies of 0.886 and 0.914, respectively.
-
Precision:
- The llama3:70b model has the highest precision of 0.981302, meaning that when it predicts a particular class, it is correct 98.13% of the time.
- The llama3:8b model has the lowest precision of 0.796789, indicating a higher rate of false positive predictions.
-
Recall:
- The llama3:70b model achieves the highest recall of 0.977032, suggesting that it successfully identifies 97.70% of the samples from each class.
- The mixtral:8x7b model has the lowest recall of 0.821479, indicating that it may miss some instances of each class.
-
F1 Score:
- The llama3:70b model has the highest F1 score of 0.978721, which is the harmonic mean of precision and recall. This suggests a good balance between precision and recall.
- The llama3:8b model has the lowest F1 score of 0.800293, indicating a lower overall performance compared to the other models.
-
Time Taken:
- The llama3:8b model is the fastest, taking only 360 seconds to classify the 500 test samples.
- The llama3:70b model takes the longest time of 720 seconds, which is twice the time taken by llama3. This could be due to the larger model size or more complex architecture of llama3:70b.
Overall, the llama3:70b model demonstrates the best performance in terms of accuracy, precision, recall, and F1 score. It correctly classifies the highest percentage of samples, has the lowest false positive rate, successfully identifies the most instances of each class, and maintains a good balance between precision and recall.
The gpt-3.5-turbo model also shows strong performance, with accuracy and F1 score close to llama3:70b. The mixtral:8x7b and llama3:8b models have comparatively lower performance metrics.
However, it's important to note that the llama3:70b model takes the longest time to classify the samples.
In summary, llama3:70b exhibits the best overall performance for the news classification task under the given five categories, followed closely by gpt-3.5-turbo, while llama3:8b is the fastest model.
Since gpt-3.5-turbo is proprietory, we are unaware of how much GPU is consumed as we are using OpenAI API service to access it. Other three models are open source and can be used offline on local machine. Hence, took snapshot of GPU usage of each model.
Let's analyze each model's confusion matrix, focusing on misclassifications:
-
Strengths:
- The classifier performs well overall, with most instances correctly classified along the diagonal of the confusion matrix.
- The "sport" class has a particularly high number of correct classifications and only one misclassification, indicating strong performance for this class.
-
Weaknesses:
- There are some misclassifications between similar classes, such as "business" and "politics", or "tech" and other classes, suggesting some confusion between these categories.
- The "business" class has a few instances misclassified as "others" and "politics", indicating potential areas for improvement.
- The "tech" class has misclassifications spread across "business", "entertainment", and "politics", suggesting that the classifier might struggle to distinguish "tech" from these other classes in some cases.
-
Strengths:
- The classifier has a strong overall performance, with most instances correctly classified along the diagonal of the confusion matrix.
- The "sport" class performs exceptionally well, with all instances correctly classified and no misclassifications.
- The "entertainment" and "politics" classes have a very high number of correctly classified instances compared to misclassifications.
-
Weaknesses:
- The "tech" class has a notable number of misclassifications, particularly with the "sport" class (12 instances), indicating potential confusion between these categories.
- The "business" class has a few misclassifications spread across "politics", "sport", and "tech", suggesting some room for improvement in distinguishing between these classes.
- Although the overall performance is good, there are still some misclassifications present for most classes, except for "sport" and "others".
-
Strengths:
- The classifier demonstrates a good overall performance, with most instances correctly classified along the diagonal of the confusion matrix.
- The "sport" class performs well, with 110 correctly classified instances and only 2 misclassifications.
- The "politics" and "entertainment" classes have a high number of correctly classified instances compared to misclassifications.
-
Weaknesses:
- The "tech" class has a significant number of misclassifications, particularly with the "entertainment" class (18 instances), indicating potential confusion between these categories.
- The "business" class has a notable number of misclassifications as "politics" (18 instances), suggesting difficulty in distinguishing between these two classes.
- Although the overall performance is good, there are still some misclassifications present for most classes, except for "others".
-
Strengths:
- The classifier shows excellent overall performance, with most instances correctly classified along the diagonal of the confusion matrix and minimal misclassifications.
- The "sport" class performs exceptionally well, with all instances correctly classified and no misclassifications.
- The "entertainment" class has no misclassifications, indicating perfect performance for this category.
- The "business", "politics", and "tech" classes have very few misclassifications, suggesting a strong ability to distinguish these classes from others.
-
Weaknesses:
- Although the performance is excellent overall, there are still a few misclassifications present for the "business", "politics", and "tech" classes.
- The "tech" class has a few misclassifications spread across "business", "entertainment", and "politics", indicating some room for improvement in distinguishing this class from others.
1. GPT-3.5-Turbo: This model shows a balanced performance across categories with a strong inclination towards correct predictions in less technical categories like sports and politics. The confusion with tech indicates a potential area of improvement.
2. Mixtral:8x7B: Shows good overall accuracy but tends to confuse tech and sports with other categories.
3. Llama3:8B: While efficient in resource use, it shows significant confusion between business and tech, which could be problematic if precise category prediction in these areas is critical.
4. Llama3:70B: Provides the best overall accuracy and F1 score, and the confusion matrix supports this with high correct predictions across most categories. The minimal confusion between categories indicates a strong generalization capability.
- If high accuracy and generalization are crucial, Llama3:70b stands out as the best choice, despite its higher resource consumption.
- For a balance between efficiency and performance, Mixtral offers a good compromise, though the confusion between categories needs to be considered.
- For resource-constrained environments, Llama3:8b could be considered, though it has limitations in distinguishing some categories clearly.
We can see how well these models have performed just by few shot prompting. In order to achieve more superior results we can do as follows:
- We can go for fine-tuning of little bit smaller models as it will just do only one task.
- We can also go for dynamic few shot prompting. i.e. select similar examples of input and output as the news article input from the user which can be done with help of embedding models such as
e5-mistral-7b-instruct
(check out embedding model leaderboard) and vector database suchpinecone
,chroma
,qdrant
etc.
It is assumed ollama is already installed in the PC.
- Create a virtual environment:
conda create -n text-classification-using-llms python=3.11
- Activate the virtual environment:
conda activate text-classification-using-llms
- Install all the required packages:
pip install -r requirements.txt
To reproduce this project, you can follow along the notebooks in workflow
directory done step by step.
01_Dataset_Preparation.ipynb
: Few shot examples selected and injected to the user prompt followed careful preparation of the test dataset such that few shot example doesn't get included in the test dataset and also managed to maintain the balance of the dataset. Also, dataset is split into 10 smaller datasets to avoid repeating experiment from scratch if anything goes wrong.02_Text_Classification.ipynb
: Text classification using different LLMs (Llama3, Llama3:70B, Mixtral:8x7B, and GPT-3.5-Turbo) on a 500-sample test dataset. It is recommended to run02_text_classification.py
script.03_Data_Cleaning_of_Outputs.ipynb
: As per the logs, during inferencing, we can see that most of the models except GPT-3.5-Turbo gave more than one word answer despite being mentioned in the prompt. But, that doesnt mean it gave wrong output as it still gave the classification output but with explanation that was not necessary. In this notebook, necessary clean up was done.04_Evaluation.ipynb
: Evaluation of the models on a 500-sample test dataset. All of the required metrics are computed and confusion matrix plotted.utils
directory includes all the helper functions and the constants.data
directory includesnews_category_classification
directory which includes the original dataset. few-shot examples and the test dataset are also saved here. This is done such that anyone using this repository can try out for different sentiment analysis case.