Offensive and problematic content including insulting, hurtful, derogatory or obscene user contributions are pervasive in social media. Societies need to develop adequate response mechanisms in order to find a balance between freedom of expression on one side and the ability to live without oppressive remarks on the other side. A requirement for any response is robust technology for identifying problematic content automatically. HASOC provides a forum for developing and testing text classification systems for various languages
This task focus on Hate speech and Offensive language identification offered for English, German, and Hindi. Sub-task A is coarse-grained binary classification in which participating system are required to classify tweets into two classes, namely: Hate and Offensive (HOF) and Non- Hate and offensive (NOT).
- NOT : Non Hate-Offensive - This post does not contain any Hate speech, profane, offensive content.
- HOF : Hate and Offensive - This post contains Hate, offensive, and profane content.
Model | Accuracy |
---|---|
Gaussian NM | 50% |
Logistic Regression | 80% |
KNN | 78% |
SVC | 84% |
Random Forest | 82% |
LSTM | 78% |
BERT | 78% |
This sub-task is a fine-grained classification offered for English, German, and Hindi. Hate-speech and offensive posts from the sub-task A are further classified into three categories:
- HATE : Hate speech:- Posts under this class contain Hate speech content.
- OFFN : Offenive:- Posts under this class contain offensive content.
- PRFN : Profane:- These posts contain profane words.
Model | Accuracy |
---|---|
Gaussian NM | 45% |
KNN | 64% |
SVC | 66% |
Decision Tree | 53% |
Random Forest | 69% |
LSTM | 60% |
BERT | 61% |
Proposed Model | 63% |
Model | Accuracy |
---|---|
BERT_A | 78% |
BERT_OFFN/HATE | 79% (2 epochs) |
Profanity | 92% |
Sub-Task | Classifier | Macro F1-score |
---|---|---|
A | DistilBERT | 75% |
B | DistilBERT | 57% |
S. Saseendran, S. R, S. V, S. Giri, Classification of Hate Speech and Offensive Content
using an approach based on DistilBERT, in: Forum for Information Retrieval Evaluation
(Working Notes) (FIRE), CEUR-WS.org, 2021.
- Preprocess.ipynb : Preprocessing the data.
- dataAnalysis.ipynb : Analysing the data.
- ml_techniques_A.ipynb : Applying ML models to classify the data and find the accuracy for Subtask A.
- ml_techniques_B.ipynb : Applying ML models to classify the data and find the accuracy for Subtask B.
- DL_LSTM for task A.ipynb : Applying DL LSTM model to classify the data and find the accuracy for Subtask A.
- DL_LSTM for task B.ipynb : Applying DL LSTM model to classify the data and find the accuracy for Subtask B.
- BERT_A.ipynb : Applying DistilBERT model to classify the data and find the accuracy for Subtask A.
- BERT_B.ipynb : Applying DistilBERT model to classify the data and find the accuracy for Subtask B.
- Proposed Model (Folder) : Files related to the Proposed Model:
- profanity_check.ipynb : Survey of various pre-trained models/libraries for profanity check.
- ProposedModel.ipynb : Implementation of the proposed model.