In the dialogue prediction task, a model is trained to perdict a character's name based on the given dialogue.
- Phase 1 report file
- Phase 2 report file
- Scripts data file
- Raw dialogues data file
- Preprocessed files step by step
- Cleaned dataset
- Statistics before preprocessing
- Statistics after preprocessing
- Notebooks
- Model checkpoints
Friends tv series scripts used as dataset. Friends is an American television sitcom which aired on NBC from September 22, 1994, to May 6, 2004.
There are 6 main characters (classes) in this show:
- Ross
- Rachel
- Joey
- Chandler
- Monica
- Phoebe
The scripts are gathered from Here.
Python packages must be installed:
pip install -r requirements.txt
To run crawler and gather/update dataset:
cd src/crawler
scrapy crawl scripts -t csv -o ../../data/raw/scripts.csv
scrapy crawl dialogues -t csv -o ../../data/raw/dialogues.csv
- Step 1: Remove white spaces
- Step 2: Lowercase all letters
- Step 3: Remove special characters
- Step 4: Remove short words
- Step 5: Remove stopwords
cd src/preprocessing
python preprocessor.py
- Each person words count
- Each person types count
- Each person wordcloud
- Each person histogram
Above metrics are extracted for entire scripts before and after preprocessing.
cd src/statistics
python main.py