Natural Language Processing Project

Text processing & data retrieval system

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.

The goal of this project is to experiment with text processing with NLTK, and Python.

The project is divided into 3 parts:

Part 1 - Developed a pipeline to read data, extract it, tokenize it, lowercase it, apply the Porter Stemmer algorithm to it (to reduce the words to their root, eg. jumping -> jump), and remove stop words. In each step of the pipeline, the results are exported to a .txt file for clarity. Every step of the pipeline is also a separate function, given that modularity allows for better debugging.

Part 2 - Implemented a naive indexer (stores words and their locations), and a single-term query processing system (handles search for individual words).

Part 3 - Refined the indexing procedure. Implemented ranking of returns.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
part1.py		part1.py
part2.py		part2.py
part3.py		part3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing Project

Text processing & data retrieval system

About

Releases

Packages

Languages

sgalawar/nlp-data-processing-retrieval-system

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing Project

Text processing & data retrieval system

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages