Document Retrieval System

Program for retrieving documents on the basis of a query

Author: Asad Raheem

Licensed under the GNU General Public License version 3.0

Installation

Python 2.7
nltk 3.2.1
beautifulsoup 4.3.2
numpy 1.9.2

Description

The system consists of three parts:

Tokenizer
Indexer
Index Reader

1. Tokenizer

Reads a document collection and creates documents containing indexable tokens. The tokenizer extracts text from HTML files and splits the text into tokens. Stop wording is also applied to ignore any stop words in the documents. All the tokens are converted to lower case (this is not always ideal and should be changed accordingly before using the code) and then Porter stemming is applied.

Outputs

docids.txt: Maps a document's file name to its document ID (DOCID).
termids.txt: Maps a token to its term ID (TERMID).
doc_index.txt: Forward index containing position of each term in each file.

Usage

In command prompt or terminal type: python tokenize_doc.py <directory_name>

2. Indexer

Reads a collection of tokenized documents and constructs an inverted index.

Outputs

term_index.txt: Inverted index containing file position for each occurence of each term in collection. Each line contains a completed inverted list for a single term i.e. a TERMID is followed by a list of DOCID:POSITION values. Delta encoding is applied to each list.
term_info.txt: Used for providing fast access time to the index reader. Each line contains a TERMID followed by offset in bytes (in term_index.txt), occurrences in entire corpus and number of documents in which term appears.

Usage

In command prompt or terminal type: python invert_index.py

3. Index Reader

Looks up offset in term_info.txt and jumps straight to the list in term_index.txt.

Usage

In command prompt or terminal type:

python read_index.py --doc DOCNAME
python read_index.py --doc DOCNAME --term TERM
python --term TERM --doc DOCNAME

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE.md		LICENSE.md
README.md		README.md
invert_index.py		invert_index.py
read_index.py		read_index.py
tokenize_doc.py		tokenize_doc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Retrieval System

Installation

Description

1. Tokenizer

Outputs

Usage

2. Indexer

Outputs

Usage

3. Index Reader

Usage

About

Releases

Packages

Languages

License

tehmas/document-retrieval-system

Folders and files

Latest commit

History

Repository files navigation

Document Retrieval System

Installation

Description

1. Tokenizer

Outputs

Usage

2. Indexer

Outputs

Usage

3. Index Reader

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages