Program for retrieving documents on the basis of a query
Author: Asad Raheem
Licensed under the GNU General Public License version 3.0
- Python 2.7
- nltk 3.2.1
- beautifulsoup 4.3.2
- numpy 1.9.2
The system consists of three parts:
- Tokenizer
- Indexer
- Index Reader
- docids.txt: Maps a document's file name to its document ID (DOCID).
- termids.txt: Maps a token to its term ID (TERMID).
- doc_index.txt: Forward index containing position of each term in each file.
- term_index.txt: Inverted index containing file position for each occurence of each term in collection. Each line contains a completed inverted list for a single term i.e. a TERMID is followed by a list of DOCID:POSITION values. Delta encoding is applied to each list.
- term_info.txt: Used for providing fast access time to the index reader. Each line contains a TERMID followed by offset in bytes (in term_index.txt), occurrences in entire corpus and number of documents in which term appears.
- python read_index.py --doc DOCNAME
- python read_index.py --doc DOCNAME --term TERM
- python --term TERM --doc DOCNAME