Skip to content

A file retrieval engine that indexes and searches large datasets. Version 2 uses multi-threading and parallel processing to significantly boost speed, making operations faster and more responsive for documents

Notifications You must be signed in to change notification settings

reynaroyce12/File-Retrieval-Engine-V2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

What This Project Does

This project implements a multithreaded file retrieval engine that indexes text files in an input folder and performs search operations over the indexed data. The program supports the following:

  • Multithreading for parallel indexing of files to improve the indexing performance.
  • Indexing a directory and reading files within it.
  • Searching for terms within indexed files, supporting both single-term queries and AND-based multiple term queries.
  • Displays the time taken for indexing and search operations, along with the total bytes read.

Enhancements in this version over the previous one (V1):

In this update, I've added multi-threading to improve performance, making the indexing and search processes much faster and more efficient, especially when dealing with larger datasets. By processing multiple documents at the same time, the system can handle datasets much faster, reducing overall execution time.

A Few Points to Consider:

  • The number of worker threads can be specified via a command-line argument when executing the program.
  • The multithreading implementation uses a thread-safe queue to coordinate work among the threads.
  • Search functionality is case-sensitive and supports multi-term search using "AND" to find documents containing all the queried words.
  • The engine processes only alphanumeric characters and ignores short words (length ≤ 2).
  • if the search query is expressed with an AND query, the result will contain all the documents that contain all the terms from the AND query.
  • The results are sorted by the number of accumulated occurrences of all terms in each document, and only the top 10 documents are printed.
  • Validations are present in the program, so in the case of the below scenarios the progarm will print the appropriate messages to the user.
    • No/invalid/negative thread count provided
    • Invalid Folder path
    • Folder does not exists
    • No folder path specifid
    • Missing search terms

The program also assumes that your enviroment already has the following installed and configured:

  • GCC 14 C++ Compiler
  • CMake (For generating build files)
  • Git

Folder and File Structure

The repository follows the below given folder and file structure:

CSC-435-PA3
├── app-cpp/
│   ├── include/
│   │   ├── AppInterface.hpp
│   │   ├── IndexStore.hpp
│   │   ├── ProcessingEngine.hpp
│   ├── src/
│   │   ├── AppInterface.cpp
│   │   ├── file-retrieval-engine.cpp
│   │   ├── IndexStore.cpp
│   │   ├── ProcessingEngine.cpp
├── CMakeLists.txt

How to build and run the program

1. Creating the build/ Directory

For creating the build folder, navigate to the app-cpp folder and create the build folder by using the command below. This folder will contain the build files generated by CMake.

cd app-cpp
mkdir build

2. Run CMake commands to intiate the Build process

Navigate into the build/ directory and run CMake commands to intialise the project.

cd build
cmake ..

3. Build the program

Navigate back into the app-cpp folder of the reposiory and run the build commands.

cd ..
cmake --build build/

4. Execute the program

Once the build is complete, run the program with the following command from the app-cpp directory. The number of worker threads can be specified as a command-line argument.

./build/file-retrieval-engine <num_worker_threads>
  • <num_worker_threads> is the the number of worker threads to be used for indexing. If not specified, the program will display an error.

Example:

The below given command will execute the program using 4 worker threads.

./build/file-retrieval-engine 4

Example:

./build/file-retrieval-engine 4
> <index | search | quit>  index ../../datasets/dataset1
Indexing completed with a total execution time of 2 seconds with 4 worker threads.
Total bytes read: 134247377
> <index | search | quit>  search at

Search executed in 0 seconds.
No results found
> <index | search | quit>  search Worms

Search executed in 0 seconds.
Search Results: ( Top 10 out of 10)

../../datasets/dataset1/folder3/Document1043.txt (Frequency: 4)
../../datasets/dataset1/folder4/Document10553.txt (Frequency: 4)
../../datasets/dataset1/folder3/Document10383.txt (Frequency: 3)
../../datasets/dataset1/folder7/Document1091.txt (Frequency: 3)
../../datasets/dataset1/folder7/folderB/Document10991.txt (Frequency: 2)
../../datasets/dataset1/folder8/Document11116.txt (Frequency: 1)
../../datasets/dataset1/folder2/Document101.txt (Frequency: 1)
../../datasets/dataset1/folder2/folderA/Document10340.txt (Frequency: 1)
../../datasets/dataset1/folder4/Document10657.txt (Frequency: 1)
../../datasets/dataset1/folder4/Document1051.txt (Frequency: 1)
> <index | search | quit>  search distortion AND adaptation

Search executed in 0 seconds.
Search Results: ( Top 10 out of 4)

../../datasets/dataset1/folder7/folderC/Document10998.txt (Frequency: 6)
../../datasets/dataset1/folder4/Document10516.txt (Frequency: 3)
../../datasets/dataset1/folder8/Document11157.txt (Frequency: 2)
../../datasets/dataset1/folder8/Document11159.txt (Frequency: 2)
> <index | search | quit> 
quit

About

A file retrieval engine that indexes and searches large datasets. Version 2 uses multi-threading and parallel processing to significantly boost speed, making operations faster and more responsive for documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published