This project implements a multithreaded file retrieval engine that indexes text files in an input folder and performs search operations over the indexed data. The program supports the following:
- Multithreading for parallel indexing of files to improve the indexing performance.
- Indexing a directory and reading files within it.
- Searching for terms within indexed files, supporting both single-term queries and AND-based multiple term queries.
- Displays the time taken for indexing and search operations, along with the total bytes read.
In this update, I've added multi-threading to improve performance, making the indexing and search processes much faster and more efficient, especially when dealing with larger datasets. By processing multiple documents at the same time, the system can handle datasets much faster, reducing overall execution time.
- The number of worker threads can be specified via a command-line argument when executing the program.
- The multithreading implementation uses a thread-safe queue to coordinate work among the threads.
- Search functionality is case-sensitive and supports multi-term search using "AND" to find documents containing all the queried words.
- The engine processes only alphanumeric characters and ignores short words (length ≤ 2).
- if the search query is expressed with an AND query, the result will contain all the documents that contain all the terms from the AND query.
- The results are sorted by the number of accumulated occurrences of all terms in each document, and only the top 10 documents are printed.
- Validations are present in the program, so in the case of the below scenarios the progarm will print the appropriate messages to the user.
- No/invalid/negative thread count provided
- Invalid Folder path
- Folder does not exists
- No folder path specifid
- Missing search terms
- GCC 14 C++ Compiler
- CMake (For generating build files)
- Git
The repository follows the below given folder and file structure:
CSC-435-PA3
├── app-cpp/
│ ├── include/
│ │ ├── AppInterface.hpp
│ │ ├── IndexStore.hpp
│ │ ├── ProcessingEngine.hpp
│ ├── src/
│ │ ├── AppInterface.cpp
│ │ ├── file-retrieval-engine.cpp
│ │ ├── IndexStore.cpp
│ │ ├── ProcessingEngine.cpp
├── CMakeLists.txt
For creating the build folder, navigate to the app-cpp
folder and create the build folder by using the command below. This folder will contain the build files generated by CMake.
cd app-cpp
mkdir build
Navigate into the build/
directory and run CMake commands to intialise the project.
cd build
cmake ..
Navigate back into the app-cpp
folder of the reposiory and run the build commands.
cd ..
cmake --build build/
Once the build is complete, run the program with the following command from the app-cpp directory. The number of worker threads can be specified as a command-line argument.
./build/file-retrieval-engine <num_worker_threads>
<num_worker_threads>
is the the number of worker threads to be used for indexing. If not specified, the program will display an error.
The below given command will execute the program using 4 worker threads.
./build/file-retrieval-engine 4
./build/file-retrieval-engine 4
> <index | search | quit> index ../../datasets/dataset1
Indexing completed with a total execution time of 2 seconds with 4 worker threads.
Total bytes read: 134247377
> <index | search | quit> search at
Search executed in 0 seconds.
No results found
> <index | search | quit> search Worms
Search executed in 0 seconds.
Search Results: ( Top 10 out of 10)
../../datasets/dataset1/folder3/Document1043.txt (Frequency: 4)
../../datasets/dataset1/folder4/Document10553.txt (Frequency: 4)
../../datasets/dataset1/folder3/Document10383.txt (Frequency: 3)
../../datasets/dataset1/folder7/Document1091.txt (Frequency: 3)
../../datasets/dataset1/folder7/folderB/Document10991.txt (Frequency: 2)
../../datasets/dataset1/folder8/Document11116.txt (Frequency: 1)
../../datasets/dataset1/folder2/Document101.txt (Frequency: 1)
../../datasets/dataset1/folder2/folderA/Document10340.txt (Frequency: 1)
../../datasets/dataset1/folder4/Document10657.txt (Frequency: 1)
../../datasets/dataset1/folder4/Document1051.txt (Frequency: 1)
> <index | search | quit> search distortion AND adaptation
Search executed in 0 seconds.
Search Results: ( Top 10 out of 4)
../../datasets/dataset1/folder7/folderC/Document10998.txt (Frequency: 6)
../../datasets/dataset1/folder4/Document10516.txt (Frequency: 3)
../../datasets/dataset1/folder8/Document11157.txt (Frequency: 2)
../../datasets/dataset1/folder8/Document11159.txt (Frequency: 2)
> <index | search | quit>
quit