Skip to content

sheikhomar/coresets-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test Bench for Coreset Algorithms

BICO code is downloaded from the BICO website.

Getting Started

Remember to install the prerequisite libraries and tools:

./install_prerequisites.sh

The BICO project can be built by using supplied Makefile in the bico/build directory:

make -C bico/build

The MT project can be built with Make:

make -C mt

The k-means++ tool can be built with Make:

make -C kmeans

The GS project can be built with CMake:

sudo apt-get update
sudo apt-get install -y ninja-build
cmake -S gs -B gs/build -G "Ninja"
cmake --build gs/build

Datasets

Generate the nytimes100d dataset:

# Download file
wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz \
    -O data/input/docword.nytimes.txt.gz
# Perform dimensionality reduction via random projection.
export CPATH=/home/omar/apps/boost_1_76_0
export LIBRARY_PATH=/home/omar/apps/boost_1_76_0/stage/lib
make -C rp && rp/bin/rp.exe \
    reduce-dim \
    data/input/docword.nytimes.txt.gz \
    8192,100 \
    0 \
    1704100552 \
    data/input/docword.nytimes.rp8192-100.txt.gz

Generate the nytimespcalowd dataset:

poetry run python -m xrun.data.tsvd -i data/input/docword.nytimes.txt.gz -d 10,20,30,40,50

Debugging

Segmentation fault

Use AddressSanitizer (ASAN) to debug segfaults. ASAN can help detect memory errors at runtime.

sudo apt install libgcc-9-dev
g++ -ggdb -std=c++17 -fsanitize=address -std=c++17 -o bin/rp.exe main.cpp

Running Experiments

pyenv install
poetry install
poetry run python -m xrun.go

Create conda environment:

conda env create -f environment.yml