Preprocessy

Preprocessy is a library that provides data preprocessing pipelines for machine learning. It bundles all the common preprocessing steps that are performed on the data to prepare it for machine learning models. It aims to do so in a manner that is independent of the source and type of dataset. Hence, it provides a set of functions that have been generalised to different types of data.

The pipelines themselves are composed of these functions and flexible so that the users can customise them by adding their processing functions or removing pipeline functions according to their needs. The pipelines thus provide an abstract and high-level interface to the users.

Pipeline Structure

The pipelines are divided into 3 logical stages -

Stage 1 - Pipeline Input

Input datasets with the following extensions are supported - .csv, .tsv, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .odt

Stage 2 - Processing

This is the major part of the pipeline consisting of processing functions. The following functions are provided out of the box as individual functions as well as a part of the pipelines -

Handling Null Values
Handling Outliers
Normalisation and Scaling
Label Encoding
Correlation and Feature Extraction
Training and Test set splitting

Stage 3 - Pipeline Output

The output consists of processed dataset and pipeline parameters depending on the verbosity required.

Project Structure

.
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── datasets
├── evaluations
├── preprocessy
├── requirements.txt
├── requirements_dev.txt
├── setup.py
├── tests
└── venv

5 directories, 6 files

preprocessy - Contains the different pipeline and function classes
tests - Contains all the unit and integration tests
datasets - Contains sample datasets for development purposes
evaluations - Contains jupyter notebooks with example implementations and performance measurements

Requirements

pandas
scikit-learn # required for feature selection

For development requirements see Contributing Guidelines

Contributing

Please read our Contributing Guide before submitting a Pull Request to the project.

Support

Feel free to contact any of the maintainers. We're happy to help!

Roadmap

Check out our roadmap to stay informed of the latest features released and the upcoming ones. Feel free to give us your insights!

Documentation

Currently, documentation is under development. All contributions are welcome! Please see our Contributing Guide.

License

See the LICENSE file for licensing information.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github/workflows		.github/workflows
datasets		datasets
docs		docs
evaluations		evaluations
preprocessy		preprocessy
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
requirements_docs.txt		requirements_docs.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocessy

Pipeline Structure

Stage 1 - Pipeline Input

Stage 2 - Processing

Stage 3 - Pipeline Output

Project Structure

Requirements

Contributing

Support

Roadmap

Documentation

License

About

Releases

Packages

Languages

License

DevayaniShivankar/preprocessy

Folders and files

Latest commit

History

Repository files navigation

Preprocessy

Pipeline Structure

Stage 1 - Pipeline Input

Stage 2 - Processing

Stage 3 - Pipeline Output

Project Structure

Requirements

Contributing

Support

Roadmap

Documentation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages