Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal chunksizes #32

Open
mrocklin opened this issue Feb 25, 2017 · 1 comment
Open

Optimal chunksizes #32

mrocklin opened this issue Feb 25, 2017 · 1 comment

Comments

@mrocklin
Copy link
Member

In some cases we may wish to rechunk our data prior to execution. This can help to balance between high scheduling overheads (too many tasks) and poor load balancing (too few tasks).

It appears that different algorithms have different optimum sizes. For example algorithms with low task counts like ADMM benefit from smaller chunksizes while algorithms with many small tasks like gradient/proximal descent benefit from larger chunksizes.

@stoneyv
Copy link

stoneyv commented Apr 6, 2017

Section 4.2 Cache-aware Access of Tianqi Chen and Carlos Guestrin's XGBoost paper discusses the tradeoff between smaller blocks to larger block sizes. Smaller blocks workloads and inefficient parallelization for each thread. Larger blocks that result in processor cache misses. They do this analysis for two data sets Allstatate 10M and Higgs 10M. Figure 9 plots time per thread versus number of threads for block sizes of 2^12, 2^16, 2^20, and 2^24. There is a significant difference between the performance of the 2^24 block size and the other block sizes.
https://arxiv.org/pdf/1603.02754.pdf

Maybe we could duplicate this experiment for ADMM on those two data sets. They used S3 instead of HDFS.

Also, it is worth looking at Section 4.3 Blocks for Out-of-core computation which discuss block-compression and sharding. You might also look at the Blosc/c-blosc package that hdf5 uses.
https://github.com/Blosc/c-blosc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants