-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the quantized data shape compatible with original tensor shape #5483
base: master
Are you sure you want to change the base?
Make the quantized data shape compatible with original tensor shape #5483
Conversation
@microsoft-github-policy-service agree [company="Snowflake"] |
@microsoft-github-policy-service agree company=Snowflake |
…bs/deepspeed into fix-quantized-shape
Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…oft#5159) Enhance testing: Skip fused_optimizer tests if not supported. Added condition check to skip fused_optimizer tests if FusedAdam and FusedLamb are not supported by the accelerator. This enhancement ensures that the tests are appropriately skipped when the hardware configuration does not support these optimizers, preventing potential issues. Details: - Introduced a condition check to determine support for FusedAdam and FusedLamb. - If not supported, fused_optimizer tests are skipped to improve test reliability. - Improved compatibility and stability across different hardware configurations. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
Fixes microsoft#4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - pytorch/pytorch#121030 - microsoft#5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, microsoft#5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: microsoft#4753 microsoft#5474 ROCm#68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Fixing following error /datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py return get_accelerator().FloatTensor(float(v)).detach() TypeError: new(): data must be a sequence (got float) cuda accelerator modified the interface for fixing warning: microsoft@177dc14 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
…icrosoft#5519) There is following error on XPU while unit testing "DeepSpeed/tests/unit/moe/test_moe.py" DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu! Fix it by device conversion. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
**Fix overwriting of the compiled wrapper class attributes by those of the wrapped class itself: Copy only those attributes which are not already present in the wrapper.** In the current implementation of the `CompiledModuleWrapper` the wrapper attributes (eg `forward` method) are overwritten by `self._dict_ = module._dict_.copy()`: ``` def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None): class wrapper(mod.__class__): def __init__(self, module, compile_config: Union[CompileConfig, None] = None): self.__dict__ = module.__dict__.copy() ``` This causes the `wrapper`'s `forward` method not being called and, consequently, the wrapped module not compiled. Instead, the wrapped module `forward` method is being called as illustrated in the diagram below (a real scenario from Deespeed-Chat): ![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452) The proposed fix copies only those attributes which are not present in the wrapper class, thus implementing the desired inheritance quality of the wrapper. Attached is a simple reproducer of the problem. [compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
change from "zero_like" to "zeros_like"
ditto Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Addresses the following warning: ``` /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. ``` and the code on the transformers side is [here](https://github.com/huggingface/transformers/blob/1a585c1222a56bcaecc070966d558d4a9d862e83/src/transformers/utils/hub.py#L86C1-L96C81).
…5546) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…osoft#5577) Fixes microsoft#5535. Todo: need to test.
- [x] Needs to be tested. Fixes microsoft#5494. Sample screenshot: <img width="1141" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">
till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value does not apply. see example for usage in HabanaAI/Megatron-DeepSpeed fork repository: https://github.com/HabanaAI/Megatron-DeepSpeed/blob/c9feb8cacabc6dd4da4266cff08db555a21122e2/tools/verify_checkpoint_non_tp_consistency.py#L296 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
Add CUDA versions 12.4 and 12.5 to the list
Fixed the Windows build. Fixes applied: - Remove some more ops that don't build on Windows. - Remove the use of symlinks that didn't work correctly and replace with `shutil.copytree()`. - Small fixes to make the C++ code compile. Tested with Python 3.9 and CUDA 12.1. --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…osoft#5099) was considering 4 bytes per model param, and 4 bytes per gradient. fixed it to 2 bytes - under the assumption of FP16/BF16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This is a simple fix for inference woq part, changing from `'cuda'` to `get_accelerator().device_name()`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Updated hpu-gaudi2 tests content as quantizer module is not yet supported.
MII legacy tests use `from transformers import Conversation` [here](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/legacy/method_table.py#L8). Conversation was removed from transformers [here](huggingface/transformers#31165) so we pin to a version before that before unpinning.
…icrosoft#5590) We have been encountered an accuracy issue when running Torch compile + zero3 + activation checkpointing. Specifically some grads gets is zeroed (running without torch compile, this issue is not encountered). This issue was also reproduced by Umesh Chand from the DS team. We found that in the Pytorch repo torch compile has been specifically disabled using the label: @torch._disable_dynamo() reference to the WA in the Pytorch repo (https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324) this indicates that there is some issue with torch compile and checkpointing (not necessarily DS related). given that the checkpointing function in DeepSpeed is based on the Pytorch function, We propose to adopt this WA to ensure correct behavior (it can be removed later if the underlying issue is fixed) Note: this shouldn't impact non-troch compile cases. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
instead of "if" that causes host/device synchronization and introduces a bubble, while clamp is hapenning on the device
`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor` only sets reduction stream waiting for default stream. This is ok in cases where the computation time is longer than the communication time, but when the communication time is longer, it may result in a rewrite of the ipg_buffer when the communication is not completed. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0) To fix this bug, the easiest way is just add default stream to wait for reduction stream at the **same point**. For example, in point 1, the `reduction stream` needs to wait for '2', so we add a wait_stream to `reduction stream` waiting for `default stream`. Also, the `default stream` needs to wait for 'A', so we need to add a wait_stream to `default stream` waiting for `reduction stream` before the 'B'. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1) Compared with the modification of microsoft#5523, wait_stream does not cause host synchronization. Compared with the modification of microsoft#5545, the modification is more simple and the logic is the same, just waiting for what needs to wait. --- With this modification, losses of Qwen-1.5 with and without overlap_comm are totally identical. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f) --- On the contrary, there is an obvious gap with a small sequence length, which means a short computation time. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d) Co-authored-by: gp513 <guopeng34@huawei.com> Co-authored-by: CurryRice233 <nmeia@qq.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…into fix-quantized-shape
@@ -45,6 +45,18 @@ at::Tensor quantize(torch::Tensor& out, | |||
return out; | |||
} | |||
|
|||
at::Tensor get_scales(torch::Tensor& out, int num_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is redefined at line 118.
@@ -62,7 +62,8 @@ def _ensure_quantized(self, tensor: torch.Tensor): | |||
tensor.data = self.quantizer.quantize(tensor.data, | |||
q_bits=self.quantization_config.q_bits, | |||
q_mantisa_bits=self.quantization_config.mantissa_bits) | |||
assert tensor.dtype == torch.uint8 | |||
assert (tensor.dtype == torch.int8), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be torch.uint8
instead?
This PR adds a simple modification for the FP-Quantizer module to have the quantized weight compatible with the original checkpoint weight's shape.
This addresses Snowflake-Labs/snowflake-arctic#16