Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEDeconvolutionLayer performance degradation in v24.11 #1150

Open
alvoron opened this issue Nov 21, 2024 · 19 comments
Open

NEDeconvolutionLayer performance degradation in v24.11 #1150

alvoron opened this issue Nov 21, 2024 · 19 comments
Milestone

Comments

@alvoron
Copy link

alvoron commented Nov 21, 2024

How ACL was built:

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=1 cppthreads=0 os=macos data_layout_support=all  build=native --jobs=16 os=macos build=native --silent fixed_format_kernels=True

Platform:
Apple M2 Pro

Operating System:
macOS 13.4

Problem description:
NEDeconvolutionLayer performance in 24.09 is better than in 24.11.

Reproducer

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include <iostream>
#include <vector>

using namespace arm_compute;

int main(int argc, char *argv[]) {
  TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 200, 200), 1, DataType::F16, DataLayout::NHWC);
  TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, DataType::F16, DataLayout::NHWC);
  TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 600, 600), 1, DataType::F16, DataLayout::NHWC);

  PadStrideInfo deconvInfo = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);
  bool fastMath = true;
  auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconvInfo, fastMath);
  if(status.error_code() != ErrorCode::OK) {
    std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
    exit(1);
  }
  std::cout << "PASSED VALIDATION" << std::endl;

  Tensor srcTensor;
  Tensor weiTensor;
  Tensor dstTensor;
  srcTensor.allocator()->init(srcTensorInfo);
  weiTensor.allocator()->init(weiTensorInfo);
  dstTensor.allocator()->init(dstTensorInfo);

  NEDeconvolutionLayer deconv;
  deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconvInfo, fastMath);
  std::cout << "PASSED CONFIGURATION" << std::endl;

  srcTensor.allocator()->allocate();
  weiTensor.allocator()->allocate();
  dstTensor.allocator()->allocate();

  //warm-up
  deconv.run();

  std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) deconv.run();
  std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
  uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
  std::cout << "time: " << total_duration / 100 << std::endl;

  return 0;
}

How reproducer was built

g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_deconv.cpp -L./ComputeLibrary/build/ -larm_compute -std=c++17

The reproducer gives 7038 on 24.09 and 10669 on 24.11.

Could you please review potential performance issues in NEDeconvolutionLayer?
I also observe degradation in Convolution, probably Deconvolution and Convolution issues have the same cause.

Also, it's worth to mention, I haven't observed these degradations on Ampere.

@morgolock
Copy link

Hi @alvoron

For macos we don't support the option openmp=1 in ACL because libomp is not part of the OS and users need to install it as a thirdparty package. Can you double check on your side? The build command you shared above scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=1 cppthreads=0 os=macos data_layout_support=all build=native --jobs=16 os=macos build=native --silent fixed_format_kernels=True does not work for me. How is it that you can build acl with openmp=1 on macos?

@alvoron
Copy link
Author

alvoron commented Nov 26, 2024

@morgolock can you install OpenMP by calling brew install libomp ?

@morgolock
Copy link

Hi @alvoron

Will do but this is not something we normally test or support on macos

@alvoron
Copy link
Author

alvoron commented Nov 27, 2024

@morgolock I've got almost the same results with cppthreads=1 openmp=0:
24.11 - 10612
24.09 - 7144

@morgolock
Copy link

Hi @alvoron

Thanks for the additional information. I reproduced the issue. We are looking into it.

@alvoron
Copy link
Author

alvoron commented Dec 3, 2024

Hi @morgolock
I've seen new release v24.11.1 has been released. Does it contain the fix?

@morgolock
Copy link

Hi @alvoron

No, the regression has not been fixed yet.

Hope this helps

@morgolock
Copy link

Hi @alvoron

This patch solves the problem and it will be included in the next release.

Hope this helps

@alvoron
Copy link
Author

alvoron commented Dec 4, 2024

@morgolock
Thank you for the patch.
Let me check it on my side.

@morgolock morgolock added this to the v25.02 milestone Dec 4, 2024
@alvoron
Copy link
Author

alvoron commented Dec 17, 2024

@morgolock
I applied the patch on the top of v24.11 (f44f09d) and got the same results: 10400-10500

Could you please double check the fix? Or some additional patches are required?

@morgolock
Copy link

Hi @alvoron

I ran the test and I can confirm that it fixes the regression on Apple M2 Pro

I built the library with the following options:
scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all build=native --jobs=16 os=macos build=native validation_tests=0 examples=0 fixed_format_kernels=True logging=0 build_dir=./build/main -j8

See the results below

user@acl-mac-mini deconv % ./deconv_fix | grep time
time: 7067
user@acl-mac-mini deconv % ./deconv_24.09 | grep time
time: 7505

@morgolock
Copy link

Hi @alvoron

Make sure you explicitly set the memory manager when you create the DeconvLayer, as shown below.

1 #include "arm_compute/core/Error.h"
  2 #include "arm_compute/core/TensorShape.h"
  3 #include "arm_compute/core/utils/misc/MMappedFile.h"
  4 #include "arm_compute/runtime/Tensor.h"
  5 #include "arm_compute/runtime/NEON/NEFunctions.h"
  6 #include <iostream>
  7 #include <vector>
  8 #include "arm_compute/runtime/BlobLifetimeManager.h"
  9 #include "arm_compute/runtime/PoolManager.h"
 10 #include "arm_compute/runtime/Allocator.h"
 11 #include "arm_compute/runtime/MemoryManagerOnDemand.h"
 12 
 13 
 14 using namespace arm_compute;
 15 
 16 int main(int argc, char *argv[]) {
 17   Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
 18   auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
 19   auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
 20   auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
 21   MemoryGroup memory_group(mm);
 22 
 23   TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 200, 200), 1, DataType::F16, DataLayout::NHWC);
 24   TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, DataType::F16, DataLayout::NHWC);
 25   TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 600, 600), 1, DataType::F16, DataLayout::NHWC);
 26   PadStrideInfo deconvInfo = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);
 27   bool fastMath = true;
 28   auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconvInfo, fastMath);
 29   if(status.error_code() != ErrorCode::OK) {
 30     std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
 31     exit(1);
 32   }
 33   std::cout << "PASSED VALIDATION" << std::endl;
 34 
 35   Tensor srcTensor;
 36   Tensor weiTensor;
 37   Tensor dstTensor;
 38 
 39   memory_group.manage(&srcTensor);         // Start managing object tmp1 and start its lifetime
 40   memory_group.manage(&weiTensor);         // Start managing object tmp2 and start its lifetime
 41   memory_group.manage(&dstTensor);         // Start managing object tmp2 and start its lifetime
 42 
 43   srcTensor.allocator()->init(srcTensorInfo);
 44   weiTensor.allocator()->init(weiTensorInfo);
 45   dstTensor.allocator()->init(dstTensorInfo);
 46 
 47   NEDeconvolutionLayer deconv(mm);
 48   deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconvInfo, fastMath);
 49   std::cout << "PASSED CONFIGURATION" << std::endl;
 50   srcTensor.allocator()->allocate();
 51   weiTensor.allocator()->allocate();
 52   dstTensor.allocator()->allocate();
 53 
 54 mm->populate(allocator, 1);
 55 memory_group.acquire();
 56 
 57   deconv.run(); //warmup
 58  std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
 59   for (int i = 0; i < 100; i++) {
 60          deconv.run();
 61  }
 62 
 63   std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
 64   uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
 65   std::cout << "time: " << total_duration / 100 << std::endl;
 66 memory_group.release();
 67           mm->clear();
 68 
 69   return 0;
 70 }

If you apply the patch and initialize NEDeconvolutionLayer with the MemoryManager you will solve the performance issue.

Hope this helps

@alvoron
Copy link
Author

alvoron commented Dec 18, 2024

@morgolock
I've observed performance degradation on f32 convolutions (gemm_acl_f32) on mobilenet-v2-1.0-224 as well.

We're using ACL Convolution kernel via oneDNN, so the fix needs to be done on oneDNN side to recover Convolution performance.

Could you please check GEMM as well?

@morgolock
Copy link

Hi @alvoron

I've observed performance degradation on f32 convolutions (gemm_acl_f32) on mobilenet-v2-1.0-224 as well.
We're using ACL Convolution kernel via oneDNN, so the fix needs to be done on oneDNN side to recover Convolution performance.
Could you please check GEMM as well?

In v24.11 we introduced this patch to improve memory management in ACL. This patch reduces considerably memory usage in some models. A side effect of this change is that it requires the user of the library to explicitly setup and configure the memory manager as shown in the reproducer above to get the best performance, otherwise you will see a performance regression.

If you setup the memory manager as in the reproducer you will see no performance regression in v24.12.

Hope this helps.

@alvoron
Copy link
Author

alvoron commented Dec 18, 2024

@morgolock how to configure the memory manager if I'm using oneDNN to call ACL Convolution kernel? Does oneDNN provide an API to setup ACL memory manager?

@theComputeKid
Copy link
Member

Does oneDNN provide an API to setup ACL memory manager?

@alvoron this sounds more like a oneDNN question. Are you asking as a oneDNN user, or a oneDNN contributor? The answer is different in both cases. We can take this discussion to the oneDNN repo if it gets too technical.

cc: @Sqvid

@alvoron
Copy link
Author

alvoron commented Dec 28, 2024

@morgolock @theComputeKid
convolution issue I mentioned above is related to stateless feature, I've created an issue in oneDNN repo: oneapi-src/oneDNN#2324

@theComputeKid
Copy link
Member

@alvoron Thanks, we will get back to you on the oneDNN side after our people get back from holidays.

@morgolock it might be likely that this is a bug in the way we implemented stateless conv (we have a workaround in oneDNN for winograd, which is probably slowing things down). We might need to apply some fixes to make it threadsafe, so that the oneDNN workaround is not required. Would you like to track this as part of this issue? In which case, we should rename the issue to "stateless conv performance worse than NEConv".

@morgolock
Copy link

Hi @alvoron

This patch solves the performance regression in NEDeconvLayer.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants