Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long-running fiber performance loss since 1.14.0 #15214

Open
zw963 opened this issue Nov 22, 2024 · 8 comments
Open

long-running fiber performance loss since 1.14.0 #15214

zw963 opened this issue Nov 22, 2024 · 8 comments
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. performance topic:stdlib:concurrency

Comments

@zw963
Copy link
Contributor

zw963 commented Nov 22, 2024

require "wait_group"

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end

    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end

  steps
end

def calculate(total_seeds, batch_size, worker_size, batches)
  channel = Channel({Int64, Int64}).new(batches + 1)
  wg = WaitGroup.new(worker_size)

  p! batch_size
  p! batches

  worker_size.times do |i|
    spawn(name: "WORKER-#{i}") do
      while r = channel.receive?
        (r[0]...r[1]).each do |seed|
          steps = collatz(seed)

          if seed % 1_000_000 == 0
            print "Seed: #{seed} Steps: #{steps}\r"
          end
        end
      end
      wg.done
    end
  end

  start = Time.measure do
    r0 = 0_i64

    batches.times do
      r1 = r0 &+ batch_size
      channel.send({r0, r1})
      r0 = r1
    end

    if total_seeds - batch_size &* batches > 0
      channel.send({r0, total_seeds})
    end

    channel.close
    wg.wait
  end

  puts "\ncollatz took: #{start}"
end

# ---------------- common part ----------------

total_seeds = 1_000_000_000_i64
worker_size = ENV.fetch("CRYSTAL_WORKERS").to_i

p! total_seeds
p! worker_size

puts "-"*100

batch_size = 1000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 2000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 5000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 500_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 100_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batches = worker_size
batch_size = (total_seeds // batches).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

You can run it with: CRYSTAL_WORKERS=16 crystal run --release -Dpreview_mt 1.cr

Following is the result when built with 1.13.2

total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.577703068
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.815965558
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.833879590
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.944712438
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.189403820
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 937000000 Steps: 224
collatz took: 00:00:14.090417122

Following is the result when built with 1.14.0

 ╰──➤ $ ./1
total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.265471911
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.537970427
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.500822759
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.484737884
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.781218646
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 999000000 Steps: 162
collatz took: 00:00:13.329298756

Following is the result when built with latest master 879ec12

total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.540662927
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.590167441
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.690674863
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.750579132
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:14.017306410
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 687000000 Steps: 239
collatz took: 00:00:22.480855790

The spent time from 13s -> 22s when run on latest master, almost 70% performance loss, it is reproducible.

@zw963 zw963 added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Nov 22, 2024
@straight-shoota
Copy link
Member

Is this the same code you mentioned in https://forum.crystal-lang.org/t/there-is-a-way-to-optimize-this-program/6947/63?u=straight-shoota ? Please add at least references to related comments when cross-posting.

@zw963
Copy link
Contributor Author

zw963 commented Nov 22, 2024

Is this the same code you mentioned in https://forum.crystal-lang.org/t/there-is-a-way-to-optimize-this-program/6947/63?u=straight-shoota ?

Yes, I mention it in several places in the forum, and i consider this is a regression issue now.

@ysbaddaden
Copy link
Contributor

Thanks, I'm aware of your report. I didn't have the time to dig into it, yet.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Nov 22, 2024

I reproduce on master (linux) and it goes away with -Devloop=libevent. It also goes aways when I drop the print "...\r" line.

It's harder to reproduce with tracing enabled, but printing more often eventually triggers it.

At one point STDOUT would block, its fd gets added to an evloop, but then every print seem to trigger an epoll event (maybe the fd status is reset whenever there's enough space, or something?), and it causes the thread with said evloop to enter a loop of evloop.run; evloop.event fd=4.

@ysbaddaden
Copy link
Contributor

Creating a log instance and using it instead of print "\r" also avoids the issue, even if writing much more often (every 100 000 seeds), the program never concurrently writes to STDOUT.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Nov 22, 2024

Same with a Channel & Fiber dedicated to write the report (so we can get the \r behavior). Even if we report every 1000 seeds (i.e. a lot more than the initial every 1 000 000).

So definitely: the problem exhibited here is parallel writes to a shared fd from different threads (the same could happen with reads).

@ysbaddaden
Copy link
Contributor

To sum up: I pinpointed the issue to multiple threads trying to write to the same fd. In this case it's stdout. Whenever the concurrency writes are dropped (using Log or a Channel+Fiber) the performance is back.

What I extracted from tracing is that the fd would block at some point, so it's added to epoll, then the evloop is notified that it became ready and the fiber continues... but the fd is left in the evloop.

Then said thread has nothing left to do but other threads continue writing to the fd, which regularly triggers ready events, and the thread enters a loop { evloop.wait; evoop.ready fd=4 }.

I don't know why it has an impact on the overall performance. It shouldn't have any impact on the other threads.

@crysbot
Copy link

crysbot commented Dec 18, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/upcoming-release-1-15-0/7537/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. performance topic:stdlib:concurrency
Projects
None yet
Development

No branches or pull requests

5 participants