long-running fiber performance loss since 1.14.0 #15214

zw963 · 2024-11-22T08:44:03Z

require "wait_group"

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end

    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end

  steps
end

def calculate(total_seeds, batch_size, worker_size, batches)
  channel = Channel({Int64, Int64}).new(batches + 1)
  wg = WaitGroup.new(worker_size)

  p! batch_size
  p! batches

  worker_size.times do |i|
    spawn(name: "WORKER-#{i}") do
      while r = channel.receive?
        (r[0]...r[1]).each do |seed|
          steps = collatz(seed)

          if seed % 1_000_000 == 0
            print "Seed: #{seed} Steps: #{steps}\r"
          end
        end
      end
      wg.done
    end
  end

  start = Time.measure do
    r0 = 0_i64

    batches.times do
      r1 = r0 &+ batch_size
      channel.send({r0, r1})
      r0 = r1
    end

    if total_seeds - batch_size &* batches > 0
      channel.send({r0, total_seeds})
    end

    channel.close
    wg.wait
  end

  puts "\ncollatz took: #{start}"
end

# ---------------- common part ----------------

total_seeds = 1_000_000_000_i64
worker_size = ENV.fetch("CRYSTAL_WORKERS").to_i

p! total_seeds
p! worker_size

puts "-"*100

batch_size = 1000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 2000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 5000_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 500_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batch_size = 100_i64
batches = (total_seeds // batch_size).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batches = worker_size
batch_size = (total_seeds // batches).to_i32
calculate(total_seeds, batch_size, worker_size, batches)

You can run it with: CRYSTAL_WORKERS=16 crystal run --release -Dpreview_mt 1.cr

Following is the result when built with 1.13.2

total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.577703068
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.815965558
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.833879590
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.944712438
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.189403820
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 937000000 Steps: 224
collatz took: 00:00:14.090417122

Following is the result when built with 1.14.0

 ╰──➤ $ ./1
total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.265471911
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.537970427
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.500822759
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.484737884
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.781218646
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 999000000 Steps: 162
collatz took: 00:00:13.329298756

Following is the result when built with latest master 879ec12

total_seeds # => 1000000000
worker_size # => 16
----------------------------------------------------------------------------------------------------
batch_size # => 1000
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.540662927
----------------------------------------------------------------------------------------------------
batch_size # => 2000
batches # => 500000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.590167441
----------------------------------------------------------------------------------------------------
batch_size # => 5000
batches # => 200000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.690674863
----------------------------------------------------------------------------------------------------
batch_size # => 500
batches # => 2000000
Seed: 999000000 Steps: 162
collatz took: 00:00:13.750579132
----------------------------------------------------------------------------------------------------
batch_size # => 100
batches # => 10000000
Seed: 999000000 Steps: 162
collatz took: 00:00:14.017306410
----------------------------------------------------------------------------------------------------
batch_size # => 62500000
batches # => 16
Seed: 687000000 Steps: 239
collatz took: 00:00:22.480855790

The spent time from 13s -> 22s when run on latest master, almost 70% performance loss, it is reproducible.

The text was updated successfully, but these errors were encountered:

straight-shoota · 2024-11-22T11:07:17Z

Is this the same code you mentioned in https://forum.crystal-lang.org/t/there-is-a-way-to-optimize-this-program/6947/63?u=straight-shoota ? Please add at least references to related comments when cross-posting.

zw963 · 2024-11-22T11:59:09Z

Is this the same code you mentioned in https://forum.crystal-lang.org/t/there-is-a-way-to-optimize-this-program/6947/63?u=straight-shoota ?

Yes, I mention it in several places in the forum, and i consider this is a regression issue now.

ysbaddaden · 2024-11-22T12:16:32Z

Thanks, I'm aware of your report. I didn't have the time to dig into it, yet.

ysbaddaden · 2024-11-22T13:32:02Z

I reproduce on master (linux) and it goes away with -Devloop=libevent. It also goes aways when I drop the print "...\r" line.

It's harder to reproduce with tracing enabled, but printing more often eventually triggers it.

At one point STDOUT would block, its fd gets added to an evloop, but then every print seem to trigger an epoll event (maybe the fd status is reset whenever there's enough space, or something?), and it causes the thread with said evloop to enter a loop of evloop.run; evloop.event fd=4.

ysbaddaden · 2024-11-22T14:56:22Z

Creating a log instance and using it instead of print "\r" also avoids the issue, even if writing much more often (every 100 000 seeds), the program never concurrently writes to STDOUT.

ysbaddaden · 2024-11-22T16:18:19Z

Same with a Channel & Fiber dedicated to write the report (so we can get the \r behavior). Even if we report every 1000 seeds (i.e. a lot more than the initial every 1 000 000).

So definitely: the problem exhibited here is parallel writes to a shared fd from different threads (the same could happen with reads).

ysbaddaden · 2024-11-22T17:06:57Z

To sum up: I pinpointed the issue to multiple threads trying to write to the same fd. In this case it's stdout. Whenever the concurrency writes are dropped (using Log or a Channel+Fiber) the performance is back.

What I extracted from tracing is that the fd would block at some point, so it's added to epoll, then the evloop is notified that it became ready and the fiber continues... but the fd is left in the evloop.

Then said thread has nothing left to do but other threads continue writing to the fd, which regularly triggers ready events, and the thread enters a loop { evloop.wait; evoop.ready fd=4 }.

I don't know why it has an impact on the overall performance. It shouldn't have any impact on the other threads.

crysbot · 2024-12-18T05:44:30Z

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/upcoming-release-1-15-0/7537/2

zw963 added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Nov 22, 2024

Blacksmoke16 added performance topic:stdlib:concurrency labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long-running fiber performance loss since 1.14.0 #15214

long-running fiber performance loss since 1.14.0 #15214

zw963 commented Nov 22, 2024 •

edited

Loading

straight-shoota commented Nov 22, 2024

zw963 commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024 •

edited

Loading

ysbaddaden commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024 •

edited

Loading

ysbaddaden commented Nov 22, 2024

crysbot commented Dec 18, 2024

long-running fiber performance loss since 1.14.0 #15214

long-running fiber performance loss since 1.14.0 #15214

Comments

zw963 commented Nov 22, 2024 • edited Loading

straight-shoota commented Nov 22, 2024

zw963 commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024 • edited Loading

ysbaddaden commented Nov 22, 2024

ysbaddaden commented Nov 22, 2024 • edited Loading

ysbaddaden commented Nov 22, 2024

crysbot commented Dec 18, 2024

zw963 commented Nov 22, 2024 •

edited

Loading

ysbaddaden commented Nov 22, 2024 •

edited

Loading

ysbaddaden commented Nov 22, 2024 •

edited

Loading