Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ken started Quickjs backed indexer processes freeze #5342

Open
aduchate opened this issue Nov 25, 2024 · 6 comments
Open

Ken started Quickjs backed indexer processes freeze #5342

aduchate opened this issue Nov 25, 2024 · 6 comments
Labels
bug erlang-vm-compatibility Possible Erlang VM compatibility issue

Comments

@aduchate
Copy link

aduchate commented Nov 25, 2024

Description

On a cluster (3.4.2) of 6 nodes that has a fairly large amount of databases (~1500 of size between 1GB and 150GB each), we have recently added and modified about 20 design docs per database (total 30000 dds). We have setup Ken with a concurrency of 5 to let the indexation happen.

About every 10 minutes, we see one indexer process not being updated anymore. It basically stays stuck forever (we let a few linger for 24 hours). Killing all couchjs_mainjs has no impact on the stuck indexer. The only way to get rid of it is to issue, in remsh, an exit(Pid, kill). . Pid here is the pid field of /_active_task, not indexer_pid.

Steps to Reproduce

Create a lot of databases with a lot of data, create a few design documents per database, start ken.

Expected Behaviour

The indexers shouldn't get stuck

Your Environment

  • CouchDB version used: 3.4.2
  • Browser name and version: irrelevant
  • Operating system and version: ubuntu 22.04 (couchdb compiled with dockerfile below)
FROM ubuntu:22.04

# Create app directory
WORKDIR /root

# Install dependencies
RUN apt-get update
RUN ln -fs /usr/share/zoneinfo/Europe/Brussels /etc/localtime
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata
RUN dpkg-reconfigure --frontend noninteractive tzdata
RUN apt-get install -y gnupg2 wget curl vim git build-essential pkg-config libicu-dev libmozjs-78-dev libcurl4-openssl-dev libncurses-dev node-gyp npm libssl-dev help2man openjdk-21-jdk-headless
RUN git clone https://github.com/erlang/otp otp_src_27.1.2
WORKDIR /root/otp_src_27.1.2
RUN git checkout -b 27.1.2 44ffe8811dfcf3d2fe04d530c6e8fac5ca384e02
RUN bash -c 'export ERL_TOP=`pwd`; export LANG=C; ./configure; make; make release_tests; cd release/tests/test_server; /root/otp_src_27.1.2/bin/erl -s ts install -s ts smoke_test batch -s init stop; cd ..; tar zcvf otp-tests.tgz test_server; cd /root/otp_src_27.1.2; make install'
WORKDIR /root
RUN git clone https://github.com/apache/couchdb.git #9
WORKDIR /root/couchdb
RUN git checkout -b 3.4.2 6e5ad2a5c5479cb09722b4a7d13b3d59b7bb2a23
RUN bash -c 'curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash ; . /root/.nvm/nvm.sh; nvm install 18'
RUN ./configure  --disable-docs --spidermonkey-version 78 chdir=/media/data/src/couchdb
RUN bash -c '. /root/.nvm/nvm.sh; nvm use 18; make release'
WORKDIR /root/couchdb/rel
RUN tar zcvf couchdb.jammy-jellyfish.3.4.2.tgz couchdb

Additional Context

We can give you access to the infrastructure that causes the problem to happen if needed.

@aduchate aduchate changed the title Ken started Quickjs backed indexer processes freezes Ken started Quickjs backed indexer processes freeze Nov 25, 2024
@nickva
Copy link
Contributor

nickva commented Nov 25, 2024

Thanks for reaching out, Antoine.

  • Do you see any crashes or errors in the logs?

  • QuickJS can build indexes a bit faster but that could mean putting more pressure on the disk or other resources. Do you see increased cpu, disk(I/O quota reaching a limit?) or memory usage during that time?

  • If your design documents or documents are large sometimes it's possible to hit the maximum stack size in the JS engine. For QuickJS that can be adjusted with memory_limit_bytes. Though in that case I'd expect to see repeated crashes in the logs with segfault or a memory limit error.

  • Before killing the process try to gather some stats if you an remsh in:

    • That should give us some idea what the the process is running in the remsh erlang:process_info(Pid).
    • These three commands will show the top 3 busiest process, message queue lengths and memory users:
recon:proc_window(reductions, 3, 5000).
recon:proc_window(message_queue_len, 3, 5000).
recon:proc_window(memory, 3, 5000).
  • Another option if you see you have reached the os_process_limit on any node with the number of couchjs processes, could consider increasing that a bit.

@aduchate
Copy link
Author

Hi nicka,

No crash in the logs at all
The parallelism of ken was limited to 5 which insured that the CPU usage was staying very low on the 48 cores machine
I'll try raising memory_limit_bytes though I doubt it will improve things because at a subsequent indexations of the same view succeed.

Below, the request information from remsh

(couchdb@couchdb-02-rbx-02.vrack.icure.cloud)2> (couchdb@couchdb-02-rbx-02.vrack.icure.cloud)2> erlang:process_info(<0.83257742.0>). 
[{current_function,{gen,do_call,4}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.83311580.0>]},
 {dictionary,
     [{task_status_props,
          [{changes_done,622766},
           {database,
               <<"shards/a0000000-bfffffff/icure-ms-wgc-nieuwegent-614fc45"...>>},
           {design_document,<<"_design/Contact-DataOwner">>},
           {indexer_pid,<<"<0.4441283.0>">>},
           {progress,50},
           {started_on,1732624104},
           {total_changes,1223768},
           {type,indexer},
           {updated_on,1732624464}]},
      {io_priority,
          {view_update,
              <<"shards/a0000000-bfffffff/icure-ms-wgc-nieuwegent-614fc455-b5"...>>,
              <<"_design/Contact-DataOwner">>}},
      {task_status_update,{{0,0,0},500000}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.252.0>},
 {total_heap_size,833025},
 {heap_size,318187},
 {stack_size,16},
 {reductions,86130223},
 {garbage_collection,
     [{max_heap_size,
          #{error_logger => true,include_shared_binaries => false,
            kill => true,size => 0}},
      {min_bin_vheap_size,46422},
      {min_heap_size,233},
      {fullsweep_after,65535},
      {minor_gcs,1217}]},
 {suspending,[]}]
(couchdb@couchdb-02-rbx-02.vrack.icure.cloud)3> recon:proc_window(reductions, 3, 5000).
[{<0.172400993.0>,64192268,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]},
 {<0.170530515.0>,61520321,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]},
 {<0.172131409.0>,60832785,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]}]
(couchdb@couchdb-02-rbx-02.vrack.icure.cloud)4> recon:proc_window(message_queue_len, 3, 5000).
[{<0.554738.0>,1,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.21585.0>,1,
  [{current_function,{erts_internal,dirty_nif_finalizer,1}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.3.0>,1,
  [{current_function,
       {erts_dirty_process_signal_handler,msg_loop,0}},
   {initial_call,
       {erts_dirty_process_signal_handler,start,0}}]}]
(couchdb@couchdb-02-rbx-02.vrack.icure.cloud)5> recon:proc_window(memory, 3, 5000).
[{<0.172416530.0>,4721072,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_bt_engine_compactor,start,4}}]},
 {<0.172600591.0>,4720504,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.4328.0>,4032976,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

@nickva
Copy link
Contributor

nickva commented Nov 26, 2024

Thanks @aduchate.

Hmm, so far everything looks good.

couch_index_updater:update/3 processes seem to be the most busy on the cluster. But they seem to wait in 'do_call` so they are waiting on a response from another process.

  • Try to get erlang:process_info(Pid, current_stacktrace). along with erlang:process_info(Pid). for the stuck indexer, maybe the indexer updater process as well.

  • What about the {progress,50}, {total_changes,1223768}, {type,indexer}, {updated_on,1732624464} values on the stuck indexer, do they get updated periodically? What about disk IO and CPU usage does it seems like anything is being used significantly?

  • Is it always the same db/index that gets stuck or it's always a random one?

  • RUN git clone https://github.com/erlang/otp otp_src_27.1.2 we do support and test with Erlang 27. However our packages still ship with 25. Would it be possible to see if 25 would act the same?

  • Does running the QuickJS scanner show any discrepancies or crashes in the logs against the cluster, specifically for the stuck indexes? https://docs.couchdb.org/en/stable/best-practices/jsdevel.html#scanning-for-quickjs-incompatibilities

  • Do indexes build with Spidermonkey engine? (ensure to restart the nodes after toggling the config settings as in 3.4.2 the engine setting doesn't update at runtime)

@aduchate
Copy link
Author

pid:

erlang:process_info(Pid, current_stacktrace).

(couchdb@couchdb-02-lim-04.vrack.icure.cloud)1> erlang:process_info(<0.18820539.1>, current_stacktrace).
{current_stacktrace,[{gen,do_call,4,
                          [{file,"gen.erl"},{line,241}]},
                     {gen_server,call,3,[{file,"gen_server.erl"},{line,1218}]},
                     {couch_work_queue,dequeue,2,
                                       [{file,"src/couch_work_queue.erl"},{line,50}]},
                     {couch_mrview_updater,map_docs,2,
                                           [{file,"src/couch_mrview_updater.erl"},{line,154}]}]}

erlang:process_info(<0.18820539.1>).

[{current_function,{gen,do_call,4}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.1754697.1>]},
 {dictionary,[{task_status_update,{{0,0,0},500000}},
              {io_priority,{view_update,<<"shards/80000000-9fffffff/icure-ms-elsa-vancaster-cf8a1abf-ed"...>>,
                                        <<"_design/Message-DataOwner">>}},
              {task_status_props,[{changes_done,42824},
                                  {database,<<"shards/80000000-9fffffff/icure-ms-elsa-vancaster"...>>},
                                  {design_document,<<"_design/Message-DataOwner">>},
                                  {indexer_pid,<<"<0.180092364.0>">>},
                                  {progress,13},
                                  {started_on,1732686977},
                                  {total_changes,324883},
                                  {type,indexer},
                                  {updated_on,1732687030}]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.249.0>},
 {total_heap_size,1196281},
 {heap_size,196650},
 {stack_size,16},
 {reductions,12771499},
 {garbage_collection,[{max_heap_size,#{error_logger => true,include_shared_binaries => false,
                                       kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,220}]},
 {suspending,[]}]

indexer_pid:

erlang:process_info(<0.180092364.0>, current_stacktrace).

{current_stacktrace,[{gen_server,loop,7,
                                 [{file,"gen_server.erl"},{line,2250}]},
                     {proc_lib,init_p_do_apply,3,
                               [{file,"proc_lib.erl"},{line,329}]}]}

erlang:process_info(<0.180092364.0>).

[{current_function,{gen_server,loop,7}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.175818994.0>,<0.179769987.0>,<0.178322255.0>,
         <0.580.0>]},
 {dictionary,[{'$initial_call',{couch_index,init,1}},
              {io_priority,{view_update,<<"shards/80000000-9fffffff/icure-ms-elsa-vancaster-cf8a1abf-ed"...>>}},
              {'$ancestors',[<0.180269918.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.249.0>},
 {total_heap_size,17730},
 {heap_size,6772},
 {stack_size,11},
 {reductions,156876},
 {garbage_collection,[{max_heap_size,#{error_logger => true,include_shared_binaries => false,
                                       kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,144}]},
 {suspending,[]}]

A stuck indexer never moves anymore nothing changes in the values once it is stuck

The stuck indexers are completely random. After killing the stuck indexer, the index ends up building correctly.

QuickJS scanner show no problem. The indexes used to build correctly with spidermonkey but in this current build, spidermonkey is broken so, I cannot check if the problem also arises with spidermonkey.

I'll try building with otp 25 and let you know.

Thanks for the support
A.

@aduchate
Copy link
Author

aduchate commented Jan 7, 2025

I have some update in this one. Using OTP 25 instead of OTP 27 seems to fix the issue.

@nickva
Copy link
Contributor

nickva commented Jan 8, 2025

Thank you for the update, Antoine. That is an interesting find!

We have seen sometimes new Erlang version showing issues discovered by CouchDB code only, so that's certainly possibility. I guess it could be either OTP 26 or OTP 27.

Let's keep this issue open to serve as a record/reminder in case we narrow down the cause. If you get a chance it might be interesting at least to see if it's 26 or 27. In 26 I believe there were some OTP changes related to terminal IO processing which would be related to talking to the indexer couchjs processes (either Spidermonkey or QuickJS, it's all stdio basically).

@nickva nickva added erlang-vm-compatibility Possible Erlang VM compatibility issue and removed needs-triage labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug erlang-vm-compatibility Possible Erlang VM compatibility issue
Projects
None yet
Development

No branches or pull requests

2 participants