-
-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterminism in diameter reported by TLC #883
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Issue: When running a model checking with multiple workers, sometimes the maxLevel (or max Depth) is larger then it should be with respect to maxLevel computed with a single worker. Some findings first:
The ModelChecker involved is running a BFS. I wanted to try the DFS one, but apparently it just doesn't work yet with multiple workers (#548 seems like a good issue to look into). Anyways, what's happening is that when a worker is faster then another, a node is discovered from a longer path. A sequential BFS would visit the tree in the following order:
E and D are effectively at depth 2. A concurrent visit with two threads (the prefix is the thread id):
Now node E is at depth 3 and boom. For the specification in exam, I first reduced the constants of
Then in the codebase, I've patched the code here with:
To get the attached stack trace: stacktrace.txt If you follow the stack trace from the graph above, you will find that it's a fair execution. This is possible because the next states are generated from the previous states from within the worker itself. I haven't thought of a solution yet, but wanted to share my findings. |
Is this deterministic depth feature even implemented for multiple workers then? The algorithm should be to ensure all states with depth < N (where N is the depth of the error state) are visited before terminating, right? I can't imagine what it would be otherwise. |
@FedericoPonzi Thanks for the refresher! No, a deterministic diameter is not what TLC implements, as that would require additional synchronization, limiting scalability. For small models, use a single worker. For large models, the diameter will almost always appear deterministic. Does anyone have the time to find and update the relevant documentation? |
Unless @ahelwer wants to do it, I can try to find a good place to persist this information. |
All yours! |
There are the Toolbox help, the command-line help, and current tools (markdown clone). The PDF is under Leslie's control. |
Persist the learnings from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Actually I was mistaking this for an error trace reporting issue in my description of the algorithm. There could be other methods. The simplest one would be to ensure workers finish exploring all nodes of distance N from the origin before any explore one of distance N+1. If work stealing from concurrent queues were implemented we could avoid thread starvation. Would that be worth it? |
For larger, real-world specifications and models, a non-deterministic diameter is generally not an issue, except in pathological cases. The state queue remains the primary bottleneck in TLC's breadth-first search. To improve scalability, efforts to replace the the state queue should focus on that. Replacing it is not a trivial task because TLC workers globally synchronize on the state queue during checkpoints or liveness checking. A few years ago, I began designing a more scalable state queue based on dynamically sized sets of unexplored states. |
More background: If I recall correctly, @Calvin-L once created a prototype for a priority queue. However, I am unsure whether this prototype replaced StateQueue.java or was derived from it. My recent work on StateDeque.java does derive from StateQueue.java, as I did not have the time to refactor the global synchronization out of StateQueue.java. StateQueue.java Verification verifies deadlock-freedom of StateQueue.java. Note that our performance and scalability benchmarks are currently offline because we no longer have access to suitable hardware. These benchmarks are essential for any performance or scalability-related work on TLC. |
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
That sounds like an interesting concurrency problem! I also wonder what it would take to get a decent benchmarking system in place again. I don't know much about scientifically-valid benchmarking techniques honestly. |
We have a decent benchmarking system that can be easily extended with additional workloads, i.e., spec/models. However, the system currently lacks dedicated hardware since our previous sponsor withdrew. Perhaps, it is possible to request resources from the TLA+ Foundation. |
Persist the learning from issue #883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>
Done |
Description
TLC occasionally reports varying diameter values with multiple workers. As discovered in this PR: tlaplus/Examples#122
Expected Behavior
Diameter is expected to be deterministic, per Markus.
Actual Behavior
TLC will occasionally report varying state graph diameter, although the total & distinct states remain the same.
Steps to Reproduce
specifications/glowingRaccoon/clean.tla
orspecifications/Paxos/MCVoting.cfg
with -workers greater than 1Steps Taken to Fix
N/A
Possible Fix
N/A
Your Environment
Has been reproduced across multiple OSs and (recent) versions of TLC.
The text was updated successfully, but these errors were encountered: