-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore every move twice before normal training self-play search #698
Comments
It sounds like for chess, there's typically at most ~50 legal moves, so forcing 2 visits per move out of 800 might not be too bad at 10% of visits. Whereas for go, there could be ~300 legal moves, so even with 3200 visits is close to 20%. In particular for training data if not removing the forced moves, a should-be-100% prior move would appear as 90% visits in training data if 10% were given to other moves. |
Here's some more CCLS games when running with CCLS Season 3 - Elite League /// Leela Chess Zero v0.10 ID 351 Gauntlet GTX
Slightly different where lczero didn't consider a better move for itself:
|
Hi, interesting idea! If I understand the idea correctly, the TLDR; version of the idea is to give each root move 2 forced visits, but subtract these visits from the training count. I think this might be a good idea. I also think, though, that this points to a larger issue of, how can we improve the search? Basically, if forcing a 2-N visit of each root move is "worth it" after 800 nodes, then this is evidence that the default Alpha Zero search is easy to improve. (And why wouldn't it be?) Alpha-Beta search in traditional chess engines has gone through many, many tweaks and additions over the years (null move, reductions, extensions, quiescence search, and I am sure many other things). It seems self-evident that a foundation of PUCT search also can be improved upon immensely. In fact, didn't the Komodo team claim that they added 300 Elo by improving the PUCT search? I think this is something we should put more thinking into and I encourage more thinking and also testing of your idea to get things started. Sooner or later there will too much Elo to pick up for us to ignore this because of "zero purity" philosophy. |
Let me add that while we should certainly respect the expertise the Google team, it would also be naive to think that they "figured everything out" already. Remember Deep Blue, it was also a rather primitive engine/search which was later surpassed by far by commercial engines with new search ideas. |
Thinking some more, I see a problem that I think warrants some consideration. Say we have 10 moves that the policy thinks are "crap", and it gives them a crap 0.8%. One of them is actually strong Rxh4, and a 2 visit would reveal it, but the policy has not learned to identify this promising move. The other 9 moves are crap, but they DO kinda warrant a 1-visit to verify they are crap in an 800 search. If we now give each move two visits for free, we would indeed help the net to search this Rxh4 move more and learn to identify its promising features on its own it over time. But, we would also teach the network that it doesn't need to spend even 1 node on the other 9 crap moves (because the 2 free nodes already checked them out and showed they were not worth further search) even though they actually warrant a 1-visit from a global "let's be sure" perspective. So we might speed up learning to identify rare tactical shots, but we would also teach the net to not check out the average random crap move often enough. (unless we also used the same "free" 2-search in match setting). |
That's where we are now, and that's MCTS working as intended. If a move is truly bad, search knows not to waste time even visiting it. If a move is only sometimes bad, the training data should increase the prior to levels that are searchable for match settings. |
Let's say we have 10 moves who warrant 15 visits, 1 each and then 5 more on a move that "randomly" warrants a bit more checking out (we cannot teach the policy everything, and we assume the net cannot tell without a visit) . If we give each move 2 free visits, we would teach the net it only needs to spend on average 4-5 visits on these 10 moves instead of the 10-15 they actually warrant. So it would not even give them 1 visit. |
I tried a lot of these schemes at root and throughout the tree, unfortunately in self play they are always hugely inferior. Just try the selfplay option of LC0 to test your approach (min 1000 games). If you just want to find tactics, then this works, you can also tweak search by making these changes progressively less invasive if you go further down the tree - but then again, this almost never translates to improvement in strength. |
Inferior when playing against itself without the changes? That's expected similar to how playing against itself with one side picking most visited moves vs another side picking moves proportionally to visits. |
Why not simply increase the PUCT value? |
I'm assuming you mean specially increasing puct at noised root -- yes, it will probably increase the likelihood of exploring a noised move when other moves look bad. Notably, puct scales up the prior for all moves and reduces the impact of the win rate eval. Any particular puct numbers you think I should run to report in #699? |
For LC0, yes, 3.1 as PUCT value. I have been running various time controls and conditions and one value has been consistent: 3.1 |
Seems that would work with the 'vanilla' UCT formula (un-biased UCB term). But with AG-liked formula (AGZ/AZ, LZ and LCZ), policy bias is injected is the UCB term in a multiplicative manner. Thus high PUCT value cannot compensate very low prior. |
Not alone no. I ran a long CLOP with three settings, and it came up with the following values after 710 trials: I then ran a match against outside engines with the default settings and these and these showed a 63 Elo increase. Out of curiosity, I also tested them on a revised version of the WAC tactics suite, and the default settings solved 109/200 and these solved 159/200. In other words, they are not only stronger in playing, but are also vastly better in tactics. |
This is a bit of departure from AZ's Dirichlet noise, but it's still "zero" with an additional kind of "noise" with the hopes of better directing self-play's search by getting a better eval by doing a quick check to see how say white would think of the board for each of the likeliest subsequent white move -- i.e., 2 visits.
In the first CCLS SCTR vs id359 game (lczero black), lczero evaluating white SCTR's position would not consider the winning move Rxh4:
https://clips.twitch.tv/NimbleLazyNewtPRChase
Even with noise, it's unlikely to find:
However, forcing 2 visits with something like:
…quickly finds the move:
At least for this position, the 2 visits for Rxh4 was enough for normal search to drive the majority of visits to the move, and in this case, training data would learn to boost prior closer to 80% instead of the current 0.33%.
@jkiliani has pointed out in leela-zero/leela-zero#1408 (comment) that with lower visits, we definitely should be careful about subtracting these forced visits from training data if search determined it shouldn't put any more visits to the move. Also, unclear if these forced visits should count towards the total visits to stop searching.
The text was updated successfully, but these errors were encountered: