-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regexes sorting is incorrect #33
Comments
According to the regex list there are two (or may be even more) regexes that match the 'specificUA'. After it you hit some other regex, that is in "lower" index in the list with UA ,that doesn't hit the regex from the first case. This one cause the sort to "pop up" the regex which matched the broadUA to the top of the list. After it, you are trying to find match for the specificUA once again and now the regex which matches both, specificUA and broadUA is at the top of the list and this one matches. What's wrong here? To summarize it I can see the following scenario: Am I missing something? |
You're right, regex sorting works in this way. |
This is because the regex '(Opera Mini)(?:/att)?/(\d+).(\d+)' (which is being matched at the first time) located at line 109 in the regexes.yaml and the regex (Opera)/9.80.*Version/(\d+).(\d+)(?:.(\d+))? (which is being matched for the loopped calls, and which affects the scoring) located at line 110. As I described in my pull request, my code was supposed to reduce lookup time by "popping up" the most "usable" UA. And I think it does it well (especially on the high loaded cluster (4-6M requests/min)) |
yes, it's exactly what sorting means. The second call result is indeed incorrect, since input user agent is opera mini, but it's being detected as opera (opera != opera mini). |
As a conclusion, sorting is correct, but in some cases, like described before, it can affect the regex match. Unfortunately, Golang doesn't support negative (and positive) lookbehind assertion, so it is not an option to create regex which will exclude (Opera Mini). So currently I can see two solutions:
I'll try to implement some walkaround... gkalabin, thank you for pointing at this issue. |
Cool - thanks for diving in. In the past when we've discussed tradeoffs I'm open to sorting things if they don't affect readability (e.g. a change :-) On Thu, Aug 11, 2016 at 2:01 PM, evgenigourvitch notifications@github.com
|
I skeptical of the whole sorting feature in the first place. We just merged caching #75 which should get a huge speedup for repeated queries of the same user agents - which is likely if used to parse UAs from web traffic. So that might fill the performance need a lot better than this sorting feature, without the downside of potentially incorrect detections. |
According to specification:
Here is the proof that sorting of regexes will cause wrong detection results:
Result:
The text was updated successfully, but these errors were encountered: