Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut ripgrep some slack in benchmarks ;-), document how to run benchmarks #104

Open
gvansickle opened this issue Oct 26, 2016 · 10 comments
Open

Comments

@gvansickle
Copy link
Owner

@BurntSushi rightly reported that the 0.3.0 benchmarks do not pass '-u' to rg, thus giving the other utilities which don't look at .gitignore files an arguable unfair advantage. Address this.

Also, document how to obtain the corpi and run the benchmarks. This was slated for 0.3.0 (in my best intentions at least), but didn't make it. Maybe add rg's Linux corpus into the mix as well, it's a more typical use-case.

@gvansickle
Copy link
Owner Author

@BurntSushi : In early results, it does looks like you're beating me with the '-u':

| Test Case | built_ucg | inst_ucg | inst_ag | inst_ripgrep | inst_pcre2grep | inst_system_grep | inst_gnu_grep_e |
|-|-|-|-|-|-|-|
| TC2 | 0.266699 | 0.26657 | 1.55662 | 0.181029 | 0.915668 | 0.324283 | 0.322742 |

... and tables don't work in Github issue comments, great. :-/ Anyway, this is your second benchmark, PM_RESUME against the built linux tree. ucg == 0.267, rg -u == 0.181. Except something's not right, you and everyone else are getting 5 hits, I'm getting 11 hits. As the wise Mark Freuder Knopfler, OBE once said, "Two men say they're Jesus/One of them must be wrong", and I'm guessing that may be me in this instance...

...well, wait again. I'm actually getting 5 hits, but also detecting 6 recursive directory loops due to symlinks (which are mistakenly being counted as hits). You're doing a physical traversal right? I'm defaulting to logical.

@BurntSushi
Copy link

Yeah, my impression is that standard behavior is to not follow symlinks, so ripgrep won't do it by default. If you pass -L, then that should do the trick.

@BurntSushi
Copy link

Also, I'm kind of surprised at how slow ag is for you. Are you running on a VM?

@gvansickle
Copy link
Owner Author

Yeah, the Fedora 24 numbers are on Virtual Box. If github supported tables, I could post the system info my benchmark suite obtains for you here. I'm working on getting the results into HTML form suitable for posting (graphs and everything), but I'm not quite there yet. Let me try the table for that specific system:

Test System Details

Parameter Value
Distribution Fedora 24 (Workstation Edition)
Kernel name Linux
Kernel release 4.7.9-200.fc24.x86_64
Kernel build info '#'1 SMP Thu Oct 20 14:26:16 UTC 2016
CPU model name Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
CPU architecture x86_64
CPU number of sockets 1
CPU cores per socket 4
CPU threads per core 1
CPU ISA extensions apic clflush cmov constant_tsc cx8 de eagerfpu fpu fxsr ht hypervisor lahf_lm lm mca mce mmx msr mtrr nonstop_tsc nopl nx pae pat pge pni pse pse36 rdtscp rep_good sep sse sse2 sse4_1 sse4_2 ssse3 syscall tsc vme xtopology
Hypervisor present Yes
Hypervisor vendor Oracle VM VirtualBox
Hypervisor type full

I guess that's almost readable. Like I said before, it's way past time for me to get a new rig. Never enough round tuits.....

@BurntSushi
Copy link

BurntSushi commented Oct 27, 2016

Yeah, in my testing the silver searcher does much worse in a virtual machine than on a native system, and my current hypothesis is because of memory maps. You can test it for yourself by passing the --mmap flag to ripgrep. I bet you'll see a noticeable slow down. :-)

(That's not to say it invalidates your benchmark. Running these tools in a VM is a perfectly common and legitimate use case. But it's probably important to acknowledge or at least understand.)

@gvansickle
Copy link
Owner Author

gvansickle commented Oct 27, 2016

Yep, I've done the experiments too (ucg still has some dormant mmap code in it if you look hard enough). I can build for Cygwin, and even there it's a hit. Not knowing what all is going on under the hood, it sure does seem contrary to what one would expect. But yeah, with virtualization, it makes a bit more sense that there'd be a hit here.

Similar topic: Have you tried asynchronous I/O for reading in the files? I have not, but I'm curious if that's any better than just a read() loop.

@BurntSushi
Copy link

BurntSushi commented Oct 27, 2016

@gvansickle I've always heard pretty terrible things about async I/O on Linux, so I've never tried it. See: http://stackoverflow.com/questions/8513663/linux-disk-file-aio

Note that ripgrep does I/O differently from ucg. When it doesn't use memory maps, it reads incrementally. I think ucg just slurps the entire file in at once and then searches it, right?

@gvansickle
Copy link
Owner Author

Right. I try to read the entire file with one read() call. Honestly I was a bit surprised that worked as well as it does. Again it's probably due to the use-case: mostly smallish files. I should gather statistics on that.....

@BurntSushi
Copy link

Honestly I was a bit surprised that worked as well as it does.

Me too. :P It actually holds up pretty well even on largeish files too. (Look at the subtitle benchmarks.)

@gvansickle
Copy link
Owner Author

@BurntSushi : I just updated the one benchmark in the README.md. Sorry it took so long (in more ways than one: now you're winning! ;-))
I still have to document the "how to reproduce" better, but it's still little more than a "make check" away. hopefully this long weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants