Skip to content

Commit

Permalink
Interim. README and NEWS updates.
Browse files Browse the repository at this point in the history
  • Loading branch information
gvansickle committed Oct 23, 2016
1 parent 5553946 commit f69f9e5
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 18 deletions.
12 changes: 8 additions & 4 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
# NEWS file for the UniversalCodeGrep project.

## [UNRELEASED]
## [0.3.0] - 2016-10-23

Major feature/bugfix release of UniversalCodeGrep (ucg).

### New Features
- More than 30% faster than ucg 0.2.2 on most benchmarks.
- New file inclusion/exclusion options:
- `ack`-style `--ignore-file=FILTER:FILTERARGS`: Files matching FILTER:FILTERARGS (e.g. "ext:txt,cpp") will be ignored.
- `grep`-style `--include=GLOB`: Only files matching GLOB will be searched.
- `grep`-style `--exclude=GLOB`: Files matching GLOB will be ignored.
- `ag`-style `--ignore=GLOB`: Files matching GLOB will be ignored. Note that unlike `ag`'s option, this does not apply to directories).
- Files and directories specified on the command line (including hidden files) are now scanned regardless of ignore settings, and in the case of files, whether they are recognized as text files.
- `--TYPE`- and `--noTYPE`-style options now support unique-prefix matching. E.g., `--py`, `--pyth`, and `--python` all select the Python file type.
- OSX and PC-BSD now supported.
- OS X and some *BSDs now supported. Builds and runs on Xcode6.1/OS X 10.9 through Xcode 8gm/OS X 10.11.
- Now compiles and links with either or both of libpcre and libpcre2, if available. Defaults to using libpcre2 for matching.
- Directory tree traversal now uses more than one thread (two by default). Can be overridden with new "--dirjobs" command-line parameter. Overall performance improvement on all platforms vs. 0.2.2 (e.g., ~25% on Fedora 23 with hot cache).
- New portable function multiversioning infrastructure. Currently used by the following features:
Expand All @@ -24,8 +27,8 @@
- Scanner threads now use a reusable buffer when reading in files, reducing memory allocations by ~10% (and ~40% fewer bytes allocated) compared to version 0.2.2.
- Refactored FileScanner to be a base class with derived classes handling the particulars of using libpcre or libpcre2 to do the scanning.
- Added a basic diagnostic/debug logging facility.
- ResizableArray now aligns allocations to 64-byte boundaries to match Core i7 cache line sizes in an effort to prevent false sharing.
- Performance/Benchmarking infrastructure and test expansion and improvements.
- ResizableArray now takes an alignment parameter. File buffer allocations are now done on max(ST_BLKSIZE,128k)-byte boundaries.
- Testing/Benchmarking infrastructure expansion and improvements.

### Fixed
- Cygwin now requires AC_USE_SYSTEM_EXTENSIONS for access to get_current_dir_name(). Resolves #76.
Expand All @@ -34,6 +37,7 @@
- Resolved segfaults on some systems due to dirname() modifying its parameter. Resolves #96.
- No longer treating PCRE2 reporting no JIT support as an error. Resolves #100.


## [0.2.2] - 2016-04-09

Minor feature/bugfix release of UniversalCodeGrep (ucg).
Expand Down
40 changes: 28 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,22 +44,34 @@ UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for sear

## Introduction

UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for searching large bodies of source code. It is intended to be largely command-line compatible with [`Ack`](http://beyondgrep.com/), to some extent with [`ag`](http://geoff.greer.fm/ag/), and where appropriate with `grep`. Search patterns are specified as PCRE regexes.
UniversalCodeGrep (`ucg`) is an extremely fast grep-like tool specialized for searching large bodies of source code. It is intended to be largely command-line compatible with [`Ack`](http://beyondgrep.com/), to some extent with [`ag`](http://geoff.greer.fm/ag/), and where appropriate with `grep`. Search patterns are specified as PCRE regexes.

### Speed
`ucg` is intended to address the impatient programmer's code searching needs. `ucg` is written in C++11 and takes advantage of the concurrency (and other) support of the language to increase scanning speed while reducing reliance on third-party libraries and increasing portability. Regex scanning is provided by the [PCRE library](http://www.pcre.org/), with its [JIT compilation feature](http://www.pcre.org/original/doc/html/pcrejit.html) providing a huge performance gain on most platforms.
`ucg` is intended to address the impatient programmer's code searching needs. `ucg` is written in C++11 and takes advantage of the concurrency (and other) support of the language to increase scanning speed while reducing reliance on third-party libraries and increasing portability. Regex scanning is provided by the [PCRE2 library](http://www.pcre.org/), with its [JIT compilation feature](http://www.pcre.org/current/doc/html/pcre2jit.html) providing a huge performance gain on most platforms.

As a consequence of its use of these facilities and its overall design for maximum concurrency and speed, `ucg` is extremely fast. Under Fedora 23, scanning the Boost 1.58.0 source tree with `ucg` 0.2.2, [`ag`](http://geoff.greer.fm/ag/) 0.31.0, and `ack` 2.14 produces the following results:
As a consequence of its use of these facilities and its overall design for maximum concurrency and speed, `ucg` is extremely fast. Under Fedora 24, scanning the Boost 1.58.0 source tree with `ucg` 0.3.0, [`ag`](http://geoff.greer.fm/ag/) 0.31.0, and `ack` 2.14 produces the following results:

| Command | Elapsed Real Time, Average of 5 Runs |
| Command | Elapsed Real Time, Average of 10 Runs |
|---------|-----------------------|
| `time ucg --noenv --cpp 'BOOST.*HPP' ~/src/boost_1_58_0` | ~ 0.404 seconds |
| `time ag --cpp 'BOOST.*HPP' ~/src/boost_1_58_0` | ~ 5.8862 seconds |
| `time ack --noenv --cpp 'BOOST.*HPP' ~/src/boost_1_58_0` | ~ 12.0398 seconds |

#### Benchmark: '#include\s+".*"' on Boost source

| Command | Program Version | Elapsed Real Time, Average of 10 Runs | Num Matched Lines | Num Diff Chars |
|---------|-----------------|---------------------------------------|-------------------|----------------|
| `ucg --noenv --cpp '#include\s+.*' ../../../../../boost_1_58_0` | 0.3.0 | 0.212767 | 9511 | 189 |
| `/usr/bin/ucg --noenv --cpp '#include\s+.*' ../../../../../boost_1_58_0` | 0.2.2 | 0.262368 | 9511 | 189 |
| `/usr/bin/ag --cpp '#include\s+.*' ../../../../../boost_1_58_0` | 0.32.0 | 1.90161 | 9511 | 189 |
| `/usr/bin/rg -n -t cpp '#include\s+.*' ../../../../../boost_1_58_0` | 0.2.3 | 0.262967 | 9509 | 0 |

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Oct 26, 2016

I suspect you need to pass -u here for this to be a fair comparison. (Which is done in my blog post.)

Could you also provide instructions on how to get the corpus you're benchmarking with?

This comment has been minimized.

Copy link
@gvansickle

gvansickle Oct 26, 2016

Author Owner

Hey @BurntSushi ,
(read this, then I had a heck of a time finding it again. Is there a way on Git hub to find review comments like this without manually looking at each commit, do you know?)

I suspect you need to pass -u here for this to be a fair comparison. (Which is done in my blog post.)

Good point, I missed that one. I'll create an issue for this and your next item. Though, as you noted in your analysis, it gets awful hard awful quick to determine what constitutes a "fair comparison". I've essentially been using the working definition something alone the lines of "the absolute minimum beyond the defaults to get comparable output".

Could you also provide instructions on how to get the corpus you're benchmarking with?

Absolutely. Everything you need to reproduce my benchmarks is included in the tarball, with the exception of the Boost 1.58.0 tar.gz distro. Untar that alongside the ucg tarball tree, and do a "make check"; it should automatically find the boost tree and run the appropriate tests. Again, I'll get that info into the README or somewhere on the same issue mentined above.

Good to hear from you,

GRVS

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Oct 26, 2016

(read this, then I had a heck of a time finding it again. Is there a way on Git hub to find review comments like this without manually looking at each commit, do you know?)

Hmm, I have no idea. Usually I just follow the link from an email. :-)

Though, as you noted in your analysis, it gets awful hard awful quick to determine what constitutes a "fair comparison".

Indeed. That's why I tried to frame it as a task that the user is trying to perform. In this case, the difference is between "I want to search precisely a whitelist" and "I want to search a whitelist while respecting my gitignore files."

with the exception of the Boost 1.58.0 tar.gz distro.

Ah! I think this is what I was looking for. I could probably guess how to get it, but a direct link (or instructions) would be helpful. Is it one of these? https://sourceforge.net/projects/boost/files/boost/1.58.0/

Thanks!

This comment has been minimized.

Copy link
@gvansickle

gvansickle Oct 26, 2016

Author Owner

I could probably guess how to get it, but a direct link (or instructions) would be helpful.

Absolutely. I've opened #104 to document that better (== at all).

Is it one of these? https://sourceforge.net/projects/boost/files/boost/1.58.0/

It is in fact (looking at my .travis.yml....) yep, this one: http://downloads.sourceforge.net/project/boost/boost/1.58.0/boost_1_58_0.tar.bz2. Like I said, just untar that along side of the ucg directory, and you should be good to go with "make check".

I'm probably going to add the capability to run against your built Linux tree corpus though, that's certainly closer to the average use-case for these types of tools.

GRVS

| `/usr/bin/pcre2grep -rn --color '--exclude=^.*(?<!\.cpp|\.hpp|\.h|\.cc|\.cxx)$' '#include\s+.*' ../../../../../boost_1_58_0` | 10.21 2016-01-12 | 0.818627 | 9527 | 1386 |
| `grep -Ern --color --include=\*.cpp --include=\*.hpp --include=\*.h --include=\*.cc --include=\*.cxx '#include\s+.*' ../../../../../boost_1_58_0` | grep (GNU grep) 2.25 | 0.366634 | 9509 | 0 |


UniversalCodeGrep is in fact somewhat faster than `grep` itself. Again under Fedora 23 and searching the Boost 1.58.0 source tree, `ucg` bests grep 2.22 not only in ease-of-use but in raw speed:

| Command | Elapsed Real Time, Average of 5 Runs |
| Command | Elapsed Real Time, Average of 10 Runs |
|---------|--------------------------------------|
| `time grep -Ern --color --include=\*.cpp --include=\*.hpp --include=\*.h --include=\*.cc --include=\*.cxx 'BOOST.*HPP' ~/src/boost_1_58_0` | ~ 0.9852 seconds |
| `time ucg --noenv --cpp 'BOOST.*HPP' ~/src/boost_1_58_0` | ~ 0.404 seconds |
Expand All @@ -72,6 +84,9 @@ The resulting matches are identical.

## Installation

UniversalCodeGrep binaries are currently available for Fedora 23/24/25/rawhide and Centos 7. Binaries for other platforms (Ubuntu, Arch, openSUSE) are coming soon.

<!-- COMING SOON
### Ubuntu PPA
If you are a Ubuntu user, the easiest way to install UniversalCodeGrep is from the Launchpad PPA [here](https://launchpad.net/~grvs/+archive/ubuntu/ucg). To install from the command line:
Expand All @@ -84,10 +99,11 @@ sudo apt-get update
# Install ucg:
sudo apt-get install universalcodegrep
```
-->

### Red Hat/Fedora/CentOS dnf/yum Repository
### Fedora/CentOS Copr Repository

If you are a Red Hat, Fedora, or CentOS user, the easiest way to install UniversalCodeGrep is from the Fedora Copr-hosted dnf/yum repository [here](https://copr.fedoraproject.org/coprs/grvs/UniversalCodeGrep). Installation is as simple as:
If you are a Fedora or CentOS user, the easiest way to install UniversalCodeGrep is from the Fedora Copr-hosted dnf/yum repository [here](https://copr.fedoraproject.org/coprs/grvs/UniversalCodeGrep). Installation is as simple as:

```sh
# Add the Copr repo to your system:
Expand All @@ -113,15 +129,15 @@ makepkg -sri

### openSUSE Binary RPMs

Binary RPMs for openSUSE are available [here](https://github.com/gvansickle/ucg/releases/tag/0.2.2).
Binary RPMs for openSUSE are available [here](https://github.com/gvansickle/ucg/releases/tag/0.3.0).

### Building the Source Tarball

UniversalCodeGrep can be built and installed from the distribution tarball (available [here](https://github.com/gvansickle/ucg/releases/download/0.2.2/universalcodegrep-0.2.2.tar.gz)) in the standard autotools manner:
UniversalCodeGrep can be built and installed from the distribution tarball (available [here](https://github.com/gvansickle/ucg/releases/download/0.3.0/universalcodegrep-0.3.0.tar.gz)) in the standard autotools manner:

```sh
tar -xaf universalcodegrep-0.2.2.tar.gz
cd universalcodegrep-0.2.2.tar.gz
tar -xaf universalcodegrep-0.3.0.tar.gz
cd universalcodegrep-0.3.0.tar.gz
./configure
make
make install
Expand Down Expand Up @@ -173,7 +189,7 @@ If no `FILES OR DIRECTORIES` are specified, searching starts in the current dire

### Command Line Options

Version 0.2.2 of `ucg` supports a significant subset of the options supported by `ack`. Future releases will have support for more options.
Version 0.3.0 of `ucg` supports a significant subset of the options supported by `ack`. Future releases will have support for more options.

#### Searching

Expand Down
2 changes: 1 addition & 1 deletion tests/benchmark_progs.csv
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ built_ucg,ucg,true,--noenv,--dirjobs=,-j,--ignore-dir=,only_cpp_ack
inst_ucg,${PROG_INST_UCG},true,--noenv,--dirjobs=,-j,--ignore-dir=,only_cpp_ack
fake_for_test,nosuchprog,true,-Abc,,,--exclude-dir=,only_cpp_ack
inst_ag,${PROG_INST_AG},true,,,,--ignore-dir=,only_cpp_ack
inst_ripgrep,${PROG_INST_RIPGREP},false,,,-j,--glob !,only_cpp_rg
inst_ripgrep,${PROG_INST_RIPGREP},false,-n,,-j,--glob !,only_cpp_rg
inst_pcre2grep,${PROG_PCRE2GREP},true,-rn --color,,,--exclude-dir=,only_cpp_pcre2grep
inst_system_grep,grep,false,-Ern --color,,,--exclude-dir=,only_cpp_grep
inst_gnu_grep_e,${PROG_GNU_GREP},false,-Ern --color,,,--exclude-dir=,only_cpp_grep
Expand Down
4 changes: 3 additions & 1 deletion tests/gen_test_script.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@
### GENERATED FILE, DO NOT EDIT
###
if test "x$$NUM_ITERATIONS" = "x"; then
NUM_ITERATIONS=${num_iterations};
fi;
# Use our own time program so we don't have to worry about portability.
PROG_TIME="$$builddir/portable_time -p"
Expand Down Expand Up @@ -263,7 +265,7 @@ def GenerateTestScript(self, test_case_id, test_output_filename, options=None, f
)
test_cases += test_case + "\n"
script = test_script_template_1.substitute(
num_iterations=3,
num_iterations=10,
results_file=test_output_filename,
test_cases=test_cases
)
Expand Down

0 comments on commit f69f9e5

Please sign in to comment.