Skip to content

Commit

Permalink
Initial implementation
Browse files Browse the repository at this point in the history
Notes:

- uap is licensed under Apache 2.0 because that's the normal license
  for the project
- regex-filtered is licensed under BSD 3-clauses because it's largely
  a translation (with changes) of re2's FilteredRE2 and IANAL but it
  seems fairer (and safer) to match
  • Loading branch information
masklinn committed Jun 17, 2024
0 parents commit 5ee1e9f
Show file tree
Hide file tree
Showing 20 changed files with 3,193 additions and 0 deletions.
28 changes: 28 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Rust

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

env:
CARGO_TERM_COLOR: always

jobs:
checks:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Build
run: cargo build --verbose
- name: Format
run: cargo fmt --check
- name: clippy
run: cargo clippy
- name: Run tests
run: cargo test -r --verbose
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/target
Cargo.lock
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "ua-parser/uap-core"]
path = ua-parser/uap-core
url = https://github.com/ua-parser/uap-core
3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[workspace]
members = ["regex-filtered", "ua-parser"]
resolver = "2"
141 changes: 141 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# User Agent Parser

This module implements the [browserscope / uap
standard](https://github.com/ua-parser/uap-core) for rust, allowing
the extraction of various metadata from user agents.

The browserscope standard is data-oriented, with [`regexes.yaml`]
specifying the matching and extraction from user-agent strings. This
library implements the maching protocols and provides various types to
make loading the dataset easier, however it does *not* provide the
data itself, to avoid dependencies on serialization libraries or
constrain loading.

## Dataset loading

The crate does not provide any sort of precompiled data file, or
dedicated loader, however [`Regexes`] implements
[`serde::Deserialize`] and can load a [`regexes.yaml`] file or any
format-preserving conversion thereof (e.g. loading from json or cbor
might be preferred if the application already depends on one of
those):

```no_run
# let ua_str = "";
let f = std::fs::File::open("regexes.yaml")?;
let regexes: ua_parser::Regexes = serde_yaml::from_reader(f)?;
let extractor = ua_parser::Extractor::try_from(regexes)?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

All the data-description structures are also Plain Old Data, so they
can be embedded in the application directly e.g. via a build script:

``` rust
let parsers = vec![
ua_parser::user_agent::Parser {
regex: "foo".into(),
family_replacement: Some("bar".into()),
..Default::default()
}
];
```
## Extraction

The crate provides the ability to either extract individual
information sets (user agent — browser, OS, and device) or extract all
three in a single call.

The three infosets are are independent and non-overlapping so while
the full extractor may be convenient if only one is needed a complete
extraction is unnecessary overhead, and the extractors themselves are
somewhat costly to create and take up memory.

### Complete Extractor

For the complete extractor, it is simply converted from the
[`Regexes`] structure. The resulting [`Extractor`] embeds all three
module-level extractors as attributes, and [`Extractor::extract`]-s
into a 3-uple of `ValueRef`s.


### Individual Extractors

The individual extractors are in the [`user_agent`], [`os`], and
[`device`] modules, the three modules follow the exact same model:

- a `Parser` struct which specifies individual parser configurations,
used as inputs to the `Builder`
- a `Builder`, into which the relevant parsers can be `push`-ed
- an `Extractor` created from the `Builder`, from which the user can
`extract` a `ValueRef`
- the `ValueRef` result of data extraction, which may borrow from (and
is thus lifetime-bound to) the `Parser` substitution data and the
user agent string it was extracted from
- for convenience, an owned `Value` variant of the `ValueRef`

``` rust
use ua_parser::os::{Builder, Parser, ValueRef};

let e = Builder::new()
.push(Parser {
regex: r"(Android)[ \-/](\d+)(?:\.(\d+)|)(?:[.\-]([a-z0-9]+)|)".into(),
..Default::default()
})?
.push(Parser {
regex: r"(Android) Donut".into(),
os_v1_replacement: Some("1".into()),
os_v2_replacement: Some("2".into()),
..Default::default()
})?
.push(Parser {
regex: r"(Android) Eclair".into(),
os_v1_replacement: Some("2".into()),
os_v2_replacement: Some("1".into()),
..Default::default()
})?
.push(Parser {
regex: r"(Android) Froyo".into(),
os_v1_replacement: Some("2".into()),
os_v2_replacement: Some("2".into()),
..Default::default()
})?
.push(Parser {
regex: r"(Android) Gingerbread".into(),
os_v1_replacement: Some("2".into()),
os_v2_replacement: Some("3".into()),
..Default::default()
})?
.push(Parser {
regex: r"(Android) Honeycomb".into(),
os_v1_replacement: Some("3".into()),
..Default::default()
})?
.push(Parser {
regex: r"(Android) (\d+);".into(),
..Default::default()
})?
.build()?;

assert_eq!(
e.extract("Android Donut"),
Some(ValueRef {
os: "Android".into(),
major: Some("1".into()),
minor: Some("2".into()),
..Default::default()
}),
);
assert_eq!(
e.extract("Android 15"),
Some(ValueRef { os: "Android".into(), major: Some("15".into()), ..Default::default()}),
);
assert_eq!(
e.extract("ZuneWP7"),
None,
);
# Ok::<(), Box<dyn std::error::Error>>(())
```

[`regexes.yaml`]: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml
27 changes: 27 additions & 0 deletions regex-filtered/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[package]
name = "regex-filtered"
version = "0.1.0"
edition = "2021"
description = "Efficiently check an input against a large number of patterns"
keywords = ["regex", "filter", "FilteredRE2", "multiple", "prefilter"]
license = "BSD-3-Clause"

documentation = "https://docs.rs/regex-filtered/"
homepage = "https://github.com/ua-parser/uap-rust/tree/main/regex-filtered"
repository = "https://github.com/ua-parser/uap-rust/"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
aho-corasick = "1.1.3"
indexmap = "2.2.6"
itertools = "0.13.0"
regex = "1.10.4"
regex-syntax = "0.8.3"

[dev-dependencies]
criterion = "0.5.1"

[[bench]]
name = "regex"
harness = false
28 changes: 28 additions & 0 deletions regex-filtered/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
BSD 3-Clause License

Copyright (c) 2024, ua-parser project

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
97 changes: 97 additions & 0 deletions regex-filtered/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# regex-filtered: FilteredRE2 for rust-regex

This crate implements the logic behind [`FilteredRE2`] on top of
[`regex`].

The purpose is to allow efficient selection of one or more regexes
matching an input from a *large* set without having to check every
regex linearly, by prefiltering candidate regexes and only matching
those against the input.

This should be preferred to [`regex::RegexSet`] if the regexes are
non-trivial (e.g. non-literal), as [`regex::RegexSet`] constructs a
single state machine which quickly grows huge and slow.

Linear matching does not have *that* issue and works fine with complex
regexes, but doesn't scale as the number of regexes increases and
match failures quickly get very expensive (as they require traversing
the entire set every time).

## Usage

``` rust
let matcher = regex_filtered::Builder::new()
.push("foo")?
.push("bar")?
.push("baz")?
.push("quux")?
.build()?;

assert!(matcher.is_match("bar"));
assert_eq!(matcher.matching("baz").count(), 1);
assert_eq!(matcher.matching("foo quux").count(), 2);
# Ok::<(), Box<dyn std::error::Error>>(())
```

[`Regexes::is_match`] returns whether *any* pattern in the set matches
the haystack. It is essentially equivalent to
`matcher.matching(...).next().is_some()`.

[`Regexes::matching`] returns an iterator of matching [`regex::Regex`]
and corresponding index. The index can be used to look up ancillary
data (e.g. replacement content), and the [`regex::Regex`] can be used
to [`regex::Regex::find`] or [`regex::Regex::captures`] data out of
the haystack.

## Notes

`regex-filtered` only returns the matching regexes (and their index)
as capturing especially is *significantly* more expensive than
checking for a match, this slightly pessimises situations where the
prefilter prunes perfectly but it is a large gain as soon as that's
not the case and the prefilter has to be post-filtered.

## Concepts

From a large set of regexes, extract distinguishing literal tokens,
match the tokens against the input, reverse-lookup which regexes the
matching tokens correspond to, and only run the corresponding regexes
on the input.

This extraction is done by gathering literal items, converting them to
content sets, then symbolically executing concatenations and
alternations (`|`) in order to find out what literal items *need* to
be present in the haystack for this regex to match. A reverse index is
then built from literal items to regexes.

At match time, a prefilter is run checking which literals are present
in the haystack then find out what regexes that corresponds to,
following which the regexes themselves are matched against the
haystack to only return actual matching regexes.

## Divergences

While [`FilteredRE2`] requires the user to perform prefiltering,
`regex-filtered` handles this internally: [`aho-corasick`] is pretty
much ideal for that task and already a dependency of [`regex`] which
`regex-filtered` based on.

## TODO

- add a stats feature to report various build-size infos e.g.

- number of tokens
- number of regexes
- number of unfiltered regexes, this would be useful to know if
prefiltering will be done or a naive sequential application would
be a better idea.
- ratio of checked regexes to successes (how does it work with lazy
iterators?)
- total / prefiltered (- unfiltered) so atom size impact can be
evaluated
- also maybe mapper stats on the pruning stuff and whatever

[`aho-corasick`]: https://docs.rs/aho-corasick/
[`FilteredRE2`]: https://github.com/google/re2/blob/main/re2/filtered_re2.h
[`regex`]: https://docs.rs/regex/
[`regex-syntax`]: https://docs.rs/regex-syntax/
34 changes: 34 additions & 0 deletions regex-filtered/benches/regex.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
use criterion::{criterion_group, criterion_main, Criterion};

use regex::Regex;

/// On this trivial syntetic test, the results on an M1P are:
///
/// * 18ns for a match failure
/// * 33ns for a match success
/// * 44ns for a capture failure
/// * 111ns for a capture success
///
/// Cutoff is at n=1.27 failures average. So really depends how
/// selective the prefilter is...
fn bench_regex(c: &mut Criterion) {
let r = Regex::new(r"(foo|bar)baz/(\d+)\.(\d+)").unwrap();

c.bench_function("has match - success", |b| {
b.iter(|| r.is_match("foobaz/1.2"))
});
c.bench_function("has match - failure", |b| {
b.iter(|| r.is_match("fooxbaz/1.2"))
});

c.bench_function("match - success", |b| b.iter(|| r.find("foobaz/1.2")));
c.bench_function("match - failure", |b| b.iter(|| r.find("fooxbaz/1.2")));

c.bench_function("capture - success", |b| b.iter(|| r.captures("foobaz/1.2")));
c.bench_function("capture - failure", |b| {
b.iter(|| r.captures("fooxbaz/1.2"))
});
}

criterion_group!(benches, bench_regex);
criterion_main!(benches);
Loading

0 comments on commit 5ee1e9f

Please sign in to comment.