-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Notes: - uap is licensed under Apache 2.0 because that's the normal license for the project - regex-filtered is licensed under BSD 3-clauses because it's largely a translation (with changes) of re2's FilteredRE2 and IANAL but it seems fairer (and safer) to match
- Loading branch information
0 parents
commit 5cd3764
Showing
19 changed files
with
3,184 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
name: Rust | ||
|
||
on: | ||
push: | ||
branches: [ "main" ] | ||
pull_request: | ||
branches: [ "main" ] | ||
|
||
env: | ||
CARGO_TERM_COLOR: always | ||
|
||
jobs: | ||
checks: | ||
|
||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
submodules: true | ||
- name: Build | ||
run: cargo build --verbose | ||
- name: Format | ||
run: cargo fmt --check | ||
- name: clippy | ||
run: cargo clippy | ||
- name: Run tests | ||
run: cargo test -r --verbose |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
/target | ||
Cargo.lock |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[submodule "ua-parser/uap-core"] | ||
path = ua-parser/uap-core | ||
url = https://github.com/ua-parser/uap-core |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[workspace] | ||
members = ["regex-filtered", "ua-parser"] | ||
resolver = "2" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
# User Agent Parser | ||
|
||
This module implements the [browserscope / uap | ||
standard](https://github.com/ua-parser/uap-core) for rust, allowing | ||
the extraction of various metadata from user agents. | ||
|
||
The browserscope standard is data-oriented, with [`regexes.yaml`] | ||
specifying the matching and extraction from user-agent strings. This | ||
library implements the maching protocols and provides various types to | ||
make loading the dataset easier, however it does *not* provide the | ||
data itself, to avoid dependencies on serialization libraries or | ||
constrain loading. | ||
|
||
## Dataset loading | ||
|
||
The crate does not provide any sort of precompiled data file, or | ||
dedicated loader, however [`Regexes`] implements | ||
[`serde::Deserialize`] and can load a [`regexes.yaml`] file or any | ||
format-preserving conversion thereof (e.g. loading from json or cbor | ||
might be preferred if the application already depends on one of | ||
those): | ||
|
||
```no_run | ||
# let ua_str = ""; | ||
let f = std::fs::File::open("regexes.yaml")?; | ||
let regexes: ua_parser::Regexes = serde_yaml::from_reader(f)?; | ||
let extractor = ua_parser::Extractor::try_from(regexes)?; | ||
# Ok::<(), Box<dyn std::error::Error>>(()) | ||
``` | ||
|
||
All the data-description structures are also Plain Old Data, so they | ||
can be embedded in the application directly e.g. via a build script: | ||
|
||
``` rust | ||
let parsers = vec![ | ||
ua_parser::user_agent::Parser { | ||
regex: "foo".into(), | ||
family_replacement: Some("bar".into()), | ||
..Default::default() | ||
} | ||
]; | ||
``` | ||
## Extraction | ||
|
||
The crate provides the ability to either extract individual | ||
information sets (user agent — browser, OS, and device) or extract all | ||
three in a single call. | ||
|
||
The three infosets are are independent and non-overlapping so while | ||
the full extractor may be convenient if only one is needed a complete | ||
extraction is unnecessary overhead, and the extractors themselves are | ||
somewhat costly to create and take up memory. | ||
|
||
### Complete Extractor | ||
|
||
For the complete extractor, it is simply converted from the | ||
[`Regexes`] structure. The resulting [`Extractor`] embeds all three | ||
module-level extractors as attributes, and [`Extractor::extract`]-s | ||
into a 3-uple of `ValueRef`s. | ||
|
||
|
||
### Individual Extractors | ||
|
||
The individual extractors are in the [`user_agent`], [`os`], and | ||
[`device`] modules, the three modules follow the exact same model: | ||
|
||
- a `Parser` struct which specifies individual parser configurations, | ||
used as inputs to the `Builder` | ||
- a `Builder`, into which the relevant parsers can be `push`-ed | ||
- an `Extractor` created from the `Builder`, from which the user can | ||
`extract` a `ValueRef` | ||
- the `ValueRef` result of data extraction, which may borrow from (and | ||
is thus lifetime-bound to) the `Parser` substitution data and the | ||
user agent string it was extracted from | ||
- for convenience, an owned `Value` variant of the `ValueRef` | ||
|
||
``` rust | ||
use ua_parser::os::{Builder, Parser, ValueRef}; | ||
|
||
let e = Builder::new() | ||
.push(Parser { | ||
regex: r"(Android)[ \-/](\d+)(?:\.(\d+)|)(?:[.\-]([a-z0-9]+)|)".into(), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) Donut".into(), | ||
os_v1_replacement: Some("1".into()), | ||
os_v2_replacement: Some("2".into()), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) Eclair".into(), | ||
os_v1_replacement: Some("2".into()), | ||
os_v2_replacement: Some("1".into()), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) Froyo".into(), | ||
os_v1_replacement: Some("2".into()), | ||
os_v2_replacement: Some("2".into()), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) Gingerbread".into(), | ||
os_v1_replacement: Some("2".into()), | ||
os_v2_replacement: Some("3".into()), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) Honeycomb".into(), | ||
os_v1_replacement: Some("3".into()), | ||
..Default::default() | ||
})? | ||
.push(Parser { | ||
regex: r"(Android) (\d+);".into(), | ||
..Default::default() | ||
})? | ||
.build()?; | ||
|
||
assert_eq!( | ||
e.extract("Android Donut"), | ||
Some(ValueRef { | ||
os: "Android".into(), | ||
major: Some("1".into()), | ||
minor: Some("2".into()), | ||
..Default::default() | ||
}), | ||
); | ||
assert_eq!( | ||
e.extract("Android 15"), | ||
Some(ValueRef { os: "Android".into(), major: Some("15".into()), ..Default::default()}), | ||
); | ||
assert_eq!( | ||
e.extract("ZuneWP7"), | ||
None, | ||
); | ||
# Ok::<(), Box<dyn std::error::Error>>(()) | ||
``` | ||
|
||
[`regexes.yaml`]: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
[package] | ||
name = "regex-filtered" | ||
version = "0.1.0" | ||
edition = "2021" | ||
description = "Efficiently check an input against a large number of patterns" | ||
keywords = ["regex", "filter", "FilteredRE2", "multiple", "prefilter"] | ||
license = "BSD-3-Clause" | ||
|
||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html | ||
|
||
[dependencies] | ||
aho-corasick = "1.1.3" | ||
indexmap = "2.2.6" | ||
itertools = "0.13.0" | ||
regex = "1.10.4" | ||
regex-syntax = "0.8.3" | ||
|
||
[dev-dependencies] | ||
criterion = "0.5.1" | ||
|
||
[[bench]] | ||
name = "regex" | ||
harness = false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright (c) 2024, ua-parser project | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of the copyright holder nor the names of its | ||
contributors may be used to endorse or promote products derived from | ||
this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# regex-filtered: FilteredRE2 for rust-regex | ||
|
||
This crate implements the logic behind [`FilteredRE2`] on top of | ||
[`regex`]. | ||
|
||
The purpose is to allow efficient selection of one or more regexes | ||
matching an input from a *large* set without having to check every | ||
regex linearly, by prefiltering candidate regexes and only matching | ||
those against the input. | ||
|
||
This should be preferred to [`regex::RegexSet`] if the regexes are | ||
non-trivial (e.g. non-literal), as [`regex::RegexSet`] constructs a | ||
single state machine which quickly grows huge and slow. | ||
|
||
Linear matching does not have *that* issue and works fine with complex | ||
regexes, but doesn't scale as the number of regexes increases and | ||
match failures quickly get very expensive (as they require traversing | ||
the entire set every time). | ||
|
||
## Usage | ||
|
||
``` rust | ||
let matcher = regex_filtered::Builder::new() | ||
.push("foo")? | ||
.push("bar")? | ||
.push("baz")? | ||
.push("quux")? | ||
.build()?; | ||
|
||
assert!(matcher.is_match("bar")); | ||
assert_eq!(matcher.matching("baz").count(), 1); | ||
assert_eq!(matcher.matching("foo quux").count(), 2); | ||
# Ok::<(), Box<dyn std::error::Error>>(()) | ||
``` | ||
|
||
[`Regexes::is_match`] returns whether *any* pattern in the set matches | ||
the haystack. It is essentially equivalent to | ||
`matcher.matching(...).next().is_some()`. | ||
|
||
[`Regexes::matching`] returns an iterator of matching [`regex::Regex`] | ||
and corresponding index. The index can be used to look up ancillary | ||
data (e.g. replacement content), and the [`regex::Regex`] can be used | ||
to [`regex::Regex::find`] or [`regex::Regex::captures`] data out of | ||
the haystack. | ||
|
||
## Notes | ||
|
||
`regex-filtered` only returns the matching regexes (and their index) | ||
as capturing especially is *significantly* more expensive than | ||
checking for a match, this slightly pessimises situations where the | ||
prefilter prunes perfectly but it is a large gain as soon as that's | ||
not the case and the prefilter has to be post-filtered. | ||
|
||
## Concepts | ||
|
||
From a large set of regexes, extract distinguishing literal tokens, | ||
match the tokens against the input, reverse-lookup which regexes the | ||
matching tokens correspond to, and only run the corresponding regexes | ||
on the input. | ||
|
||
This extraction is done by gathering literal items, converting them to | ||
content sets, then symbolically executing concatenations and | ||
alternations (`|`) in order to find out what literal items *need* to | ||
be present in the haystack for this regex to match. A reverse index is | ||
then built from literal items to regexes. | ||
|
||
At match time, a prefilter is run checking which literals are present | ||
in the haystack then find out what regexes that corresponds to, | ||
following which the regexes themselves are matched against the | ||
haystack to only return actual matching regexes. | ||
|
||
## Divergences | ||
|
||
While [`FilteredRE2`] requires the user to perform prefiltering, | ||
`regex-filtered` handles this internally: [`aho-corasick`] is pretty | ||
much ideal for that task and already a dependency of [`regex`] which | ||
`regex-filtered` based on. | ||
|
||
## TODO | ||
|
||
- add a stats feature to report various build-size infos e.g. | ||
|
||
- number of tokens | ||
- number of regexes | ||
- number of unfiltered regexes, this would be useful to know if | ||
prefiltering will be done or a naive sequential application would | ||
be a better idea. | ||
- ratio of checked regexes to successes (how does it work with lazy | ||
iterators?) | ||
- total / prefiltered (- unfiltered) so atom size impact can be | ||
evaluated | ||
- also maybe mapper stats on the pruning stuff and whatever | ||
|
||
[`aho-corasick`]: https://docs.rs/aho-corasick/ | ||
[`FilteredRE2`]: https://github.com/google/re2/blob/main/re2/filtered_re2.h | ||
[`regex`]: https://docs.rs/regex/ | ||
[`regex-syntax`]: https://docs.rs/regex-syntax/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
use criterion::{criterion_group, criterion_main, Criterion}; | ||
|
||
use regex::Regex; | ||
|
||
/// On this trivial syntetic test, the results on an M1P are: | ||
/// | ||
/// * 18ns for a match failure | ||
/// * 33ns for a match success | ||
/// * 44ns for a capture failure | ||
/// * 111ns for a capture success | ||
/// | ||
/// Cutoff is at n=1.27 failures average. So really depends how | ||
/// selective the prefilter is... | ||
fn bench_regex(c: &mut Criterion) { | ||
let r = Regex::new(r"(foo|bar)baz/(\d+)\.(\d+)").unwrap(); | ||
|
||
c.bench_function("has match - success", |b| { | ||
b.iter(|| r.is_match("foobaz/1.2")) | ||
}); | ||
c.bench_function("has match - failure", |b| { | ||
b.iter(|| r.is_match("fooxbaz/1.2")) | ||
}); | ||
|
||
c.bench_function("match - success", |b| b.iter(|| r.find("foobaz/1.2"))); | ||
c.bench_function("match - failure", |b| b.iter(|| r.find("fooxbaz/1.2"))); | ||
|
||
c.bench_function("capture - success", |b| b.iter(|| r.captures("foobaz/1.2"))); | ||
c.bench_function("capture - failure", |b| { | ||
b.iter(|| r.captures("fooxbaz/1.2")) | ||
}); | ||
} | ||
|
||
criterion_group!(benches, bench_regex); | ||
criterion_main!(benches); |
Oops, something went wrong.