Initial implementation

Notes: - uap is licensed under Apache 2.0 because that's the normal license for the project - regex-filtered is licensed under BSD 3-clauses because it's largely a translation (with changes) of re2's FilteredRE2 and IANAL but it seems fairer (and safer) to match
ua-parser · Jun 16, 2024 · 5cd3764 · 5cd3764
commit 5cd3764
Show file tree

Hide file tree

Showing 19 changed files with 3,184 additions and 0 deletions.
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -0,0 +1,28 @@
+name: Rust
+
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+
+env:
+  CARGO_TERM_COLOR: always
+
+jobs:
+  checks:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        submodules: true
+    - name: Build
+      run: cargo build --verbose
+    - name: Format
+      run: cargo fmt --check
+    - name: clippy
+      run: cargo clippy
+    - name: Run tests
+      run: cargo test -r --verbose
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+/target
+Cargo.lock
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "ua-parser/uap-core"]
+	path = ua-parser/uap-core
+	url = https://github.com/ua-parser/uap-core
diff --git a/Cargo.toml b/Cargo.toml
@@ -0,0 +1,3 @@
+[workspace]
+members = ["regex-filtered", "ua-parser"]
+resolver = "2"
diff --git a/README.md b/README.md
@@ -0,0 +1,141 @@
+# User Agent Parser
+
+This module implements the [browserscope / uap
+standard](https://github.com/ua-parser/uap-core) for rust, allowing
+the extraction of various metadata from user agents.
+
+The browserscope standard is data-oriented, with [`regexes.yaml`]
+specifying the matching and extraction from user-agent strings. This
+library implements the maching protocols and provides various types to
+make loading the dataset easier, however it does *not* provide the
+data itself, to avoid dependencies on serialization libraries or
+constrain loading.
+
+## Dataset loading
+
+The crate does not provide any sort of precompiled data file, or
+dedicated loader, however [`Regexes`] implements
+[`serde::Deserialize`] and can load a [`regexes.yaml`] file or any
+format-preserving conversion thereof (e.g. loading from json or cbor
+might be preferred if the application already depends on one of
+those):
+
+```no_run
+# let ua_str = "";
+let f = std::fs::File::open("regexes.yaml")?;
+let regexes: ua_parser::Regexes = serde_yaml::from_reader(f)?;
+let extractor = ua_parser::Extractor::try_from(regexes)?;
+
+# Ok::<(), Box<dyn std::error::Error>>(())
+```
+
+All the data-description structures are also Plain Old Data, so they
+can be embedded in the application directly e.g. via a build script:
+
+``` rust
+let parsers = vec![
+    ua_parser::user_agent::Parser {
+        regex: "foo".into(),
+        family_replacement: Some("bar".into()),
+        ..Default::default()
+    }
+];
+```
+## Extraction
+
+The crate provides the ability to either extract individual
+information sets (user agent — browser, OS, and device) or extract all
+three in a single call.
+
+The three infosets are are independent and non-overlapping so while
+the full extractor may be convenient if only one is needed a complete
+extraction is unnecessary overhead, and the extractors themselves are
+somewhat costly to create and take up memory.
+
+### Complete Extractor
+
+For the complete extractor, it is simply converted from the
+[`Regexes`] structure. The resulting [`Extractor`] embeds all three
+module-level extractors as attributes, and [`Extractor::extract`]-s
+into a 3-uple of `ValueRef`s.
+
+
+### Individual Extractors
+
+The individual extractors are in the [`user_agent`], [`os`], and
+[`device`] modules, the three modules follow the exact same model:
+
+- a `Parser` struct which specifies individual parser configurations,
+  used as inputs to the `Builder`
+- a `Builder`, into which the relevant parsers can be `push`-ed
+- an `Extractor` created from the `Builder`, from which the user can
+  `extract` a `ValueRef`
+- the `ValueRef` result of data extraction, which may borrow from (and
+  is thus lifetime-bound to) the `Parser` substitution data and the
+  user agent string it was extracted from
+- for convenience, an owned `Value` variant of the `ValueRef`
+
+``` rust
+use ua_parser::os::{Builder, Parser, ValueRef};
+
+let e = Builder::new()
+    .push(Parser {
+        regex: r"(Android)[ \-/](\d+)(?:\.(\d+)|)(?:[.\-]([a-z0-9]+)|)".into(),
+        ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) Donut".into(),
+        os_v1_replacement: Some("1".into()),
+        os_v2_replacement: Some("2".into()),
+        ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) Eclair".into(),
+        os_v1_replacement: Some("2".into()),
+        os_v2_replacement: Some("1".into()),
+        ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) Froyo".into(),
+        os_v1_replacement: Some("2".into()),
+        os_v2_replacement: Some("2".into()),
+        ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) Gingerbread".into(),
+        os_v1_replacement: Some("2".into()),
+        os_v2_replacement: Some("3".into()),
+        ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) Honeycomb".into(),
+        os_v1_replacement: Some("3".into()),
+       ..Default::default()
+    })?
+    .push(Parser {
+        regex: r"(Android) (\d+);".into(),
+        ..Default::default()
+    })?
+    .build()?;
+
+assert_eq!(
+    e.extract("Android Donut"),
+    Some(ValueRef {
+        os: "Android".into(),
+        major: Some("1".into()),
+        minor: Some("2".into()),
+        ..Default::default()
+    }),
+);
+assert_eq!(
+    e.extract("Android 15"),
+    Some(ValueRef { os: "Android".into(), major: Some("15".into()), ..Default::default()}),
+);
+assert_eq!(
+    e.extract("ZuneWP7"),
+    None,
+);
+# Ok::<(), Box<dyn std::error::Error>>(())
+```
+
+[`regexes.yaml`]: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml
diff --git a/regex-filtered/Cargo.toml b/regex-filtered/Cargo.toml
@@ -0,0 +1,23 @@
+[package]
+name = "regex-filtered"
+version = "0.1.0"
+edition = "2021"
+description = "Efficiently check an input against a large number of patterns"
+keywords = ["regex", "filter", "FilteredRE2", "multiple", "prefilter"]
+license = "BSD-3-Clause"
+
+# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
+
+[dependencies]
+aho-corasick = "1.1.3"
+indexmap = "2.2.6"
+itertools = "0.13.0"
+regex = "1.10.4"
+regex-syntax = "0.8.3"
+
+[dev-dependencies]
+criterion = "0.5.1"
+
+[[bench]]
+name = "regex"
+harness = false
diff --git a/regex-filtered/LICENSE b/regex-filtered/LICENSE
@@ -0,0 +1,28 @@
+BSD 3-Clause License
+
+Copyright (c) 2024, ua-parser project
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/regex-filtered/README.md b/regex-filtered/README.md
@@ -0,0 +1,97 @@
+# regex-filtered: FilteredRE2 for rust-regex
+
+This crate implements the logic behind [`FilteredRE2`] on top of
+[`regex`].
+
+The purpose is to allow efficient selection of one or more regexes
+matching an input from a *large* set without having to check every
+regex linearly, by prefiltering candidate regexes and only matching
+those against the input.
+
+This should be preferred to [`regex::RegexSet`] if the regexes are
+non-trivial (e.g. non-literal), as [`regex::RegexSet`] constructs a
+single state machine which quickly grows huge and slow.
+
+Linear matching does not have *that* issue and works fine with complex
+regexes, but doesn't scale as the number of regexes increases and
+match failures quickly get very expensive (as they require traversing
+the entire set every time).
+
+## Usage
+
+``` rust
+let matcher = regex_filtered::Builder::new()
+    .push("foo")?
+    .push("bar")?
+    .push("baz")?
+    .push("quux")?
+    .build()?;
+
+assert!(matcher.is_match("bar"));
+assert_eq!(matcher.matching("baz").count(), 1);
+assert_eq!(matcher.matching("foo quux").count(), 2);
+# Ok::<(), Box<dyn std::error::Error>>(())
+```
+
+[`Regexes::is_match`] returns whether *any* pattern in the set matches
+the haystack. It is essentially equivalent to
+`matcher.matching(...).next().is_some()`.
+
+[`Regexes::matching`] returns an iterator of matching [`regex::Regex`]
+and corresponding index. The index can be used to look up ancillary
+data (e.g. replacement content), and the [`regex::Regex`] can be used
+to [`regex::Regex::find`] or [`regex::Regex::captures`] data out of
+the haystack.
+
+## Notes
+
+`regex-filtered` only returns the matching regexes (and their index)
+as capturing especially is *significantly* more expensive than
+checking for a match, this slightly pessimises situations where the
+prefilter prunes perfectly but it is a large gain as soon as that's
+not the case and the prefilter has to be post-filtered.
+
+## Concepts
+
+From a large set of regexes, extract distinguishing literal tokens,
+match the tokens against the input, reverse-lookup which regexes the
+matching tokens correspond to, and only run the corresponding regexes
+on the input.
+
+This extraction is done by gathering literal items, converting them to
+content sets, then symbolically executing concatenations and
+alternations (`|`) in order to find out what literal items *need* to
+be present in the haystack for this regex to match. A reverse index is
+then built from literal items to regexes.
+
+At match time, a prefilter is run checking which literals are present
+in the haystack then find out what regexes that corresponds to,
+following which the regexes themselves are matched against the
+haystack to only return actual matching regexes.
+
+## Divergences
+
+While [`FilteredRE2`] requires the user to perform prefiltering,
+`regex-filtered` handles this internally: [`aho-corasick`] is pretty
+much ideal for that task and already a dependency of [`regex`] which
+`regex-filtered` based on.
+
+## TODO
+
+- add a stats feature to report various build-size infos e.g.
+
+  - number of tokens
+  - number of regexes
+  - number of unfiltered regexes, this would be useful to know if
+    prefiltering will be done or a naive sequential application would
+    be a better idea.
+  - ratio of checked regexes to successes (how does it work with lazy
+    iterators?)
+  - total / prefiltered (- unfiltered) so atom size impact can be
+    evaluated
+  - also maybe mapper stats on the pruning stuff and whatever
+
+[`aho-corasick`]: https://docs.rs/aho-corasick/
+[`FilteredRE2`]: https://github.com/google/re2/blob/main/re2/filtered_re2.h
+[`regex`]: https://docs.rs/regex/
+[`regex-syntax`]: https://docs.rs/regex-syntax/
diff --git a/regex-filtered/benches/regex.rs b/regex-filtered/benches/regex.rs
@@ -0,0 +1,34 @@
+use criterion::{criterion_group, criterion_main, Criterion};
+
+use regex::Regex;
+
+/// On this trivial syntetic test, the results on an M1P are:
+///
+/// * 18ns for a match failure
+/// * 33ns for a match success
+/// * 44ns for a capture failure
+/// * 111ns for a capture success
+///
+/// Cutoff is at n=1.27 failures average. So really depends how
+/// selective the prefilter is...
+fn bench_regex(c: &mut Criterion) {
+    let r = Regex::new(r"(foo|bar)baz/(\d+)\.(\d+)").unwrap();
+
+    c.bench_function("has match - success", |b| {
+        b.iter(|| r.is_match("foobaz/1.2"))
+    });
+    c.bench_function("has match - failure", |b| {
+        b.iter(|| r.is_match("fooxbaz/1.2"))
+    });
+
+    c.bench_function("match - success", |b| b.iter(|| r.find("foobaz/1.2")));
+    c.bench_function("match - failure", |b| b.iter(|| r.find("fooxbaz/1.2")));
+
+    c.bench_function("capture - success", |b| b.iter(|| r.captures("foobaz/1.2")));
+    c.bench_function("capture - failure", |b| {
+        b.iter(|| r.captures("fooxbaz/1.2"))
+    });
+}
+
+criterion_group!(benches, bench_regex);
+criterion_main!(benches);