72 lines
2.3 KiB
Markdown
72 lines
2.3 KiB
Markdown
# Paraglob 2
|
|
#### A fairly quick data structure for matching a string against a large list of patterns.
|
|
|
|
For example, given a list of patterns
|
|
```
|
|
{*og, do*, ca*, plant}
|
|
```
|
|
and an input string `dog`, paraglob will return
|
|
```
|
|
{*og, do*}
|
|
```
|
|
## How it works
|
|
For any pattern, there exist a set of sub-strings that a string must contain in
|
|
order for it to have any hope of matching against that pattern. We call these
|
|
meta-words. Here are some examples:
|
|
|
|
```
|
|
*og -> |og|
|
|
dog*fish -> |dog| |fish|
|
|
```
|
|
|
|
When a pattern is added to a Paraglob the pattern is stored and is split into
|
|
its meta-words. Those meta words are then added to an Aho-Corasick data
|
|
structure that can be found in `multifast-ac`.
|
|
|
|
When Paraglob is given a query, it first gets the meta-words contained in the
|
|
query using `multifast-ac`. Then, it builds a set of all patterns associated with
|
|
those meta-words and runs `fnmatch` on the query and those patterns. It finally
|
|
returns a vector of all the patterns that match.
|
|
|
|
## Installation
|
|
```
|
|
# ./configure && make && make test && make install
|
|
```
|
|
|
|
## How to use it
|
|
`paraglob-test` is a small
|
|
benchmarking script that takes three parameters: the number of patterns to
|
|
generate, the number of queries to perform, and the percentage generated of
|
|
patterns that will match.
|
|
|
|
As an example, running `paraglob-test 10000 50 50` will add 10,000 patterns,
|
|
perform 50 queries on them (of which 50% should match), and then return the
|
|
results.
|
|
|
|
## Inside Zeek
|
|
Paraglob is integrated with Zeek & provides a simple api inside of its
|
|
scripting language. In Zeek, paraglob is implemented as an
|
|
`OpaqueType` and its syntax closely follows other similar constructs
|
|
inside Zeek. A paraglob can only be instantiated once from a vector of
|
|
patterns and then only supports get operations which return a vector
|
|
of all patterns matching an input string. These patterns are different than
|
|
the `patttern` type in Zeek in that they are just strings. The syntax is as
|
|
follows:
|
|
|
|
```
|
|
local v = vector("*", "d?g", "*og", "d?", "d[!wl]g");
|
|
|
|
local p = paraglob_init(v);
|
|
|
|
print paraglob_match(p1, "dog");
|
|
```
|
|
out:
|
|
```
|
|
[*, *og, d?g, d[!wl]g]
|
|
```
|
|
|
|
## Notes
|
|
Paraglob can make queries very quickly, but does not build instantly. It takes
|
|
about 1.5 seconds to build for 10,000 items, 3 seconds for 20,000, and so on.
|
|
This is because of the time required to build the Aho-Corasick structure.
|