2596 lines
83 KiB
ReStructuredText
2596 lines
83 KiB
ReStructuredText
|
|
.. _parsing:
|
|
|
|
=======
|
|
Parsing
|
|
=======
|
|
|
|
Basics
|
|
======
|
|
|
|
Type Declaration
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
Spicy expresses units of data to parse through a type called,
|
|
appropriately, ``unit``. At a high level, a unit is similar to structs
|
|
or records in other languages: It defines an ordered set of fields,
|
|
each with a name and a type, that during runtime will store
|
|
corresponding values. Units can be instantiated, fields can be
|
|
assigned values, and these values can then be retrieved. Here's about
|
|
the most basic Spicy unit one can define:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
version: uint32;
|
|
};
|
|
|
|
We name the type ``Foo``, and it has just one field called
|
|
``version``, which stores a 32-bit unsigned integer type.
|
|
|
|
Leaving parsing aside for a moment, we can indeed use this type
|
|
similar to a typical struct/record type:
|
|
|
|
.. spicy-code:: basic-unit-module.spicy
|
|
|
|
module Test;
|
|
|
|
type Foo = unit {
|
|
version: uint32;
|
|
};
|
|
|
|
global f: Foo;
|
|
f.version = 42;
|
|
print f;
|
|
|
|
This will print:
|
|
|
|
.. spicy-output:: basic-unit-module.spicy
|
|
:exec: spicyc -j %INPUT
|
|
|
|
Fields are initially unset, and attempting to read an unset field will
|
|
trigger a :ref:`runtime error <error_handling>`. You may, however,
|
|
provide a default value by adding a ``&default`` *attribute* to the
|
|
field, in which case that will be returned on access if no value has
|
|
been explicitly assigned:
|
|
|
|
.. spicy-code:: basic-unit-module-with-default.spicy
|
|
|
|
module Test;
|
|
|
|
type Foo = unit {
|
|
version: uint32 &default=42;
|
|
};
|
|
|
|
global f: Foo;
|
|
print f;
|
|
print "version is %s" % f.version;
|
|
|
|
This will print:
|
|
|
|
.. spicy-output:: basic-unit-module-with-default.spicy
|
|
:exec: spicyc -j %INPUT
|
|
|
|
Note how the field remains unset even with the default now specified,
|
|
while the access returns the expected value.
|
|
|
|
Parsing a Field
|
|
^^^^^^^^^^^^^^^
|
|
|
|
We can turn this minimal unit type into a starting point for parsing
|
|
data---in this case a 32-bit integer from four bytes of raw input.
|
|
First, we need to declare the unit as ``public`` to make it accessible
|
|
from outside of the current module---a requirement if a host
|
|
application wants to use the unit as a parsing entry point.
|
|
|
|
.. spicy-code:: basic-unit-parse.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
version: uint32;
|
|
|
|
on %done {
|
|
print "0x%x" % self.version;
|
|
}
|
|
};
|
|
|
|
Let's use :ref:`spicy-driver` to parse 4 bytes of input through this
|
|
unit:
|
|
|
|
.. spicy-output:: basic-unit-parse.spicy
|
|
:exec: printf '\01\02\03\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
The output comes of course from the ``print`` statement inside the
|
|
``%done`` hook, which executes once the unit has been fully parsed.
|
|
(We will discuss unit hooks further below.)
|
|
|
|
.. _attribute_order:
|
|
|
|
By default, Spicy assumes integers that it parses to be represented in
|
|
network byte order (i.e., big-endian), hence the output above.
|
|
Alternatively, we can tell the parser through an attribute that our
|
|
input is arriving in, say, little-endian instead. To do that, we
|
|
import the ``spicy`` library module, which provides an enum type
|
|
:ref:`spicy_byteorder` that we can give to a ``&byte-order`` field
|
|
attribute for fields that support it:
|
|
|
|
.. spicy-code:: basic-unit-parse-byte-order.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
version: uint32 &byte-order=spicy::ByteOrder::Little;
|
|
|
|
on %done {
|
|
print "0x%x" % self.version;
|
|
}
|
|
};
|
|
|
|
.. spicy-output:: basic-unit-parse-byte-order.spicy
|
|
:exec: printf '\01\02\03\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
We see that unpacking the value has now flipped the bytes before
|
|
storing it in the ``version`` field.
|
|
|
|
Similar to ``&byte-order``, Spicy offers a variety of further
|
|
attributes that control the specifics of how fields are parsed. We'll
|
|
discuss them in the relevant sections throughout the rest of this
|
|
chapter.
|
|
|
|
Non-type Fields
|
|
^^^^^^^^^^^^^^^
|
|
|
|
Unit fields always have a type. However, in some cases a field's type
|
|
is not explicitly declared, but derived from what's being parsed. The
|
|
main example of this is parsing a constant value: Instead of a type, a
|
|
field can specify a constant of a parseable type. The field's type
|
|
will then (usually) just correspond to the constant's type, and
|
|
parsing will expect to find the corresponding value in the input
|
|
stream. If a different value gets unpacked instead, parsing will abort
|
|
with an error. Example:
|
|
|
|
.. spicy-code:: constant-field.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
bar: b"bar";
|
|
on %done { print self.bar; }
|
|
};
|
|
|
|
.. spicy-output:: constant-field.spicy 1
|
|
:exec: printf 'bar' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. spicy-output:: constant-field.spicy 2
|
|
:exec: printf 'foo' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
:expect-failure:
|
|
|
|
:ref:`Regular expressions <parse_regexp>` extend this scheme a bit
|
|
further: If a field specifies a regular expression constant rather
|
|
than a type, the field will have type :ref:`type_bytes` and store
|
|
the data that ends up matching the regular expression:
|
|
|
|
.. spicy-code:: regexp.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: /Foo.*Bar/;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: regexp.spicy
|
|
:exec: printf 'Foo12345Bar' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
There's also a programmatic way to change a field's type to something
|
|
that's different than what's being parsed, see the
|
|
:ref:`&convert attribute <attribute_convert>`.
|
|
|
|
.. _attribute_size:
|
|
|
|
Parsing Fields With Known Size
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
You can limit the input that a field receives by attaching a
|
|
``&size=EXPR`` attribute that specifies the number of raw bytes to
|
|
make available. This works on top of any other attributes that control
|
|
the field's parsing. From the field's perspective, such a size limit
|
|
acts just like reaching the end of the input stream at the specified
|
|
position. Example:
|
|
|
|
.. spicy-code:: size.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: int16[] &size=6;
|
|
y: bytes &eod;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: size.spicy
|
|
:exec: printf '\000\001\000\002\000\003xyz' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
As you can see, ``x`` receives 6 bytes of input, which it then turns
|
|
into three 16-bit integers.
|
|
|
|
Normally, the field must consume all the bytes specified by ``&size``,
|
|
otherwise a parse error will be triggered. Some types support an
|
|
additional ``&eod`` attribute to lift this restrictions; we discuss
|
|
that in the corresponding type's section where applicable.
|
|
|
|
After a field with a ``&size=EXPR`` attribute, parsing will always
|
|
move ahead the full amount of bytes, even if the field did not consume
|
|
them all.
|
|
|
|
.. todo::
|
|
|
|
Parsing a regular expression would make a nice example for
|
|
``&size`` as well.
|
|
|
|
Defensively Limiting Input Size
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
On their own, parsers place no intrinsic upper limit on the size of
|
|
variable-size fields or units. This can have negative effects like
|
|
out-of-memory errors, e.g., when available memory is constrained, or for
|
|
malformed input.
|
|
|
|
As a defensive mechanism you can put an upper limit on the data a field or unit
|
|
receives by attaching a ``&max-size=EXPR`` attribute where ``EXPR`` is an
|
|
unsigned integer specifying the upper limit of number of raw bytes a field or
|
|
unit should receive. If more than ``&max-size`` bytes are consumed during
|
|
parsing, an error will be triggered. This attribute works on top of any other
|
|
attributes that control parsing. Example:
|
|
|
|
.. spicy-code:: max-size.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: bytes &until=b"\x00" &max-size=1024;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: max-size.spicy
|
|
:exec: printf '\001\002\003\004\005\000' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Here ``x`` will parse a ``NULL``-terminated byte sequence (excluding the
|
|
terminating ``NULL``), but never more than 1024 bytes.
|
|
|
|
``&max-size`` cannot be combined with ``&size``.
|
|
|
|
.. _anonymous_fields:
|
|
|
|
Anonymous Fields
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
Field names are optional. If skipped, the field becomes an *anonymous*
|
|
field. These still participate in parsing as any other field, but they
|
|
won't store any value, nor is there a way to get access to them from
|
|
outside. You can, however, still get to the field's final value inside
|
|
a corresponding field hook (see :ref:`unit_hooks`) using the reserved
|
|
``$$`` identifier (see :ref:`id_dollardollar`).
|
|
|
|
.. spicy-code:: anonymous-field.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: int8;
|
|
: int8 { print $$; } # anonymous field
|
|
y: int8;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: anonymous-field.spicy
|
|
:exec: printf '\01\02\03' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Anonymous fields can often be more efficient to process because the
|
|
parser doesn't need to retain their values. In particular for larger
|
|
``bytes`` fields, making them anonymous is recommended where possible
|
|
(unless, even better, they can be fully skipped over; see
|
|
:ref:`skip`).
|
|
|
|
.. _skip:
|
|
|
|
Skipping Input
|
|
^^^^^^^^^^^^^^
|
|
|
|
For cases where your parser just needs to skip over some data without
|
|
needing access to its content, Spicy provides a ``skip`` keyword to
|
|
prefix corresponding fields with:
|
|
|
|
.. spicy-code:: skip.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: int8;
|
|
: skip bytes &size=5;
|
|
y: int8;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: skip.spicy
|
|
:exec: printf '\01\02\03\04\05\06\07' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
``skip`` works for all kinds of fields but is particularly efficient
|
|
for fields of known size for which optimized code will be generating
|
|
avoiding the overhead of storing any data.
|
|
|
|
``skip`` fields may have conditions and hooks attached, like any other fields.
|
|
However, they do not support ``$$`` in expressions and hook.
|
|
|
|
Since ``skip`` allows the compiler to optimize the field's parsing
|
|
code---including completely eliding most of it---it remains undefined if any
|
|
side effects associated with the field will take effect. For example,
|
|
``&requires`` attributes might be ignored, ``&convert`` expressions might not
|
|
be evaluated, and hooks could end up not being invoked.
|
|
|
|
For readability, a ``skip`` field may be named (e.g., ``padding: skip
|
|
bytes &size=3;``), but even with a name, its value cannot be accessed.
|
|
|
|
.. _id_dollardollar:
|
|
.. _id_self:
|
|
|
|
Reserved Identifiers
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Inside units, two reserved identifiers provide access to values
|
|
currently being parsed:
|
|
|
|
``self``
|
|
Inside a unit's type definition, ``self`` refers to the unit
|
|
instance that's currently being processed. The instance is
|
|
writable and maybe modified by assigning to any fields of
|
|
``self``.
|
|
|
|
``$$``
|
|
Inside field attributes, ``$$`` refers to the value as it was
|
|
parsed. Inside field hooks, ``$$`` refers to the final value
|
|
*after* any conversions are applied (see
|
|
:ref:`attribute_convert`). This applies even if the value is not
|
|
going to be directly stored in the field. The value of ``$$`` is
|
|
writable and may be modified.
|
|
|
|
.. note::
|
|
|
|
``$$`` has slightly different semantics in a field attribute and
|
|
in a hook. In an attribute, ``$$`` refers to the parsed value
|
|
*before* any conversions. In a hook, ``$$`` refers to the final
|
|
value *after* any conversions.
|
|
|
|
.. _attribute_convert:
|
|
|
|
On-the-fly Type Conversion with &convert
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Fields may use an attribute ``&convert=EXPR`` to transform the value
|
|
that was just being parsed before storing it as the field's final
|
|
value. With the attribute being present, it's the value of ``EXPR``
|
|
that's stored in the field, not the parsed value. Accordingly, the
|
|
field's type also changes to the type of ``EXPR``.
|
|
|
|
Typically, ``EXPR`` will use ``$$`` to access the parsed value and
|
|
then transform it into the desired representation. For example, the
|
|
following stores an integer parsed in an ASCII representation as a
|
|
``uint64``:
|
|
|
|
.. spicy-code:: parse-convert.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
x: bytes &eod &convert=$$.to_uint();
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-convert.spicy
|
|
:exec: printf 12345 | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
``&convert`` also works at the unit level to transform a whole
|
|
instance into a different value after it has been parsed:
|
|
|
|
.. spicy-code:: parse-convert-unit.spicy
|
|
|
|
module Test;
|
|
|
|
type Data = unit {
|
|
data: bytes &size=2;
|
|
} &convert=self.data.to_int();
|
|
|
|
public type Foo = unit {
|
|
numbers: Data[];
|
|
|
|
on %done { print self.numbers; }
|
|
};
|
|
|
|
.. spicy-output:: parse-convert-unit.spicy
|
|
:exec: printf 12345678 | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Note how the ``Data`` instances have been turned into integers.
|
|
Without the ``&convert`` attribute, the output would have looked like
|
|
this::
|
|
|
|
[[$data=b"12"], [$data=b"34"], [$data=b"56"], [$data=b"78"]]
|
|
|
|
.. _attribute_requires:
|
|
|
|
Enforcing Parsing Constraints
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Fields may use an attribute ``&requires=EXPR`` to enforce additional
|
|
constraints on their values. ``EXPR`` must yield a boolean value
|
|
and will be evaluated after the parsing for the field has finished,
|
|
but before any hooks execute. If ``EXPR`` returns ``False``, the
|
|
parsing process will abort with an error, just as if the field had
|
|
been unparsable in the first place (incl. executing any :ref:`%error
|
|
<on_error>` hooks). ``EXPR`` has access to the parsed value through
|
|
:ref:`$$ <id_dollardollar>`. It may also retrieve the field's final
|
|
value through ``self.<field>``, which can be helpful when
|
|
:ref:`&convert <attribute_convert>` is present.
|
|
|
|
Example:
|
|
|
|
.. spicy-code:: parse-requires.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
x: int8 &requires=($$ < 5);
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-requires.spicy 1
|
|
:exec: printf '\001' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. spicy-output:: parse-requires.spicy 2
|
|
:exec: printf '\010' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
:expect-failure:
|
|
|
|
.. versionadded:: 1.12 Custom error messages
|
|
|
|
Instead of computing a boolean value directly, ``EXPR`` can also
|
|
leverage the :ref:`condition test operator <operator_condition_test>`
|
|
to provide a custom error message when the condition fails. Example:
|
|
|
|
.. spicy-code:: parse-requires-with-error.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
x: int8 &requires=($$ < 5 : "x is too large"); # custom error message
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-requires-with-error.spicy
|
|
:exec: printf '\010' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
:expect-failure:
|
|
|
|
One can also enforce conditions globally at the unit level through a attribute
|
|
``&requires = EXPR``. ``EXPR`` will be evaluated once the unit has been fully
|
|
parsed, but before any ``%done`` hook executes. If ``EXPR`` returns ``False``,
|
|
the unit's parsing process will abort with an error. As usual, ``EXPR`` has
|
|
access to the parsed instance through ``self``. More than one ``&requires``
|
|
attribute may be specified.
|
|
|
|
Example:
|
|
|
|
.. spicy-code:: parse-requires-property.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
x: int8;
|
|
on %done { print self; }
|
|
} &requires = self.x < 5;
|
|
|
|
|
|
.. spicy-output:: parse-requires-property.spicy 1
|
|
:exec: printf '\001' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. spicy-output:: parse-requires-property.spicy 2
|
|
:exec: printf '\010' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
:expect-failure:
|
|
|
|
.. _unit_hooks:
|
|
|
|
Unit Hooks
|
|
===========
|
|
|
|
Unit hooks provide one of the most powerful Spicy tools to control
|
|
parsing, track state, and retrieve results. Generally, hooks are
|
|
blocks of code triggered to execute at certain points during parsing,
|
|
with access to the current unit instance.
|
|
|
|
Conceptually, unit hooks are somewhat similar to methods: They have
|
|
bodies that execute when triggered, and these bodies may receive a set
|
|
of parameters as input. Different from functions, however, a hook can
|
|
have more than one body. If multiple implementations are provided for
|
|
the same hook, all of them will execute successively. A hook may also
|
|
not have any body implemented at all, in which case there's nothing to
|
|
do when it executes.
|
|
|
|
The most commonly used hooks are:
|
|
|
|
``on %init() { ... }``
|
|
Executes just before unit parsing will start.
|
|
|
|
``on %done { ... }``
|
|
Executes just after unit parsing has completed successfully.
|
|
|
|
.. _on_error:
|
|
|
|
``on %error { ... }`` or ``on %error(msg: string) { ... }``
|
|
Executes when a parse error has been encountered, just before the
|
|
parser either aborts processing. If the second form is used, a
|
|
description of the error will be provided through the string
|
|
argument.
|
|
|
|
``on %finally { ... }``
|
|
Executes once unit parsing has completed in any way. This hook is
|
|
most useful to modify global state that needs to be updated no
|
|
matter the success of the parsing process. Once `%init` triggers, this
|
|
hook is guaranteed to eventually execute as well. It will run
|
|
*after* either ``%done`` or ``%error``, respectively. (If a new
|
|
error occurs during execution of ``%finally``, that will not
|
|
trigger the unit's ``%error`` hook.)
|
|
|
|
``on %print { ... }``
|
|
Executes when a unit is about to be printed (and more generally:
|
|
when rendered into a string representation). By default, printing
|
|
a unit will produce a list of its fields with their current
|
|
values. Through this hook, a unit can customize its appearance by
|
|
returning the desired string.
|
|
|
|
``on <field name> { ... }`` (field hook)
|
|
Executes just after the given unit field has been parsed. The
|
|
final value is accessible through the ``$$``, potentially with
|
|
any relevant type conversion applied (see
|
|
:ref:`attribute_convert`). The same will also have been assigned
|
|
to the field already.
|
|
|
|
.. _foreach:
|
|
|
|
``on <field name> foreach { ... }`` (container hook)
|
|
Assuming the specified field is a container (e.g., a vector), this
|
|
executes each time a new container element has been parsed, and
|
|
just before it's been added to the container. The element's final
|
|
value is accessible through the ``$$`` identifier, although it
|
|
can be further modified before it's stored. The hook
|
|
implementation may also use the :ref:`statement_stop` statement to
|
|
abort container parsing, without the current element being added
|
|
anymore.
|
|
|
|
In addition, Spicy provides a set of hooks specific to the ``sink`` type which
|
|
are discussed in the :ref:`section on sinks <sinks>`, and hooks which are
|
|
executed during :ref:`error recovery <error_recovery_hooks>`.
|
|
|
|
There are three locations where hooks can be implemented:
|
|
|
|
- Inside a unit, ``on <hook name> { ... }`` implements the hook of the
|
|
given name:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
x: uint32;
|
|
v: uint8[];
|
|
|
|
on %init { ... }
|
|
on x { ... }
|
|
on v foreach { ... }
|
|
on %done { ... }
|
|
}
|
|
|
|
- Field and container hooks may be directly attached to their field,
|
|
skipping the ``on ...`` part:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
x: uint32 { ... }
|
|
v: uint8[] foreach { ... }
|
|
}
|
|
|
|
- At the global module level, one can add hooks to any available unit
|
|
type through ``on <unit type>::<hook name> { ... }``. With the
|
|
definition of ``Foo`` above, this implements hooks externally:
|
|
|
|
.. spicy-code::
|
|
|
|
on Foo::%init { ... }
|
|
on Foo::x { ... }
|
|
on Foo::v foreach { ... }
|
|
on Foo::%done { ... }
|
|
|
|
External hooks work across module boundaries by qualifying the unit
|
|
type accordingly. They provide a powerful mechanism to extend a
|
|
predefined unit without changing any of its code.
|
|
|
|
If multiple implementations are provided for the same hook, by default
|
|
it remains undefined in which order they will execute. If a particular
|
|
order is desired, you can specify priorities for your hook
|
|
implementations:
|
|
|
|
.. spicy-code::
|
|
|
|
on Foo::v priority=5 { ... }
|
|
on Foo::v priority=-5 { ... }
|
|
|
|
Implementations then execute in order of their priorities: The higher a
|
|
priority value, the earlier it will execute. If not specified, a
|
|
hook's priority is implicitly taken as zero.
|
|
|
|
.. note::
|
|
|
|
When a hook executes, it has access to the current unit instance
|
|
through the ``self`` identifier. The state of that instance will
|
|
reflect where parsing is at that time. In particular, any field
|
|
that hasn't been parsed yet, will remain unset. You can use the
|
|
``?.`` unit operator to test if a field has received a value yet.
|
|
|
|
Unit Variables
|
|
==============
|
|
|
|
In addition to unit field for parsing, you can also add further instance
|
|
variables to a unit type to store arbitrary state:
|
|
|
|
.. spicy-code:: unit-vars.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
on %init { print self; }
|
|
x: int8 { self.a = "Our integer is %d" % $$; }
|
|
on %done { print self; }
|
|
|
|
var a: string;
|
|
};
|
|
|
|
.. spicy-output:: unit-vars.spicy
|
|
:exec: printf \05 | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Here, we assign a string value to ``a`` once we have parsed ``x``. The
|
|
final ``print`` shows the expected value. As you can also see, before
|
|
we assign anything, the variable's value is just empty: Spicy
|
|
initializes unit variables with well-defined defaults. If you
|
|
would rather leave a variable unset by default, you can add
|
|
``&optional``:
|
|
|
|
.. spicy-code:: unit-vars-optional.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
on %init { print self; }
|
|
x: int8 { self.a = "Our integer is %d" % $$; }
|
|
on %done { print self; }
|
|
|
|
var a: string &optional;
|
|
};
|
|
|
|
.. spicy-output:: unit-vars-optional.spicy
|
|
:exec: printf \05 | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
You can use the ``?.`` unit operator to test if an optional unit variable
|
|
remains unset, e.g., ``self?.x`` would return ``True`` if field ``x`` is set
|
|
and ``False`` otherwise.
|
|
|
|
Unit variables can also be initialized with custom expressions when being
|
|
defined. The initialization is performed just before the containing unit starts
|
|
parsing (implying that the expressions cannot access parse results
|
|
of the unit itself yet)
|
|
|
|
.. spicy-code:: unit-vars-init.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: int8;
|
|
var a: int8 = 123;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: unit-vars-init.spicy
|
|
:exec: printf \05 | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. _unit_parameters:
|
|
|
|
Unit Parameters
|
|
===============
|
|
|
|
Unit types can receive parameters upon instantiation, which will then be
|
|
available to any code inside the type's declaration:
|
|
|
|
.. spicy-code:: unit-params.spicy
|
|
|
|
module Test;
|
|
|
|
type Bar = unit(msg: string, mult: int8) {
|
|
x: int8 &convert=($$ * mult);
|
|
on %done { print "%s: %d" % (msg, self.x); }
|
|
};
|
|
|
|
public type Foo = unit {
|
|
y: Bar("My multiplied integer", 5);
|
|
};
|
|
|
|
.. spicy-output:: unit-params.spicy
|
|
:exec: printf '\05' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
This example shows a typical idiom: We're handing parameters down to a
|
|
subunit through parameters it receives. Inside the submodule, we then
|
|
have access to the values passed in.
|
|
|
|
.. note:: It's usually not very useful to define a top-level parsing
|
|
unit with parameters because we don't have a way to pass anything
|
|
in through ``spicy-driver``. A custom host application could make
|
|
use of them, though.
|
|
|
|
This works with subunits inside containers as well:
|
|
|
|
.. spicy-code:: unit-params-vector.spicy
|
|
|
|
module Test;
|
|
|
|
type Bar = unit(mult: int8) {
|
|
x: int8 &convert=($$ * mult);
|
|
on %done { print self.x; }
|
|
};
|
|
|
|
public type Foo = unit {
|
|
x: int8;
|
|
y: Bar(self.x)[];
|
|
};
|
|
|
|
.. spicy-output:: unit-params-vector.spicy
|
|
:exec: printf '\05\01\02\03' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
A common use-case for unit parameters is passing the ``self`` of a
|
|
higher-level unit down into a subunit:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
...
|
|
b: Bar(self);
|
|
...
|
|
}
|
|
|
|
type Bar = unit(foo: Foo) {
|
|
# We now have access to any state in "foo".
|
|
}
|
|
|
|
That way, the subunit can for example store state directly in the
|
|
parent. If you declare the ``foo`` parameter as ``inout``, the subunit
|
|
can also modify its members.
|
|
|
|
Unit parameters generally follow the same passing conventions as
|
|
:ref:`function parameters <functions>`, yet with some restrictions.
|
|
By default, just like with functions, parameters are read-only by
|
|
default. If you want the receiving unit to be able to modify the
|
|
value, there are two options:
|
|
|
|
1. If the parameter itself is a unit, you can declare it as ``inout``
|
|
as described above.
|
|
|
|
2. For all other types, you instead need to pass the parameter as a
|
|
:ref:`reference <type_reference>`. Here's an example passing a
|
|
string so that it can be modified by the subunit:
|
|
|
|
.. spicy-code:: unit-params-string.spicy
|
|
|
|
module Test;
|
|
|
|
type X = unit(s: string&) {
|
|
n: uint8 {
|
|
*s = "Hello, world!";
|
|
}
|
|
};
|
|
|
|
public type Y = unit {
|
|
x: X(self.s);
|
|
|
|
on %done { print self.s; }
|
|
|
|
var s: string& = new string;
|
|
};
|
|
|
|
.. spicy-output:: unit-params-string.spicy
|
|
:exec: printf '\x2a' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. **
|
|
|
|
.. note::
|
|
|
|
While this lack of support for ``inout`` may seem like a
|
|
surprising restriction at first, it follows from Spicy's safety
|
|
guarantees: since a subunit may access its parameters during its
|
|
entire lifetime, generally Spicy couldn't guarantee that a
|
|
parameter passed as ``inout`` at initialization time would
|
|
actually remain around for modification the whole time. References
|
|
do not have that problem: their wrapped values are guaranteed to
|
|
remain valid as long as necessary. (Units happen to share that
|
|
behaviour, too, which is why Spicy can support ``inout`` for
|
|
them.)
|
|
|
|
.. _unit_attributes:
|
|
|
|
Unit Attributes
|
|
===============
|
|
|
|
Unit types support the following type attributes:
|
|
|
|
``&byte-order=ORDER``
|
|
Specifies a byte order to use for parsing the unit where ``ORDER`` is of
|
|
type :ref:`spicy_ByteOrder`. This overrides the byte order specified for the
|
|
module. Individual fields can override this value by specifying their own
|
|
byte-order. Example:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
version: uint32;
|
|
} &byte-order=spicy::ByteOrder::Little;
|
|
|
|
``&convert=EXPR``
|
|
Replaces a unit instance with the result of the expression
|
|
``EXPR`` after parsing it from inside a parent unit. See
|
|
:ref:`attribute_convert` for an example. ``EXPR`` has access to
|
|
``self`` to retrieves state from the unit.
|
|
|
|
``&requires=EXPR``
|
|
Enforces post-conditions on the parsed unit. ``EXPR`` must be a boolean
|
|
expression that will be evaluated after the parsing for the unit has
|
|
finished, but before any hooks execute. More than one ``&requires``
|
|
attributes may be specified. Example:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
a: int8;
|
|
b: int8;
|
|
} &requires=self.a==self.b;
|
|
|
|
See the :ref:`section on parsing constraints <attribute_requires>` for more
|
|
details.
|
|
|
|
``&size=N``
|
|
Limits the unit's input to ``N`` bytes, which it must fully
|
|
consume. Example:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
a: int8;
|
|
b: bytes &eod;
|
|
} &size=5;
|
|
|
|
This expects 5 bytes of input when parsing an instance of ``Foo``.
|
|
The unit will store the first byte into ``a``, and then fill ``b``
|
|
with the remaining 4 bytes.
|
|
|
|
The expression ``N`` has access to ``self`` as well as to the
|
|
unit's parameters.
|
|
|
|
.. _unit_meta_data:
|
|
|
|
Meta data
|
|
=========
|
|
|
|
Units can provide meta data about their semantics through *properties*
|
|
that both Spicy itself and host applications can access. One defines
|
|
properties inside the unit's type through either a ``%<property> =
|
|
<value>;`` tuple, or just as ``%<property>;`` if the property does not
|
|
take an argument. Currently, units support the following meta data
|
|
properties:
|
|
|
|
``%mime-type = STRING``
|
|
A string of the form ``"<type>/<subtype>"`` that defines the MIME
|
|
type for content the unit knows how to parse. This may include a
|
|
``*`` wildcard for either the type or subtype. We use a
|
|
generalized notion of MIME types here that can include custom
|
|
meanings. See :ref:`sinks` for more on how these MIME types are
|
|
used to select parsers dynamically during runtime.
|
|
|
|
You can specify this property more than once to associate a unit
|
|
with multiple types.
|
|
|
|
``%description = STRING``
|
|
A short textual description of the unit type (i.e., the parser
|
|
that it defines). Host applications have access to this property,
|
|
and ``spicy-driver`` includes the information into the list of
|
|
available parsers that it prints with the ``--list-parsers``
|
|
option.
|
|
|
|
``%port = PORT_VALUE [&originator|&responder]``
|
|
A :ref:`type_port` to associate this unit with, optionally
|
|
including a direction to limit its use to the corresponding side.
|
|
This property has no built-in effect, but host applications may
|
|
make use of the information to decide which unit type to use for
|
|
parsing a connection's payload.
|
|
|
|
``%skip = ( REGEXP | Null );``
|
|
Specifies a pattern which should be skipped when encountered in the input
|
|
stream in between parsing of unit fields. This overwrites a value set at
|
|
the module level; use ``Null`` to reset the property, i.e., not skip
|
|
anything.
|
|
|
|
``%skip-pre = ( REGEXP | Null );``
|
|
Specifies a pattern which should be skipped when encountered in the input
|
|
stream before parsing of a unit begins. This overwrites a value set at the
|
|
module level; use ``Null`` to reset the property, i.e., not skip anything.
|
|
|
|
``%skip-post = ( REGEXP | Null );``
|
|
Specifies a pattern which should be skipped when encountered in the input
|
|
stream after parsing of a unit has finished. This overwrites a value set at
|
|
the module level; use ``Null`` to reset the property, i.e., not skip
|
|
anything.
|
|
|
|
.. _synchronize-at:
|
|
|
|
``%synchronize-at = EXPR;``
|
|
Specifies a literal to synchronize on if the unit is used as a
|
|
synchronization point during :ref:`error recovery <error_recovery>`.
|
|
The literal is left in the input stream.
|
|
|
|
.. _synchronize-after:
|
|
|
|
``%synchronize-after = EXPR;``
|
|
Specifies a literal to synchronize on if the unit is used as a
|
|
synchronization point during :ref:`error recovery <error_recovery>`.
|
|
The literal is consumed and will not be present in the input stream after
|
|
successful synchronization.
|
|
|
|
Units support some further properties for other purposes, which we
|
|
introduce in the corresponding sections.
|
|
|
|
Parsing Types
|
|
=============
|
|
|
|
Several, but not all, of Spicy's :ref:`data types <types>` can be
|
|
parsed from binary data. In the following we summarize the types that
|
|
can, along with any options they support to control specifics of how
|
|
they unpack binary representations.
|
|
|
|
.. _parse_address:
|
|
|
|
Address
|
|
^^^^^^^
|
|
|
|
Spicy parses :ref:`addresses <type_address>` from either 4 bytes of
|
|
input for IPv4 addresses, or 16 bytes for IPv6 addresses. To select
|
|
the type, a unit field of type ``addr`` must come with either an
|
|
``&ipv4`` or ``&ipv6`` attribute.
|
|
|
|
By default, addresses are assumed to be represented in network byte
|
|
order. Alternatively, a different byte order can be specified through
|
|
a ``&byte-order`` attribute specifying the desired
|
|
:ref:`spicy_byteorder`.
|
|
|
|
Example:
|
|
|
|
.. spicy-code:: parse-address.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
public type Foo = unit {
|
|
ip: addr &ipv6 &byte-order=spicy::ByteOrder::Little;
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-address.spicy
|
|
:exec: printf '1234567890123456' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. _parse_bitfield:
|
|
|
|
Bitfield
|
|
^^^^^^^^
|
|
|
|
:ref:`Bitfields <type_bitfield>` parse an integer value of a given
|
|
size, and then make selected smaller bit ranges within that value
|
|
available individually through dedicated identifiers. For example, the
|
|
following unit parses 4 bytes as an ``uint32`` and then makes the
|
|
value of bit 0 available as ``f.x1``, bits 1 to 2 as ``f.x2``, and
|
|
bits 3 to 4 as ``f.x3``, respectively:
|
|
|
|
.. spicy-code:: parse-bitfield.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
f: bitfield(32) {
|
|
x1: 0;
|
|
x2: 1..2;
|
|
x3: 3..4;
|
|
};
|
|
|
|
on %done {
|
|
print self.f.x1, self.f.x2, self.f.x3;
|
|
print self;
|
|
}
|
|
};
|
|
|
|
.. spicy-output:: parse-bitfield.spicy
|
|
:exec: printf '\01\02\03\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Generally, a field ``bitfield(N)`` field is parsed like an
|
|
``uint<N>``. The field then supports dereferencing individual bit
|
|
ranges through their labels. The corresponding expressions
|
|
(``self.x.<id>``) have the same ``uint<N>`` type as the parsed value
|
|
itself, with the value shifted to the right so that the least significant
|
|
extracted bit becomes the least significant bit of the returned value. As you can see in
|
|
the example, the type of the field itself becomes a tuple composed of
|
|
the values of the individual bit ranges.
|
|
|
|
By default, a bitfield assumes the underlying integer comes in network
|
|
byte order. You can specify a ``&byte-order`` attribute to change that
|
|
(e.g., ``bitfield(32) { ... } &byte-order=spicy::ByteOrder::Little``).
|
|
|
|
When parsing a ``bitfield(16)`` in network byte order and with bit order
|
|
``spicy::BitOrder::LSB0`` (default value of ``&bit-order``), bits are
|
|
numbered 0 to 15 from right to left.
|
|
|
|
.. code::
|
|
|
|
MSB LSB
|
|
<-- 1 <-- 0
|
|
6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
|
+---------------+---------------+
|
|
| | |
|
|
+-------------------------------+
|
|
|
|
|
|
This default bit numbering may be surprising given that some RFCs use the inverse
|
|
as documented in `RFC 1700 <https://www.rfc-editor.org/rfc/rfc1700.html>`_.
|
|
Here, the most significant bit is numbered 0 on the left with higher
|
|
bit numbers representing less significant bits to the right.
|
|
Concrete examples would be the `WebSocket framing <https://datatracker.ietf.org/doc/html/rfc6455#section-5.2>`_
|
|
or `IPv4 header <https://datatracker.ietf.org/doc/html/rfc791#section-3.1>`_
|
|
notations.
|
|
|
|
.. code::
|
|
|
|
MSB LSB
|
|
0 --> 1 -->
|
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
|
|
+-+-+-+-+-------+-+-------------+
|
|
|F|R|R|R| opcode|M| Payload len |
|
|
|I|S|S|S| (4) |A| (7) |
|
|
|N|V|V|V| |S| |
|
|
| |1|2|3| |K| |
|
|
+-+-+-+-+-------+-+-------------+
|
|
|
|
To express such bitfields more naturally in Spicy, use ``&bit-order=spicy::BitOrder::MSB0``
|
|
on the whole bitfield:
|
|
|
|
.. spicy-code:: parse-websocket-bitfield.spicy
|
|
|
|
module WebSocket;
|
|
|
|
import spicy;
|
|
|
|
public type Header= unit {
|
|
: bitfield(32) {
|
|
fin: 0;
|
|
rsv: 1..3;
|
|
opcode: 4..7;
|
|
mask: 8;
|
|
payload_len: 9..15;
|
|
} &bit-order=spicy::BitOrder::MSB0;
|
|
};
|
|
|
|
The way to think about this is that the most significant bit of an integer in
|
|
network byte order is always the most left bit and the least significant bit
|
|
the most right one. Specifying the bit order as ``LSB0`` or ``MSB0`` essentially
|
|
sets the bit numbering direction by specifying the location of bit 0.
|
|
|
|
With little endian byte order, the bits are numbered zigzag-wise and
|
|
``MSB0`` and ``LSB0`` can again be used to change the direction of the bit
|
|
numbering. The following example uses ``spicy::ByteOrder::Little`` and
|
|
the default ``LSB0`` bit order for ``bitfield(16)``. Notice how the most
|
|
significant and least significant bit for a 2 byte little endian integer
|
|
are next to each other.
|
|
|
|
.. code::
|
|
|
|
f: bitfield(16) {
|
|
|
|
...
|
|
|
|
} &byte-order=spicy::ByteOrder::Little;
|
|
|
|
LSB MSB
|
|
<-- 0 <-- 1
|
|
7 6 5 4 3 2 1 0 5 4 3 2 1 0 9 8
|
|
+---------------+---------------+
|
|
| | |
|
|
+-------------------------------+
|
|
|
|
With ``MSB0`` as bit order, the bit numbering direction is from left to right, instead:
|
|
|
|
.. code::
|
|
|
|
f: bitfield(16) {
|
|
|
|
...
|
|
|
|
} &byte-order=spicy::ByteOrder::Little &bit-order=spicy::BitOrder::MSB0;
|
|
|
|
LSB MSB
|
|
1 --> 0 -->
|
|
8 9 0 1 2 3 4 5 0 1 2 3 4 5 6 7
|
|
+---------------+---------------+
|
|
| | |
|
|
+-------------------------------+
|
|
|
|
|
|
Bit numbering with larger sized bitfields in little endian gets only more
|
|
confusing. Prefer network byte ordered bitfields unless it makes sense given
|
|
the spec you're working with.
|
|
|
|
The individual bit ranges support the ``&convert`` attribute and will
|
|
adjust their types accordingly, just like a regular unit field (see
|
|
:ref:`attribute_convert`). For example, that allows for mapping a bit
|
|
range to an enum, using ``$$`` to access the parsed value:
|
|
|
|
.. spicy-code:: parse-bitfield-enum.spicy
|
|
|
|
module Test;
|
|
|
|
import spicy;
|
|
|
|
type X = enum { A = 1, B = 2 };
|
|
|
|
public type Foo = unit {
|
|
f: bitfield(8) {
|
|
x1: 0..3 &convert=X($$);
|
|
x2: 4..7 &convert=X($$);
|
|
} { print self.f.x1, self.f.x2; }
|
|
};
|
|
|
|
.. spicy-output:: parse-bitfield-enum.spicy
|
|
:exec: printf '\x21' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
When parsing a bitfield, you can enforce expected values for some
|
|
or all of the bitranges through an assignment-style syntax:
|
|
|
|
.. spicy-code::
|
|
|
|
type Foo = unit {
|
|
f: bitfield(8) {
|
|
x1: 0..3 = 2;
|
|
x2: 4..5;
|
|
x3: 6..7 = 3;
|
|
}
|
|
};
|
|
|
|
Now parsing will fail if values of ``x1`` and ``x3`` aren't ``2`` and
|
|
``3``, respectively. Internally, Spicy treats bitfields with such
|
|
expected values similar to constants of other types, meaning they
|
|
operate as valid look-ahead symbols as well (see
|
|
:ref:`parse_lookahead`).
|
|
|
|
.. _parse_bytes:
|
|
|
|
Bytes
|
|
^^^^^
|
|
|
|
When parsing a field of type :ref:`type_bytes`, Spicy will consume raw
|
|
input bytes according to a specified attribute that determines when to
|
|
stop. The following attributes are supported:
|
|
|
|
``&eod``
|
|
Consumes all subsequent data until the end of the input is reached.
|
|
|
|
``&size=N``
|
|
Consumes exactly ``N`` bytes. The attribute may be combined with
|
|
``&eod`` to consume up to ``N`` bytes instead (i.e., permit
|
|
running out of input before the size limit is reached).
|
|
|
|
(This attribute :ref:`works for fields of all types
|
|
<attribute_size>`. We list it here because it's particularly
|
|
common to use it with `bytes`.)
|
|
|
|
``&until=DELIM``
|
|
Consumes bytes until the specified delimiter is found. ``DELIM``
|
|
must be of type ``bytes`` itself. The delimiter will not be
|
|
included into the resulting value, but consumed.
|
|
|
|
``&until-including=DELIM``
|
|
Similar to ``&until``, but this does include the delimiter
|
|
``DELIM`` into the resulting value.
|
|
|
|
At least one of these attributes must be provided.
|
|
|
|
On top of that, bytes fields support the attribute ``&chunked`` to
|
|
change how the parsed data is processed and stored. Normally, a bytes
|
|
field will first accumulate all desired data and then store the final,
|
|
complete value in the field. With ``&chunked``, if the data arrives
|
|
incrementally in pieces, the field instead processes just whatever is
|
|
available at a time, storing each piece directly, and individually, in
|
|
the field. Each time a piece gets stored, any associated field hooks
|
|
execute with the new part as their ``$$``. Parsing with ``&chunked``
|
|
will eventually still consume the same number of bytes overall, but it
|
|
avoids buffering everything in cases where that's either infeasible or
|
|
simply not not needed.
|
|
|
|
Bytes fields support parsing constants: If a ``bytes`` constant is
|
|
specified instead of a field type, parsing will expect to find the
|
|
corresponding value in the input stream.
|
|
|
|
.. _parse_integer:
|
|
|
|
Integer
|
|
^^^^^^^
|
|
|
|
Fields of :ref:`integer type <type_integer>` can be either signed
|
|
(``intN``) or unsigned (``uintN``). In either case, the bit length
|
|
``N`` determines the number of bytes being parsed. By default,
|
|
integers are expected to come in network byte order. You can specify a
|
|
different order through the ``&byte-order=ORDER`` attribute, where
|
|
``ORDER`` is of type :ref:`spicy_ByteOrder`.
|
|
|
|
Integer fields support parsing constants: If an integer constant is
|
|
specified instead the instead of a field type, parsing will expect to
|
|
find the corresponding value in the input stream. Since the exact type
|
|
of the integer constant is important, you should use their constructor
|
|
syntax to make that explicit (e.g., ``uint32(42)``, ``int8(-1)``; vs.
|
|
using just ``42`` or ``-1``).
|
|
|
|
.. _parse_real:
|
|
|
|
Real
|
|
^^^^
|
|
|
|
Real values are parsed as either single or double precision values in
|
|
IEEE754 format, depending on the value of their ``&type=T`` attribute,
|
|
where ``T`` is one of :ref:`spicy_RealType`.
|
|
|
|
.. _parse_regexp:
|
|
|
|
Regular Expression
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
When parsing a field through a :ref:`type_regexp`, the expression is
|
|
expected to match at the current position of the input stream. The
|
|
field's type becomes ``bytes``, and it will store the matching data.
|
|
|
|
Inside hooks for fields with regular expressions, you can access
|
|
capture groups through ``$1``, ``$2``, ``$3``, etc. For example:
|
|
|
|
.. spicy-code::
|
|
|
|
x : /(a.c)(de*f)(h.j)/ {
|
|
print $1, $2, $3;
|
|
}
|
|
|
|
This will print out the relevant pieces of the data matching the
|
|
corresponding set of parentheses. (There's no ``$0``, just use ``$$``
|
|
as normal to get the full match.)
|
|
|
|
Matching an regular expression is more expensive if you need it to
|
|
capture groups. If are using groups inside your expression but don't
|
|
need the actual captures, add ``&nosub`` to the field to remove that
|
|
overhead.
|
|
|
|
.. _parse_unit:
|
|
|
|
Unit
|
|
^^^^
|
|
|
|
Fields can have the type of another unit, in which case parsing will
|
|
descend into that subunit's grammar until that instance has been fully
|
|
parsed. Field initialization and hooks work as usual.
|
|
|
|
If the subunit receives parameters, they must be given right after the
|
|
type.
|
|
|
|
.. spicy-code:: parse-unit-params.spicy
|
|
|
|
module Test;
|
|
|
|
type Bar = unit(a: string) {
|
|
x: uint8 { print "%s: %u" % (a, self.x); }
|
|
};
|
|
|
|
public type Foo = unit {
|
|
y: Bar("Spicy");
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-unit-params.spicy
|
|
:exec: printf '\01\02' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
See :ref:`unit_parameters` for more.
|
|
|
|
.. _parse_vector:
|
|
|
|
Vector
|
|
^^^^^^
|
|
|
|
Parsing a :ref:`vector <type_vector>` creates a loop that repeatedly
|
|
parses elements of the specified type from the input stream until an
|
|
end condition is reached. The field's value accumulates all the
|
|
elements into the final vector.
|
|
|
|
Spicy uses a specific syntax to define fields of type vector::
|
|
|
|
NAME : ELEM_TYPE[SIZE]
|
|
|
|
``NAME`` is the field name as usual. ``ELEM_TYPE`` is type of the
|
|
vector's elements, i.e., the type that will be repeatedly parsed.
|
|
``SIZE`` is the number of elements to parse into the vector; this is
|
|
an arbitrary Spicy expression yielding an integer value. The resulting
|
|
field type then will be ``vector<ELEM_TYPE>``. Here's a simple example
|
|
parsing five ``uint8``:
|
|
|
|
.. spicy-code:: parse-vector.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: uint8[5];
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-vector.spicy
|
|
:exec: printf '\01\02\03\04\05' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
It is possible to skip the ``SIZE`` (e.g., ``x: uint8[]``) and instead
|
|
use another kind of end conditions to terminate a vector's parsing
|
|
loop. To that end, vectors support the following attributes:
|
|
|
|
``&eod``
|
|
Parses elements until the end of the input stream is reached.
|
|
|
|
``&size=N``
|
|
Parses the vector from the subsequent ``N`` bytes of input data.
|
|
This effectively limits the available input to the corresponding
|
|
window, letting the vector parse elements until it runs out of
|
|
data. (This attribute :ref:`works for fields of all types
|
|
<attribute_size>`. We list it here because it's particularly
|
|
common to use it with vectors.)
|
|
|
|
``&until=EXPR``
|
|
Vector elements are parsed in a loop with ``EXPR`` being evaluated
|
|
as a boolean expression after each parsed element, and before
|
|
adding the element to the vector. Once ``EXPR`` evaluates to true,
|
|
parsing stops *without* adding the element that was just
|
|
parsed. Inside ``EXPR``, ``$$`` refers to the element most recently
|
|
parsed.
|
|
|
|
``&until-including=EXPR``
|
|
Similar to ``&until``, but does include the final element ``EXPR``
|
|
into the field's vector when stopping parsing. Inside ``EXPR``,
|
|
``$$`` refers to the element most recently parsed.
|
|
|
|
``&while=EXPR``
|
|
Continues parsing as long as the boolean expression ``EXPR``
|
|
evaluates to true. Inside ``EXPR``, ``$$`` refers to the element
|
|
most recently parsed.
|
|
|
|
If neither a size nor an attribute is given, Spicy will attempt to use
|
|
:ref:`look-ahead parsing <parse_lookahead>` to determine the end of
|
|
the vector based on the next expected token. Depending on the unit's
|
|
field, this may not be possible, in which case Spicy will decline to
|
|
compile the unit.
|
|
|
|
The syntax shown above generally works for all element types,
|
|
including subunits (e.g., ``x: MyUnit[]``).
|
|
|
|
.. note::
|
|
|
|
The ``x: (<T>)[]`` syntax is quite flexible. In fact, ``<T>`` is
|
|
not limited to subunits, but allows for any standard field
|
|
specification defining how to parse the vector elements. For
|
|
example, ``x: (bytes &size=5)[];`` parses a vector of 5-character
|
|
``bytes`` instances.
|
|
|
|
.. _hook_foreach:
|
|
|
|
When parsing a vector, Spicy supports using a special kind of field
|
|
hook, ``foreach``, that executes for each parsed element individually.
|
|
Inside that hook, ``$$`` refers to the element's final value:
|
|
|
|
.. spicy-code:: parse-vector-foreach.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: uint8[5] foreach { print $$, self.x; }
|
|
};
|
|
|
|
.. spicy-output:: parse-vector-foreach.spicy
|
|
:exec: printf '\01\02\03\04\05' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
As you can see, when a ``foreach`` hook executes the element has not yet
|
|
been added to the vector. You may indeed use a ``stop`` statement
|
|
inside a ``foreach`` hook to abort the vector's parsing without adding
|
|
the current element anymore. See :ref:`unit_hooks` for more on hooks.
|
|
|
|
.. _parse_void:
|
|
|
|
Void
|
|
^^^^
|
|
|
|
The :ref:`type_void` type can be used as a placeholder in fields not
|
|
meant to consume any data. This can be useful in some situations, such
|
|
as providing a branch in :ref:`switch <parse_switch>` constructs to
|
|
that foregoes any parsing, or attaching a :ref:`&requires
|
|
<attribute_requires>` attribute to enforce a condition.
|
|
|
|
Fields of type ``void`` do not have any accessible value.
|
|
|
|
Controlling Parsing
|
|
===================
|
|
|
|
Spicy offers a few additional constructs inside a unit's declaration
|
|
for steering the parsing process. We discuss them in the following.
|
|
|
|
Conditional Parsing
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
A unit field may be conditionally skipped for parsing by adding an
|
|
``if ( COND )`` clause, where ``COND`` is a boolean expression. The
|
|
field will be only parsed if the expression evaluates to true at the
|
|
time the field is next in line.
|
|
|
|
.. spicy-code:: parse-if.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
a: int8;
|
|
b: int8 if ( self.a == 1 );
|
|
c: int8 if ( self.a % 2 == 0 );
|
|
d: int8;
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-if.spicy
|
|
:exec: printf '\01\02\03\04' | spicy-driver %INPUT; printf '\02\02\03\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. versionadded:: 1.12 Conditional blocks
|
|
|
|
If the same condition applies to multiple subsequent fields, they can
|
|
be grouped together into a single conditional block:
|
|
|
|
.. spicy-code:: parse-if-block.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
a: int8;
|
|
|
|
if ( self.a == 1 ) {
|
|
b: int8;
|
|
c: int8;
|
|
}; # note the trailing semicolon
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
|
|
The syntax supports an optional ``else``-block as well:
|
|
|
|
.. spicy-code:: parse-if-block-with-else.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
a: int8;
|
|
|
|
if ( self.a == 1 ) {
|
|
b: int8;
|
|
}
|
|
else {
|
|
c: int8;
|
|
}; # note the trailing semicolon
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
|
|
For repeated cases of conditional parsing where a single expression
|
|
evaluates to one of several values, unit :ref:`parse_switch`
|
|
statements might allow for more compact and easier to maintain code.
|
|
|
|
.. _parse_lookahead:
|
|
|
|
Look-Ahead
|
|
^^^^^^^^^^
|
|
|
|
Internally, Spicy builds an LR(1) grammar for each unit that it
|
|
parses, meaning that it can actually look *ahead* in the parsing
|
|
stream to determine how to process the current input location. Roughly
|
|
speaking, if (1) the current construct does not have a clear end
|
|
condition defined (such as a specific length), and (2) a specific value
|
|
is expected to be found next; then the parser will keep looking for
|
|
that value and end the current construct once it finds it.
|
|
|
|
"Construct" deliberately remains a bit of a fuzzy term here, but think
|
|
of vector parsing as the most common instance of this: If you don't
|
|
give a vector an explicit termination condition (as discussed in
|
|
:ref:`parse_vector`), Spicy will look at what's expected to come
|
|
*after* the container. As long as that's something clearly
|
|
recognizable (e.g., a specific value of an atomic type, or a match for
|
|
a regular expression), it'll terminate the vector accordingly.
|
|
|
|
Here's an example:
|
|
|
|
.. spicy-code:: parse-look-ahead.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
data: uint8[];
|
|
: /EOD/;
|
|
x : int8;
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-look-ahead.spicy
|
|
:exec: printf '\01\02\03EOD\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
For vectors, Spicy attempts look-ahead parsing automatically as a last
|
|
resort when it doesn't find more explicit instructions. However, it
|
|
will reject a unit if it can't find a suitable look-ahead symbol to
|
|
work with. If we had written ``int32`` in the example above, that
|
|
would not have worked as the parser can't recognize when there's a
|
|
``int32`` coming; it would need to be a concrete value, such as
|
|
``int32(42)``.
|
|
|
|
See the :ref:`parse_switch` construct for another instance of
|
|
look-ahead parsing.
|
|
|
|
.. _parse_switch:
|
|
|
|
``switch``
|
|
^^^^^^^^^^
|
|
|
|
Spicy supports a ``switch`` construct as way to branch into one
|
|
of several parsing alternatives. There are two variants of this, an
|
|
explicit branch and one driving by look-ahead:
|
|
|
|
.. rubric:: Branch by expression
|
|
|
|
The most basic form of switching by expression looks like this:
|
|
|
|
.. spicy-code::
|
|
|
|
switch ( EXPR ) {
|
|
VALUE_1 -> FIELD_1;
|
|
VALUE_2 -> FIELD_2;
|
|
...
|
|
VALUE_N -> FIELD_N;
|
|
};
|
|
|
|
This evaluates ``EXPR`` at the time parsing reaches the ``switch``. If
|
|
there's a ``VALUE`` matching the result, parsing continues with the
|
|
corresponding field, and then proceeds with whatever comes after the
|
|
switch. Example:
|
|
|
|
.. spicy-code:: parse-switch.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: bytes &size=1;
|
|
switch ( self.x ) {
|
|
b"A" -> a8: int8;
|
|
b"B" -> a16: int16;
|
|
b"C" -> a32: int32;
|
|
};
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-switch.spicy
|
|
:exec: printf 'A\01' | spicy-driver %INPUT; printf 'B\01\02' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
We see in the output that all of the alternatives turn into normal
|
|
unit members, with all but the one for the branch that was taken left
|
|
unset.
|
|
|
|
If none of the values match the expression, that's considered a
|
|
parsing error and processing will abort. Alternative, one can add a
|
|
default alternative by using ``*`` as the value. The branch will then
|
|
be taken whenever no other value matches.
|
|
|
|
A couple additional notes about the fields inside an alternative:
|
|
|
|
- In our example, the fields of all alternatives all have
|
|
different names, and they all show up in the output. One can
|
|
also reuse names across alternatives as long as the types
|
|
exactly match. In that case, the unit will end up with only a
|
|
single instance of that member.
|
|
|
|
- An alternative can match against more than one value by
|
|
separating them with commas (e.g., ``b"A", b"B" -> x: int8;``).
|
|
|
|
- Alternatives can have more than one field attached by enclosing
|
|
them in braces, i.e.,: ``VALUE -> { FIELD_1a; FIELD_1b; ...;
|
|
FIELD_1n; }``.
|
|
|
|
- Sometimes one really just needs the branching capability, but
|
|
doesn't have any field values to store. In that case an
|
|
anonymous ``void`` field may be helpful( e.g., ``b"A" -> : void
|
|
{ DoSomethingHere(); }``.
|
|
|
|
.. rubric:: Branch by look-ahead
|
|
|
|
``switch`` also works without any expression as long as the presence
|
|
of all the alternatives can be reliably recognized by looking ahead in
|
|
the input stream:
|
|
|
|
.. spicy-code:: parse-switch-lhead.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
switch {
|
|
-> a: b"A";
|
|
-> b: b"B";
|
|
-> c: b"C";
|
|
};
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-switch-lhead.spicy
|
|
:exec: printf 'A' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
While this example is a bit contrived, the mechanism becomes powerful
|
|
once you have subunits that are recognizable by how they start:
|
|
|
|
.. spicy-code:: parse-switch-lhead-2.spicy
|
|
|
|
module Test;
|
|
|
|
type A = unit {
|
|
a: b"A";
|
|
};
|
|
|
|
type B = unit {
|
|
b: uint16(0xffff);
|
|
};
|
|
|
|
public type Foo = unit {
|
|
switch {
|
|
-> a: A;
|
|
-> b: B;
|
|
};
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-switch-lhead-2.spicy
|
|
:exec: printf 'A ' | spicy-driver %INPUT; printf '\377\377' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. rubric:: Switching Over Fields With Common Size
|
|
|
|
You can limit the input any field in a unit switch receives by attaching an
|
|
optional ``&size=EXPR`` attribute that specifies the number of raw bytes to
|
|
make available. This is analog to the :ref:`field size attribute <attribute_size>`
|
|
and especially useful to remove duplication when each case is subject to the
|
|
same constraint.
|
|
|
|
.. spicy-code:: parse-switch-size.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
tag: uint8;
|
|
switch ( self.tag ) {
|
|
1 -> b1: bytes &eod;
|
|
2 -> b2: bytes &eod &convert=$$.lower();
|
|
} &size=3;
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-switch-size.spicy
|
|
:exec: printf '\01ABC' | spicy-driver %INPUT; printf '\02ABC' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
.. _backtracking:
|
|
|
|
Backtracking
|
|
^^^^^^^^^^^^
|
|
|
|
Spicy supports a simple form of manual backtracking. If a field is
|
|
marked with ``&try``, a later call to the unit's ``backtrack()``
|
|
method anywhere down in the parse tree originating at that field will
|
|
immediately transfer control over to the field following the ``&try``.
|
|
When doing so, the data position inside the input stream will be reset
|
|
to where it was when the ``&try`` field started its processing. Units
|
|
along the original path will be left in whatever state they were at
|
|
the time ``backtrack()`` executed (i.e., they will probably remain
|
|
just partially initialized). When ``backtrack()`` is called on a path
|
|
that involves multiple ``&try`` fields, control continues after the
|
|
most recent.
|
|
|
|
Example:
|
|
|
|
.. spicy-code:: parse-backtrack.spicy
|
|
|
|
module Test;
|
|
|
|
public type test = unit {
|
|
foo: Foo &try;
|
|
bar: Bar;
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
type Foo = unit {
|
|
a: int8 {
|
|
if ( $$ != 1 )
|
|
self.backtrack();
|
|
}
|
|
b: int8;
|
|
};
|
|
|
|
type Bar = unit {
|
|
a: int8;
|
|
b: int8;
|
|
};
|
|
|
|
|
|
.. spicy-output:: parse-backtrack.spicy
|
|
:exec: printf '\001\002\003\004' | spicy-driver %INPUT; printf '\003\004' | spicy-driver %INPUT
|
|
:show-with: backtrack.spicy
|
|
|
|
``backtrack()`` can be called from inside :ref:`%error hooks
|
|
<on_error>`, so this provides a simple form of error recovery
|
|
as well.
|
|
|
|
.. note::
|
|
|
|
This mechanism is preliminary and will probably see refinement
|
|
over time, both in terms of more automated backtracking and by
|
|
providing better control where to continue after backtracking.
|
|
|
|
Changing Input
|
|
==============
|
|
|
|
By default, a Spicy parser proceeds linearly through its inputs,
|
|
parsing as much as it can and yielding back to the host application
|
|
once it runs out of input. There are two ways to change this linear
|
|
model: diverting parsing to a different input, and random access
|
|
within the current unit's data.
|
|
|
|
.. rubric:: Parsing custom data
|
|
|
|
A unit field can have either ``&parse-from=EXPR`` or
|
|
``&parse-at=EXPR`` attached to it to change where it's receiving its
|
|
data to parse from. ``EXPR`` is evaluated at the time the field is
|
|
reached. For ``&parse-from`` it must produce a value of type
|
|
``bytes``, which will then constitute the input for the field. This
|
|
can, e.g., be used to reparse previously received input:
|
|
|
|
.. spicy-code:: parse-parse.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
x: bytes &size=2;
|
|
y: uint16 &parse-from=self.x;
|
|
z: bytes &size=2;
|
|
|
|
on %done { print self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-parse.spicy
|
|
:exec: printf '\x01\x02\x03\04' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
For ``&parse-at``, ``EXPR`` must yield an iterator pointing to (a
|
|
still valid) position of the current unit's input stream (such as
|
|
retrieved through :spicy:method:`unit::input`). The field will then be
|
|
parsed from the data starting at that location.
|
|
|
|
.. _random_access:
|
|
|
|
.. rubric:: Random access
|
|
|
|
While a unit is being parsed, you may revert the current input
|
|
position backwards to any location between the first byte the unit has
|
|
seen and the current position. You can use a set of built-in unit methods to
|
|
control the current position:
|
|
|
|
:spicy:method:`unit::input`
|
|
Returns a stream iterator pointing to the current input position.
|
|
|
|
:spicy:method:`unit::set_input`
|
|
Sets the current input position to the location of the specified
|
|
stream iterator. Per above, the new position needs to reside
|
|
between the beginning of the current unit's data and the current
|
|
position; otherwise an exception will be generated at runtime.
|
|
|
|
:spicy:method:`unit::offset`
|
|
Returns the numerical offset of the current input position
|
|
relative to position of the first byte fed into this unit.
|
|
|
|
:spicy:method:`unit::position`
|
|
Returns iterator to the current input position in the stream fed
|
|
into this unit.
|
|
|
|
You can achieve random access by saving an iterator from ``input()``
|
|
in a unit variable, then later return to that position (or one derived
|
|
from it) by calling ``set_input()`` with that variable. Here's an
|
|
example that parses input data twice with different sub units:
|
|
|
|
.. spicy-code:: parse-random-access.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
on %init() { self.start = self.input(); }
|
|
|
|
a: A { self.set_input(self.start); }
|
|
b: B;
|
|
|
|
on %done() { print self; }
|
|
|
|
var start: iterator<stream>;
|
|
};
|
|
|
|
type A = unit {
|
|
x: uint32;
|
|
};
|
|
|
|
type B = unit {
|
|
y: bytes &size=4;
|
|
};
|
|
|
|
|
|
.. spicy-output:: parse-random-access.spicy
|
|
:exec: printf '\00\00\00\01' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
If you look at output, you see that ``start`` iterator remembers its
|
|
offset, relative to the global input stream. It would also show the
|
|
data at that offset if the parser had not already discarded that at
|
|
the time we print it out.
|
|
|
|
.. note::
|
|
|
|
Spicy parsers discard input data as quickly as possible as parsing
|
|
moves through the input stream. Indeed, that's why using random
|
|
access may come with a performance penalty as the parser now needs
|
|
to buffer all of unit's data until it has been fully processed.
|
|
|
|
.. _filters:
|
|
|
|
Filters
|
|
=======
|
|
|
|
Spicy supports attaching *filters* to units that get to preprocess and
|
|
transform a unit's input before its parser gets to see it. A typical
|
|
use case for this is stripping off a data encoding, such as
|
|
compression or Base64.
|
|
|
|
A filter is itself just a ``unit`` that comes with an additional property
|
|
``%filter`` marking it as such. The filter unit's input represents the
|
|
original input to be transformed. The filter calls an internally
|
|
provided unit method :spicy:method:`unit::forward` to pass any
|
|
transformed data on to the main unit that it's attached to. The filter
|
|
can call ``forward`` arbitrarily many times, each time forwarding a
|
|
subsequent chunk of input. To attach a filter to a unit, one calls the
|
|
method :spicy:method:`unit::connect_filter` with an instance of the
|
|
filter's type. Putting that all together, this is an example of a simple
|
|
a filter that upper-cases all input before the main parsing unit gets
|
|
to see it:
|
|
|
|
.. spicy-code:: parse-filter.spicy
|
|
|
|
module Test;
|
|
|
|
type Filter = unit {
|
|
%filter;
|
|
|
|
: bytes &eod &chunked {
|
|
self.forward($$.upper());
|
|
}
|
|
};
|
|
|
|
public type Foo = unit {
|
|
on %init { self.connect_filter(new Filter); }
|
|
x: bytes &size=5 { print self.x; }
|
|
};
|
|
|
|
.. spicy-output:: parse-filter.spicy
|
|
:exec: printf 'aBcDe' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
There are a couple of predefined filters coming with Spicy that become
|
|
available by importing the ``filter`` library module:
|
|
|
|
``filter::Zlib``
|
|
Provides zlib decompression.
|
|
|
|
``filter::Base64Decode``
|
|
Provides base64 decoding.
|
|
|
|
.. _sinks:
|
|
|
|
Sinks
|
|
=====
|
|
|
|
Sinks provide a powerful mechanism to chain multiple units together
|
|
into a layered stack, each processing the output of its predecessor. A
|
|
sink is the connector here that links two unit instances: one side
|
|
writing and one side reading, like a Unix pipe. As additional
|
|
functionality, the sink can internally reassemble data chunks that are
|
|
arriving out of order before passing anything on.
|
|
|
|
Here's a basic example of two units types chained through a sink:
|
|
|
|
.. spicy-code:: parse-sink.spicy
|
|
|
|
module Test;
|
|
|
|
public type A = unit {
|
|
on %init { self.b.connect(new B); }
|
|
|
|
length: uint8;
|
|
data: bytes &size=self.length { self.b.write($$); }
|
|
|
|
on %done { print "A", self; }
|
|
|
|
sink b;
|
|
};
|
|
|
|
public type B = unit {
|
|
: /GET /;
|
|
path: /[^\n]+/;
|
|
|
|
on %done { print "B", self; }
|
|
};
|
|
|
|
.. spicy-output:: parse-sink.spicy
|
|
:exec: printf '\13GET /a/b/c\n' | spicy-driver -p Test::A %INPUT
|
|
:show-with: foo.spicy
|
|
|
|
Let's see what's going on here. First, there's ``sink b`` inside the
|
|
declaration of ``A``. That's the connector, kept as state inside
|
|
``A``. When parsing for ``A`` is about to begin, the ``%init`` hook
|
|
connects the sink to a :ref:`new instance <operator_new>` of ``B``; that'll be the receiver
|
|
for data that ``A`` is going to write into the sink. That writing
|
|
happens inside the field hook for ``data``: once we have parsed that
|
|
field, we write what will go to the sink using its built-in
|
|
:spicy:method:`sink::write` method. With that write operation, the
|
|
data will emerge as input for the instance of ``B`` that we created
|
|
earlier, and that will just proceed parsing it normally. As the output
|
|
shows, in the end both unit instances end up having their fields set.
|
|
|
|
As an alternative for using the :spicy:method:`sink::write` in the
|
|
example, there's some syntactic sugar for fields of type ``bytes``
|
|
(like ``data`` here): We can just replace the hook with a ``->``
|
|
operator to have the parsed data automatically be forwarded to the
|
|
sink: ``data: bytes &size=self.length -> self.b``.
|
|
|
|
Sinks have a number of further methods, see :ref:`type_sink` for the
|
|
complete reference. Most of them we will also encounter in the
|
|
following when discussing additional functionality that sinks provide.
|
|
|
|
.. note::
|
|
|
|
Because sinks are meant to decouple processing between two units, a
|
|
unit connected to a sink will *not* pass any parse errors back up
|
|
to the sink's parent. If you want to catch them, install an
|
|
:ref:`%error <on_error>` hook inside the connected unit.
|
|
|
|
Using Filters
|
|
^^^^^^^^^^^^^
|
|
|
|
Sinks also support :ref:`filters <filters>` to preprocess any data
|
|
they receive before forwarding it on. This works just like for units
|
|
by calling the built-in sink method
|
|
:spicy:method:`sink::connect_filter`. For example, if in the example
|
|
above, ``data`` would have been gzip compressed, we could have
|
|
instructed the sink to automatically decompress it by calling
|
|
``self.b.connect_filter(new filter::Zlib)`` (leveraging the
|
|
Spicy-provided ``Zlib`` filter).
|
|
|
|
Leveraging MIME Types
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In our example above we knew which type of unit we wanted to connect.
|
|
In practice, that may or may not be the case. Often, it only becomes
|
|
clear at runtime what the choice for the next layer should be, such as
|
|
when using well-known ports to determine the appropriate
|
|
application-layer analyzer for a TCP stream. Spicy supports dynamic
|
|
selection through a generalized notion of MIME types: Units can
|
|
declare which MIME types they know how to parse (see
|
|
:ref:`unit_meta_data`) , and sinks have
|
|
:spicy:method:`sink::connect_mime_type` method that will instantiate and
|
|
connect any that match their argument (if that's multiple, all will be
|
|
connected and all will receive the same data).
|
|
|
|
"MIME type" can mean actual MIME types, such ``text/html``.
|
|
Applications can, however, also define their own notion of
|
|
``<type>/<subtype>`` to model other semantics. For example, one could
|
|
use ``x-port/443`` as convention to trigger parsers by well-known
|
|
port. An SSL unit would then declare ``%mime-type = "x-port/443``, and
|
|
the connection would be established through the equivalent of
|
|
``connect_mime_type("x-port/%d" % resp_port_of_connection)``.
|
|
|
|
.. todo::
|
|
|
|
For this specific example, there's a better solution: We also have
|
|
the ``%port`` property and should just build up a table index on
|
|
that.
|
|
|
|
Reassembly
|
|
^^^^^^^^^^
|
|
|
|
Reassembly (or defragmentation) of out-of-order data chunks is a common requirement
|
|
for many protocols. Sinks have that functionality built-in by
|
|
allowing you to associate a position inside a virtual sequence space with each
|
|
chunk of data. Sinks will then pass their data on to
|
|
connected units only once they have collected a continuous, in-order range of bytes.
|
|
|
|
The easiest way to leverage this
|
|
is to simply associate sequence numbers with each
|
|
:spicy:method:`sink::write` operation:
|
|
|
|
.. spicy-code:: parse-reassembly.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
|
|
sink data;
|
|
|
|
on %init {
|
|
self.data.connect(new Bar);
|
|
self.data.write(b"567", 5);
|
|
self.data.write(b"89", 8);
|
|
self.data.write(b"012", 0);
|
|
self.data.write(b"34", 3);
|
|
}
|
|
};
|
|
|
|
public type Bar = unit {
|
|
s: bytes &eod;
|
|
on %done { print self.s; }
|
|
};
|
|
|
|
.. spicy-output:: parse-reassembly.spicy
|
|
:exec: spicy-driver -p Test::Foo %INPUT </dev/null
|
|
:show-with: foo.spicy
|
|
|
|
|
|
By default, Spicy expects the sequence space to start at zero, so the
|
|
first byte of the input stream needs to be passed in with sequence
|
|
number zero. You can change that base number by calling the
|
|
sink method :spicy:method:`sink::set_initial_sequence_number`. You can
|
|
control Spicy's gap handling, including when to stop buffering data
|
|
because you know nothing further will arrive anymore. Spicy can also
|
|
notify you about unsuccessful reassembly through a series of built-in unit hooks.
|
|
See :ref:`type_sink` for a reference of the available functionality.
|
|
|
|
|
|
.. _unit_context:
|
|
|
|
Contexts
|
|
========
|
|
|
|
Parsing may need to retain state beyond any specific unit's lifetime.
|
|
For example, a UDP protocol may want to remember information across
|
|
individual packets (and hence units), or a bi-directional protocol may
|
|
need to correlate the request side with the response side. One option
|
|
for implementing this in Spicy is managing such state manually in
|
|
:ref:`global variables <variables>`, for example by maintaining a
|
|
global map that ties a unique connection ID to the information that
|
|
needs to be retained. However, doing so is clearly cumbersome and
|
|
error prone. As an alternative, a unit can make use of a dedicated
|
|
*context* value, which is an instance of a custom type that has its
|
|
lifetime determined by the host application running the parser. For
|
|
example, Zeek will tie the context to the underlying connection.
|
|
|
|
Any public unit can declare a context through a unit-level property
|
|
called ``%context``, which takes an arbitrary type as its argument.
|
|
For example:
|
|
|
|
.. spicy-code::
|
|
|
|
public type Foo = unit {
|
|
%context = bytes;
|
|
[...]
|
|
};
|
|
|
|
When used as a top-level entry point to parsing, the unit will then,
|
|
by default, receive a unique context value of that type. That context
|
|
value can be accessed through the :spicy:method:`unit::context`
|
|
method, which will return a :ref:`reference <type_reference>` to it:
|
|
|
|
.. spicy-code:: context-empty.spicy
|
|
|
|
module Test;
|
|
|
|
public type Foo = unit {
|
|
%context = int64;
|
|
|
|
on %init { print self.context(); }
|
|
};
|
|
|
|
.. spicy-output:: context-empty.spicy
|
|
:exec: spicy-driver %INPUT </dev/null
|
|
:show-with: foo.spicy
|
|
|
|
By itself, this is not very useful. However, host applications can
|
|
control how contexts are maintained, and they may assign the same
|
|
context value to multiple units. For example, when parsing a protocol,
|
|
the :zeek:`Zeek integration <devel/spicy/index.html>` always creates a single context
|
|
value shared by all top-level units belonging to the same connection,
|
|
enabling parsers to maintain bi-directional, per-connection state.
|
|
The batch mode of :ref:`spicy-driver <spicy-driver>` does the same.
|
|
|
|
.. note::
|
|
|
|
A unit's context value gets set only when a host application uses
|
|
it as the top-level starting point for parsing. If in the above
|
|
example `Foo` wasn't the entry point, but used inside another unit
|
|
further down during the parsing process, its context would remain
|
|
unset.
|
|
|
|
As an example, the following grammar---mimicking a request/reply-style
|
|
protocol---maintains a queue of outstanding textual commands to then
|
|
associate numerical result codes with them as the responses come in:
|
|
|
|
.. spicy-code:: context-pipelining.spicy
|
|
|
|
module Test;
|
|
|
|
# We wrap the state into a tuple to make it easy to add more attributes if needed later.
|
|
type Pending = tuple<pending: vector<bytes>>;
|
|
|
|
public type Requests = unit {
|
|
%context = Pending;
|
|
|
|
: Request[] foreach { self.context().pending.push_back($$.cmd); }
|
|
};
|
|
|
|
public type Replies = unit {
|
|
%context = Pending;
|
|
|
|
: Reply[] foreach {
|
|
if ( |self.context().pending| ) {
|
|
print "%s -> %s" % (self.context().pending.back(), $$.response);
|
|
self.context().pending.pop_back();
|
|
}
|
|
else
|
|
print "<missing request> -> %s", $$.response;
|
|
}
|
|
};
|
|
|
|
type Request = unit {
|
|
cmd: /[A-Za-z]+/;
|
|
: b"\n";
|
|
};
|
|
|
|
type Reply = unit {
|
|
response: /[0-9]+/;
|
|
: b"\n";
|
|
};
|
|
|
|
.. spicy-output:: context-pipelining.spicy
|
|
:exec: spicy-driver -F programming/examples/context-input.dat %INPUT
|
|
:show-as: spicy-driver -F input.dat context.spicy
|
|
|
|
The output is produced from :download:`this input batch file
|
|
<examples/context-input.dat>`. This would work the same when used with
|
|
the Zeek on a corresponding packet trace.
|
|
|
|
Note that the units for the two sides of the connection need to
|
|
declare the same ``%context`` type. Processing will abort at
|
|
runtime with a type mismatch error if that's not the case.
|
|
|
|
.. _error_handling:
|
|
|
|
Error Handling
|
|
===============
|
|
|
|
Whenever a parser encounters an unexpected situation during
|
|
processing, it triggers a runtime error. This includes parsing errors
|
|
due to input that does not match the current unit, failing
|
|
:ref:`&requires <attribute_requires>` conditions, and also any logic
|
|
errors in hooks, such as attempting to read an unset unit field or
|
|
accessing an invalid vector index.
|
|
|
|
By default, any runtime error will cause the parsing to terminate
|
|
immediately, with a corresponding error message reported back to the
|
|
host application. The Spicy parser will not be able to continue
|
|
processing afterwards. However, there are a couple of ways to catch
|
|
*parsing errors* (but not other runtime errors) and potentially
|
|
recover from them, which we discuss in the following.
|
|
|
|
.. _parsing_errors:
|
|
|
|
A unit can provide special :ref:`%error hooks <unit_hooks>` that will
|
|
execute when a parsing error is encountered. A unit-wide ``%error``
|
|
hook will catch all parsing errors occurring anywhere inside the unit,
|
|
including any sub-units (if not otherwise handled by the sub-unit
|
|
itself already). Example:
|
|
|
|
.. code-block:: spicy
|
|
|
|
module MyModule;
|
|
|
|
type MyType = unit {
|
|
magic: b"MAGIC";
|
|
|
|
on %error(msg: string) {
|
|
print "Error when parsing MyUnit: ", msg;
|
|
}
|
|
};
|
|
|
|
The ``msg`` parameter is optional. If it's specified, it will contain
|
|
an error message describing the issue.
|
|
|
|
By default, even with an ``%error`` hook in place, the parser will
|
|
still terminate after executing the hook. To change that, the hook may
|
|
use :ref:`backtracking` to specify where to continue parsing after the
|
|
error. Alternatively, if :ref:`automatic error recovery
|
|
<error_recovery>` is in place, the parser will attempt recovery after
|
|
the error hooks have executed.
|
|
|
|
.. versionadded:: 1.12 Per-field ``%error`` handler
|
|
|
|
Rather than defining a unit-wide ``%error`` hook, it is also possible
|
|
to just have an individual field catch its own parsing errors. The
|
|
easiest way to do this is to attach an ``%error`` attribute to an
|
|
inline hook:
|
|
|
|
.. code-block:: spicy
|
|
|
|
module My;
|
|
|
|
type MyType = unit {
|
|
magic: b"MAGIC" %error { # will run if magic cannot be parsed
|
|
print "magic not found";
|
|
}
|
|
};
|
|
|
|
To get access to the error message as well, define it out of line like this:
|
|
|
|
.. code-block:: spicy
|
|
|
|
module MyUnit;
|
|
|
|
type MyType = unit {
|
|
magic: b"MAGIC"
|
|
|
|
on magic(msg: string) %error {
|
|
print "Error when parsing magic: ", msg;
|
|
}
|
|
};
|
|
|
|
|
|
.. _error_recovery:
|
|
|
|
Error Recovery
|
|
==============
|
|
|
|
Real world input does not always look like what parsers expect:
|
|
endpoints may not conform to the protocol's specification, a parser's
|
|
grammar might not fully cover all of the protocol, or some input may
|
|
be missing due to packet loss or stepping into the middle of a
|
|
conversation. By default, if a Spicy parser encounters such
|
|
situations, it will abort parsing altogether and issue an error
|
|
message. Alternatively, however, Spicy allows grammar writers to
|
|
specify heuristics to recover from errors. The main challenge here is
|
|
finding a spot in the subsequent input where parsing can reliably
|
|
resume.
|
|
|
|
Spicy employs a two-phase approach to such recovery: it first searches
|
|
for a possible point in the input stream where it seems promising to
|
|
attempt to resume parsing; and then it confirms that choice by trying
|
|
to parse a few fields at that location according to the grammar
|
|
grammar to see if that's successful. We say that during the first part
|
|
of this process, the Spicy parser is in *synchronization mode*; d
|
|
during the second, it is in *trial mode*.
|
|
|
|
.. rubric:: Phase 1: Synchronization
|
|
|
|
To identity locations where parsing can attempt to pick up again after
|
|
an error, a grammar can add ``&synchronize`` attributes to selected unit
|
|
fields, marking them as a *synchronization points*. Whenever an error
|
|
occurs during parsing, Spicy will determine the closest
|
|
synchronization point in the grammar following the error's location,
|
|
and then attempt to continue processing there by skipping ahead in the
|
|
input data until it aligns with what that field is looking for.
|
|
|
|
A synchronization point may be any of the following:
|
|
|
|
- A field for which parsing begins with a constant literal (e.g., a specific
|
|
sequence of bytes). To realign the input stream, the parser will search the
|
|
input for the next occurrence of this literal, discarding any data in
|
|
between. Example::
|
|
|
|
type X = unit { ... }
|
|
|
|
type Y = unit {
|
|
a: b"begin-of-Y";
|
|
b: bytes &size=10;
|
|
};
|
|
|
|
type Foo = unit {
|
|
x: X;
|
|
y: Y &synchronize;
|
|
};
|
|
|
|
If parse error occurs during ``Foo::x``, Spicy will move ahead to ``Foo::y``,
|
|
switch into synchronization mode, and start search the input for the bytes
|
|
``begin-of-Y``. If found, it'll continue with parsing ``Foo::y`` at that location
|
|
in trial mode (see below).
|
|
|
|
.. note::
|
|
|
|
Behind the scenes, synchronization through literals uses the same machinery
|
|
as :ref:`look-ahead parsing <parse_lookahead>`, meaning that it works
|
|
across sub-units, vector content, ``switch`` statements, etc.. No matter how
|
|
complex the field, as long as there's one or more literals that always
|
|
*must* be coming first when parsing it, the field may be used as a
|
|
synchronization point.
|
|
|
|
- A field with a type which specifies :ref:`%synchronize-at <synchronize-at>`
|
|
or :ref:`%synchronize-after <synchronize-after>`. The parser will search the
|
|
input for the next occurrence of the given literal, discarding any data in
|
|
between. If the search was successful, ``%synchronize-at`` will leave the
|
|
input at the position of the search literal for later extraction while
|
|
``%synchronize-after`` will discard the search literal.
|
|
|
|
If either of these unit properties is specified, it will always overrule any
|
|
other potential synchronization points in the unit. Example::
|
|
|
|
type X = unit {
|
|
...
|
|
: /END/;
|
|
};
|
|
|
|
type Y = unit {
|
|
%synchronize-after = /END/;
|
|
a: bytes &size=10;
|
|
};
|
|
|
|
type Foo = unit {
|
|
x: X;
|
|
y: Y &synchronize;
|
|
};
|
|
|
|
- A field that's located inside the input stream at a fixed offset relative to
|
|
the field triggering the error. The parser will then be able to skip ahead to
|
|
that offset. Example::
|
|
|
|
type X = unit { ... }
|
|
type Y = unit { ... }
|
|
|
|
type Foo = unit {}
|
|
x: X &size=512;
|
|
y: Y &synchronize;
|
|
};
|
|
|
|
Here, when parsing ``Foo:x`` triggers an error, Spicy will know that it can
|
|
continue with ``Foo::y`` at offset ``<beginning of Foox:x> + 512``.
|
|
|
|
.. todo::
|
|
|
|
This synchronization strategy is not yet implemented.
|
|
|
|
- When :ref:`parsing a vector <parse_vector>`, the inner elements may provide
|
|
synchronization points as well. Example::
|
|
|
|
type X = unit {
|
|
a: b"begin-of-X";
|
|
b: bytes &size=10;
|
|
};
|
|
|
|
type Foo = unit {}
|
|
xs: (X &synchronize)[];
|
|
};
|
|
|
|
If one element of the vector ``Foo::xs`` fails to parse, Spicy will attempt
|
|
to find the beginning of the next ``X`` in the input stream and continue
|
|
there. For this to work, the vector's elements must itself represent valid
|
|
synchronization point (e.g., start with a literal). If the list is of fixed
|
|
size, after successful synchronization, it will contain the expected number
|
|
of entries, but some of them may remain (fully or partially) uninitialized
|
|
if they encountered errors.
|
|
|
|
.. rubric:: Phase 2: Trial parsing
|
|
|
|
Once input has been realigned with a synchronization point, parsing
|
|
switches from synchronization mode into trial mode, in which the
|
|
parser will attempt to confirm that it has indeed found a viable place
|
|
to continue. It does so by proceeding to parse subsequent input from
|
|
the synchronization point onwards, until one of the following occurs:
|
|
|
|
- A unit hook explicitly acknowledges that synchronization has been successful
|
|
by executing Spicy's :ref:`statement_confirm` statement. Typically, a grammar
|
|
will do so once it has been able to correctly parse a few fields following
|
|
the synchronization point--whatever it needs to sufficiently certain that
|
|
it's indeed seeing the expected structure.
|
|
|
|
- A unit hook explicitly declines the synchronization by executing Spicy's
|
|
:ref:`statement_reject` statement. This will abandon the current
|
|
synchronization attempt, and switch back into the original synchronization
|
|
mode again to find another location to try.
|
|
|
|
- Parsing reaches the end of the grammar without either ``confirm`` or
|
|
``reject`` already called. In this case, the parser will abort with a fatal
|
|
parse error.
|
|
|
|
Note that during trial mode, any fields between the synchronization point and
|
|
the eventual ``confirm``/``reject`` location will already be processed as
|
|
usual, including any hooks executing except ``%error``. This may leave the
|
|
unit's state in a partially initialized state if trial parsing eventually
|
|
fails. Trial mode will also consume any input along the way, with any further
|
|
synchronization attempts proceeding only on subsequent, not yet seen, data.
|
|
|
|
.. _error_recovery_hooks:
|
|
|
|
.. rubric:: Synchronisation Hooks
|
|
|
|
For customization, Spicy provides a set of hooks executing at
|
|
different points during the synchronization process:
|
|
|
|
``on %synced { ...}``
|
|
Executes when a synchronization point has been found and parsing
|
|
resumes there, just before the parser begins processing the
|
|
corresponding field in trial mode.
|
|
|
|
``on %confirmed { ...}``
|
|
Executes when trial mode ends successfully with :ref:`statement_confirm`.
|
|
|
|
``on %rejected { ...}``
|
|
Executes when trial mode fails with :ref:`statement_reject`.
|
|
|
|
``on %sync_advance(offset: uint64)``
|
|
Executes regularly (see below) while the parser is searching for a
|
|
synchronization point. The `offset` is given the current position
|
|
inside the input stream.
|
|
|
|
This hook can be used check if the parser is skipping too much
|
|
data for the analysis to remain useful. For example, a protocol
|
|
analyzer could decide to bail out if the input stream consists
|
|
mainly of gaps, as reported by
|
|
:spicy:method:`self.stream().statistics() <stream::statistics>`.
|
|
|
|
By default, the hook executes every 4KB of input data skipped
|
|
while searching for a synchronization point. It may not
|
|
necessarily trigger immediately at the 4KB mark, but soon after
|
|
when parsing gets a chance to check the input stream's position.
|
|
|
|
You may change the trigger volume by defining a unit property
|
|
``%sync-advance-block-size = <VALUE>`` where ``<VALUE>`` is an
|
|
alternative size value in bytes. As usual, this property can also
|
|
be set at the module level to apply to all units.
|
|
|
|
.. rubric:: Example Synchronization Process
|
|
|
|
As an example, let's consider a grammar consisting of two sections
|
|
where each section is started with a section header literal (``SEC_A``
|
|
and ``SEC_B`` here).
|
|
|
|
We want to allow for inputs which miss parts or all of the first
|
|
section. For such inputs, we can still synchronize the input stream by
|
|
looking for the start of the second section. (For simplicity, we just
|
|
use a single unit, even though typically one would probably have
|
|
separate units for the two sections.)
|
|
|
|
.. spicy-code:: parse-synchronized.spicy
|
|
|
|
module Test;
|
|
|
|
public type Example = unit {
|
|
start_a: /SEC_A/;
|
|
a: uint8;
|
|
|
|
# If we fail to find e.g., 'SEC_A' in the input, try to synchronize on this literal.
|
|
start_b: /SEC_B/ &synchronize;
|
|
b: bytes &eod;
|
|
|
|
# In this example confirm unconditionally.
|
|
on %synced {
|
|
print "Synced: %s" % self;
|
|
confirm;
|
|
}
|
|
|
|
# Perform logging for these %confirmed and %rejected.
|
|
on %confirmed { print "Confirmed: %s" % self; }
|
|
on %rejected { print "Rejected: %s" % self; }
|
|
|
|
on %done { print "Done %s" % self; }
|
|
};
|
|
|
|
Let us consider that this parsers encounters the input
|
|
``\xFFSEC_Babc`` that missed the ``SEC_A`` section marker:
|
|
|
|
- ``start_a`` missing,
|
|
- ``a=255``
|
|
- ``start_b=SEC_B`` as expected, and
|
|
- ``b=abc``.
|
|
|
|
For such an input parsing will encounter an initial error when it sees
|
|
``\xFF`` where ``SEC_A`` would have been expected.
|
|
|
|
1. Since ``start_b`` is marked as a synchronization point, the parser
|
|
enters synchronisation mode, and jumps over the field ``a`` to
|
|
``start_b``, to now search for ``SEC_B``.
|
|
|
|
2. At this point the input still contains the unexpected ``\xFF`` and
|
|
remains ``\xFFSEC_Babc`` . While searching for ``SEC_B`` ``\xFF``
|
|
is skipped over, and then the expected token is found. The input
|
|
is now ``SEC_Babc``.
|
|
|
|
3. The parser has successfully synchronized and enters trial mode. All
|
|
``%synced`` hooks are invoked.
|
|
|
|
4. The unit's ``%synced`` hook executes ``confirm`` and the parser
|
|
leaves trial mode. All ``%confirmed`` hooks are invoked.
|
|
|
|
5. Regular parsing continues at ``start_b``. The input was ``SEC_Babc`` so
|
|
``start_b`` is set to ``SEC_B`` and ``b`` to ``abc``.
|
|
|
|
Since parsing for ``start_a`` was unsuccessful and ``a`` was jumped
|
|
over, their fields remain unset.
|
|
|
|
.. spicy-output:: parse-synchronized.spicy
|
|
:exec: printf '\xFFSEC_Babc' | spicy-driver %INPUT
|
|
:show-with: foo.spicy
|