.. _parsing: ======= Parsing ======= Basics ====== Type Declaration ^^^^^^^^^^^^^^^^ Spicy expresses units of data to parse through a type called, appropriately, ``unit``. At a high level, a unit is similar to structs or records in other languages: It defines an ordered set of fields, each with a name and a type, that during runtime will store corresponding values. Units can be instantiated, fields can be assigned values, and these values can then be retrieved. Here's about the most basic Spicy unit one can define: .. spicy-code:: type Foo = unit { version: uint32; }; We name the type ``Foo``, and it has just one field called ``version``, which stores a 32-bit unsigned integer type. Leaving parsing aside for a moment, we can indeed use this type similar to a typical struct/record type: .. spicy-code:: basic-unit-module.spicy module Test; type Foo = unit { version: uint32; }; global f: Foo; f.version = 42; print f; This will print: .. spicy-output:: basic-unit-module.spicy :exec: spicyc -j %INPUT Fields are initially unset, and attempting to read an unset field will trigger a :ref:`runtime error `. You may, however, provide a default value by adding a ``&default`` *attribute* to the field, in which case that will be returned on access if no value has been explicitly assigned: .. spicy-code:: basic-unit-module-with-default.spicy module Test; type Foo = unit { version: uint32 &default=42; }; global f: Foo; print f; print "version is %s" % f.version; This will print: .. spicy-output:: basic-unit-module-with-default.spicy :exec: spicyc -j %INPUT Note how the field remains unset even with the default now specified, while the access returns the expected value. Parsing a Field ^^^^^^^^^^^^^^^ We can turn this minimal unit type into a starting point for parsing data---in this case a 32-bit integer from four bytes of raw input. First, we need to declare the unit as ``public`` to make it accessible from outside of the current module---a requirement if a host application wants to use the unit as a parsing entry point. .. spicy-code:: basic-unit-parse.spicy module Test; public type Foo = unit { version: uint32; on %done { print "0x%x" % self.version; } }; Let's use :ref:`spicy-driver` to parse 4 bytes of input through this unit: .. spicy-output:: basic-unit-parse.spicy :exec: printf '\01\02\03\04' | spicy-driver %INPUT :show-with: foo.spicy The output comes of course from the ``print`` statement inside the ``%done`` hook, which executes once the unit has been fully parsed. (We will discuss unit hooks further below.) .. _attribute_order: By default, Spicy assumes integers that it parses to be represented in network byte order (i.e., big-endian), hence the output above. Alternatively, we can tell the parser through an attribute that our input is arriving in, say, little-endian instead. To do that, we import the ``spicy`` library module, which provides an enum type :ref:`spicy_byteorder` that we can give to a ``&byte-order`` field attribute for fields that support it: .. spicy-code:: basic-unit-parse-byte-order.spicy module Test; import spicy; public type Foo = unit { version: uint32 &byte-order=spicy::ByteOrder::Little; on %done { print "0x%x" % self.version; } }; .. spicy-output:: basic-unit-parse-byte-order.spicy :exec: printf '\01\02\03\04' | spicy-driver %INPUT :show-with: foo.spicy We see that unpacking the value has now flipped the bytes before storing it in the ``version`` field. Similar to ``&byte-order``, Spicy offers a variety of further attributes that control the specifics of how fields are parsed. We'll discuss them in the relevant sections throughout the rest of this chapter. Non-type Fields ^^^^^^^^^^^^^^^ Unit fields always have a type. However, in some cases a field's type is not explicitly declared, but derived from what's being parsed. The main example of this is parsing a constant value: Instead of a type, a field can specify a constant of a parseable type. The field's type will then (usually) just correspond to the constant's type, and parsing will expect to find the corresponding value in the input stream. If a different value gets unpacked instead, parsing will abort with an error. Example: .. spicy-code:: constant-field.spicy module Test; public type Foo = unit { bar: b"bar"; on %done { print self.bar; } }; .. spicy-output:: constant-field.spicy 1 :exec: printf 'bar' | spicy-driver %INPUT :show-with: foo.spicy .. spicy-output:: constant-field.spicy 2 :exec: printf 'foo' | spicy-driver %INPUT :show-with: foo.spicy :expect-failure: :ref:`Regular expressions ` extend this scheme a bit further: If a field specifies a regular expression constant rather than a type, the field will have type :ref:`type_bytes` and store the data that ends up matching the regular expression: .. spicy-code:: regexp.spicy module Test; public type Foo = unit { x: /Foo.*Bar/; on %done { print self; } }; .. spicy-output:: regexp.spicy :exec: printf 'Foo12345Bar' | spicy-driver %INPUT :show-with: foo.spicy There's also a programmatic way to change a field's type to something that's different than what's being parsed, see the :ref:`&convert attribute `. .. _attribute_size: Parsing Fields With Known Size ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can limit the input that a field receives by attaching a ``&size=EXPR`` attribute that specifies the number of raw bytes to make available. This works on top of any other attributes that control the field's parsing. From the field's perspective, such a size limit acts just like reaching the end of the input stream at the specified position. Example: .. spicy-code:: size.spicy module Test; public type Foo = unit { x: int16[] &size=6; y: bytes &eod; on %done { print self; } }; .. spicy-output:: size.spicy :exec: printf '\000\001\000\002\000\003xyz' | spicy-driver %INPUT :show-with: foo.spicy As you can see, ``x`` receives 6 bytes of input, which it then turns into three 16-bit integers. Normally, the field must consume all the bytes specified by ``&size``, otherwise a parse error will be triggered. Some types support an additional ``&eod`` attribute to lift this restrictions; we discuss that in the corresponding type's section where applicable. After a field with a ``&size=EXPR`` attribute, parsing will always move ahead the full amount of bytes, even if the field did not consume them all. .. todo:: Parsing a regular expression would make a nice example for ``&size`` as well. Defensively Limiting Input Size ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On their own, parsers place no intrinsic upper limit on the size of variable-size fields or units. This can have negative effects like out-of-memory errors, e.g., when available memory is constrained, or for malformed input. As a defensive mechanism you can put an upper limit on the data a field or unit receives by attaching a ``&max-size=EXPR`` attribute where ``EXPR`` is an unsigned integer specifying the upper limit of number of raw bytes a field or unit should receive. If more than ``&max-size`` bytes are consumed during parsing, an error will be triggered. This attribute works on top of any other attributes that control parsing. Example: .. spicy-code:: max-size.spicy module Test; public type Foo = unit { x: bytes &until=b"\x00" &max-size=1024; on %done { print self; } }; .. spicy-output:: max-size.spicy :exec: printf '\001\002\003\004\005\000' | spicy-driver %INPUT :show-with: foo.spicy Here ``x`` will parse a ``NULL``-terminated byte sequence (excluding the terminating ``NULL``), but never more than 1024 bytes. ``&max-size`` cannot be combined with ``&size``. .. _anonymous_fields: Anonymous Fields ^^^^^^^^^^^^^^^^ Field names are optional. If skipped, the field becomes an *anonymous* field. These still participate in parsing as any other field, but they won't store any value, nor is there a way to get access to them from outside. You can, however, still get to the field's final value inside a corresponding field hook (see :ref:`unit_hooks`) using the reserved ``$$`` identifier (see :ref:`id_dollardollar`). .. spicy-code:: anonymous-field.spicy module Test; public type Foo = unit { x: int8; : int8 { print $$; } # anonymous field y: int8; on %done { print self; } }; .. spicy-output:: anonymous-field.spicy :exec: printf '\01\02\03' | spicy-driver %INPUT :show-with: foo.spicy Anonymous fields can often be more efficient to process because the parser doesn't need to retain their values. In particular for larger ``bytes`` fields, making them anonymous is recommended where possible (unless, even better, they can be fully skipped over; see :ref:`skip`). .. _skip: Skipping Input ^^^^^^^^^^^^^^ For cases where your parser just needs to skip over some data without needing access to its content, Spicy provides a ``skip`` keyword to prefix corresponding fields with: .. spicy-code:: skip.spicy module Test; public type Foo = unit { x: int8; : skip bytes &size=5; y: int8; on %done { print self; } }; .. spicy-output:: skip.spicy :exec: printf '\01\02\03\04\05\06\07' | spicy-driver %INPUT :show-with: foo.spicy ``skip`` works for all kinds of fields but is particularly efficient for fields of known size for which optimized code will be generating avoiding the overhead of storing any data. ``skip`` fields may have conditions and hooks attached, like any other fields. However, they do not support ``$$`` in expressions and hook. Since ``skip`` allows the compiler to optimize the field's parsing code---including completely eliding most of it---it remains undefined if any side effects associated with the field will take effect. For example, ``&requires`` attributes might be ignored, ``&convert`` expressions might not be evaluated, and hooks could end up not being invoked. For readability, a ``skip`` field may be named (e.g., ``padding: skip bytes &size=3;``), but even with a name, its value cannot be accessed. .. _id_dollardollar: .. _id_self: Reserved Identifiers ^^^^^^^^^^^^^^^^^^^^ Inside units, two reserved identifiers provide access to values currently being parsed: ``self`` Inside a unit's type definition, ``self`` refers to the unit instance that's currently being processed. The instance is writable and maybe modified by assigning to any fields of ``self``. ``$$`` Inside field attributes, ``$$`` refers to the value as it was parsed. Inside field hooks, ``$$`` refers to the final value *after* any conversions are applied (see :ref:`attribute_convert`). This applies even if the value is not going to be directly stored in the field. The value of ``$$`` is writable and may be modified. .. note:: ``$$`` has slightly different semantics in a field attribute and in a hook. In an attribute, ``$$`` refers to the parsed value *before* any conversions. In a hook, ``$$`` refers to the final value *after* any conversions. .. _attribute_convert: On-the-fly Type Conversion with &convert ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fields may use an attribute ``&convert=EXPR`` to transform the value that was just being parsed before storing it as the field's final value. With the attribute being present, it's the value of ``EXPR`` that's stored in the field, not the parsed value. Accordingly, the field's type also changes to the type of ``EXPR``. Typically, ``EXPR`` will use ``$$`` to access the parsed value and then transform it into the desired representation. For example, the following stores an integer parsed in an ASCII representation as a ``uint64``: .. spicy-code:: parse-convert.spicy module Test; import spicy; public type Foo = unit { x: bytes &eod &convert=$$.to_uint(); on %done { print self; } }; .. spicy-output:: parse-convert.spicy :exec: printf 12345 | spicy-driver %INPUT :show-with: foo.spicy ``&convert`` also works at the unit level to transform a whole instance into a different value after it has been parsed: .. spicy-code:: parse-convert-unit.spicy module Test; type Data = unit { data: bytes &size=2; } &convert=self.data.to_int(); public type Foo = unit { numbers: Data[]; on %done { print self.numbers; } }; .. spicy-output:: parse-convert-unit.spicy :exec: printf 12345678 | spicy-driver %INPUT :show-with: foo.spicy Note how the ``Data`` instances have been turned into integers. Without the ``&convert`` attribute, the output would have looked like this:: [[$data=b"12"], [$data=b"34"], [$data=b"56"], [$data=b"78"]] .. _attribute_requires: Enforcing Parsing Constraints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fields may use an attribute ``&requires=EXPR`` to enforce additional constraints on their values. ``EXPR`` must yield a boolean value and will be evaluated after the parsing for the field has finished, but before any hooks execute. If ``EXPR`` returns ``False``, the parsing process will abort with an error, just as if the field had been unparsable in the first place (incl. executing any :ref:`%error ` hooks). ``EXPR`` has access to the parsed value through :ref:`$$ `. It may also retrieve the field's final value through ``self.``, which can be helpful when :ref:`&convert ` is present. Example: .. spicy-code:: parse-requires.spicy module Test; import spicy; public type Foo = unit { x: int8 &requires=($$ < 5); on %done { print self; } }; .. spicy-output:: parse-requires.spicy 1 :exec: printf '\001' | spicy-driver %INPUT :show-with: foo.spicy .. spicy-output:: parse-requires.spicy 2 :exec: printf '\010' | spicy-driver %INPUT :show-with: foo.spicy :expect-failure: .. versionadded:: 1.12 Custom error messages Instead of computing a boolean value directly, ``EXPR`` can also leverage the :ref:`condition test operator ` to provide a custom error message when the condition fails. Example: .. spicy-code:: parse-requires-with-error.spicy module Test; import spicy; public type Foo = unit { x: int8 &requires=($$ < 5 : "x is too large"); # custom error message on %done { print self; } }; .. spicy-output:: parse-requires-with-error.spicy :exec: printf '\010' | spicy-driver %INPUT :show-with: foo.spicy :expect-failure: One can also enforce conditions globally at the unit level through a attribute ``&requires = EXPR``. ``EXPR`` will be evaluated once the unit has been fully parsed, but before any ``%done`` hook executes. If ``EXPR`` returns ``False``, the unit's parsing process will abort with an error. As usual, ``EXPR`` has access to the parsed instance through ``self``. More than one ``&requires`` attribute may be specified. Example: .. spicy-code:: parse-requires-property.spicy module Test; import spicy; public type Foo = unit { x: int8; on %done { print self; } } &requires = self.x < 5; .. spicy-output:: parse-requires-property.spicy 1 :exec: printf '\001' | spicy-driver %INPUT :show-with: foo.spicy .. spicy-output:: parse-requires-property.spicy 2 :exec: printf '\010' | spicy-driver %INPUT :show-with: foo.spicy :expect-failure: .. _unit_hooks: Unit Hooks =========== Unit hooks provide one of the most powerful Spicy tools to control parsing, track state, and retrieve results. Generally, hooks are blocks of code triggered to execute at certain points during parsing, with access to the current unit instance. Conceptually, unit hooks are somewhat similar to methods: They have bodies that execute when triggered, and these bodies may receive a set of parameters as input. Different from functions, however, a hook can have more than one body. If multiple implementations are provided for the same hook, all of them will execute successively. A hook may also not have any body implemented at all, in which case there's nothing to do when it executes. The most commonly used hooks are: ``on %init() { ... }`` Executes just before unit parsing will start. ``on %done { ... }`` Executes just after unit parsing has completed successfully. .. _on_error: ``on %error { ... }`` or ``on %error(msg: string) { ... }`` Executes when a parse error has been encountered, just before the parser either aborts processing. If the second form is used, a description of the error will be provided through the string argument. ``on %finally { ... }`` Executes once unit parsing has completed in any way. This hook is most useful to modify global state that needs to be updated no matter the success of the parsing process. Once `%init` triggers, this hook is guaranteed to eventually execute as well. It will run *after* either ``%done`` or ``%error``, respectively. (If a new error occurs during execution of ``%finally``, that will not trigger the unit's ``%error`` hook.) ``on %print { ... }`` Executes when a unit is about to be printed (and more generally: when rendered into a string representation). By default, printing a unit will produce a list of its fields with their current values. Through this hook, a unit can customize its appearance by returning the desired string. ``on { ... }`` (field hook) Executes just after the given unit field has been parsed. The final value is accessible through the ``$$``, potentially with any relevant type conversion applied (see :ref:`attribute_convert`). The same will also have been assigned to the field already. .. _foreach: ``on foreach { ... }`` (container hook) Assuming the specified field is a container (e.g., a vector), this executes each time a new container element has been parsed, and just before it's been added to the container. The element's final value is accessible through the ``$$`` identifier, although it can be further modified before it's stored. The hook implementation may also use the :ref:`statement_stop` statement to abort container parsing, without the current element being added anymore. In addition, Spicy provides a set of hooks specific to the ``sink`` type which are discussed in the :ref:`section on sinks `, and hooks which are executed during :ref:`error recovery `. There are three locations where hooks can be implemented: - Inside a unit, ``on { ... }`` implements the hook of the given name: .. spicy-code:: type Foo = unit { x: uint32; v: uint8[]; on %init { ... } on x { ... } on v foreach { ... } on %done { ... } } - Field and container hooks may be directly attached to their field, skipping the ``on ...`` part: .. spicy-code:: type Foo = unit { x: uint32 { ... } v: uint8[] foreach { ... } } - At the global module level, one can add hooks to any available unit type through ``on :: { ... }``. With the definition of ``Foo`` above, this implements hooks externally: .. spicy-code:: on Foo::%init { ... } on Foo::x { ... } on Foo::v foreach { ... } on Foo::%done { ... } External hooks work across module boundaries by qualifying the unit type accordingly. They provide a powerful mechanism to extend a predefined unit without changing any of its code. If multiple implementations are provided for the same hook, by default it remains undefined in which order they will execute. If a particular order is desired, you can specify priorities for your hook implementations: .. spicy-code:: on Foo::v priority=5 { ... } on Foo::v priority=-5 { ... } Implementations then execute in order of their priorities: The higher a priority value, the earlier it will execute. If not specified, a hook's priority is implicitly taken as zero. .. note:: When a hook executes, it has access to the current unit instance through the ``self`` identifier. The state of that instance will reflect where parsing is at that time. In particular, any field that hasn't been parsed yet, will remain unset. You can use the ``?.`` unit operator to test if a field has received a value yet. Unit Variables ============== In addition to unit field for parsing, you can also add further instance variables to a unit type to store arbitrary state: .. spicy-code:: unit-vars.spicy module Test; public type Foo = unit { on %init { print self; } x: int8 { self.a = "Our integer is %d" % $$; } on %done { print self; } var a: string; }; .. spicy-output:: unit-vars.spicy :exec: printf \05 | spicy-driver %INPUT :show-with: foo.spicy Here, we assign a string value to ``a`` once we have parsed ``x``. The final ``print`` shows the expected value. As you can also see, before we assign anything, the variable's value is just empty: Spicy initializes unit variables with well-defined defaults. If you would rather leave a variable unset by default, you can add ``&optional``: .. spicy-code:: unit-vars-optional.spicy module Test; public type Foo = unit { on %init { print self; } x: int8 { self.a = "Our integer is %d" % $$; } on %done { print self; } var a: string &optional; }; .. spicy-output:: unit-vars-optional.spicy :exec: printf \05 | spicy-driver %INPUT :show-with: foo.spicy You can use the ``?.`` unit operator to test if an optional unit variable remains unset, e.g., ``self?.x`` would return ``True`` if field ``x`` is set and ``False`` otherwise. Unit variables can also be initialized with custom expressions when being defined. The initialization is performed just before the containing unit starts parsing (implying that the expressions cannot access parse results of the unit itself yet) .. spicy-code:: unit-vars-init.spicy module Test; public type Foo = unit { x: int8; var a: int8 = 123; on %done { print self; } }; .. spicy-output:: unit-vars-init.spicy :exec: printf \05 | spicy-driver %INPUT :show-with: foo.spicy .. _unit_parameters: Unit Parameters =============== Unit types can receive parameters upon instantiation, which will then be available to any code inside the type's declaration: .. spicy-code:: unit-params.spicy module Test; type Bar = unit(msg: string, mult: int8) { x: int8 &convert=($$ * mult); on %done { print "%s: %d" % (msg, self.x); } }; public type Foo = unit { y: Bar("My multiplied integer", 5); }; .. spicy-output:: unit-params.spicy :exec: printf '\05' | spicy-driver %INPUT :show-with: foo.spicy This example shows a typical idiom: We're handing parameters down to a subunit through parameters it receives. Inside the submodule, we then have access to the values passed in. .. note:: It's usually not very useful to define a top-level parsing unit with parameters because we don't have a way to pass anything in through ``spicy-driver``. A custom host application could make use of them, though. This works with subunits inside containers as well: .. spicy-code:: unit-params-vector.spicy module Test; type Bar = unit(mult: int8) { x: int8 &convert=($$ * mult); on %done { print self.x; } }; public type Foo = unit { x: int8; y: Bar(self.x)[]; }; .. spicy-output:: unit-params-vector.spicy :exec: printf '\05\01\02\03' | spicy-driver %INPUT :show-with: foo.spicy A common use-case for unit parameters is passing the ``self`` of a higher-level unit down into a subunit: .. spicy-code:: type Foo = unit { ... b: Bar(self); ... } type Bar = unit(foo: Foo) { # We now have access to any state in "foo". } That way, the subunit can for example store state directly in the parent. If you declare the ``foo`` parameter as ``inout``, the subunit can also modify its members. Unit parameters generally follow the same passing conventions as :ref:`function parameters `, yet with some restrictions. By default, just like with functions, parameters are read-only by default. If you want the receiving unit to be able to modify the value, there are two options: 1. If the parameter itself is a unit, you can declare it as ``inout`` as described above. 2. For all other types, you instead need to pass the parameter as a :ref:`reference `. Here's an example passing a string so that it can be modified by the subunit: .. spicy-code:: unit-params-string.spicy module Test; type X = unit(s: string&) { n: uint8 { *s = "Hello, world!"; } }; public type Y = unit { x: X(self.s); on %done { print self.s; } var s: string& = new string; }; .. spicy-output:: unit-params-string.spicy :exec: printf '\x2a' | spicy-driver %INPUT :show-with: foo.spicy .. ** .. note:: While this lack of support for ``inout`` may seem like a surprising restriction at first, it follows from Spicy's safety guarantees: since a subunit may access its parameters during its entire lifetime, generally Spicy couldn't guarantee that a parameter passed as ``inout`` at initialization time would actually remain around for modification the whole time. References do not have that problem: their wrapped values are guaranteed to remain valid as long as necessary. (Units happen to share that behaviour, too, which is why Spicy can support ``inout`` for them.) .. _unit_attributes: Unit Attributes =============== Unit types support the following type attributes: ``&byte-order=ORDER`` Specifies a byte order to use for parsing the unit where ``ORDER`` is of type :ref:`spicy_ByteOrder`. This overrides the byte order specified for the module. Individual fields can override this value by specifying their own byte-order. Example: .. spicy-code:: type Foo = unit { version: uint32; } &byte-order=spicy::ByteOrder::Little; ``&convert=EXPR`` Replaces a unit instance with the result of the expression ``EXPR`` after parsing it from inside a parent unit. See :ref:`attribute_convert` for an example. ``EXPR`` has access to ``self`` to retrieves state from the unit. ``&requires=EXPR`` Enforces post-conditions on the parsed unit. ``EXPR`` must be a boolean expression that will be evaluated after the parsing for the unit has finished, but before any hooks execute. More than one ``&requires`` attributes may be specified. Example: .. spicy-code:: type Foo = unit { a: int8; b: int8; } &requires=self.a==self.b; See the :ref:`section on parsing constraints ` for more details. ``&size=N`` Limits the unit's input to ``N`` bytes, which it must fully consume. Example: .. spicy-code:: type Foo = unit { a: int8; b: bytes &eod; } &size=5; This expects 5 bytes of input when parsing an instance of ``Foo``. The unit will store the first byte into ``a``, and then fill ``b`` with the remaining 4 bytes. The expression ``N`` has access to ``self`` as well as to the unit's parameters. .. _unit_meta_data: Meta data ========= Units can provide meta data about their semantics through *properties* that both Spicy itself and host applications can access. One defines properties inside the unit's type through either a ``% = ;`` tuple, or just as ``%;`` if the property does not take an argument. Currently, units support the following meta data properties: ``%mime-type = STRING`` A string of the form ``"/"`` that defines the MIME type for content the unit knows how to parse. This may include a ``*`` wildcard for either the type or subtype. We use a generalized notion of MIME types here that can include custom meanings. See :ref:`sinks` for more on how these MIME types are used to select parsers dynamically during runtime. You can specify this property more than once to associate a unit with multiple types. ``%description = STRING`` A short textual description of the unit type (i.e., the parser that it defines). Host applications have access to this property, and ``spicy-driver`` includes the information into the list of available parsers that it prints with the ``--list-parsers`` option. ``%port = PORT_VALUE [&originator|&responder]`` A :ref:`type_port` to associate this unit with, optionally including a direction to limit its use to the corresponding side. This property has no built-in effect, but host applications may make use of the information to decide which unit type to use for parsing a connection's payload. ``%skip = ( REGEXP | Null );`` Specifies a pattern which should be skipped when encountered in the input stream in between parsing of unit fields. This overwrites a value set at the module level; use ``Null`` to reset the property, i.e., not skip anything. ``%skip-pre = ( REGEXP | Null );`` Specifies a pattern which should be skipped when encountered in the input stream before parsing of a unit begins. This overwrites a value set at the module level; use ``Null`` to reset the property, i.e., not skip anything. ``%skip-post = ( REGEXP | Null );`` Specifies a pattern which should be skipped when encountered in the input stream after parsing of a unit has finished. This overwrites a value set at the module level; use ``Null`` to reset the property, i.e., not skip anything. .. _synchronize-at: ``%synchronize-at = EXPR;`` Specifies a literal to synchronize on if the unit is used as a synchronization point during :ref:`error recovery `. The literal is left in the input stream. .. _synchronize-after: ``%synchronize-after = EXPR;`` Specifies a literal to synchronize on if the unit is used as a synchronization point during :ref:`error recovery `. The literal is consumed and will not be present in the input stream after successful synchronization. Units support some further properties for other purposes, which we introduce in the corresponding sections. Parsing Types ============= Several, but not all, of Spicy's :ref:`data types ` can be parsed from binary data. In the following we summarize the types that can, along with any options they support to control specifics of how they unpack binary representations. .. _parse_address: Address ^^^^^^^ Spicy parses :ref:`addresses ` from either 4 bytes of input for IPv4 addresses, or 16 bytes for IPv6 addresses. To select the type, a unit field of type ``addr`` must come with either an ``&ipv4`` or ``&ipv6`` attribute. By default, addresses are assumed to be represented in network byte order. Alternatively, a different byte order can be specified through a ``&byte-order`` attribute specifying the desired :ref:`spicy_byteorder`. Example: .. spicy-code:: parse-address.spicy module Test; import spicy; public type Foo = unit { ip: addr &ipv6 &byte-order=spicy::ByteOrder::Little; on %done { print self; } }; .. spicy-output:: parse-address.spicy :exec: printf '1234567890123456' | spicy-driver %INPUT :show-with: foo.spicy .. _parse_bitfield: Bitfield ^^^^^^^^ :ref:`Bitfields ` parse an integer value of a given size, and then make selected smaller bit ranges within that value available individually through dedicated identifiers. For example, the following unit parses 4 bytes as an ``uint32`` and then makes the value of bit 0 available as ``f.x1``, bits 1 to 2 as ``f.x2``, and bits 3 to 4 as ``f.x3``, respectively: .. spicy-code:: parse-bitfield.spicy module Test; public type Foo = unit { f: bitfield(32) { x1: 0; x2: 1..2; x3: 3..4; }; on %done { print self.f.x1, self.f.x2, self.f.x3; print self; } }; .. spicy-output:: parse-bitfield.spicy :exec: printf '\01\02\03\04' | spicy-driver %INPUT :show-with: foo.spicy Generally, a field ``bitfield(N)`` field is parsed like an ``uint``. The field then supports dereferencing individual bit ranges through their labels. The corresponding expressions (``self.x.``) have the same ``uint`` type as the parsed value itself, with the value shifted to the right so that the least significant extracted bit becomes the least significant bit of the returned value. As you can see in the example, the type of the field itself becomes a tuple composed of the values of the individual bit ranges. By default, a bitfield assumes the underlying integer comes in network byte order. You can specify a ``&byte-order`` attribute to change that (e.g., ``bitfield(32) { ... } &byte-order=spicy::ByteOrder::Little``). When parsing a ``bitfield(16)`` in network byte order and with bit order ``spicy::BitOrder::LSB0`` (default value of ``&bit-order``), bits are numbered 0 to 15 from right to left. .. code:: MSB LSB <-- 1 <-- 0 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 +---------------+---------------+ | | | +-------------------------------+ This default bit numbering may be surprising given that some RFCs use the inverse as documented in `RFC 1700 `_. Here, the most significant bit is numbered 0 on the left with higher bit numbers representing less significant bits to the right. Concrete examples would be the `WebSocket framing `_ or `IPv4 header `_ notations. .. code:: MSB LSB 0 --> 1 --> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 +-+-+-+-+-------+-+-------------+ |F|R|R|R| opcode|M| Payload len | |I|S|S|S| (4) |A| (7) | |N|V|V|V| |S| | | |1|2|3| |K| | +-+-+-+-+-------+-+-------------+ To express such bitfields more naturally in Spicy, use ``&bit-order=spicy::BitOrder::MSB0`` on the whole bitfield: .. spicy-code:: parse-websocket-bitfield.spicy module WebSocket; import spicy; public type Header= unit { : bitfield(32) { fin: 0; rsv: 1..3; opcode: 4..7; mask: 8; payload_len: 9..15; } &bit-order=spicy::BitOrder::MSB0; }; The way to think about this is that the most significant bit of an integer in network byte order is always the most left bit and the least significant bit the most right one. Specifying the bit order as ``LSB0`` or ``MSB0`` essentially sets the bit numbering direction by specifying the location of bit 0. With little endian byte order, the bits are numbered zigzag-wise and ``MSB0`` and ``LSB0`` can again be used to change the direction of the bit numbering. The following example uses ``spicy::ByteOrder::Little`` and the default ``LSB0`` bit order for ``bitfield(16)``. Notice how the most significant and least significant bit for a 2 byte little endian integer are next to each other. .. code:: f: bitfield(16) { ... } &byte-order=spicy::ByteOrder::Little; LSB MSB <-- 0 <-- 1 7 6 5 4 3 2 1 0 5 4 3 2 1 0 9 8 +---------------+---------------+ | | | +-------------------------------+ With ``MSB0`` as bit order, the bit numbering direction is from left to right, instead: .. code:: f: bitfield(16) { ... } &byte-order=spicy::ByteOrder::Little &bit-order=spicy::BitOrder::MSB0; LSB MSB 1 --> 0 --> 8 9 0 1 2 3 4 5 0 1 2 3 4 5 6 7 +---------------+---------------+ | | | +-------------------------------+ Bit numbering with larger sized bitfields in little endian gets only more confusing. Prefer network byte ordered bitfields unless it makes sense given the spec you're working with. The individual bit ranges support the ``&convert`` attribute and will adjust their types accordingly, just like a regular unit field (see :ref:`attribute_convert`). For example, that allows for mapping a bit range to an enum, using ``$$`` to access the parsed value: .. spicy-code:: parse-bitfield-enum.spicy module Test; import spicy; type X = enum { A = 1, B = 2 }; public type Foo = unit { f: bitfield(8) { x1: 0..3 &convert=X($$); x2: 4..7 &convert=X($$); } { print self.f.x1, self.f.x2; } }; .. spicy-output:: parse-bitfield-enum.spicy :exec: printf '\x21' | spicy-driver %INPUT :show-with: foo.spicy When parsing a bitfield, you can enforce expected values for some or all of the bitranges through an assignment-style syntax: .. spicy-code:: type Foo = unit { f: bitfield(8) { x1: 0..3 = 2; x2: 4..5; x3: 6..7 = 3; } }; Now parsing will fail if values of ``x1`` and ``x3`` aren't ``2`` and ``3``, respectively. Internally, Spicy treats bitfields with such expected values similar to constants of other types, meaning they operate as valid look-ahead symbols as well (see :ref:`parse_lookahead`). .. _parse_bytes: Bytes ^^^^^ When parsing a field of type :ref:`type_bytes`, Spicy will consume raw input bytes according to a specified attribute that determines when to stop. The following attributes are supported: ``&eod`` Consumes all subsequent data until the end of the input is reached. ``&size=N`` Consumes exactly ``N`` bytes. The attribute may be combined with ``&eod`` to consume up to ``N`` bytes instead (i.e., permit running out of input before the size limit is reached). (This attribute :ref:`works for fields of all types `. We list it here because it's particularly common to use it with `bytes`.) ``&until=DELIM`` Consumes bytes until the specified delimiter is found. ``DELIM`` must be of type ``bytes`` itself. The delimiter will not be included into the resulting value, but consumed. ``&until-including=DELIM`` Similar to ``&until``, but this does include the delimiter ``DELIM`` into the resulting value. At least one of these attributes must be provided. On top of that, bytes fields support the attribute ``&chunked`` to change how the parsed data is processed and stored. Normally, a bytes field will first accumulate all desired data and then store the final, complete value in the field. With ``&chunked``, if the data arrives incrementally in pieces, the field instead processes just whatever is available at a time, storing each piece directly, and individually, in the field. Each time a piece gets stored, any associated field hooks execute with the new part as their ``$$``. Parsing with ``&chunked`` will eventually still consume the same number of bytes overall, but it avoids buffering everything in cases where that's either infeasible or simply not not needed. Bytes fields support parsing constants: If a ``bytes`` constant is specified instead of a field type, parsing will expect to find the corresponding value in the input stream. .. _parse_integer: Integer ^^^^^^^ Fields of :ref:`integer type ` can be either signed (``intN``) or unsigned (``uintN``). In either case, the bit length ``N`` determines the number of bytes being parsed. By default, integers are expected to come in network byte order. You can specify a different order through the ``&byte-order=ORDER`` attribute, where ``ORDER`` is of type :ref:`spicy_ByteOrder`. Integer fields support parsing constants: If an integer constant is specified instead the instead of a field type, parsing will expect to find the corresponding value in the input stream. Since the exact type of the integer constant is important, you should use their constructor syntax to make that explicit (e.g., ``uint32(42)``, ``int8(-1)``; vs. using just ``42`` or ``-1``). .. _parse_real: Real ^^^^ Real values are parsed as either single or double precision values in IEEE754 format, depending on the value of their ``&type=T`` attribute, where ``T`` is one of :ref:`spicy_RealType`. .. _parse_regexp: Regular Expression ^^^^^^^^^^^^^^^^^^ When parsing a field through a :ref:`type_regexp`, the expression is expected to match at the current position of the input stream. The field's type becomes ``bytes``, and it will store the matching data. Inside hooks for fields with regular expressions, you can access capture groups through ``$1``, ``$2``, ``$3``, etc. For example: .. spicy-code:: x : /(a.c)(de*f)(h.j)/ { print $1, $2, $3; } This will print out the relevant pieces of the data matching the corresponding set of parentheses. (There's no ``$0``, just use ``$$`` as normal to get the full match.) Matching an regular expression is more expensive if you need it to capture groups. If are using groups inside your expression but don't need the actual captures, add ``&nosub`` to the field to remove that overhead. .. _parse_unit: Unit ^^^^ Fields can have the type of another unit, in which case parsing will descend into that subunit's grammar until that instance has been fully parsed. Field initialization and hooks work as usual. If the subunit receives parameters, they must be given right after the type. .. spicy-code:: parse-unit-params.spicy module Test; type Bar = unit(a: string) { x: uint8 { print "%s: %u" % (a, self.x); } }; public type Foo = unit { y: Bar("Spicy"); on %done { print self; } }; .. spicy-output:: parse-unit-params.spicy :exec: printf '\01\02' | spicy-driver %INPUT :show-with: foo.spicy See :ref:`unit_parameters` for more. .. _parse_vector: Vector ^^^^^^ Parsing a :ref:`vector ` creates a loop that repeatedly parses elements of the specified type from the input stream until an end condition is reached. The field's value accumulates all the elements into the final vector. Spicy uses a specific syntax to define fields of type vector:: NAME : ELEM_TYPE[SIZE] ``NAME`` is the field name as usual. ``ELEM_TYPE`` is type of the vector's elements, i.e., the type that will be repeatedly parsed. ``SIZE`` is the number of elements to parse into the vector; this is an arbitrary Spicy expression yielding an integer value. The resulting field type then will be ``vector``. Here's a simple example parsing five ``uint8``: .. spicy-code:: parse-vector.spicy module Test; public type Foo = unit { x: uint8[5]; on %done { print self; } }; .. spicy-output:: parse-vector.spicy :exec: printf '\01\02\03\04\05' | spicy-driver %INPUT :show-with: foo.spicy It is possible to skip the ``SIZE`` (e.g., ``x: uint8[]``) and instead use another kind of end conditions to terminate a vector's parsing loop. To that end, vectors support the following attributes: ``&eod`` Parses elements until the end of the input stream is reached. ``&size=N`` Parses the vector from the subsequent ``N`` bytes of input data. This effectively limits the available input to the corresponding window, letting the vector parse elements until it runs out of data. (This attribute :ref:`works for fields of all types `. We list it here because it's particularly common to use it with vectors.) ``&until=EXPR`` Vector elements are parsed in a loop with ``EXPR`` being evaluated as a boolean expression after each parsed element, and before adding the element to the vector. Once ``EXPR`` evaluates to true, parsing stops *without* adding the element that was just parsed. Inside ``EXPR``, ``$$`` refers to the element most recently parsed. ``&until-including=EXPR`` Similar to ``&until``, but does include the final element ``EXPR`` into the field's vector when stopping parsing. Inside ``EXPR``, ``$$`` refers to the element most recently parsed. ``&while=EXPR`` Continues parsing as long as the boolean expression ``EXPR`` evaluates to true. Inside ``EXPR``, ``$$`` refers to the element most recently parsed. If neither a size nor an attribute is given, Spicy will attempt to use :ref:`look-ahead parsing ` to determine the end of the vector based on the next expected token. Depending on the unit's field, this may not be possible, in which case Spicy will decline to compile the unit. The syntax shown above generally works for all element types, including subunits (e.g., ``x: MyUnit[]``). .. note:: The ``x: ()[]`` syntax is quite flexible. In fact, ```` is not limited to subunits, but allows for any standard field specification defining how to parse the vector elements. For example, ``x: (bytes &size=5)[];`` parses a vector of 5-character ``bytes`` instances. .. _hook_foreach: When parsing a vector, Spicy supports using a special kind of field hook, ``foreach``, that executes for each parsed element individually. Inside that hook, ``$$`` refers to the element's final value: .. spicy-code:: parse-vector-foreach.spicy module Test; public type Foo = unit { x: uint8[5] foreach { print $$, self.x; } }; .. spicy-output:: parse-vector-foreach.spicy :exec: printf '\01\02\03\04\05' | spicy-driver %INPUT :show-with: foo.spicy As you can see, when a ``foreach`` hook executes the element has not yet been added to the vector. You may indeed use a ``stop`` statement inside a ``foreach`` hook to abort the vector's parsing without adding the current element anymore. See :ref:`unit_hooks` for more on hooks. .. _parse_void: Void ^^^^ The :ref:`type_void` type can be used as a placeholder in fields not meant to consume any data. This can be useful in some situations, such as providing a branch in :ref:`switch ` constructs to that foregoes any parsing, or attaching a :ref:`&requires ` attribute to enforce a condition. Fields of type ``void`` do not have any accessible value. Controlling Parsing =================== Spicy offers a few additional constructs inside a unit's declaration for steering the parsing process. We discuss them in the following. Conditional Parsing ^^^^^^^^^^^^^^^^^^^ A unit field may be conditionally skipped for parsing by adding an ``if ( COND )`` clause, where ``COND`` is a boolean expression. The field will be only parsed if the expression evaluates to true at the time the field is next in line. .. spicy-code:: parse-if.spicy module Test; public type Foo = unit { a: int8; b: int8 if ( self.a == 1 ); c: int8 if ( self.a % 2 == 0 ); d: int8; on %done { print self; } }; .. spicy-output:: parse-if.spicy :exec: printf '\01\02\03\04' | spicy-driver %INPUT; printf '\02\02\03\04' | spicy-driver %INPUT :show-with: foo.spicy .. versionadded:: 1.12 Conditional blocks If the same condition applies to multiple subsequent fields, they can be grouped together into a single conditional block: .. spicy-code:: parse-if-block.spicy module Test; public type Foo = unit { a: int8; if ( self.a == 1 ) { b: int8; c: int8; }; # note the trailing semicolon on %done { print self; } }; The syntax supports an optional ``else``-block as well: .. spicy-code:: parse-if-block-with-else.spicy module Test; public type Foo = unit { a: int8; if ( self.a == 1 ) { b: int8; } else { c: int8; }; # note the trailing semicolon on %done { print self; } }; For repeated cases of conditional parsing where a single expression evaluates to one of several values, unit :ref:`parse_switch` statements might allow for more compact and easier to maintain code. .. _parse_lookahead: Look-Ahead ^^^^^^^^^^ Internally, Spicy builds an LR(1) grammar for each unit that it parses, meaning that it can actually look *ahead* in the parsing stream to determine how to process the current input location. Roughly speaking, if (1) the current construct does not have a clear end condition defined (such as a specific length), and (2) a specific value is expected to be found next; then the parser will keep looking for that value and end the current construct once it finds it. "Construct" deliberately remains a bit of a fuzzy term here, but think of vector parsing as the most common instance of this: If you don't give a vector an explicit termination condition (as discussed in :ref:`parse_vector`), Spicy will look at what's expected to come *after* the container. As long as that's something clearly recognizable (e.g., a specific value of an atomic type, or a match for a regular expression), it'll terminate the vector accordingly. Here's an example: .. spicy-code:: parse-look-ahead.spicy module Test; public type Foo = unit { data: uint8[]; : /EOD/; x : int8; on %done { print self; } }; .. spicy-output:: parse-look-ahead.spicy :exec: printf '\01\02\03EOD\04' | spicy-driver %INPUT :show-with: foo.spicy For vectors, Spicy attempts look-ahead parsing automatically as a last resort when it doesn't find more explicit instructions. However, it will reject a unit if it can't find a suitable look-ahead symbol to work with. If we had written ``int32`` in the example above, that would not have worked as the parser can't recognize when there's a ``int32`` coming; it would need to be a concrete value, such as ``int32(42)``. See the :ref:`parse_switch` construct for another instance of look-ahead parsing. .. _parse_switch: ``switch`` ^^^^^^^^^^ Spicy supports a ``switch`` construct as way to branch into one of several parsing alternatives. There are two variants of this, an explicit branch and one driving by look-ahead: .. rubric:: Branch by expression The most basic form of switching by expression looks like this: .. spicy-code:: switch ( EXPR ) { VALUE_1 -> FIELD_1; VALUE_2 -> FIELD_2; ... VALUE_N -> FIELD_N; }; This evaluates ``EXPR`` at the time parsing reaches the ``switch``. If there's a ``VALUE`` matching the result, parsing continues with the corresponding field, and then proceeds with whatever comes after the switch. Example: .. spicy-code:: parse-switch.spicy module Test; public type Foo = unit { x: bytes &size=1; switch ( self.x ) { b"A" -> a8: int8; b"B" -> a16: int16; b"C" -> a32: int32; }; on %done { print self; } }; .. spicy-output:: parse-switch.spicy :exec: printf 'A\01' | spicy-driver %INPUT; printf 'B\01\02' | spicy-driver %INPUT :show-with: foo.spicy We see in the output that all of the alternatives turn into normal unit members, with all but the one for the branch that was taken left unset. If none of the values match the expression, that's considered a parsing error and processing will abort. Alternative, one can add a default alternative by using ``*`` as the value. The branch will then be taken whenever no other value matches. A couple additional notes about the fields inside an alternative: - In our example, the fields of all alternatives all have different names, and they all show up in the output. One can also reuse names across alternatives as long as the types exactly match. In that case, the unit will end up with only a single instance of that member. - An alternative can match against more than one value by separating them with commas (e.g., ``b"A", b"B" -> x: int8;``). - Alternatives can have more than one field attached by enclosing them in braces, i.e.,: ``VALUE -> { FIELD_1a; FIELD_1b; ...; FIELD_1n; }``. - Sometimes one really just needs the branching capability, but doesn't have any field values to store. In that case an anonymous ``void`` field may be helpful( e.g., ``b"A" -> : void { DoSomethingHere(); }``. .. rubric:: Branch by look-ahead ``switch`` also works without any expression as long as the presence of all the alternatives can be reliably recognized by looking ahead in the input stream: .. spicy-code:: parse-switch-lhead.spicy module Test; public type Foo = unit { switch { -> a: b"A"; -> b: b"B"; -> c: b"C"; }; on %done { print self; } }; .. spicy-output:: parse-switch-lhead.spicy :exec: printf 'A' | spicy-driver %INPUT :show-with: foo.spicy While this example is a bit contrived, the mechanism becomes powerful once you have subunits that are recognizable by how they start: .. spicy-code:: parse-switch-lhead-2.spicy module Test; type A = unit { a: b"A"; }; type B = unit { b: uint16(0xffff); }; public type Foo = unit { switch { -> a: A; -> b: B; }; on %done { print self; } }; .. spicy-output:: parse-switch-lhead-2.spicy :exec: printf 'A ' | spicy-driver %INPUT; printf '\377\377' | spicy-driver %INPUT :show-with: foo.spicy .. rubric:: Switching Over Fields With Common Size You can limit the input any field in a unit switch receives by attaching an optional ``&size=EXPR`` attribute that specifies the number of raw bytes to make available. This is analog to the :ref:`field size attribute ` and especially useful to remove duplication when each case is subject to the same constraint. .. spicy-code:: parse-switch-size.spicy module Test; public type Foo = unit { tag: uint8; switch ( self.tag ) { 1 -> b1: bytes &eod; 2 -> b2: bytes &eod &convert=$$.lower(); } &size=3; on %done { print self; } }; .. spicy-output:: parse-switch-size.spicy :exec: printf '\01ABC' | spicy-driver %INPUT; printf '\02ABC' | spicy-driver %INPUT :show-with: foo.spicy .. _backtracking: Backtracking ^^^^^^^^^^^^ Spicy supports a simple form of manual backtracking. If a field is marked with ``&try``, a later call to the unit's ``backtrack()`` method anywhere down in the parse tree originating at that field will immediately transfer control over to the field following the ``&try``. When doing so, the data position inside the input stream will be reset to where it was when the ``&try`` field started its processing. Units along the original path will be left in whatever state they were at the time ``backtrack()`` executed (i.e., they will probably remain just partially initialized). When ``backtrack()`` is called on a path that involves multiple ``&try`` fields, control continues after the most recent. Example: .. spicy-code:: parse-backtrack.spicy module Test; public type test = unit { foo: Foo &try; bar: Bar; on %done { print self; } }; type Foo = unit { a: int8 { if ( $$ != 1 ) self.backtrack(); } b: int8; }; type Bar = unit { a: int8; b: int8; }; .. spicy-output:: parse-backtrack.spicy :exec: printf '\001\002\003\004' | spicy-driver %INPUT; printf '\003\004' | spicy-driver %INPUT :show-with: backtrack.spicy ``backtrack()`` can be called from inside :ref:`%error hooks `, so this provides a simple form of error recovery as well. .. note:: This mechanism is preliminary and will probably see refinement over time, both in terms of more automated backtracking and by providing better control where to continue after backtracking. Changing Input ============== By default, a Spicy parser proceeds linearly through its inputs, parsing as much as it can and yielding back to the host application once it runs out of input. There are two ways to change this linear model: diverting parsing to a different input, and random access within the current unit's data. .. rubric:: Parsing custom data A unit field can have either ``&parse-from=EXPR`` or ``&parse-at=EXPR`` attached to it to change where it's receiving its data to parse from. ``EXPR`` is evaluated at the time the field is reached. For ``&parse-from`` it must produce a value of type ``bytes``, which will then constitute the input for the field. This can, e.g., be used to reparse previously received input: .. spicy-code:: parse-parse.spicy module Test; public type Foo = unit { x: bytes &size=2; y: uint16 &parse-from=self.x; z: bytes &size=2; on %done { print self; } }; .. spicy-output:: parse-parse.spicy :exec: printf '\x01\x02\x03\04' | spicy-driver %INPUT :show-with: foo.spicy For ``&parse-at``, ``EXPR`` must yield an iterator pointing to (a still valid) position of the current unit's input stream (such as retrieved through :spicy:method:`unit::input`). The field will then be parsed from the data starting at that location. .. _random_access: .. rubric:: Random access While a unit is being parsed, you may revert the current input position backwards to any location between the first byte the unit has seen and the current position. You can use a set of built-in unit methods to control the current position: :spicy:method:`unit::input` Returns a stream iterator pointing to the current input position. :spicy:method:`unit::set_input` Sets the current input position to the location of the specified stream iterator. Per above, the new position needs to reside between the beginning of the current unit's data and the current position; otherwise an exception will be generated at runtime. :spicy:method:`unit::offset` Returns the numerical offset of the current input position relative to position of the first byte fed into this unit. :spicy:method:`unit::position` Returns iterator to the current input position in the stream fed into this unit. You can achieve random access by saving an iterator from ``input()`` in a unit variable, then later return to that position (or one derived from it) by calling ``set_input()`` with that variable. Here's an example that parses input data twice with different sub units: .. spicy-code:: parse-random-access.spicy module Test; public type Foo = unit { on %init() { self.start = self.input(); } a: A { self.set_input(self.start); } b: B; on %done() { print self; } var start: iterator; }; type A = unit { x: uint32; }; type B = unit { y: bytes &size=4; }; .. spicy-output:: parse-random-access.spicy :exec: printf '\00\00\00\01' | spicy-driver %INPUT :show-with: foo.spicy If you look at output, you see that ``start`` iterator remembers its offset, relative to the global input stream. It would also show the data at that offset if the parser had not already discarded that at the time we print it out. .. note:: Spicy parsers discard input data as quickly as possible as parsing moves through the input stream. Indeed, that's why using random access may come with a performance penalty as the parser now needs to buffer all of unit's data until it has been fully processed. .. _filters: Filters ======= Spicy supports attaching *filters* to units that get to preprocess and transform a unit's input before its parser gets to see it. A typical use case for this is stripping off a data encoding, such as compression or Base64. A filter is itself just a ``unit`` that comes with an additional property ``%filter`` marking it as such. The filter unit's input represents the original input to be transformed. The filter calls an internally provided unit method :spicy:method:`unit::forward` to pass any transformed data on to the main unit that it's attached to. The filter can call ``forward`` arbitrarily many times, each time forwarding a subsequent chunk of input. To attach a filter to a unit, one calls the method :spicy:method:`unit::connect_filter` with an instance of the filter's type. Putting that all together, this is an example of a simple a filter that upper-cases all input before the main parsing unit gets to see it: .. spicy-code:: parse-filter.spicy module Test; type Filter = unit { %filter; : bytes &eod &chunked { self.forward($$.upper()); } }; public type Foo = unit { on %init { self.connect_filter(new Filter); } x: bytes &size=5 { print self.x; } }; .. spicy-output:: parse-filter.spicy :exec: printf 'aBcDe' | spicy-driver %INPUT :show-with: foo.spicy There are a couple of predefined filters coming with Spicy that become available by importing the ``filter`` library module: ``filter::Zlib`` Provides zlib decompression. ``filter::Base64Decode`` Provides base64 decoding. .. _sinks: Sinks ===== Sinks provide a powerful mechanism to chain multiple units together into a layered stack, each processing the output of its predecessor. A sink is the connector here that links two unit instances: one side writing and one side reading, like a Unix pipe. As additional functionality, the sink can internally reassemble data chunks that are arriving out of order before passing anything on. Here's a basic example of two units types chained through a sink: .. spicy-code:: parse-sink.spicy module Test; public type A = unit { on %init { self.b.connect(new B); } length: uint8; data: bytes &size=self.length { self.b.write($$); } on %done { print "A", self; } sink b; }; public type B = unit { : /GET /; path: /[^\n]+/; on %done { print "B", self; } }; .. spicy-output:: parse-sink.spicy :exec: printf '\13GET /a/b/c\n' | spicy-driver -p Test::A %INPUT :show-with: foo.spicy Let's see what's going on here. First, there's ``sink b`` inside the declaration of ``A``. That's the connector, kept as state inside ``A``. When parsing for ``A`` is about to begin, the ``%init`` hook connects the sink to a :ref:`new instance ` of ``B``; that'll be the receiver for data that ``A`` is going to write into the sink. That writing happens inside the field hook for ``data``: once we have parsed that field, we write what will go to the sink using its built-in :spicy:method:`sink::write` method. With that write operation, the data will emerge as input for the instance of ``B`` that we created earlier, and that will just proceed parsing it normally. As the output shows, in the end both unit instances end up having their fields set. As an alternative for using the :spicy:method:`sink::write` in the example, there's some syntactic sugar for fields of type ``bytes`` (like ``data`` here): We can just replace the hook with a ``->`` operator to have the parsed data automatically be forwarded to the sink: ``data: bytes &size=self.length -> self.b``. Sinks have a number of further methods, see :ref:`type_sink` for the complete reference. Most of them we will also encounter in the following when discussing additional functionality that sinks provide. .. note:: Because sinks are meant to decouple processing between two units, a unit connected to a sink will *not* pass any parse errors back up to the sink's parent. If you want to catch them, install an :ref:`%error ` hook inside the connected unit. Using Filters ^^^^^^^^^^^^^ Sinks also support :ref:`filters ` to preprocess any data they receive before forwarding it on. This works just like for units by calling the built-in sink method :spicy:method:`sink::connect_filter`. For example, if in the example above, ``data`` would have been gzip compressed, we could have instructed the sink to automatically decompress it by calling ``self.b.connect_filter(new filter::Zlib)`` (leveraging the Spicy-provided ``Zlib`` filter). Leveraging MIME Types ^^^^^^^^^^^^^^^^^^^^^ In our example above we knew which type of unit we wanted to connect. In practice, that may or may not be the case. Often, it only becomes clear at runtime what the choice for the next layer should be, such as when using well-known ports to determine the appropriate application-layer analyzer for a TCP stream. Spicy supports dynamic selection through a generalized notion of MIME types: Units can declare which MIME types they know how to parse (see :ref:`unit_meta_data`) , and sinks have :spicy:method:`sink::connect_mime_type` method that will instantiate and connect any that match their argument (if that's multiple, all will be connected and all will receive the same data). "MIME type" can mean actual MIME types, such ``text/html``. Applications can, however, also define their own notion of ``/`` to model other semantics. For example, one could use ``x-port/443`` as convention to trigger parsers by well-known port. An SSL unit would then declare ``%mime-type = "x-port/443``, and the connection would be established through the equivalent of ``connect_mime_type("x-port/%d" % resp_port_of_connection)``. .. todo:: For this specific example, there's a better solution: We also have the ``%port`` property and should just build up a table index on that. Reassembly ^^^^^^^^^^ Reassembly (or defragmentation) of out-of-order data chunks is a common requirement for many protocols. Sinks have that functionality built-in by allowing you to associate a position inside a virtual sequence space with each chunk of data. Sinks will then pass their data on to connected units only once they have collected a continuous, in-order range of bytes. The easiest way to leverage this is to simply associate sequence numbers with each :spicy:method:`sink::write` operation: .. spicy-code:: parse-reassembly.spicy module Test; public type Foo = unit { sink data; on %init { self.data.connect(new Bar); self.data.write(b"567", 5); self.data.write(b"89", 8); self.data.write(b"012", 0); self.data.write(b"34", 3); } }; public type Bar = unit { s: bytes &eod; on %done { print self.s; } }; .. spicy-output:: parse-reassembly.spicy :exec: spicy-driver -p Test::Foo %INPUT `, for example by maintaining a global map that ties a unique connection ID to the information that needs to be retained. However, doing so is clearly cumbersome and error prone. As an alternative, a unit can make use of a dedicated *context* value, which is an instance of a custom type that has its lifetime determined by the host application running the parser. For example, Zeek will tie the context to the underlying connection. Any public unit can declare a context through a unit-level property called ``%context``, which takes an arbitrary type as its argument. For example: .. spicy-code:: public type Foo = unit { %context = bytes; [...] }; When used as a top-level entry point to parsing, the unit will then, by default, receive a unique context value of that type. That context value can be accessed through the :spicy:method:`unit::context` method, which will return a :ref:`reference ` to it: .. spicy-code:: context-empty.spicy module Test; public type Foo = unit { %context = int64; on %init { print self.context(); } }; .. spicy-output:: context-empty.spicy :exec: spicy-driver %INPUT ` always creates a single context value shared by all top-level units belonging to the same connection, enabling parsers to maintain bi-directional, per-connection state. The batch mode of :ref:`spicy-driver ` does the same. .. note:: A unit's context value gets set only when a host application uses it as the top-level starting point for parsing. If in the above example `Foo` wasn't the entry point, but used inside another unit further down during the parsing process, its context would remain unset. As an example, the following grammar---mimicking a request/reply-style protocol---maintains a queue of outstanding textual commands to then associate numerical result codes with them as the responses come in: .. spicy-code:: context-pipelining.spicy module Test; # We wrap the state into a tuple to make it easy to add more attributes if needed later. type Pending = tuple>; public type Requests = unit { %context = Pending; : Request[] foreach { self.context().pending.push_back($$.cmd); } }; public type Replies = unit { %context = Pending; : Reply[] foreach { if ( |self.context().pending| ) { print "%s -> %s" % (self.context().pending.back(), $$.response); self.context().pending.pop_back(); } else print " -> %s", $$.response; } }; type Request = unit { cmd: /[A-Za-z]+/; : b"\n"; }; type Reply = unit { response: /[0-9]+/; : b"\n"; }; .. spicy-output:: context-pipelining.spicy :exec: spicy-driver -F programming/examples/context-input.dat %INPUT :show-as: spicy-driver -F input.dat context.spicy The output is produced from :download:`this input batch file `. This would work the same when used with the Zeek on a corresponding packet trace. Note that the units for the two sides of the connection need to declare the same ``%context`` type. Processing will abort at runtime with a type mismatch error if that's not the case. .. _error_handling: Error Handling =============== Whenever a parser encounters an unexpected situation during processing, it triggers a runtime error. This includes parsing errors due to input that does not match the current unit, failing :ref:`&requires ` conditions, and also any logic errors in hooks, such as attempting to read an unset unit field or accessing an invalid vector index. By default, any runtime error will cause the parsing to terminate immediately, with a corresponding error message reported back to the host application. The Spicy parser will not be able to continue processing afterwards. However, there are a couple of ways to catch *parsing errors* (but not other runtime errors) and potentially recover from them, which we discuss in the following. .. _parsing_errors: A unit can provide special :ref:`%error hooks ` that will execute when a parsing error is encountered. A unit-wide ``%error`` hook will catch all parsing errors occurring anywhere inside the unit, including any sub-units (if not otherwise handled by the sub-unit itself already). Example: .. code-block:: spicy module MyModule; type MyType = unit { magic: b"MAGIC"; on %error(msg: string) { print "Error when parsing MyUnit: ", msg; } }; The ``msg`` parameter is optional. If it's specified, it will contain an error message describing the issue. By default, even with an ``%error`` hook in place, the parser will still terminate after executing the hook. To change that, the hook may use :ref:`backtracking` to specify where to continue parsing after the error. Alternatively, if :ref:`automatic error recovery ` is in place, the parser will attempt recovery after the error hooks have executed. .. versionadded:: 1.12 Per-field ``%error`` handler Rather than defining a unit-wide ``%error`` hook, it is also possible to just have an individual field catch its own parsing errors. The easiest way to do this is to attach an ``%error`` attribute to an inline hook: .. code-block:: spicy module My; type MyType = unit { magic: b"MAGIC" %error { # will run if magic cannot be parsed print "magic not found"; } }; To get access to the error message as well, define it out of line like this: .. code-block:: spicy module MyUnit; type MyType = unit { magic: b"MAGIC" on magic(msg: string) %error { print "Error when parsing magic: ", msg; } }; .. _error_recovery: Error Recovery ============== Real world input does not always look like what parsers expect: endpoints may not conform to the protocol's specification, a parser's grammar might not fully cover all of the protocol, or some input may be missing due to packet loss or stepping into the middle of a conversation. By default, if a Spicy parser encounters such situations, it will abort parsing altogether and issue an error message. Alternatively, however, Spicy allows grammar writers to specify heuristics to recover from errors. The main challenge here is finding a spot in the subsequent input where parsing can reliably resume. Spicy employs a two-phase approach to such recovery: it first searches for a possible point in the input stream where it seems promising to attempt to resume parsing; and then it confirms that choice by trying to parse a few fields at that location according to the grammar grammar to see if that's successful. We say that during the first part of this process, the Spicy parser is in *synchronization mode*; d during the second, it is in *trial mode*. .. rubric:: Phase 1: Synchronization To identity locations where parsing can attempt to pick up again after an error, a grammar can add ``&synchronize`` attributes to selected unit fields, marking them as a *synchronization points*. Whenever an error occurs during parsing, Spicy will determine the closest synchronization point in the grammar following the error's location, and then attempt to continue processing there by skipping ahead in the input data until it aligns with what that field is looking for. A synchronization point may be any of the following: - A field for which parsing begins with a constant literal (e.g., a specific sequence of bytes). To realign the input stream, the parser will search the input for the next occurrence of this literal, discarding any data in between. Example:: type X = unit { ... } type Y = unit { a: b"begin-of-Y"; b: bytes &size=10; }; type Foo = unit { x: X; y: Y &synchronize; }; If parse error occurs during ``Foo::x``, Spicy will move ahead to ``Foo::y``, switch into synchronization mode, and start search the input for the bytes ``begin-of-Y``. If found, it'll continue with parsing ``Foo::y`` at that location in trial mode (see below). .. note:: Behind the scenes, synchronization through literals uses the same machinery as :ref:`look-ahead parsing `, meaning that it works across sub-units, vector content, ``switch`` statements, etc.. No matter how complex the field, as long as there's one or more literals that always *must* be coming first when parsing it, the field may be used as a synchronization point. - A field with a type which specifies :ref:`%synchronize-at ` or :ref:`%synchronize-after `. The parser will search the input for the next occurrence of the given literal, discarding any data in between. If the search was successful, ``%synchronize-at`` will leave the input at the position of the search literal for later extraction while ``%synchronize-after`` will discard the search literal. If either of these unit properties is specified, it will always overrule any other potential synchronization points in the unit. Example:: type X = unit { ... : /END/; }; type Y = unit { %synchronize-after = /END/; a: bytes &size=10; }; type Foo = unit { x: X; y: Y &synchronize; }; - A field that's located inside the input stream at a fixed offset relative to the field triggering the error. The parser will then be able to skip ahead to that offset. Example:: type X = unit { ... } type Y = unit { ... } type Foo = unit {} x: X &size=512; y: Y &synchronize; }; Here, when parsing ``Foo:x`` triggers an error, Spicy will know that it can continue with ``Foo::y`` at offset `` + 512``. .. todo:: This synchronization strategy is not yet implemented. - When :ref:`parsing a vector `, the inner elements may provide synchronization points as well. Example:: type X = unit { a: b"begin-of-X"; b: bytes &size=10; }; type Foo = unit {} xs: (X &synchronize)[]; }; If one element of the vector ``Foo::xs`` fails to parse, Spicy will attempt to find the beginning of the next ``X`` in the input stream and continue there. For this to work, the vector's elements must itself represent valid synchronization point (e.g., start with a literal). If the list is of fixed size, after successful synchronization, it will contain the expected number of entries, but some of them may remain (fully or partially) uninitialized if they encountered errors. .. rubric:: Phase 2: Trial parsing Once input has been realigned with a synchronization point, parsing switches from synchronization mode into trial mode, in which the parser will attempt to confirm that it has indeed found a viable place to continue. It does so by proceeding to parse subsequent input from the synchronization point onwards, until one of the following occurs: - A unit hook explicitly acknowledges that synchronization has been successful by executing Spicy's :ref:`statement_confirm` statement. Typically, a grammar will do so once it has been able to correctly parse a few fields following the synchronization point--whatever it needs to sufficiently certain that it's indeed seeing the expected structure. - A unit hook explicitly declines the synchronization by executing Spicy's :ref:`statement_reject` statement. This will abandon the current synchronization attempt, and switch back into the original synchronization mode again to find another location to try. - Parsing reaches the end of the grammar without either ``confirm`` or ``reject`` already called. In this case, the parser will abort with a fatal parse error. Note that during trial mode, any fields between the synchronization point and the eventual ``confirm``/``reject`` location will already be processed as usual, including any hooks executing except ``%error``. This may leave the unit's state in a partially initialized state if trial parsing eventually fails. Trial mode will also consume any input along the way, with any further synchronization attempts proceeding only on subsequent, not yet seen, data. .. _error_recovery_hooks: .. rubric:: Synchronisation Hooks For customization, Spicy provides a set of hooks executing at different points during the synchronization process: ``on %synced { ...}`` Executes when a synchronization point has been found and parsing resumes there, just before the parser begins processing the corresponding field in trial mode. ``on %confirmed { ...}`` Executes when trial mode ends successfully with :ref:`statement_confirm`. ``on %rejected { ...}`` Executes when trial mode fails with :ref:`statement_reject`. ``on %sync_advance(offset: uint64)`` Executes regularly (see below) while the parser is searching for a synchronization point. The `offset` is given the current position inside the input stream. This hook can be used check if the parser is skipping too much data for the analysis to remain useful. For example, a protocol analyzer could decide to bail out if the input stream consists mainly of gaps, as reported by :spicy:method:`self.stream().statistics() `. By default, the hook executes every 4KB of input data skipped while searching for a synchronization point. It may not necessarily trigger immediately at the 4KB mark, but soon after when parsing gets a chance to check the input stream's position. You may change the trigger volume by defining a unit property ``%sync-advance-block-size = `` where ```` is an alternative size value in bytes. As usual, this property can also be set at the module level to apply to all units. .. rubric:: Example Synchronization Process As an example, let's consider a grammar consisting of two sections where each section is started with a section header literal (``SEC_A`` and ``SEC_B`` here). We want to allow for inputs which miss parts or all of the first section. For such inputs, we can still synchronize the input stream by looking for the start of the second section. (For simplicity, we just use a single unit, even though typically one would probably have separate units for the two sections.) .. spicy-code:: parse-synchronized.spicy module Test; public type Example = unit { start_a: /SEC_A/; a: uint8; # If we fail to find e.g., 'SEC_A' in the input, try to synchronize on this literal. start_b: /SEC_B/ &synchronize; b: bytes &eod; # In this example confirm unconditionally. on %synced { print "Synced: %s" % self; confirm; } # Perform logging for these %confirmed and %rejected. on %confirmed { print "Confirmed: %s" % self; } on %rejected { print "Rejected: %s" % self; } on %done { print "Done %s" % self; } }; Let us consider that this parsers encounters the input ``\xFFSEC_Babc`` that missed the ``SEC_A`` section marker: - ``start_a`` missing, - ``a=255`` - ``start_b=SEC_B`` as expected, and - ``b=abc``. For such an input parsing will encounter an initial error when it sees ``\xFF`` where ``SEC_A`` would have been expected. 1. Since ``start_b`` is marked as a synchronization point, the parser enters synchronisation mode, and jumps over the field ``a`` to ``start_b``, to now search for ``SEC_B``. 2. At this point the input still contains the unexpected ``\xFF`` and remains ``\xFFSEC_Babc`` . While searching for ``SEC_B`` ``\xFF`` is skipped over, and then the expected token is found. The input is now ``SEC_Babc``. 3. The parser has successfully synchronized and enters trial mode. All ``%synced`` hooks are invoked. 4. The unit's ``%synced`` hook executes ``confirm`` and the parser leaves trial mode. All ``%confirmed`` hooks are invoked. 5. Regular parsing continues at ``start_b``. The input was ``SEC_Babc`` so ``start_b`` is set to ``SEC_B`` and ``b`` to ``abc``. Since parsing for ``start_a`` was unsuccessful and ``a`` was jumped over, their fields remain unset. .. spicy-output:: parse-synchronized.spicy :exec: printf '\xFFSEC_Babc' | spicy-driver %INPUT :show-with: foo.spicy