.. _metrics: Metrics ======= Building and testing an application (or microservice) is merely the first step in its lifetime cycle. Once you enter production and start deploying your software, you constantly need to monitor it. Is it still running? How many actors do we have? How much requests can our system handle? Where are potential bottlenecks? Do we have resources to spare or do we need to allocate more? Are we keeping our SLAs? In order to answer such high-level questions, powerful tools like `Prometheus `_ have emerged. However, such monitoring systems are only as good as the data you feed it. The metrics API in CAF enables you to instrument your code for generating performance data. The API is vendor-neutral, but borrows many concepts as well as terminology from Prometheus. Currently, CAF can only export metrics to Prometheus. However, the API allows users to collect the metrics manually for writing custom integrations. .. note:: All classes for instrumenting code live in the namespace ``caf::telemetry``. Metric Names and Labels ----------------------- Each metric is uniquely identified by: - A prefix. This acts as a namespace for grouping metrics together. All metrics that CAF collects by itself use the prefix ``caf``. - A name. This identifies the metric within the prefix. By convention, these names are all-lowercase and hyphenated. For example, ``running-actors``. - Any number of label dimensions. Labels are key-value pairs that divide a metric into useful categories. For example, a metric that counts HTTP requests could split into ``method=get``, ``method=put``, ``method=post``, etc. Aggregating all metrics by ``method`` would then yield the total amount. Metrics that share prefix, name and label names form a *metric family*. This is also directly reflected in the API: the class ``metric_family`` bundles all shared attributes and stores all instances as children. A metric family without labels always contains exactly one child. Hence, CAF calls this metric *singleton* in its API. .. note:: CAF identifies metrics by prefix and name. Hence, families with the same prefix and name but different label names are prohibited. Metric Types ------------ CAF knows these types of metrics: #. **Counters**. A counter represents a monotonically increasing value. For example, the total number of messages received by all actors, the total number of errors since starting the system, etc. #. **Gauges**. A gauge represents a numerical value that can arbitrarily increase or decrease. For example, the current number of messages in all mailboxes, the number of running actors, etc. #. **Histograms**. A histogram observes numerical values and counts them in (configurable) buckets. For example, sampling the processing time of messages ``t`` with buckets for ``0ms ≤ t ≤ 1ms``, ``1ms < t ≤ 10ms``, ``10ms < t ≤ 100ms``, and so on gives information on the usual response time and outliers. Histograms internally consist of counters and provide a relatively lightweight sampling mechanism. However, providing the right boundaries for the buckets can require some experimentation or experience. Further, CAF provides two implementations for each metric type: one using ``int64_t`` as internal representation and one using ``double``. Both implementations use atomic operations, but the former is usually more efficient on platforms such as x86. In user code, we recommend only using these type definitions: - ``dbl_counter`` for monotonically increasing floating point numbers - ``int_counter`` for monotonically increasing 64-bit integers - ``dbl_gauge`` for arbitrary floating point numbers - ``int_gauge`` for arbitrary 64-bit integers - ``dbl_histogram`` for sampling floating point numbers - ``int_histogram`` for sampling 64-bit integers The associated headers are: - ``caf/telemetry/counter.hpp`` - ``caf/telemetry/gauge.hpp`` - ``caf/telemetry/histogram.hpp`` Counters ~~~~~~~~ Counters wrap an atomic count but only allows incrementing it. The class provides the following member functions: .. code-block:: C++ /// Increments the counter by 1. void inc() noexcept; /// Increments the counter by `amount`. /// @pre `amount > 0` void inc(value_type amount) noexcept; /// Returns the current value of the counter. value_type value() const noexcept; /// Increments the counter by 1. /// @note only available if value_type == int64_t value_type operator++() noexcept; Gauges ~~~~~~ Like counters, gauges also wrap an atomic count. However, gauges are less permissive and allow decrementing as well. .. code-block:: C++ /// Increments the gauge by 1. void inc() noexcept; /// Increments the gauge by `amount`. void inc(value_type amount) noexcept; /// Decrements the gauge by 1. void dec() noexcept; /// Decrements the gauge by `amount`. void dec(value_type amount) noexcept; /// Sets the gauge to `x`. void value(value_type x) noexcept; /// Increments the gauge by 1. /// @returns The new value of the gauge. /// @note only available if value_type == int64_t value_type operator++() noexcept; /// Decrements the gauge by 1. /// @returns The new value of the gauge. /// @note only available if value_type == int64_t value_type operator--() noexcept; /// Returns the current value of the gauge. value_type value() const noexcept; Histogram ~~~~~~~~~ Histograms consist of one counter per bucket as well as a gauge for the sum of all observed values (values may be negative). .. code-block:: C++ /// Increments the bucket where the observed value falls into and increments /// the sum of all observed values. void observe(value_type value); /// Returns the sum of all observed values. value_type sum() const noexcept; Metric Units and Flags ---------------------- All metric types store numerical values, either as ``double`` or as ``int64_t``. For giving this number additional semantics, CAF allows assigning *units* (of measurement) to metrics. The default unit is ``1``, which denotes dimensionless counts such as the number of messages in a mailbox. The unit can be any string, but we recommend using only *base units* such as ``seconds`` or ``bytes`` to make processing of these metrics with monitoring systems easier. Each metric also carries one flag: ``is-sum``. Setting this to ``true`` (the default is ``false``) indicates that this metric adds something up to a total where only the total value is of interest. For example, the total number of HTTP requests. CAF itself does not care about the flag, but it can give extra information to collectors or exporters. For example, the Prometheus exporter will add a ``_total`` suffix to the exported metric name. The Metric Registry ------------------- All metrics of an actor system are managed by a single registry to make sure only one metric instance exists per prefix and name combination. Further, the registry stores all metrics in a single place to allow *collectors* to iterate over all metrics in a single place. A minimal custom collector class requires providing ``operator()`` overloads as shown below: .. code-block:: C++ class my_collector { public: void operator()(const metric_family* family, const metric* instance, const dbl_counter* impl); void operator()(const metric_family* family, const metric* instance, const int_counter* impl); void operator()(const metric_family* family, const metric* instance, const dbl_gauge* impl); void operator()(const metric_family* family, const metric* instance, const int_gauge* impl); void operator()(const metric_family* family, const metric* instance, const dbl_histogram* impl); void operator()(const metric_family* family, const metric* instance, const int_histogram* impl); }; Applying the collector to the registry looks as follows (with ``sys`` being a reference to an ``actor_system``): .. code-block:: C++ my_collector f; sys.metrics().collect(f); The associated headers is ``caf/telemetry/metric_registry.hpp``. Accessing Metrics ----------------- Accessing a metric is a three-step process: 1. Get the ``metric_registry`` from the actor system. 2. Get the ``metric_family`` from the registry. 3. Call ``get_or_add`` on the family to get a pointer to the counter, gauge, or histogram. The pointer remains valid until the actor system gets destroyed. Hence, holding on to the pointer in an actor is always safe. The registry creates metrics lazily (to be more precise, it creates families lazily that in turn create metric instances lazily). Since this requires synchronization via mutexes, we recommend to only access the registry once per metric and then store the pointer. Accessing Counters and Gauges ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Counters and gauges are very similar in their API. Hence, all functions that work on gauges only require replacing ``gauge`` with ``counter`` to work with counters instead. Gauges are owned (and created) by a gauge family object. We can either get the family object explicitly by calling ``gauge_family``, or we can use one of the two shortcut functions ``gauge_instance`` or ``gauge_singleton``. The C++ prototypes for the registry member functions look as follows: .. code-block:: C++ template auto* gauge_family(string_view prefix, string_view name, span labels, string_view helptext, string_view unit = "1", bool is_sum = false); template auto* gauge_instance(string_view prefix, string_view name, span labels, string_view helptext, string_view unit = "1", bool is_sum = false); template auto* gauge_singleton(string_view prefix, string_view name, string_view helptext, string_view unit = "1", bool is_sum = false); .. note:: All functions that take a ``span`` also provide an overload that accepts a ``std::initializer_list`` instead to make working with constants easier. The function ``gauge_family`` returns a type-specific metric family object, while the other two functions return the gauge directly. The family objects only have a single noteworthy member function, ``get_or_add``: .. code-block:: C++ auto fptr = registry.counter_family("http", "requests", {"method"}, "Number of HTTP requests.", "seconds", true); auto count = fptr->get_or_add({{"method", "put"}}); If we only get a single counter from the family, we can use ``counter_instance`` instead: .. code-block:: C++ auto count = registry.counter_instance("http", "requests", {{"method", "put"}}, "Number of HTTP requests.", "seconds", true); Accessing Histograms ~~~~~~~~~~~~~~~~~~~~ The member functions for accessing histogram families and histograms follow the same pattern as the member functions for counters and gauges. .. code-block:: C++ template auto* histogram_family(string_view prefix, string_view name, span label_names, span default_upper_bounds, string_view helptext, string_view unit = "1", bool is_sum = false); template auto* histogram_instance(string_view prefix, string_view name, span label_names, span default_upper_bounds, string_view helptext, string_view unit = "1", bool is_sum = false); template auto* histogram_singleton(string_view prefix, string_view name, span default_upper_bounds, string_view helptext, string_view unit = "1", bool is_sum = false); Compared to the member functions for counters and guages, histograms require one addition argument for the default bucket upper bounds. .. warning:: The ``default_upper_bounds`` parameter **must** be sorted! CAF automatically adds one additional bucket for observing all values between the last upper bound and *infinity* (``double``) or *INT_MAX* (``int64_t``). For example, passing ``[10, 100, 1000]`` as upper bounds creates four buckets in total. The first bucket captues all values with ``x ≤ 10``. The second bucket captues all values with ``10 < x ≤ 100``. The third bucket captures all values with ``100 < x ≤ 1000``. Finally, the fourth bucket (added automatically) captures all values with ``1000 < x ≤ INT_MAX``. Configuration Parameters ------------------------ Histograms use the actor system configuration to enable users to override hard-coded default bucket settings. On construction, the histogram family check whether a key ``caf.metrics.${prefix}.${name}.buckets`` exists. Further, the metric instance also checks on construction whether a more specific bucket setting for one of its label dimensions exist. For example, consider we add a histogram family with prefix ``http``, name ``request-duration``, and label dimension ``method`` to the registry. The family first tries to read ``caf.metrics.http.request-duration.buckets`` from the configuration and otherwise falls back to the hard-coded defaults. When creating a histogram instance from the family with the label ``method=put``, the construct first tries to read ``caf.metrics.http.request-duration.method=put.buckets`` from the configuration and otherwise uses the default for the family. In a configuration file, users may provide bucket settings like this: .. code-block:: none caf { metrics { http { # measures the duration per HTTP request in seconds request-duration { buckets = [ 0.001, # ≤ 1ms 0.01, # ≤ 10ms 0.05, # ≤ 50ms 0.1, # ≤ 100ms 0.25, # ≤ 250ms 0.5, # ≤ 500ms 0.75, # ≤ 750ms ] # use different settings for get requests "method=put" { buckets = [ 0.007, # ≤ 7ms 0.012, # ≤ 12ms 0.025, # ≤ 25ms 0.05, # ≤ 50ms 0.1, # ≤ 100ms ] } } } } } .. note:: Ambiguous settings for metrics with multiple label dimensions will result in CAF picking the first match from an unspecified order. Hence, prefer using only one label dimension for configuring buckets or otherwise make sure there is always exactly one match for instance labels. Performance Considerations -------------------------- Instrumenting code should affect the performance as little as possible. Keep in mind that each member function on the registry has to acquire a lock. Ideally, applications call functions such as ``gauge_family`` *once* during setup and then store the family pointer to create metric instances later. Ideally, there is a single occurrence in the code for getting the family object from the registry and a single occurrence in the code for getting the gauge/counter/histogram object from the family (``get_or_add`` also has to acquire a lock). All operations on gauges, counters and histograms use atomic operations. Depending on the type, CAF internally uses ``std::atomic`` or ``std::atomic``. Adding a sample to a histogram requires two atomic operations: one for the bucket and one for the sum. Atomic operations are reasonably fast, but we still recommend to avoid them in tight loops. Builtin Metrics --------------- CAF collects a set of builtin metrics in order to provide insights into the actor system and its modules. Some are always collect while others require configuration by the user. Base Metrics ~~~~~~~~~~~~ The actor system collects this set of metrics always by default (note that all ``caf.middleman`` metrics only appear when loading the I/O module). caf.system.running-actors - Tracks the current number of running actors in the system. - **Type**: ``int_gauge`` - **Label dimensions**: none. caf.system.processed-messages - Counts the total number of processed messages. - **Type**: ``int_counter`` - **Label dimensions**: none. caf.system.rejected-messages - Counts the number of messages that where rejected because the target mailbox was closed or did not exist. - **Type**: ``int_counter`` - **Label dimensions**: none. caf.middleman.inbound-messages-size - Samples the size of inbound messages before deserializing them. - **Type**: ``int_histogram`` - **Unit**: ``bytes`` - **Label dimensions**: none. caf.middleman.outbound-messages-size - Samples the size of outbound messages after serializing them. - **Type**: ``int_histogram`` - **Unit**: ``bytes`` - **Label dimensions**: none. caf.middleman.deserialization-time - Samples how long the middleman needs to deserialize inbound messages. - **Type**: ``dbl_histogram`` - **Unit**: ``seconds`` - **Label dimensions**: none. caf.middleman.serialization-time - Samples how long the middleman needs to serialize outbound messages. - **Type**: ``dbl_histogram`` - **Unit**: ``seconds`` - **Label dimensions**: none. Actor Metrics and Filters ~~~~~~~~~~~~~~~~~~~~~~~~~ Unlike the base metrics, actor metrics are *off* by default. Applications can spawn thousands of actors, with many only existing for a brief time. Hence, blindly collecting data from all actors in the system can impact the performance and also produce a lot of irrelevant noise. To make sure CAF only collects actor metrics that are relevant to the user, the actor system configuration provides two lists: ``caf.metrics-filters.actors.includes`` and ``caf.metrics-filters.actors.excludes``. CAF collects metrics for all actors that have names that are selected by the ``includes`` list and are not selected by the ``excludes`` list. Entries in the list can use glob-style syntax, in particular ``*``-wildcards. For example: .. code-block:: none caf { metrics-filters { actors { includes = [ "foo.*" ] excludes = [ "foo.bar" ] } } } The configuration above would select all actors with names that start with ``foo.`` except for actors named ``foo.bar``. .. note:: Names belong to actor *types*. CAF assigns default names such as ``user.scheduled-actor`` by default. To provide a custom name, either override the member function ``const char* name() const`` when implementing class-based actors or add a *static* member variable ``static inline const char* name = "..."`` to your state class when using stateful actors. CAF uses a hierarchical, hyphenated naming scheme with ``.`` as the separator and all-lowercase name components. For example, ``caf.system.spawn-server``. Users may follow this naming scheme for consistency, but CAF does not enforce any structure on the names. However, we do recommend to avoid whitespaces and special characters that the glob engine recognizes, such as ``*``, ``/``, etc. For all actors that are selected by the user-defined filters, CAF collects this set of metrics: caf.actor.processing-time - Samples how long the actor needs to process messages. - **Type**: ``dbl_histogram`` - **Unit**: ``seconds`` - **Label dimensions**: name. caf.actor.mailbox-time - Samples how long messages wait in the mailbox before being processed. - **Type**: ``dbl_histogram`` - **Unit**: ``seconds`` - **Label dimensions**: name. caf.actor.mailbox-size - Counts how many messages are currently waiting in the mailbox. - **Type**: ``int_gauge`` - **Label dimensions**: name. caf.actor.stream.processed-elements - Counts the total number of processed stream elements from upstream. - **Type**: ``int_counter`` - **Label dimensions**: name, type. caf.actor.stream.input-buffer-size - Tracks how many stream elements from upstream are currently buffered. - **Type**: ``int_gauge`` - **Label dimensions**: name, type. caf.stream.pushed-elements - Counts the total number of elements that have been pushed downstream. - **Type**: ``int_counter`` - **Label dimensions**: name, type. caf.stream.output-buffer-size - Tracks how many stream elements are currently waiting in the output buffer. - **Type**: ``int_gauge`` - **Label dimensions**: name, type. Exporting Metrics to Prometheus ------------------------------- The network module in CAF comes with builtin support for exporting metrics to Prometheus via HTTP. However, this feature is off by default since CAF generally avoids opening ports without explicit user input. During startup, the middleman enables the export of metrics when the configuration provides a valid value (0 to 65536) for ``caf.middleman.prometheus-http.port`` as shown in the example config file below. .. code-block:: none caf { middleman { prometheus-http { # listen for incoming HTTP requests on port 8080 (required parameter) port = 8080 # the bind address (optional parameter; default is 0.0.0.0) address = "0.0.0.0" } } }