588 lines
21 KiB
ReStructuredText
588 lines
21 KiB
ReStructuredText
.. _metrics:
|
|
|
|
Metrics
|
|
=======
|
|
|
|
Building and testing an application (or microservice) is merely the first step
|
|
in its lifetime cycle. Once you enter production and start deploying your
|
|
software, you constantly need to monitor it. Is it still running? How many
|
|
actors do we have? How much requests can our system handle? Where are potential
|
|
bottlenecks? Do we have resources to spare or do we need to allocate more? Are
|
|
we keeping our SLAs?
|
|
|
|
In order to answer such high-level questions, powerful tools like `Prometheus
|
|
<https://prometheus.io>`_ have emerged. However, such monitoring systems are
|
|
only as good as the data you feed it.
|
|
|
|
The metrics API in CAF enables you to instrument your code for generating
|
|
performance data. The API is vendor-neutral, but borrows many concepts as well
|
|
as terminology from Prometheus. Currently, CAF can only export metrics to
|
|
Prometheus. However, the API allows users to collect the metrics manually for
|
|
writing custom integrations.
|
|
|
|
.. note::
|
|
|
|
All classes for instrumenting code live in the namespace ``caf::telemetry``.
|
|
|
|
Metric Names and Labels
|
|
-----------------------
|
|
|
|
Each metric is uniquely identified by:
|
|
|
|
- A prefix. This acts as a namespace for grouping metrics together. All metrics
|
|
that CAF collects by itself use the prefix ``caf``.
|
|
- A name. This identifies the metric within the prefix. By convention, these
|
|
names are all-lowercase and hyphenated. For example, ``running-actors``.
|
|
- Any number of label dimensions. Labels are key-value pairs that divide a
|
|
metric into useful categories. For example, a metric that counts HTTP requests
|
|
could split into ``method=get``, ``method=put``, ``method=post``, etc.
|
|
Aggregating all metrics by ``method`` would then yield the total amount.
|
|
|
|
Metrics that share prefix, name and label names form a *metric family*. This is
|
|
also directly reflected in the API: the class ``metric_family`` bundles all
|
|
shared attributes and stores all instances as children.
|
|
|
|
A metric family without labels always contains exactly one child. Hence, CAF
|
|
calls this metric *singleton* in its API.
|
|
|
|
.. note::
|
|
|
|
CAF identifies metrics by prefix and name. Hence, families with the same
|
|
prefix and name but different label names are prohibited.
|
|
|
|
Metric Types
|
|
------------
|
|
|
|
CAF knows these types of metrics:
|
|
|
|
#. **Counters**. A counter represents a monotonically increasing value. For
|
|
example, the total number of messages received by all actors, the total
|
|
number of errors since starting the system, etc.
|
|
#. **Gauges**. A gauge represents a numerical value that can arbitrarily
|
|
increase or decrease. For example, the current number of messages in all
|
|
mailboxes, the number of running actors, etc.
|
|
#. **Histograms**. A histogram observes numerical values and counts them in
|
|
(configurable) buckets. For example, sampling the processing time of messages
|
|
``t`` with buckets for ``0ms ≤ t ≤ 1ms``, ``1ms < t ≤ 10ms``, ``10ms < t ≤
|
|
100ms``, and so on gives information on the usual response time and outliers.
|
|
Histograms internally consist of counters and provide a relatively
|
|
lightweight sampling mechanism. However, providing the right boundaries for
|
|
the buckets can require some experimentation or experience.
|
|
|
|
Further, CAF provides two implementations for each metric type: one using
|
|
``int64_t`` as internal representation and one using ``double``. Both
|
|
implementations use atomic operations, but the former is usually more efficient
|
|
on platforms such as x86. In user code, we recommend only using these type
|
|
definitions:
|
|
|
|
- ``dbl_counter`` for monotonically increasing floating point numbers
|
|
- ``int_counter`` for monotonically increasing 64-bit integers
|
|
- ``dbl_gauge`` for arbitrary floating point numbers
|
|
- ``int_gauge`` for arbitrary 64-bit integers
|
|
- ``dbl_histogram`` for sampling floating point numbers
|
|
- ``int_histogram`` for sampling 64-bit integers
|
|
|
|
The associated headers are:
|
|
|
|
- ``caf/telemetry/counter.hpp``
|
|
- ``caf/telemetry/gauge.hpp``
|
|
- ``caf/telemetry/histogram.hpp``
|
|
|
|
Counters
|
|
~~~~~~~~
|
|
|
|
Counters wrap an atomic count but only allows incrementing it. The class
|
|
provides the following member functions:
|
|
|
|
.. code-block:: C++
|
|
|
|
/// Increments the counter by 1.
|
|
void inc() noexcept;
|
|
|
|
/// Increments the counter by `amount`.
|
|
/// @pre `amount > 0`
|
|
void inc(value_type amount) noexcept;
|
|
|
|
/// Returns the current value of the counter.
|
|
value_type value() const noexcept;
|
|
|
|
/// Increments the counter by 1.
|
|
/// @note only available if value_type == int64_t
|
|
value_type operator++() noexcept;
|
|
|
|
Gauges
|
|
~~~~~~
|
|
|
|
Like counters, gauges also wrap an atomic count. However, gauges are less
|
|
permissive and allow decrementing as well.
|
|
|
|
.. code-block:: C++
|
|
|
|
/// Increments the gauge by 1.
|
|
void inc() noexcept;
|
|
|
|
/// Increments the gauge by `amount`.
|
|
void inc(value_type amount) noexcept;
|
|
|
|
/// Decrements the gauge by 1.
|
|
void dec() noexcept;
|
|
|
|
/// Decrements the gauge by `amount`.
|
|
void dec(value_type amount) noexcept;
|
|
|
|
/// Sets the gauge to `x`.
|
|
void value(value_type x) noexcept;
|
|
|
|
/// Increments the gauge by 1.
|
|
/// @returns The new value of the gauge.
|
|
/// @note only available if value_type == int64_t
|
|
value_type operator++() noexcept;
|
|
|
|
/// Decrements the gauge by 1.
|
|
/// @returns The new value of the gauge.
|
|
/// @note only available if value_type == int64_t
|
|
value_type operator--() noexcept;
|
|
|
|
/// Returns the current value of the gauge.
|
|
value_type value() const noexcept;
|
|
|
|
Histogram
|
|
~~~~~~~~~
|
|
|
|
Histograms consist of one counter per bucket as well as a gauge for the sum of
|
|
all observed values (values may be negative).
|
|
|
|
.. code-block:: C++
|
|
|
|
/// Increments the bucket where the observed value falls into and increments
|
|
/// the sum of all observed values.
|
|
void observe(value_type value);
|
|
|
|
/// Returns the sum of all observed values.
|
|
value_type sum() const noexcept;
|
|
|
|
Metric Units and Flags
|
|
----------------------
|
|
|
|
All metric types store numerical values, either as ``double`` or as ``int64_t``.
|
|
For giving this number additional semantics, CAF allows assigning *units* (of
|
|
measurement) to metrics. The default unit is ``1``, which denotes dimensionless
|
|
counts such as the number of messages in a mailbox.
|
|
|
|
The unit can be any string, but we recommend using only *base units* such as
|
|
``seconds`` or ``bytes`` to make processing of these metrics with monitoring
|
|
systems easier.
|
|
|
|
Each metric also carries one flag: ``is-sum``. Setting this to ``true`` (the
|
|
default is ``false``) indicates that this metric adds something up to a total
|
|
where only the total value is of interest. For example, the total number of HTTP
|
|
requests. CAF itself does not care about the flag, but it can give extra
|
|
information to collectors or exporters. For example, the Prometheus exporter
|
|
will add a ``_total`` suffix to the exported metric name.
|
|
|
|
The Metric Registry
|
|
-------------------
|
|
|
|
All metrics of an actor system are managed by a single registry to make sure
|
|
only one metric instance exists per prefix and name combination. Further, the
|
|
registry stores all metrics in a single place to allow *collectors* to iterate
|
|
over all metrics in a single place.
|
|
|
|
A minimal custom collector class requires providing ``operator()`` overloads as
|
|
shown below:
|
|
|
|
.. code-block:: C++
|
|
|
|
class my_collector {
|
|
public:
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const dbl_counter* impl);
|
|
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const int_counter* impl);
|
|
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const dbl_gauge* impl);
|
|
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const int_gauge* impl);
|
|
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const dbl_histogram* impl);
|
|
|
|
void operator()(const metric_family* family, const metric* instance,
|
|
const int_histogram* impl);
|
|
};
|
|
|
|
Applying the collector to the registry looks as follows (with ``sys`` being a
|
|
reference to an ``actor_system``):
|
|
|
|
.. code-block:: C++
|
|
|
|
my_collector f;
|
|
sys.metrics().collect(f);
|
|
|
|
The associated headers is ``caf/telemetry/metric_registry.hpp``.
|
|
|
|
|
|
Accessing Metrics
|
|
-----------------
|
|
|
|
Accessing a metric is a three-step process:
|
|
|
|
1. Get the ``metric_registry`` from the actor system.
|
|
2. Get the ``metric_family`` from the registry.
|
|
3. Call ``get_or_add`` on the family to get a pointer to the counter, gauge, or
|
|
histogram.
|
|
|
|
The pointer remains valid until the actor system gets destroyed. Hence, holding
|
|
on to the pointer in an actor is always safe.
|
|
|
|
The registry creates metrics lazily (to be more precise, it creates families
|
|
lazily that in turn create metric instances lazily). Since this requires
|
|
synchronization via mutexes, we recommend to only access the registry once per
|
|
metric and then store the pointer.
|
|
|
|
Accessing Counters and Gauges
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Counters and gauges are very similar in their API. Hence, all functions that
|
|
work on gauges only require replacing ``gauge`` with ``counter`` to work with
|
|
counters instead.
|
|
|
|
Gauges are owned (and created) by a gauge family object. We can either get the
|
|
family object explicitly by calling ``gauge_family``, or we can use one of the
|
|
two shortcut functions ``gauge_instance`` or ``gauge_singleton``. The C++
|
|
prototypes for the registry member functions look as follows:
|
|
|
|
.. code-block:: C++
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* gauge_family(string_view prefix, string_view name,
|
|
span<const string_view> labels, string_view helptext,
|
|
string_view unit = "1", bool is_sum = false);
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* gauge_instance(string_view prefix, string_view name,
|
|
span<const label_view> labels, string_view helptext,
|
|
string_view unit = "1", bool is_sum = false);
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* gauge_singleton(string_view prefix, string_view name,
|
|
string_view helptext, string_view unit = "1",
|
|
bool is_sum = false);
|
|
|
|
.. note::
|
|
|
|
All functions that take a ``span`` also provide an overload that accepts a
|
|
``std::initializer_list`` instead to make working with constants easier.
|
|
|
|
The function ``gauge_family`` returns a type-specific metric family object,
|
|
while the other two functions return the gauge directly.
|
|
|
|
The family objects only have a single noteworthy member function,
|
|
``get_or_add``:
|
|
|
|
.. code-block:: C++
|
|
|
|
|
|
auto fptr = registry.counter_family("http", "requests", {"method"},
|
|
"Number of HTTP requests.", "seconds",
|
|
true);
|
|
auto count = fptr->get_or_add({{"method", "put"}});
|
|
|
|
If we only get a single counter from the family, we can use ``counter_instance``
|
|
instead:
|
|
|
|
.. code-block:: C++
|
|
|
|
auto count = registry.counter_instance("http", "requests",
|
|
{{"method", "put"}},
|
|
"Number of HTTP requests.",
|
|
"seconds", true);
|
|
|
|
Accessing Histograms
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The member functions for accessing histogram families and histograms follow the
|
|
same pattern as the member functions for counters and gauges.
|
|
|
|
.. code-block:: C++
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* histogram_family(string_view prefix, string_view name,
|
|
span<const string_view> label_names,
|
|
span<const ValueType> default_upper_bounds,
|
|
string_view helptext, string_view unit = "1",
|
|
bool is_sum = false);
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* histogram_instance(string_view prefix, string_view name,
|
|
span<const label_view> label_names,
|
|
span<const ValueType> default_upper_bounds,
|
|
string_view helptext, string_view unit = "1",
|
|
bool is_sum = false);
|
|
|
|
template <class ValueType = int64_t>
|
|
auto* histogram_singleton(string_view prefix, string_view name,
|
|
span<const ValueType> default_upper_bounds,
|
|
string_view helptext, string_view unit = "1",
|
|
bool is_sum = false);
|
|
|
|
Compared to the member functions for counters and guages, histograms require one
|
|
addition argument for the default bucket upper bounds.
|
|
|
|
.. warning::
|
|
|
|
The ``default_upper_bounds`` parameter **must** be sorted!
|
|
|
|
CAF automatically adds one additional bucket for observing all values between
|
|
the last upper bound and *infinity* (``double``) or *INT_MAX* (``int64_t``). For
|
|
example, passing ``[10, 100, 1000]`` as upper bounds creates four buckets in
|
|
total. The first bucket captues all values with ``x ≤ 10``. The second bucket
|
|
captues all values with ``10 < x ≤ 100``. The third bucket captures all values
|
|
with ``100 < x ≤ 1000``. Finally, the fourth bucket (added automatically)
|
|
captures all values with ``1000 < x ≤ INT_MAX``.
|
|
|
|
Configuration Parameters
|
|
------------------------
|
|
|
|
Histograms use the actor system configuration to enable users to override
|
|
hard-coded default bucket settings. On construction, the histogram family check
|
|
whether a key ``caf.metrics.${prefix}.${name}.buckets`` exists. Further, the
|
|
metric instance also checks on construction whether a more specific bucket
|
|
setting for one of its label dimensions exist.
|
|
|
|
For example, consider we add a histogram family with prefix ``http``, name
|
|
``request-duration``, and label dimension ``method`` to the registry. The family
|
|
first tries to read ``caf.metrics.http.request-duration.buckets`` from the
|
|
configuration and otherwise falls back to the hard-coded defaults. When creating
|
|
a histogram instance from the family with the label ``method=put``, the
|
|
construct first tries to read
|
|
``caf.metrics.http.request-duration.method=put.buckets`` from the configuration
|
|
and otherwise uses the default for the family.
|
|
|
|
In a configuration file, users may provide bucket settings like this:
|
|
|
|
.. code-block:: none
|
|
|
|
caf {
|
|
metrics {
|
|
http {
|
|
# measures the duration per HTTP request in seconds
|
|
request-duration {
|
|
buckets = [
|
|
0.001, # ≤ 1ms
|
|
0.01, # ≤ 10ms
|
|
0.05, # ≤ 50ms
|
|
0.1, # ≤ 100ms
|
|
0.25, # ≤ 250ms
|
|
0.5, # ≤ 500ms
|
|
0.75, # ≤ 750ms
|
|
]
|
|
# use different settings for get requests
|
|
"method=put" {
|
|
buckets = [
|
|
0.007, # ≤ 7ms
|
|
0.012, # ≤ 12ms
|
|
0.025, # ≤ 25ms
|
|
0.05, # ≤ 50ms
|
|
0.1, # ≤ 100ms
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
.. note::
|
|
|
|
Ambiguous settings for metrics with multiple label dimensions will result in
|
|
CAF picking the first match from an unspecified order. Hence, prefer using
|
|
only one label dimension for configuring buckets or otherwise make sure there
|
|
is always exactly one match for instance labels.
|
|
|
|
Performance Considerations
|
|
--------------------------
|
|
|
|
Instrumenting code should affect the performance as little as possible. Keep in
|
|
mind that each member function on the registry has to acquire a lock. Ideally,
|
|
applications call functions such as ``gauge_family`` *once* during setup and
|
|
then store the family pointer to create metric instances later.
|
|
|
|
Ideally, there is a single occurrence in the code for getting the family object
|
|
from the registry and a single occurrence in the code for getting the
|
|
gauge/counter/histogram object from the family (``get_or_add`` also has to
|
|
acquire a lock).
|
|
|
|
All operations on gauges, counters and histograms use atomic operations.
|
|
Depending on the type, CAF internally uses ``std::atomic<int64_t>`` or
|
|
``std::atomic<double>``. Adding a sample to a histogram requires two atomic
|
|
operations: one for the bucket and one for the sum.
|
|
|
|
Atomic operations are reasonably fast, but we still recommend to avoid them in
|
|
tight loops.
|
|
|
|
Builtin Metrics
|
|
---------------
|
|
|
|
CAF collects a set of builtin metrics in order to provide insights into the
|
|
actor system and its modules. Some are always collect while others require
|
|
configuration by the user.
|
|
|
|
Base Metrics
|
|
~~~~~~~~~~~~
|
|
|
|
The actor system collects this set of metrics always by default (note that all
|
|
``caf.middleman`` metrics only appear when loading the I/O module).
|
|
|
|
caf.system.running-actors
|
|
- Tracks the current number of running actors in the system.
|
|
- **Type**: ``int_gauge``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.system.processed-messages
|
|
- Counts the total number of processed messages.
|
|
- **Type**: ``int_counter``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.system.rejected-messages
|
|
- Counts the number of messages that where rejected because the target mailbox
|
|
was closed or did not exist.
|
|
- **Type**: ``int_counter``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.middleman.inbound-messages-size
|
|
- Samples the size of inbound messages before deserializing them.
|
|
- **Type**: ``int_histogram``
|
|
- **Unit**: ``bytes``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.middleman.outbound-messages-size
|
|
- Samples the size of outbound messages after serializing them.
|
|
- **Type**: ``int_histogram``
|
|
- **Unit**: ``bytes``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.middleman.deserialization-time
|
|
- Samples how long the middleman needs to deserialize inbound messages.
|
|
- **Type**: ``dbl_histogram``
|
|
- **Unit**: ``seconds``
|
|
- **Label dimensions**: none.
|
|
|
|
caf.middleman.serialization-time
|
|
- Samples how long the middleman needs to serialize outbound messages.
|
|
- **Type**: ``dbl_histogram``
|
|
- **Unit**: ``seconds``
|
|
- **Label dimensions**: none.
|
|
|
|
Actor Metrics and Filters
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Unlike the base metrics, actor metrics are *off* by default. Applications can
|
|
spawn thousands of actors, with many only existing for a brief time. Hence,
|
|
blindly collecting data from all actors in the system can impact the performance
|
|
and also produce a lot of irrelevant noise.
|
|
|
|
To make sure CAF only collects actor metrics that are relevant to the user, the
|
|
actor system configuration provides two lists:
|
|
``caf.metrics-filters.actors.includes`` and
|
|
``caf.metrics-filters.actors.excludes``. CAF collects metrics for all actors
|
|
that have names that are selected by the ``includes`` list and are not selected
|
|
by the ``excludes`` list. Entries in the list can use glob-style syntax, in
|
|
particular ``*``-wildcards. For example:
|
|
|
|
.. code-block:: none
|
|
|
|
caf {
|
|
metrics-filters {
|
|
actors {
|
|
includes = [ "foo.*" ]
|
|
excludes = [ "foo.bar" ]
|
|
}
|
|
}
|
|
}
|
|
|
|
The configuration above would select all actors with names that start with
|
|
``foo.`` except for actors named ``foo.bar``.
|
|
|
|
.. note::
|
|
|
|
Names belong to actor *types*. CAF assigns default names such as
|
|
``user.scheduled-actor`` by default. To provide a custom name, either override
|
|
the member function ``const char* name() const`` when implementing class-based
|
|
actors or add a *static* member variable
|
|
``static inline const char* name = "..."`` to your state class when using
|
|
stateful actors.
|
|
|
|
CAF uses a hierarchical, hyphenated naming scheme with ``.`` as the separator
|
|
and all-lowercase name components. For example, ``caf.system.spawn-server``.
|
|
Users may follow this naming scheme for consistency, but CAF does not enforce
|
|
any structure on the names. However, we do recommend to avoid whitespaces and
|
|
special characters that the glob engine recognizes, such as ``*``, ``/``, etc.
|
|
|
|
For all actors that are selected by the user-defined filters, CAF collects this
|
|
set of metrics:
|
|
|
|
caf.actor.processing-time
|
|
- Samples how long the actor needs to process messages.
|
|
- **Type**: ``dbl_histogram``
|
|
- **Unit**: ``seconds``
|
|
- **Label dimensions**: name.
|
|
|
|
caf.actor.mailbox-time
|
|
- Samples how long messages wait in the mailbox before being processed.
|
|
- **Type**: ``dbl_histogram``
|
|
- **Unit**: ``seconds``
|
|
- **Label dimensions**: name.
|
|
|
|
caf.actor.mailbox-size
|
|
- Counts how many messages are currently waiting in the mailbox.
|
|
- **Type**: ``int_gauge``
|
|
- **Label dimensions**: name.
|
|
|
|
caf.actor.stream.processed-elements
|
|
- Counts the total number of processed stream elements from upstream.
|
|
- **Type**: ``int_counter``
|
|
- **Label dimensions**: name, type.
|
|
|
|
caf.actor.stream.input-buffer-size
|
|
- Tracks how many stream elements from upstream are currently buffered.
|
|
- **Type**: ``int_gauge``
|
|
- **Label dimensions**: name, type.
|
|
|
|
caf.stream.pushed-elements
|
|
- Counts the total number of elements that have been pushed downstream.
|
|
- **Type**: ``int_counter``
|
|
- **Label dimensions**: name, type.
|
|
|
|
caf.stream.output-buffer-size
|
|
- Tracks how many stream elements are currently waiting in the output buffer.
|
|
- **Type**: ``int_gauge``
|
|
- **Label dimensions**: name, type.
|
|
|
|
Exporting Metrics to Prometheus
|
|
-------------------------------
|
|
|
|
The network module in CAF comes with builtin support for exporting metrics to
|
|
Prometheus via HTTP. However, this feature is off by default since CAF generally
|
|
avoids opening ports without explicit user input.
|
|
|
|
During startup, the middleman enables the export of metrics when the
|
|
configuration provides a valid value (0 to 65536) for
|
|
``caf.middleman.prometheus-http.port`` as shown in the example config file
|
|
below.
|
|
|
|
.. code-block:: none
|
|
|
|
caf {
|
|
middleman {
|
|
prometheus-http {
|
|
# listen for incoming HTTP requests on port 8080 (required parameter)
|
|
port = 8080
|
|
# the bind address (optional parameter; default is 0.0.0.0)
|
|
address = "0.0.0.0"
|
|
}
|
|
}
|
|
}
|