mbox

[RFC,0/1] A Distributed Software Event Device

Message ID 20180711212154.5807-1-mattias.ronnblom@ericsson.com (mailing list archive)
Headers

Message

Mattias Rönnblom July 11, 2018, 9:21 p.m. UTC
This is the Distributed Software (DSW) event device, which distributes
the task of scheduling events among all the eventdev ports and their
lcore threads.

DSW is primarily designed for atomic-only queues, but also supports
single-link and parallel queues.

(DSW would be more accurately described as 'parallel', but since that
term is used to describe an eventdev queue type, it's referred to as
'distributed', to avoid suggesting it's somehow biased toward parallel
queues.)

Event Scheduling
================

Internally, DSW hashes an eventdev flow id to a 15-bit "flow
hash". For each eventdev queue, there's a table mapping a flow hash to
an eventdev port. That port is considered the owner of the
flow. Owners are randomly picked at initialization time, among the
ports serving (i.e. are linked to) that queue.

The scheduling of an event to a port is done (by the sender port) at
time of the enqueue operation, and in most cases simply consists of
hashing the flow id and performing a lookup in the destination queue's
table. Each port has an MP/SC event ring to which the events are
enqueued. This means events go directly port-to-port, typically
meaning core-to-core.

Port Load Measurement
=====================

DSW includes a concept of port load. The event device keeps track of
transitions between "idle" and "busy" (or vice versa) on a per-port
basis, compares this to the wall time passed, and computes to what
extent the port was busy (for a certain interval). A port transitions
to "busy" on a non-zero dequeue, and again back to "idle" at the point
it performs a dequeue operation returning zero events.

Flow Migration
==============

Periodically, hidden to the API user and as a part of a normal
enqueue/dequeue operations, a port updates its load estimate, and in
case the load has reached a certain threshold, considers moving one of
its flow to a different, more lightly loaded, port. This process is
called migration.

Migration Strategy
------------------

The DSW migration strategy is to move a small, but yet active flow. To
quickly find which are the active flows (w/o resorting to scanning
through the tables and/or keeping per-event counters), each port
maintains a list of the last 128 events it has dequeued. If there are
lightly-loaded enough target ports, it will attempt to migrate one of
those flows, starting with the smallest. The size is estimated by the
number of events seen on that flow, in that small sample of events.

A good migration strategy, based on reasonably good estimates of port
and current flow event rates, is key for proper load balancing in a
DSW-style event device.

Migration Process
-----------------

If the prerequisites are met, and a migration target flow and port is
found, the owning (source) port will initiate the migration
process. For parallel queues it's a very straightforward operation -
simply a table update. For atomic queues, in order to maintain their
semantics, it's a fair bit more elaborate a procedure.

A high-level view the migration process is available[1] in the form a
sequence diagram.

Much simplified, it consist of the source port sending messages to all
ports configured, asking them to "pause" the to-be-migrated flow. Such
ports will flush their output buffers and provide a confirmation back
to the source port.

Each port holds a list of which flows are paused. Upon the enqueue of
an event belonging to a paused flow, it will be accepted into the
machinery, but kept in a paused-events buffer located on the sending
port.

After receiving confirmations from all ports, the source port will
make sure its application-level user has finished processing of all
events related to the migrating flow, update the relevant queue's
table, and forward all unprocessed events (in its input event ring) to
the new target port.

The source port will then send out a request to "unpause" the flow to
all ports. Upon receiving such a request, the port will flush any
buffered (paused) events related to the paused flow, and provide a
confirmation.

All the signaling are done on regular DPDK rings (separate from the
event-carrying rings), and are pulled as a part of normal
enqueue/dequeue operation.

The migrations can be made fairly rapidly (in the range of a couple
hundred us, or even faster), but the algorithm, load measurement and
migration interval parameters must be carefully chosen not to cause
the system to oscillate or otherwise misbehave.

The migration rate is primarily limited by eventdev enqueue/dequeue
function call rate, which in turn in the typical application is
limited by event burst sizes and event processing latency.

Migration API Implications
--------------------------

The migration process puts an addition requirement on the application
beyond the regular eventdev API, which is to not leave ports
'unattended'. Unattended here means a port on that neither enqueue nor
dequeue operations are performed within a reasonable time frame. What
is 'reasonable' depends on the migration latency requirements, which
in turns depends on the degree of variation in the workload. For
enqueue-only ports, which might well come into situations where no
events are enqueued for long durations of time, DSW includes an
less-than-elegant solution, allowing zero-sized enqueue operations,
which serve no other purpose that to drive the migration machinery.

Workload Suitability
====================

DSW operates under the assumption that an active flow will remain so
for a duration which is significantly longer than the migration
latency.

DSW should do well with a larger number of small flows, and also large
flows that increase their rates at a pace which is low-enough for the
migration process to move away smaller flows to make room on that
port. TCP slow-start kind of traffic, with end-to-end latencies on the
ms level, should be possible to handle, even though their expontential
nature - but all of this is speculation.

DSW won't be able to properly load balance workloads with few, highly
bursty, and high intensity flows.

Compared to the SW event device, DSW allows scaling to higher-core
count machines, with its substantially higher throughput and avoiding
a single bottleneck core, especially for long pipelines, or systems
with very short pipeline stages.

In addition, it also scales down to configurations with very few or
even just a single core, avoiding the issue with SW has with running
application work and event scheduling on the same core.

The downsides is that DSW doesn't have SW's near-immediate load
balancing flow-rerouting capability, but instead relies on flows
changing their inter-event time at a pace which isn't too high for the
migration process to handle.

Backpressure
============

Like any event device, DSW provides a backpressure mechanism to
prevent event producers flooding it.

DSW employs a credit system, with a central pool equal to the
configured max number of in-flight events. The ports are allowed to
take loans from this central pool, and may also return credits, so
that consumer-heavy ports don't end up draining the pool.

A port will, at the time of enqueue, make sure it has enough credits
(one per event) to allow the events into DSW. If not, the port will
attempt to retrieve more from the central pool. If this fails, the
enqueue operation fails. For efficiency reasons, at least 64 credits
are take from the pool (even if fewer might be needed).

A port will, at the time of dequeue, gain as many credits as the
number of events it dequeued. A port will not return credits until
they reach 128, and will always keep 64 credits.

All this in a similar although not identical manner to the SW event
device.

Output Buffering
================

Upon a successful enqueue operation, DSW will not immediately send the
events to their destination ports' event input rings. Events will
however - unless paused - be assigned a destination port and enqueued
on a buffer on the sending port. Such buffered events are considered
accepted into the event device, and is so handled from a migration and
in-flight credit system point of view.

Upon reaching a certain threshold, buffered events will be flushed,
and enqueued on the destination port's input ring.

The output buffers make the DSW ports use longer bursts against the
receiving port rings, much improving event ring efficiency.

To avoid having buffered events lingering too long (or even endlessly)
in these buffers, DSW has a schema where it only allows a certain
number of enqueue/dequeue operations ('ops') to be performed, before
the buffers are flushed.

A side effect of how 'ops' are counted is that in case a port goes
idle, it will likely perform many dequeue operations to pull new work,
and thus quickly up the 'ops' to a level it's output buffers are
flushed. That will cause lower ring efficiency, but this is of no
major concern since the worker is idling anyways.

This allows for single-event enqueue operations to be efficient,
although credit system and statistics update overhead will still make
them slower than burst enqueues.

Output Buffering API Implications
---------------------------------

The output buffering schema puts the same requirement on the
application as the migration process in that it disallows unattended ports.

In addition, DSW also implement a schema (maybe more accurately
described as a hack) where the application can force a buffer flush by
doing a zero-sized enqueue.

Alternative Approach
--------------------

An alternative to the DSW-internal output buffering is to have the
application to use burst enqueues, preferably with very large bursts
(the more cores, the larger bursts are preferred).

Depending on how the application is structured, this might well lead
to it having an internal buffer to which it does efficient,
single-event enqueue operations to, and then flushes it on a regular
basis.

However, since the DSW output buffering happens after the scheduling
is performed, the buffers can actually be flushed earlier than if
buffering happens in the application, if a large fraction of the
events are scheduled to a particular port (since the output buffer
limit is on a per-destination port basis).

In addition, since events in the output buffer are considered accepted
into DSW, migration can happen somewhat more quickly, since those
events can be flushed on migrations, as oppose to an
application-controlled buffer.

Statistics
==========

DSW supports the eventdev 'xstats' interface. It provides a large,
most port-based, set of counters, including things like port load,
number of migrations and migration latency, number of events dequeued
and enqueued, and on which queue, the average event processing latency
and a timestamp to allow the detection of unattended ports.

DSW xstats also allows reading the current global total and port
credits, making it possible to give a rough estimate of how many
events are in flight.

Performance Indications
=======================

The primary source of performance metrics comes from a test
application implementing a simulate pipeline. With zero work in each
pipe line stage, running on single socket x86_64 system, fourteen 2.4
GHz worker cores can sustain 300-400 million event/s. With a pipeline
with 1000 clock cycles of work per stage, the average event device
overhead is somewhere 50-150 clock cycles/event.

The benchmark is run when the system is fully loaded (i.e. there are
always events available on the pipeline ingress), and thus the event
device will benefit from batching effects, which are crucial for
performance. Also beneficial for DSW efficiency is the fact that the
"dummy" application work cycles has a very small memory working set,
leaving all the caches to DSW.

The simulated load has flows with a fixed event rate, causing very few
migrations - and more importantly - allows DSW to provide near-ideal
load balancing. So inefficienes due to imperfect load balancing is
also not accounted for.

The flow-to-port/thread/core affinity of DSW should provide for some
caching benefits for the application, for flow-related data
structures, compared to an event device where the flows move around
the ports in a more stochastic manner.

Short-term TODO
===============

o Figure out which DSW parameters needs to be runtime configurable.
o Consider adding support for event priority.
o Add relevant test cases to eventdev unit tests.
o Convert this massive cover letter into proper DPDK documentation.

[1] http://www.lysator.liu.se/~hofors/dsw/migration-sequence.svg

Mattias Rönnblom (1):
  eventdev: add distributed software (DSW) event device

 config/common_base                            |    5 +
 drivers/event/Makefile                        |    1 +
 drivers/event/dsw/Makefile                    |   28 +
 drivers/event/dsw/dsw_evdev.c                 |  361 +++++
 drivers/event/dsw/dsw_evdev.h                 |  296 ++++
 drivers/event/dsw/dsw_event.c                 | 1285 +++++++++++++++++
 drivers/event/dsw/dsw_sort.h                  |   47 +
 drivers/event/dsw/dsw_xstats.c                |  284 ++++
 .../event/dsw/rte_pmd_evdev_dsw_version.map   |    3 +
 mk/rte.app.mk                                 |    1 +
 10 files changed, 2311 insertions(+)
 create mode 100644 drivers/event/dsw/Makefile
 create mode 100644 drivers/event/dsw/dsw_evdev.c
 create mode 100644 drivers/event/dsw/dsw_evdev.h
 create mode 100644 drivers/event/dsw/dsw_event.c
 create mode 100644 drivers/event/dsw/dsw_sort.h
 create mode 100644 drivers/event/dsw/dsw_xstats.c
 create mode 100644 drivers/event/dsw/rte_pmd_evdev_dsw_version.map