[05/27] event/dlb: add DLB documentation

Message ID 1593232671-5690-6-git-send-email-timothy.mcdaniel@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Jerin Jacob
Headers
Series event/dlb Intel DLB PMD |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail Compilation issues

Commit Message

Timothy McDaniel June 27, 2020, 4:37 a.m. UTC
  From: "McDaniel, Timothy" <timothy.mcdaniel@intel.com>

Signed-off-by: McDaniel, Timothy <timothy.mcdaniel@intel.com>
---
 doc/guides/eventdevs/dlb.rst |  497 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 497 insertions(+)
 create mode 100644 doc/guides/eventdevs/dlb.rst
  

Comments

Eads, Gage July 9, 2020, 3:29 a.m. UTC | #1
Hi Tim,

>  doc/guides/eventdevs/dlb.rst |  497
> ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 497 insertions(+)
>  create mode 100644 doc/guides/eventdevs/dlb.rst
> 
> diff --git a/doc/guides/eventdevs/dlb.rst b/doc/guides/eventdevs/dlb.rst new
> file mode 100644 index 0000000..21e48fe
> --- /dev/null
> +++ b/doc/guides/eventdevs/dlb.rst
> @@ -0,0 +1,497 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2020 Intel Corporation.
> +
> +Driver for the Intel® Dynamic Load Balancer (DLB)
> +==================================================
> +
> +The DPDK dlb poll mode driver supports the Intel® Dynamic Load Balancer.
> +
> +.. note::
> +
> +    This PMD is disabled by default in the build configuration files, owing to
> +    an external dependency on the `Netlink Protocol Library Suite
> +    <http://www.infradead.org/~tgr/libnl/>`_ (libnl-3 and libnl-genl-3) which
> +    must be installed on the board.  Once the Netlink libraries are installed,
> +    the PMD can be enabled by setting CONFIG_RTE_LIBRTE_PMD_DLB_QM=y
> and
> +    recompiling the DPDK.
> +

This description appears to be out-of-date.

> +Prerequisites
> +-------------
> +
> +- Follow the DPDK :ref:`Getting Started Guide for Linux <linux_gsg>` to
> +setup
> +  the basic DPDK environment.
> +
> +- Learn about the DLB device and its capabilities at `Intel Support
> +  <http://www.intel.com/support>`_. FIXME: Add real link when
> +documentation
> +  becomes available.

Leftover FIXME

> +
> +- The DLB kernel module. If it is not included in the machine's OS
> +  distribution, download it from <FIXME: Add 01.org link when
> +available> and
> +  follow the build instructions.
> +

Leftover FIXME

<snip>

> +The hybrid timeout data structures are currently located in
> +drivers/event/dlb/dlb_timeout.h:
> +
> +.. code-block:: c
> +
> +        struct rte_hybrid_timeout_ticks_64 {
> +                RTE_STD_C11
> +                union {
> +                        uint64_t val64;
> +                        struct {
> +                                uint64_t poll_ticks:62;
> +                                uint64_t umonitor_wait:1;
> +                                uint64_t interrupt_wait:1;
> +                        };
> +                };
> +        };
> +        struct rte_hybrid_timeout_ns_32 {
> +                RTE_STD_C11
> +                union {
> +                        uint32_t val32;
> +                        struct {
> +                                uint32_t poll_ns:30;
> +                                uint32_t umonitor_wait:1;
> +                                uint32_t interrupt_wait:1;
> +                        };
> +                };
> +        };

Is this description correct? dlb_timeout.h isn't introduced in this patchset.

> +
> +VAS Configuration
> +~~~~~~~~~~~~~~~~~
> +
> +A VAS is a scheduling domain, of which there are 32 in the DLB.
> +(Producer ports in one VAS cannot enqueue events to a different VAS,
> +except through the `Data Mover`_.) When a VAS is configured, it

I believe this cross-VAS comment is out-of-date.

> +allocates load-balanced and directed queues, ports, credits, and other
> +hardware resources. Some VAS resource allocations are user-controlled
> +-- the number of queues, for example
> +-- and others, like credit pools (one directed and one load-balanced
> +pool per VAS), are not.
> +
> +The dlb PMD creates a single VAS per DLB device. Supporting multiple
> +VASes per DLB device is a planned feature, where each VAS will be
> +represented as a separate event device.

Is this comment up-to-date? Patch 16 ("event/dlb: add infos_get and configure") indicates that multiple event devices are supported.

<snip>

> +Hardware Credits
> +~~~~~~~~~~~~~~~~
> +
> +DLB uses a hardware credit scheme to prevent software from overflowing
> +hardware event storage, with each unit of storage represented by a
> +credit. A port spends a credit to enqueue an event, and hardware
> +refills the ports with credits as the events are scheduled to ports.
> +Refills come from credit pools, and each port is a member of a
> +load-balanced credit pool and a directed credit pool. The load-balanced
> +credits are used to enqueue to load-balanced queues, and directed credits
> are used for directed queues.
> +
> +An dlb eventdev contains one load-balanced and one directed credit

"An dlb" -> "A dlb"

> +pool. These pools' sizes are controlled by the nb_events_limit field in
> +struct rte_event_dev_config. The load-balanced pool is sized to contain
> +nb_events_limit credits, and the directed pool is sized to contain
> +nb_events_limit/4 credits. The directed pool size can be overriden with
> +the num_dir_credits vdev argument, like so:
> +
> +    .. code-block:: console
> +
> +       --vdev=dlb1_event,num_dir_credits=<value>
> +
> +This can be used if the default allocation is too low or too high for
> +the specific application needs. The PMD also supports a vdev arg that
> +limits the max_num_events reported by rte_event_dev_info_get():
> +
> +    .. code-block:: console
> +
> +       --vdev=dlb1_event,max_num_events=<value>
> +
> +By default, max_num_events is reported as the total available
> +load-balanced credits. If multiple DLB-based applications are being
> +used, it may be desirable to control how many load-balanced credits
> +each application uses, particularly when application(s) are written to
> +configure nb_events_limit equal to the reported max_num_events.
> +
> +Each port is a member of both credit pools. A port's credit allocation
> +is defined by its low watermark, high watermark, and refill quanta.
> +These three parameters are calculated by the dlb PMD like so:
> +
> +- The load-balanced high watermark is set to the port's enqueue_depth.
> +  The directed high watermark is set to the minimum of the
> +enqueue_depth and
> +  the directed pool size divided by the total number of ports.
> +- The refill quanta is set to half the high watermark.
> +- The low watermark is set to the minimum of 8 and the refill quanta.

From patch 19 ("event/dlb: add port_setup"), this should be 16 instead of 8:

+       cfg.ldb_credit_quantum = cfg.ldb_credit_high_watermark / 2;
+       cfg.ldb_credit_low_watermark = RTE_MIN(16, cfg.ldb_credit_quantum);
+
+       cfg.dir_credit_quantum = cfg.dir_credit_high_watermark / 2;
+       cfg.dir_credit_low_watermark = RTE_MIN(16, cfg.dir_credit_quantum);

> +
> +When the eventdev is started, each port is pre-allocated a high
> +watermark's worth of credits. For example, if an eventdev contains four
> +ports with enqueue depths of 32 and a load-balanced credit pool size of
> +4096, each port will start with 32 load-balanced credits, and there
> +will be 3968 credits available to replenish the ports. Thus, a single
> +port is not capable of enqueueing up to the nb_events_limit (without
> +any events being dequeued), since the other ports are retaining their
> +initial credit allocation; in short, all ports must enqueue in order to reach
> the limit.
> +
> +If a port attempts to enqueue and has no credits available, the enqueue
> +operation will fail and the application must retry the enqueue. Credits
> +are replenished asynchronously by the DLB hardware.
> +

<snip>

> +
> +Ordered Fragments
> +~~~~~~~~~~~~~~~~~
> +
> +The DLB has a fourth enqueue type: partial enqueue. When a thread is
> +processing an ordered event, it can perform up to 16 "partial"
> +enqueues, which allows a single received ordered event to result in multiple
> reordered events.
> +
> +For example, consider the case where three events (A, then B, then C)
> +are enqueued with ordered scheduling and are received by three different
> ports.
> +The ports that receive A and C forward events A' and C', while the port
> +that receives B generates three partial enqueues -- B1', B2', and B3'
> +-- followed by a release operation. The DLB will reorder the events in the
> following order:
> +
> +A', B1', B2', B3', C'
> +
> +This functionality is not available explicitly through the eventdev
> +API, but the dlb PMD provides it through an additional (DLB-specific)
> +event operation, RTE_EVENT_DLB_OP_FRAG.

I don't believe this OP type appears in this patchset.

> +
> +Deferred Scheduling
> +~~~~~~~~~~~~~~~~~~~
> +
> +The DLB PMD's default behavior for managing a CQ is to "pop" the CQ
> +once per dequeued event before returning from
> +rte_event_dequeue_burst(). This frees the corresponding entries in the
> +CQ, which enables the DLB to schedule more events to it.
> +
> +To support applications seeking finer-grained scheduling control -- for
> +example deferring scheduling to get the best possible priority
> +scheduling and load-balancing -- the PMD supports a deferred scheduling
> +mode. In this mode, the CQ entry is not popped until the *subsequent*
> +rte_event_dequeue_burst() call. This mode only applies to load-balanced
> +event ports with dequeue depth of 1.
> +
> +To enable deferred scheduling, use the defer_sched vdev argument like so:
> +
> +    .. code-block:: console
> +
> +       --vdev=dlb1_event,defer_sched=on
> +
> +Atomic Inflights Allocation
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In the last stage prior to scheduling an atomic event to a CQ, DLB
> +holds the inflight event in a temporary buffer that is divided among
> +load-balanced queues. If a queue's atomic buffer storage fills up, this
> +can result in head-of-line-blocking. For example:
> +- An LDB queue allocated N atomic buffer entries
> +- All N entries are filled with events from flow X, which is pinned to CQ 0.
> +
> +Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue
> +can be scheduled. The likelihood of this case depends on the eventdev
> +configuration, traffic behavior, event processing latency, potential
> +for a worker to be interrupted or otherwise delayed, etc.
> +
> +By default, the PMD allocates 16 buffer entries for each load-balanced
> +queue, which provides an even division across all 128 queues but
> +potentially wastes buffer space (e.g. if not all queues are used, or
> +aren't used for atomic scheduling).
> +
> +The PMD provides a dev arg to override the default per-queue
> +allocation. To increase a vdev's per-queue atomic-inflight allocation to (for
> example) 64:
> +
> +    .. code-block:: console
> +
> +       --vdev=dlb1_event,atm_inflights=64
> +

This section is duplicated below.

> +Atomic Inflights Allocation
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In the last stage prior to scheduling an atomic event to a CQ, DLB
> +holds the inflight event in a temporary buffer that is divided among
> +load-balanced queues. If a queue's atomic buffer storage fills up, this
> +can result in head-of-line-blocking. For example:
> +- An LDB queue allocated N atomic buffer entries
> +- All N entries are filled with events from flow X, which is pinned to CQ 0.
> +
> +Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue
> +can be scheduled. The likelihood of this case depends on the eventdev
> +configuration, traffic behavior, event processing latency, potential
> +for a worker to be interrupted or otherwise delayed, etc.
> +
> +By default, the PMD allocates 16 buffer entries for each load-balanced
> +queue, which provides an even division across all 128 queues but
> +potentially wastes buffer space (e.g. if not all queues are used, or
> +aren't used for atomic scheduling).
> +
> +The PMD provides a dev arg to override the default per-queue
> +allocation. To increase a vdev's per-queue atomic-inflight allocation to (for
> example) 64:
> +
> +    .. code-block:: console
> +
> +       --vdev=dlb1_event,atm_inflights=64
> --
> 1.7.10

Thanks,
Gage
  

Patch

diff --git a/doc/guides/eventdevs/dlb.rst b/doc/guides/eventdevs/dlb.rst
new file mode 100644
index 0000000..21e48fe
--- /dev/null
+++ b/doc/guides/eventdevs/dlb.rst
@@ -0,0 +1,497 @@ 
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Intel Corporation.
+
+Driver for the Intel® Dynamic Load Balancer (DLB)
+==================================================
+
+The DPDK dlb poll mode driver supports the Intel® Dynamic Load Balancer.
+
+.. note::
+
+    This PMD is disabled by default in the build configuration files, owing to
+    an external dependency on the `Netlink Protocol Library Suite
+    <http://www.infradead.org/~tgr/libnl/>`_ (libnl-3 and libnl-genl-3) which
+    must be installed on the board.  Once the Netlink libraries are installed,
+    the PMD can be enabled by setting CONFIG_RTE_LIBRTE_PMD_DLB_QM=y and
+    recompiling the DPDK.
+
+Prerequisites
+-------------
+
+- Follow the DPDK :ref:`Getting Started Guide for Linux <linux_gsg>` to setup
+  the basic DPDK environment.
+
+- Learn about the DLB device and its capabilities at `Intel Support
+  <http://www.intel.com/support>`_. FIXME: Add real link when documentation
+  becomes available.
+
+- The DLB kernel module. If it is not included in the machine's OS
+  distribution, download it from <FIXME: Add 01.org link when available> and
+  follow the build instructions.
+
+Configuration
+-------------
+
+The DLB eventdev supports two modes of operation:
+
+* Bifurcated mode: the PMD is created as a vdev device and depends on the Linux
+  DLB kernel driver for device access. The bifurcated PMD's configuration
+  accesses are performed through the kernel driver, and (performance-critical)
+  datapath functions execute entirely in user-space.
+
+  This mode supports both PF and VF devices, but is supported on Linux only.
+
+* PF PMD mode: the PF PMD is a user-space PMD that uses VFIO to gain direct
+  device access. To use this operation mode, the PCIe PF device must be bound
+  to a DPDK-compatible VFIO driver, such as vfio-pci. The PF PMD does not work
+  with PCIe VFs, but is portable to all environments (Linux, FreeBSD, etc.)
+  that DPDK supports. (Note: PF PMD testing has been limited to Linux at this
+  time.)
+
+The vdev device can be created from the application code or from the EAL
+command line like so:
+
+* Call ``rte_vdev_init("dlb1_event")`` from the application.
+
+* Use ``--vdev="dlb1_event"`` in the EAL options, which will call
+  rte_vdev_init() internally.
+
+Example:
+
+.. code-block:: console
+
+    ./your_eventdev_application --vdev="dlb1_event"
+
+Note: The dlb vdev can be instatiated with the name "event_dlb" as well.
+
+Eventdev API Notes
+------------------
+
+The DLB provides the functions of a DPDK event device; specifically, it
+supports atomic, ordered, and parallel scheduling events from queues to ports.
+However, the DLB hardware is not a perfect match to the eventdev API. Some DLB
+features are abstracted by the PMD (e.g. directed ports), some are only
+accessible as vdev command-line parameters, and certain eventdev features are
+not supported (e.g. the event flow ID is not maintained during scheduling).
+
+In general the dlb PMD is designed for ease-of-use and doesn't require a
+detailed understanding of the hardware, but these details are important when
+writing high-performance code. This section describes the places where the
+eventdev API and DLB misalign.
+
+Wait (timeout_ticks) Parameter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The eventdev API rte_event_dequeue_burst(..) can wait for an event to
+arrive. Three different forms of waiting are supported by the dlb PMD:
+polling, blocking on a hardware interrupt, and waiting using umonitor/umwait.
+Which form of wait to use can be specified using the hybrid timeout data
+structure below.The application should use the appropriate bybrid timeout
+struct below and cast it to uint32_t or uint64_t,  as appropriate.
+
+If RTE_EVENT_DEV_CFG_PER_DEQUEUE_TIMEOUT is set, then the timeout_ticks
+parameter supplied to rte_event_dequeue_burst(..) is used to control if and how
+to wait, and dequeue_timeout_ns is ignored.
+
+If RTE_EVENT_DEV_CFG_PER_DEQUEUE_TIMEOUT is not set, then dequeue_timeout_ns
+supplied to the rte_event_dev_configure API is used to control if and how to
+wait, and the timeout_ticks value is ignored.
+
+The application should use the appropriate bybrid timeout struct below and cast
+it to uint32_t (for rte_event_dev_configure) or uint64_t (for
+rte_event_dequeue_burst),  as appropriate.
+
+Hybrid timeout_ticks
+^^^^^^^^^^^^^^^^^^^^
+#. If poll_ticks is not 0 and neither interrupt_wait or umonitor_wait are set,
+   then we will busy poll for up to poll_ticks time.
+#. If the interrupt_wait bit is set and the CQ is empty, then enter kernel
+   to wait for an interrupt after busy polling for poll_ticks time. There
+   is no guarantee how much time we spend in the API when using interrupt_wait.
+#. If umonitor_wait is set, then repeatedly issue a umwait instruction
+   until the requested number of events have been dequeued,  or until
+   poll_ticks has expired.
+
+Note: It is invalid to set both interrupt_wait and umonitor_wait.
+
+The hybrid timeout data structures are currently located in
+drivers/event/dlb/dlb_timeout.h:
+
+.. code-block:: c
+
+        struct rte_hybrid_timeout_ticks_64 {
+                RTE_STD_C11
+                union {
+                        uint64_t val64;
+                        struct {
+                                uint64_t poll_ticks:62;
+                                uint64_t umonitor_wait:1;
+                                uint64_t interrupt_wait:1;
+                        };
+                };
+        };
+        struct rte_hybrid_timeout_ns_32 {
+                RTE_STD_C11
+                union {
+                        uint32_t val32;
+                        struct {
+                                uint32_t poll_ns:30;
+                                uint32_t umonitor_wait:1;
+                                uint32_t interrupt_wait:1;
+                        };
+                };
+        };
+
+VAS Configuration
+~~~~~~~~~~~~~~~~~
+
+A VAS is a scheduling domain, of which there are 32 in the DLB. (Producer
+ports in one VAS cannot enqueue events to a different VAS, except through the
+`Data Mover`_.) When a VAS is configured, it allocates load-balanced and
+directed queues, ports, credits, and other hardware resources. Some VAS
+resource allocations are user-controlled -- the number of queues, for example
+-- and others, like credit pools (one directed and one load-balanced pool per
+VAS), are not.
+
+The dlb PMD creates a single VAS per DLB device. Supporting multiple VASes
+per DLB device is a planned feature, where each VAS will be represented as a
+separate event device.
+
+The DLB is a closed system eventdev, and as such the ``nb_events_limit`` device
+setup argument and the per-port ``new_event_threshold`` argument apply as
+defined in the eventdev header file. The limit is applied to all enqueues,
+regardless of whether it will consume a directed or load-balanced credit.
+
+Load-balanced and Directed Ports
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DLB ports come in two flavors: load-balanced and directed. The eventdev API
+does not have the same concept, but it has a similar one: ports and queues that
+are singly-linked (i.e. linked to a single queue or port, respectively).
+
+The ``rte_event_dev_info_get()`` function reports the number of available
+event ports and queues (among other things). For the DLB PMD, max_event_ports
+and max_event_queues report the number of available load-balanced ports and
+queues, and max_single_link_event_port_queue_pairs reports the number of
+available directed ports and queues.
+
+When a VAS is created in ``rte_event_dev_configure()``, the user specifies
+``nb_event_ports`` and ``nb_single_link_event_port_queues``, which control the
+total number of ports (load-balanced and directed) and the number of directed
+ports. Hence, the number of requested load-balanced ports is ``nb_event_ports
+- nb_single_link_event_ports``. The ``nb_event_queues`` field specifies the
+total number of queues (load-balanced and directed). The number of directed
+queues comes from ``nb_single_link_event_port_queues``, since directed ports
+and queues come in pairs.
+
+When a port is setup, the ``RTE_EVENT_PORT_CFG_SINGLE_LINK`` flag determines
+whether it should be configured as a directed (the flag is set) or a
+load-balanced (the flag is unset) port. Similarly, the
+``RTE_EVENT_QUEUE_CFG_SINGLE_LINK`` queue configuration flag controls
+whether it is a directed or load-balanced queue.
+
+Load-balanced ports can only be linked to load-balanced queues, and directed
+ports can only be linked to directed queues. Furthermore, directed ports can
+only be linked to a single directed queue (and vice versa), and that link
+cannot change after the eventdev is started.
+
+The eventdev API doesn't have a directed scheduling type. To support directed
+traffic, the dlb PMD detects when an event is being sent to a directed queue
+and overrides its scheduling type. Note that the originally selected scheduling
+type (atomic, ordered, or parallel) is not preserved, and an event's sched_type
+will be set to ``RTE_SCHED_TYPE_ATOMIC`` when it is dequeued from a directed
+port.
+
+Flow ID
+~~~~~~~
+
+The flow ID field is not preserved in the event when it is scheduled in the
+DLB, because the DLB hardware control word format does not have sufficient
+space to preserve every event field. As a result, the flow ID specified with
+the enqueued event will not be in the dequeued event. If this field is
+required, the application should pass it through an out-of-band path (for
+example in the mbuf's udata64 field, if the event points to an mbuf) or
+reconstruct the flow ID after receiving the event.
+
+Also, the DLB hardware control word supports a 16-bit flow ID. Since struct
+rte_event's flow_id field is 20 bits, the DLB PMD drops the most significant
+four bits from the event's flow ID.
+
+Hardware Credits
+~~~~~~~~~~~~~~~~
+
+DLB uses a hardware credit scheme to prevent software from overflowing hardware
+event storage, with each unit of storage represented by a credit. A port spends
+a credit to enqueue an event, and hardware refills the ports with credits as the
+events are scheduled to ports. Refills come from credit pools, and each port is
+a member of a load-balanced credit pool and a directed credit pool. The
+load-balanced credits are used to enqueue to load-balanced queues, and directed
+credits are used for directed queues.
+
+An dlb eventdev contains one load-balanced and one directed credit pool. These
+pools' sizes are controlled by the nb_events_limit field in struct
+rte_event_dev_config. The load-balanced pool is sized to contain
+nb_events_limit credits, and the directed pool is sized to contain
+nb_events_limit/4 credits. The directed pool size can be overriden with the
+num_dir_credits vdev argument, like so:
+
+    .. code-block:: console
+
+       --vdev=dlb1_event,num_dir_credits=<value>
+
+This can be used if the default allocation is too low or too high for the
+specific application needs. The PMD also supports a vdev arg that limits the
+max_num_events reported by rte_event_dev_info_get():
+
+    .. code-block:: console
+
+       --vdev=dlb1_event,max_num_events=<value>
+
+By default, max_num_events is reported as the total available load-balanced
+credits. If multiple DLB-based applications are being used, it may be desirable
+to control how many load-balanced credits each application uses, particularly
+when application(s) are written to configure nb_events_limit equal to the
+reported max_num_events.
+
+Each port is a member of both credit pools. A port's credit allocation is
+defined by its low watermark, high watermark, and refill quanta. These three
+parameters are calculated by the dlb PMD like so:
+
+- The load-balanced high watermark is set to the port's enqueue_depth.
+  The directed high watermark is set to the minimum of the enqueue_depth and
+  the directed pool size divided by the total number of ports.
+- The refill quanta is set to half the high watermark.
+- The low watermark is set to the minimum of 8 and the refill quanta.
+
+When the eventdev is started, each port is pre-allocated a high watermark's
+worth of credits. For example, if an eventdev contains four ports with enqueue
+depths of 32 and a load-balanced credit pool size of 4096, each port will start
+with 32 load-balanced credits, and there will be 3968 credits available to
+replenish the ports. Thus, a single port is not capable of enqueueing up to the
+nb_events_limit (without any events being dequeued), since the other ports are
+retaining their initial credit allocation; in short, all ports must enqueue in
+order to reach the limit.
+
+If a port attempts to enqueue and has no credits available, the enqueue
+operation will fail and the application must retry the enqueue. Credits are
+replenished asynchronously by the DLB hardware.
+
+Software Credits
+~~~~~~~~~~~~~~~~
+
+The DLB is a "closed system" event dev, and the DLB PMD layers a software
+credit scheme on top of the hardware credit scheme in order to comply with
+the per-port backpressure described in the eventdev API.
+
+The DLB's hardware scheme is local to a queue/pipeline stage: a port spends a
+credit when it enqueues to a queue, and credits are later replenished after the
+events are dequeued and released.
+
+In the software credit scheme, a credit is consumed when a new (.op =
+RTE_EVENT_OP_NEW) event is injected into the system, and the credit is
+replenished when the event is released from the system (either explicitly with
+RTE_EVENT_OP_RELEASE or implicitly in dequeue_burst()).
+
+In this model, an event is "in the system" from its first enqueue into eventdev
+until it is last dequeued. If the event goes through multiple event queues, it
+is still considered "in the system" while a worker thread is processing it.
+
+A port will fail to enqueue if the number of events in the system exceeds its
+``new_event_threshold`` (specified at port setup time). A port will also fail
+to enqueue if it lacks enough hardware credits to enqueue; load-balanced
+credits are used to enqueue to a load-balanced queue, and directed credits are
+used to enqueue to a directed queue.
+
+The out-of-credit situations are typically transient, and an eventdev
+application using the DLB ought to retry its enqueues if they fail.
+If enqueue fails, DLB PMD sets rte_errno as follows:
+
+- -ENOSPC: Credit exhaustion (either hardware or software)
+- -EINVAL: Invalid argument, such as port ID, queue ID, or sched_type.
+
+Depending on the pipeline the application has constructed, it's possible to
+enter a credit deadlock scenario wherein the worker thread lacks the credit
+to enqueue an event, and it must dequeue an event before it can recover the
+credit. If the worker thread retries its enqueue indefinitely, it will not
+make forward progress. Such deadlock is possible if the application has event
+"loops", in which an event in dequeued from queue A and later enqueued back to
+queue A.
+
+Due to this, workers should stop retrying after a time, release the events it
+is attempting to enqueue, and dequeue more events. It is important that the
+worker release the events and don't simply set them aside to retry the enqueue
+again later, because the port has limited history list size (by default, twice
+the port's dequeue_depth).
+
+Priority
+~~~~~~~~
+
+The DLB supports event priority and per-port queue service priority, as
+described in the eventdev header file. The DLB does not support 'global' event
+queue priority established at queue creation time.
+
+DLB supports 8 event and queue service priority levels. For both priority
+types, the PMD uses the upper three bits of the priority field to determine the
+DLB priority, discarding the 5 least significant bits. The 5 least significant
+event priority bits are not preserved when an event is enqueued.
+
+Load-Balanced Queues
+~~~~~~~~~~~~~~~~~~~~
+
+A load-balanced queue can support atomic and ordered scheduling, or atomic and
+unordered scheduling, but not atomic and unordered and ordered scheduling. A
+queue's scheduling types are controlled by the event queue configuration.
+
+If the user sets the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag, the
+``nb_atomic_order_sequences`` determines the supported scheduling types.
+With non-zero ``nb_atomic_order_sequences``, the queue is configured for atomic
+and ordered scheduling. In this case, ``RTE_SCHED_TYPE_PARALLEL`` scheduling is
+supported by scheduling those events as ordered events.  Note that when the
+event is dequeued, its sched_type will be ``RTE_SCHED_TYPE_ORDERED``. Else if
+``nb_atomic_order_sequences`` is zero, the queue is configured for atomic and
+unordered scheduling. In this case, ``RTE_SCHED_TYPE_ORDERED`` is unsupported.
+
+If the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag is not set, schedule_type
+dictates the queue's scheduling type.
+
+The ``nb_atomic_order_sequences`` queue configuration field sets the ordered
+queue's reorder buffer size.  DLB has 4 groups of ordered queues, where each
+group is configured to contain either 1 queue with 1024 reorder entries, 2
+queues with 512 reorder entries, and so on down to 32 queues with 32 entries.
+
+When a load-balanced queue is created, the PMD will configure a new sequence
+number group on-demand if num_sequence_numbers does not match a pre-existing
+group with available reorder buffer entries. If all sequence number groups are
+in use, no new group will be created and queue configuration will fail. (Note
+that when the PMD is used with a virtual DLB device, it cannot change the
+sequence number configuration.)
+
+The queue's ``nb_atomic_flows`` parameter is ignored by the DLB PMD, because
+the DLB doesn't limit the number of flows a queue can track. In the DLB, all
+load-balanced queues can use the full 16-bit flow ID range.
+
+Reconfiguration
+~~~~~~~~~~~~~~~
+
+The Eventdev API allows one to reconfigure a device, its ports, and its queues
+by first stopping the device, calling the configuration function(s), then
+restarting the device. The DLB doesn't support configuring an individual queue
+or port without first reconfiguring the entire device, however, so there are
+certain reconfiguration sequences that are valid in the eventdev API but not
+supported by the PMD.
+
+Specifically, the PMD supports the following configuration sequence:
+1. Configure and start the device
+2. Stop the device
+3. (Optional) Reconfigure the device
+4. (Optional) If step 3 is run:
+
+   a. Setup queue(s). The reconfigured queue(s) lose their previous port links.
+   b. The reconfigured port(s) lose their previous queue links.
+
+5. (Optional, only if steps 4a and 4b are run) Link port(s) to queue(s)
+6. Restart the device. If the device is reconfigured in step 3 but one or more
+   of its ports or queues are not, the PMD will apply their previous
+   configuration (including port->queue links) at this time.
+
+The PMD does not support the following configuration sequences:
+1. Configure and start the device
+2. Stop the device
+3. Setup queue or setup port
+4. Start the device
+
+This sequence is not supported because the event device must be reconfigured
+before its ports or queues can be.
+
+Ordered Fragments
+~~~~~~~~~~~~~~~~~
+
+The DLB has a fourth enqueue type: partial enqueue. When a thread is processing
+an ordered event, it can perform up to 16 "partial" enqueues, which allows a
+single received ordered event to result in multiple reordered events.
+
+For example, consider the case where three events (A, then B, then C) are
+enqueued with ordered scheduling and are received by three different ports.
+The ports that receive A and C forward events A' and C', while the port that
+receives B generates three partial enqueues -- B1', B2', and B3' -- followed by
+a release operation. The DLB will reorder the events in the following order:
+
+A', B1', B2', B3', C'
+
+This functionality is not available explicitly through the eventdev API, but
+the dlb PMD provides it through an additional (DLB-specific) event operation,
+RTE_EVENT_DLB_OP_FRAG.
+
+Deferred Scheduling
+~~~~~~~~~~~~~~~~~~~
+
+The DLB PMD's default behavior for managing a CQ is to "pop" the CQ once per
+dequeued event before returning from rte_event_dequeue_burst(). This frees the
+corresponding entries in the CQ, which enables the DLB to schedule more events
+to it.
+
+To support applications seeking finer-grained scheduling control -- for example
+deferring scheduling to get the best possible priority scheduling and
+load-balancing -- the PMD supports a deferred scheduling mode. In this mode,
+the CQ entry is not popped until the *subsequent* rte_event_dequeue_burst()
+call. This mode only applies to load-balanced event ports with dequeue depth of
+1.
+
+To enable deferred scheduling, use the defer_sched vdev argument like so:
+
+    .. code-block:: console
+
+       --vdev=dlb1_event,defer_sched=on
+
+Atomic Inflights Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the last stage prior to scheduling an atomic event to a CQ, DLB holds the
+inflight event in a temporary buffer that is divided among load-balanced
+queues. If a queue's atomic buffer storage fills up, this can result in
+head-of-line-blocking. For example:
+- An LDB queue allocated N atomic buffer entries
+- All N entries are filled with events from flow X, which is pinned to CQ 0.
+
+Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue can be
+scheduled. The likelihood of this case depends on the eventdev configuration,
+traffic behavior, event processing latency, potential for a worker to be
+interrupted or otherwise delayed, etc.
+
+By default, the PMD allocates 16 buffer entries for each load-balanced queue,
+which provides an even division across all 128 queues but potentially wastes
+buffer space (e.g. if not all queues are used, or aren't used for atomic
+scheduling).
+
+The PMD provides a dev arg to override the default per-queue allocation. To
+increase a vdev's per-queue atomic-inflight allocation to (for example) 64:
+
+    .. code-block:: console
+
+       --vdev=dlb1_event,atm_inflights=64
+
+Atomic Inflights Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the last stage prior to scheduling an atomic event to a CQ, DLB holds the
+inflight event in a temporary buffer that is divided among load-balanced
+queues. If a queue's atomic buffer storage fills up, this can result in
+head-of-line-blocking. For example:
+- An LDB queue allocated N atomic buffer entries
+- All N entries are filled with events from flow X, which is pinned to CQ 0.
+
+Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue can be
+scheduled. The likelihood of this case depends on the eventdev configuration,
+traffic behavior, event processing latency, potential for a worker to be
+interrupted or otherwise delayed, etc.
+
+By default, the PMD allocates 16 buffer entries for each load-balanced queue,
+which provides an even division across all 128 queues but potentially wastes
+buffer space (e.g. if not all queues are used, or aren't used for atomic
+scheduling).
+
+The PMD provides a dev arg to override the default per-queue allocation. To
+increase a vdev's per-queue atomic-inflight allocation to (for example) 64:
+
+    .. code-block:: console
+
+       --vdev=dlb1_event,atm_inflights=64