[v5,6/6] doc/guides/: provide IOAT sample app guide

Message ID 20190920073714.1314-7-marcinx.baran@intel.com (mailing list archive)
State Superseded, archived
Headers
Series examples/ioat: sample app on ioat driver usage |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Marcin Baran Sept. 20, 2019, 7:37 a.m. UTC
  Added guide for IOAT sample app usage and
code description.

Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
---
 doc/guides/sample_app_ug/index.rst |   1 +
 doc/guides/sample_app_ug/intro.rst |   4 +
 doc/guides/sample_app_ug/ioat.rst  | 764 +++++++++++++++++++++++++++++
 3 files changed, 769 insertions(+)
 create mode 100644 doc/guides/sample_app_ug/ioat.rst
  

Comments

Bruce Richardson Sept. 27, 2019, 10:36 a.m. UTC | #1
On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and
> code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 
> diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst
> index f23f8f59e..a6a1d9e7a 100644
> --- a/doc/guides/sample_app_ug/index.rst
> +++ b/doc/guides/sample_app_ug/index.rst
> @@ -23,6 +23,7 @@ Sample Applications User Guides
>      ip_reassembly
>      kernel_nic_interface
>      keep_alive
> +    ioat
>      l2_forward_crypto
>      l2_forward_job_stats
>      l2_forward_real_virtual
> diff --git a/doc/guides/sample_app_ug/intro.rst b/doc/guides/sample_app_ug/intro.rst
> index 90704194a..74462312f 100644
> --- a/doc/guides/sample_app_ug/intro.rst
> +++ b/doc/guides/sample_app_ug/intro.rst
> @@ -91,6 +91,10 @@ examples are highlighted below.
>    forwarding, or ``l3fwd`` application does forwarding based on Internet
>    Protocol, IPv4 or IPv6 like a simple router.
>  
> +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev driver for
> +  copying packets between two threads.
> +
>  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
>    demonstrates how to distribute packets arriving on an Rx port to different
>    cores for processing and transmission.
> diff --git a/doc/guides/sample_app_ug/ioat.rst b/doc/guides/sample_app_ug/ioat.rst
> new file mode 100644
> index 000000000..69621673b
> --- /dev/null
> +++ b/doc/guides/sample_app_ug/ioat.rst
> @@ -0,0 +1,764 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2019 Intel Corporation.
> +
> +Sample Application of packet copying using Intel\ |reg| QuickData Technology
> +============================================================================
> +

Title is too long, you can drop the "Sample Application" at minimum, since
this is part of the example applications guide document. Call the section
"Packet Copying Using ..."

In order to get the proper (R) symbol, you also need to add an include to
the file. Therefore add the following line just after the copyright:

.. include:: <isonum.txt>

/Bruce
  
Bruce Richardson Sept. 27, 2019, 10:37 a.m. UTC | #2
Adding John and Marko on CC as documentation maintainers.

On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and
> code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 
> diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst
> index f23f8f59e..a6a1d9e7a 100644
> --- a/doc/guides/sample_app_ug/index.rst
> +++ b/doc/guides/sample_app_ug/index.rst
> @@ -23,6 +23,7 @@ Sample Applications User Guides
>      ip_reassembly
>      kernel_nic_interface
>      keep_alive
> +    ioat
>      l2_forward_crypto
>      l2_forward_job_stats
>      l2_forward_real_virtual
> diff --git a/doc/guides/sample_app_ug/intro.rst b/doc/guides/sample_app_ug/intro.rst
> index 90704194a..74462312f 100644
> --- a/doc/guides/sample_app_ug/intro.rst
> +++ b/doc/guides/sample_app_ug/intro.rst
> @@ -91,6 +91,10 @@ examples are highlighted below.
>    forwarding, or ``l3fwd`` application does forwarding based on Internet
>    Protocol, IPv4 or IPv6 like a simple router.
>  
> +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev driver for
> +  copying packets between two threads.
> +
>  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
>    demonstrates how to distribute packets arriving on an Rx port to different
>    cores for processing and transmission.
> diff --git a/doc/guides/sample_app_ug/ioat.rst b/doc/guides/sample_app_ug/ioat.rst
> new file mode 100644
> index 000000000..69621673b
> --- /dev/null
> +++ b/doc/guides/sample_app_ug/ioat.rst
> @@ -0,0 +1,764 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2019 Intel Corporation.
> +
> +Sample Application of packet copying using Intel\ |reg| QuickData Technology
> +============================================================================
> +
> +Overview
> +--------
> +
> +This sample is intended as a demonstration of the basic components of a DPDK
> +forwarding application and example of how to use IOAT driver API to make
> +packets copies.
> +
> +Also while forwarding, the MAC addresses are affected as follows:
> +
> +*   The source MAC address is replaced by the TX port MAC address
> +
> +*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
> +
> +This application can be used to compare performance of using software packet
> +copy with copy done using a DMA device for different sizes of packets.
> +The example will print out statistics each second. The stats shows
> +received/send packets and packets dropped or failed to copy.
> +
> +Compiling the Application
> +-------------------------
> +
> +To compile the sample application see :doc:`compiling`.
> +
> +The application is located in the ``ioat`` sub-directory.
> +
> +
> +Running the Application
> +-----------------------
> +
> +In order to run the hardware copy application, the copying device
> +needs to be bound to user-space IO driver.
> +
> +Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData Technology*
> +guide for information on using the driver.
> +
> +The application requires a number of command line options:
> +
> +.. code-block:: console
> +
> +    ./build/ioatfwd [EAL options] -- -p MASK [-q NQ] [-s RS] [-c <sw|hw>]
> +        [--[no-]mac-updating]
> +
> +where,
> +
> +*   p MASK: A hexadecimal bitmask of the ports to configure
> +
> +*   q NQ: Number of Rx queues used per port equivalent to CBDMA channels
> +    per port
> +
> +*   c CT: Performed packet copy type: software (sw) or hardware using
> +    DMA (hw)
> +
> +*   s RS: Size of IOAT rawdev ring for hardware copy mode or rte_ring for
> +    software copy mode
> +
> +*   --[no-]mac-updating: Whether MAC address of packets should be changed
> +    or not
> +
> +The application can be launched in various configurations depending on
> +provided parameters. Each port can use up to 2 lcores: one of them receives
> +incoming traffic and makes a copy of each packet. The second lcore then
> +updates MAC address and sends the copy. If one lcore per port is used,
> +both operations are done sequentially. For each configuration an additional
> +lcore is needed since master lcore in use which is responsible for
> +configuration, statistics printing and safe deinitialization of all ports
> +and devices.
> +
> +The application can use a maximum of 8 ports.
> +
> +To run the application in a Linux environment with 3 lcores (one of them
> +is master lcore), 1 port (port 0), software copying and MAC updating issue
> +the command:
> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-2 -n 2 -- -p 0x1 --mac-updating -c sw
> +
> +To run the application in a Linux environment with 2 lcores (one of them
> +is master lcore), 2 ports (ports 0 and 1), hardware copying and no MAC
> +updating issue the command:
> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-1 -n 1 -- -p 0x3 --no-mac-updating -c hw
> +
> +Refer to the *DPDK Getting Started Guide* for general information on
> +running applications and the Environment Abstraction Layer (EAL) options.
> +
> +Explanation
> +-----------
> +
> +The following sections provide an explanation of the main components of the
> +code.
> +
> +All DPDK library functions used in the sample code are prefixed with
> +``rte_`` and are explained in detail in the *DPDK API Documentation*.
> +
> +
> +The Main Function
> +~~~~~~~~~~~~~~~~~
> +
> +The ``main()`` function performs the initialization and calls the execution
> +threads for each lcore.
> +
> +The first task is to initialize the Environment Abstraction Layer (EAL).
> +The ``argc`` and ``argv`` arguments are provided to the ``rte_eal_init()``
> +function. The value returned is the number of parsed arguments:
> +
> +.. code-block:: c
> +
> +    /* init EAL */
> +    ret = rte_eal_init(argc, argv);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
> +
> +
> +The ``main()`` also allocates a mempool to hold the mbufs (Message Buffers)
> +used by the application:
> +
> +.. code-block:: c
> +
> +    nb_mbufs = RTE_MAX(rte_eth_dev_count_avail() * (nb_rxd + nb_txd
> +        + MAX_PKT_BURST + rte_lcore_count() * MEMPOOL_CACHE_SIZE),
> +        MIN_POOL_SIZE);
> +
> +    /* Create the mbuf pool */
> +    ioat_pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", nb_mbufs,
> +        MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
> +        rte_socket_id());
> +    if (ioat_pktmbuf_pool == NULL)
> +        rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n");
> +
> +Mbufs are the packet buffer structure used by DPDK. They are explained in
> +detail in the "Mbuf Library" section of the *DPDK Programmer's Guide*.
> +
> +The ``main()`` function also initializes the ports:
> +
> +.. code-block:: c
> +
> +    /* Initialise each port */
> +    RTE_ETH_FOREACH_DEV(portid) {
> +        port_init(portid, ioat_pktmbuf_pool);
> +    }
> +
> +Each port is configured using ``port_init()``:
> +
> +.. code-block:: c
> +
> +     /*
> +     * Initializes a given port using global settings and with the RX buffers
> +     * coming from the mbuf_pool passed as a parameter.
> +     */
> +    static inline void
> +    port_init(uint16_t portid, struct rte_mempool *mbuf_pool, uint16_t nb_queues)
> +    {
> +        /* configuring port to use RSS for multiple RX queues */
> +        static const struct rte_eth_conf port_conf = {
> +            .rxmode = {
> +                .mq_mode        = ETH_MQ_RX_RSS,
> +                .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +            },
> +            .rx_adv_conf = {
> +                .rss_conf = {
> +                    .rss_key = NULL,
> +                    .rss_hf = ETH_RSS_PROTO_MASK,
> +                }
> +            }
> +        };
> +
> +        struct rte_eth_rxconf rxq_conf;
> +        struct rte_eth_txconf txq_conf;
> +        struct rte_eth_conf local_port_conf = port_conf;
> +        struct rte_eth_dev_info dev_info;
> +        int ret, i;
> +
> +        /* Skip ports that are not enabled */
> +        if ((ioat_enabled_port_mask & (1 << portid)) == 0) {
> +            printf("Skipping disabled port %u\n", portid);
> +            return;
> +        }
> +
> +        /* Init port */
> +        printf("Initializing port %u... ", portid);
> +        fflush(stdout);
> +        rte_eth_dev_info_get(portid, &dev_info);
> +        local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
> +            dev_info.flow_type_rss_offloads;
> +        if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
> +            local_port_conf.txmode.offloads |=
> +                DEV_TX_OFFLOAD_MBUF_FAST_FREE;
> +        ret = rte_eth_dev_configure(portid, nb_queues, 1, &local_port_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE, "Cannot configure device:"
> +                " err=%d, port=%u\n", ret, portid);
> +
> +        ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd,
> +                            &nb_txd);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot adjust number of descriptors: err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_macaddr_get(portid, &ioat_ports_eth_addr[portid]);
> +
> +        /* Init Rx queues */
> +        rxq_conf = dev_info.default_rxconf;
> +        rxq_conf.offloads = local_port_conf.rxmode.offloads;
> +        for (i = 0; i < nb_queues; i++) {
> +            ret = rte_eth_rx_queue_setup(portid, i, nb_rxd,
> +                rte_eth_dev_socket_id(portid), &rxq_conf,
> +                mbuf_pool);
> +            if (ret < 0)
> +                rte_exit(EXIT_FAILURE,
> +                    "rte_eth_rx_queue_setup:err=%d,port=%u, queue_id=%u\n",
> +                    ret, portid, i);
> +        }
> +
> +        /* Init one TX queue on each port */
> +        txq_conf = dev_info.default_txconf;
> +        txq_conf.offloads = local_port_conf.txmode.offloads;
> +        ret = rte_eth_tx_queue_setup(portid, 0, nb_txd,
> +                rte_eth_dev_socket_id(portid),
> +                &txq_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_tx_queue_setup:err=%d,port=%u\n",
> +                ret, portid);
> +
> +        /* Initialize TX buffers */
> +        tx_buffer[portid] = rte_zmalloc_socket("tx_buffer",
> +                RTE_ETH_TX_BUFFER_SIZE(MAX_PKT_BURST), 0,
> +                rte_eth_dev_socket_id(portid));
> +        if (tx_buffer[portid] == NULL)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot allocate buffer for tx on port %u\n",
> +                portid);
> +
> +        rte_eth_tx_buffer_init(tx_buffer[portid], MAX_PKT_BURST);
> +
> +        ret = rte_eth_tx_buffer_set_err_callback(tx_buffer[portid],
> +                rte_eth_tx_buffer_count_callback,
> +                &port_statistics.tx_dropped[portid]);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot set error callback for tx buffer on port %u\n",
> +                portid);
> +
> +        /* Start device */
> +        ret = rte_eth_dev_start(portid);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_dev_start:err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_promiscuous_enable(portid);
> +
> +        printf("Port %u, MAC address: %02X:%02X:%02X:%02X:%02X:%02X\n\n",
> +                portid,
> +                ioat_ports_eth_addr[portid].addr_bytes[0],
> +                ioat_ports_eth_addr[portid].addr_bytes[1],
> +                ioat_ports_eth_addr[portid].addr_bytes[2],
> +                ioat_ports_eth_addr[portid].addr_bytes[3],
> +                ioat_ports_eth_addr[portid].addr_bytes[4],
> +                ioat_ports_eth_addr[portid].addr_bytes[5]);
> +
> +        cfg.ports[cfg.nb_ports].rxtx_port = portid;
> +        cfg.ports[cfg.nb_ports++].nb_queues = nb_queues;
> +    }
> +
> +The Ethernet ports are configured with local settings using the
> +``rte_eth_dev_configure()`` function and the ``port_conf`` struct.
> +The RSS is enabled so that multiple Rx queues could be used for
> +packet receiving and copying by multiple CBDMA channels per port:
> +
> +.. code-block:: c
> +
> +    /* configuring port to use RSS for multiple RX queues */
> +    static const struct rte_eth_conf port_conf = {
> +        .rxmode = {
> +            .mq_mode        = ETH_MQ_RX_RSS,
> +            .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +        },
> +        .rx_adv_conf = {
> +            .rss_conf = {
> +                .rss_key = NULL,
> +                .rss_hf = ETH_RSS_PROTO_MASK,
> +            }
> +        }
> +    };
> +
> +For this example the ports are set up with the number of Rx queues provided
> +with -q option and 1 Tx queue using the ``rte_eth_rx_queue_setup()``
> +and ``rte_eth_tx_queue_setup()`` functions.
> +
> +The Ethernet port is then started:
> +
> +.. code-block:: c
> +
> +    ret = rte_eth_dev_start(portid);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d, port=%u\n",
> +            ret, portid);
> +
> +
> +Finally the Rx port is set in promiscuous mode:
> +
> +.. code-block:: c
> +
> +    rte_eth_promiscuous_enable(portid);
> +
> +
> +After that each port application assigns resources needed.
> +
> +.. code-block:: c
> +
> +    check_link_status(ioat_enabled_port_mask);
> +
> +    if (!cfg.nb_ports) {
> +        rte_exit(EXIT_FAILURE,
> +            "All available ports are disabled. Please set portmask.\n");
> +    }
> +
> +    /* Check if there is enough lcores for all ports. */
> +    cfg.nb_lcores = rte_lcore_count() - 1;
> +    if (cfg.nb_lcores < 1)
> +        rte_exit(EXIT_FAILURE,
> +            "There should be at least one slave lcore.\n");
> +
> +    ret = 0;
> +
> +    if (copy_mode == COPY_MODE_IOAT_NUM) {
> +        assign_rawdevs();
> +    } else /* copy_mode == COPY_MODE_SW_NUM */ {
> +        assign_rings();
> +    }
> +
> +A link status is checked of each port enabled by port mask
> +using ``check_link_status()`` function.
> +
> +.. code-block:: c
> +
> +    /* check link status, return true if at least one port is up */
> +    static int
> +    check_link_status(uint32_t port_mask)
> +    {
> +        uint16_t portid;
> +        struct rte_eth_link link;
> +        int retval = 0;
> +
> +        printf("\nChecking link status\n");
> +        RTE_ETH_FOREACH_DEV(portid) {
> +            if ((port_mask & (1 << portid)) == 0)
> +                continue;
> +
> +            memset(&link, 0, sizeof(link));
> +            rte_eth_link_get(portid, &link);
> +
> +            /* Print link status */
> +            if (link.link_status) {
> +                printf(
> +                    "Port %d Link Up. Speed %u Mbps - %s\n",
> +                    portid, link.link_speed,
> +                    (link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
> +                    ("full-duplex") : ("half-duplex\n"));
> +                retval = 1;
> +            } else
> +                printf("Port %d Link Down\n", portid);
> +        }
> +        return retval;
> +    }
> +
> +Depending on mode set (whether copy should be done by software or by hardware)
> +special structures are assigned to each port. If software copy was chosen,
> +application have to assign ring structures for packet exchanging between lcores
> +assigned to ports.
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rings(void)
> +    {
> +        uint32_t i;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            char ring_name[20];
> +
> +            snprintf(ring_name, 20, "rx_to_tx_ring_%u", i);
> +            /* Create ring for inter core communication */
> +            cfg.ports[i].rx_to_tx_ring = rte_ring_create(
> +                    ring_name, ring_size,
> +                    rte_socket_id(), RING_F_SP_ENQ);
> +
> +            if (cfg.ports[i].rx_to_tx_ring == NULL)
> +                rte_exit(EXIT_FAILURE, "%s\n",
> +                        rte_strerror(rte_errno));
> +        }
> +    }
> +
> +
> +When using hardware copy each Rx queue of the port is assigned an
> +IOAT device (``assign_rawdevs()``) using IOAT Rawdev Driver API
> +functions:
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rawdevs(void)
> +    {
> +        uint16_t nb_rawdev = 0, rdev_id = 0;
> +        uint32_t i, j;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            for (j = 0; j < cfg.ports[i].nb_queues; j++) {
> +                struct rte_rawdev_info rdev_info = { 0 };
> +
> +                do {
> +                    if (rdev_id == rte_rawdev_count())
> +                        goto end;
> +                    rte_rawdev_info_get(rdev_id++, &rdev_info);
> +                } while (strcmp(rdev_info.driver_name,
> +                    IOAT_PMD_RAWDEV_NAME_STR) != 0);
> +
> +                cfg.ports[i].ioat_ids[j] = rdev_id - 1;
> +                configure_rawdev_queue(cfg.ports[i].ioat_ids[j]);
> +                ++nb_rawdev;
> +            }
> +        }
> +    end:
> +        if (nb_rawdev < cfg.nb_ports * cfg.ports[0].nb_queues)
> +            rte_exit(EXIT_FAILURE,
> +                "Not enough IOAT rawdevs (%u) for all queues (%u).\n",
> +                nb_rawdev, cfg.nb_ports * cfg.ports[0].nb_queues);
> +        RTE_LOG(INFO, IOAT, "Number of used rawdevs: %u.\n", nb_rawdev);
> +    }
> +
> +
> +The initialization of hardware device is done by ``rte_rawdev_configure()``
> +function and ``rte_rawdev_info`` struct. After configuration the device is
> +started using ``rte_rawdev_start()`` function. Each of the above operations
> +is done in ``configure_rawdev_queue()``.
> +
> +.. code-block:: c
> +
> +    static void
> +    configure_rawdev_queue(uint32_t dev_id)
> +    {
> +        struct rte_rawdev_info info = { .dev_private = &dev_config };
> +
> +        /* Configure hardware copy device */
> +        dev_config.ring_size = ring_size;
> +
> +        if (rte_rawdev_configure(dev_id, &info) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_configure()\n");
> +        }
> +        rte_rawdev_info_get(dev_id, &info);
> +        if (dev_config.ring_size != ring_size) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error, ring size is not %d (%d)\n",
> +                ring_size, (int)dev_config.ring_size);
> +        }
> +        if (rte_rawdev_start(dev_id) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_start()\n");
> +        }
> +    }
> +
> +If initialization is successful memory for hardware device
> +statistics is allocated.
> +
> +Finally ``main()`` functions starts all processing lcores and starts
> +printing stats in a loop on master lcore. The application can be
> +interrupted and closed using ``Ctrl-C``. The master lcore waits for
> +all slave processes to finish, deallocates resources and exits.
> +
> +The processing lcores launching function are described below.
> +
> +The Lcores Launching Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +As described above ``main()`` function invokes ``start_forwarding_cores()``
> +function in order to start processing for each lcore:
> +
> +.. code-block:: c
> +
> +    static void start_forwarding_cores(void)
> +    {
> +        uint32_t lcore_id = rte_lcore_id();
> +
> +        RTE_LOG(INFO, IOAT, "Entering %s on lcore %u\n",
> +                __func__, rte_lcore_id());
> +
> +        if (cfg.nb_lcores == 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rxtx_main_loop,
> +                NULL, lcore_id);
> +        } else if (cfg.nb_lcores > 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rx_main_loop,
> +                NULL, lcore_id);
> +
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)tx_main_loop, NULL,
> +                lcore_id);
> +        }
> +    }
> +
> +The function launches Rx/Tx processing functions on configured lcores
> +for each port using ``rte_eal_remote_launch()``. The configured ports,
> +their number and number of assigned lcores are stored in user-defined
> +``rxtx_transmission_config`` struct that is initialized before launching
> +tasks:
> +
> +.. code-block:: c
> +
> +    struct rxtx_transmission_config {
> +        struct rxtx_port_config ports[RTE_MAX_ETHPORTS];
> +        uint16_t nb_ports;
> +        uint16_t nb_lcores;
> +    };
> +
> +The Lcores Processing Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For receiving packets on each port an ``ioat_rx_port()`` function is used.
> +The function receives packets on each configured Rx queue. Depending on mode
> +the user chose, it will enqueue packets to IOAT rawdev channels and then invoke
> +copy process (hardware copy), or perform software copy of each packet using
> +``pktmbuf_sw_copy()`` function and enqueue them to 1 rte_ring:
> +
> +.. code-block:: c
> +
> +    /* Receive packets on one port and enqueue to IOAT rawdev or rte_ring. */
> +    static void
> +    ioat_rx_port(struct rxtx_port_config *rx_config)
> +    {
> +        uint32_t nb_rx, nb_enq, i, j;
> +        struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +        for (i = 0; i < rx_config->nb_queues; i++) {
> +
> +            nb_rx = rte_eth_rx_burst(rx_config->rxtx_port, i,
> +                pkts_burst, MAX_PKT_BURST);
> +
> +            if (nb_rx == 0)
> +                continue;
> +
> +            port_statistics.rx[rx_config->rxtx_port] += nb_rx;
> +
> +            if (copy_mode == COPY_MODE_IOAT_NUM) {
> +                /* Perform packet hardware copy */
> +                nb_enq = ioat_enqueue_packets(pkts_burst,
> +                    nb_rx, rx_config->ioat_ids[i]);
> +                if (nb_enq > 0)
> +                    rte_ioat_do_copies(rx_config->ioat_ids[i]);
> +            } else {
> +                /* Perform packet software copy, free source packets */
> +                int ret;
> +                struct rte_mbuf *pkts_burst_copy[MAX_PKT_BURST];
> +
> +                ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst_copy, nb_rx);
> +
> +                if (unlikely(ret < 0))
> +                    rte_exit(EXIT_FAILURE,
> +                        "Unable to allocate memory.\n");
> +
> +                for (j = 0; j < nb_rx; j++)
> +                    pktmbuf_sw_copy(pkts_burst[j],
> +                        pkts_burst_copy[j]);
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst, nb_rx);
> +
> +                nb_enq = rte_ring_enqueue_burst(
> +                    rx_config->rx_to_tx_ring,
> +                    (void *)pkts_burst_copy, nb_rx, NULL);
> +
> +                /* Free any not enqueued packets. */
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&pkts_burst_copy[nb_enq],
> +                    nb_rx - nb_enq);
> +            }
> +
> +            port_statistics.copy_dropped[rx_config->rxtx_port] +=
> +                (nb_rx - nb_enq);
> +        }
> +    }
> +
> +The packets are received in burst mode using ``rte_eth_rx_burst()``
> +function. When using hardware copy mode the packets are enqueued in
> +copying device's buffer using ``ioat_enqueue_packets()`` which calls
> +``rte_ioat_enqueue_copy()``. When all received packets are in the
> +buffer the copies are invoked by calling ``rte_ioat_do_copies()``.
> +Function ``rte_ioat_enqueue_copy()`` operates on physical address of
> +the packet. Structure ``rte_mbuf`` contains only physical address to
> +start of the data buffer (``buf_iova``). Thus the address is shifted
> +by ``addr_offset`` value in order to get pointer to ``rearm_data``
> +member of ``rte_mbuf``. That way the packet is copied all at once
> +(with data and metadata).
> +
> +.. code-block:: c
> +
> +    static uint32_t
> +    ioat_enqueue_packets(struct rte_mbuf **pkts,
> +        uint32_t nb_rx, uint16_t dev_id)
> +    {
> +        int ret;
> +        uint32_t i;
> +        struct rte_mbuf *pkts_copy[MAX_PKT_BURST];
> +
> +        const uint64_t addr_offset = RTE_PTR_DIFF(pkts[0]->buf_addr,
> +            &pkts[0]->rearm_data);
> +
> +        ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                (void *)pkts_copy, nb_rx);
> +
> +        if (unlikely(ret < 0))
> +            rte_exit(EXIT_FAILURE, "Unable to allocate memory.\n");
> +
> +        for (i = 0; i < nb_rx; i++) {
> +            /* Perform data copy */
> +            ret = rte_ioat_enqueue_copy(dev_id,
> +                pkts[i]->buf_iova
> +                    - addr_offset,
> +                pkts_copy[i]->buf_iova
> +                    - addr_offset,
> +                rte_pktmbuf_data_len(pkts[i])
> +                    + addr_offset,
> +                (uintptr_t)pkts[i],
> +                (uintptr_t)pkts_copy[i],
> +                0 /* nofence */);
> +
> +            if (ret != 1)
> +                break;
> +        }
> +
> +        ret = i;
> +        /* Free any not enqueued packets. */
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts[i], nb_rx - i);
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts_copy[i],
> +            nb_rx - i);
> +
> +        return ret;
> +    }
> +
> +
> +All done copies are processed by ``ioat_tx_port()`` function. When using
> +hardware copy mode the function invokes ``rte_ioat_completed_copies()``
> +on each assigned IOAT channel to gather copied packets. If software copy
> +mode is used the function dequeues copied packets from the rte_ring. Then each
> +packet MAC address is changed if it was enabled. After that copies are sent
> +in burst mode using `` rte_eth_tx_burst()``.
> +
> +
> +.. code-block:: c
> +
> +    /* Transmit packets from IOAT rawdev/rte_ring for one port. */
> +    static void
> +    ioat_tx_port(struct rxtx_port_config *tx_config)
> +    {
> +        uint32_t i, j, nb_dq = 0;
> +        struct rte_mbuf *mbufs_src[MAX_PKT_BURST];
> +        struct rte_mbuf *mbufs_dst[MAX_PKT_BURST];
> +
> +        if (copy_mode == COPY_MODE_IOAT_NUM) {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ioat_completed_copies(
> +                    tx_config->ioat_ids[i], MAX_PKT_BURST,
> +                    (void *)mbufs_src, (void *)mbufs_dst);
> +
> +                if (nb_dq == 0)
> +                    break;
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)mbufs_src, nb_dq);
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(
> +                    tx_config->rxtx_port, 0,
> +                    (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +        else {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ring_dequeue_burst(tx_config->rx_to_tx_ring,
> +                    (void *)mbufs_dst, MAX_PKT_BURST, NULL);
> +
> +                if (nb_dq == 0)
> +                    return;
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(tx_config->rxtx_port,
> +                    0, (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +    }
> +
> +The Packet Copying Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In order to perform packet copy there is a user-defined function
> +``pktmbuf_sw_copy()`` used. It copies a whole packet by copying
> +metadata from source packet to new mbuf, and then copying a data
> +chunk of source packet. Both memory copies are done using
> +``rte_memcpy()``:
> +
> +.. code-block:: c
> +
> +    static inline void
> +    pktmbuf_sw_copy(struct rte_mbuf *src, struct rte_mbuf *dst)
> +    {
> +        /* Copy packet metadata */
> +        rte_memcpy(&dst->rearm_data,
> +            &src->rearm_data,
> +            offsetof(struct rte_mbuf, cacheline1)
> +                - offsetof(struct rte_mbuf, rearm_data));
> +
> +        /* Copy packet data */
> +        rte_memcpy(rte_pktmbuf_mtod(dst, char *),
> +            rte_pktmbuf_mtod(src, char *), src->data_len);
> +    }
> +
> +The metadata in this example is copied from ``rearm_data`` member of
> +``rte_mbuf`` struct up to ``cacheline1``.
> +
> +In order to understand why software packet copying is done as shown
> +above please refer to the "Mbuf Library" section of the
> +*DPDK Programmer's Guide*.
> \ No newline at end of file
> -- 
> 2.22.0.windows.1
>
  
Bruce Richardson Sept. 27, 2019, 11:01 a.m. UTC | #3
On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and
> code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 
> diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst
> index f23f8f59e..a6a1d9e7a 100644
> --- a/doc/guides/sample_app_ug/index.rst
> +++ b/doc/guides/sample_app_ug/index.rst
> @@ -23,6 +23,7 @@ Sample Applications User Guides
>      ip_reassembly
>      kernel_nic_interface
>      keep_alive
> +    ioat
>      l2_forward_crypto
>      l2_forward_job_stats
>      l2_forward_real_virtual
> diff --git a/doc/guides/sample_app_ug/intro.rst b/doc/guides/sample_app_ug/intro.rst
> index 90704194a..74462312f 100644
> --- a/doc/guides/sample_app_ug/intro.rst
> +++ b/doc/guides/sample_app_ug/intro.rst
> @@ -91,6 +91,10 @@ examples are highlighted below.
>    forwarding, or ``l3fwd`` application does forwarding based on Internet
>    Protocol, IPv4 or IPv6 like a simple router.
>  
> +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev driver for
> +  copying packets between two threads.
> +
>  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
>    demonstrates how to distribute packets arriving on an Rx port to different
>    cores for processing and transmission.
> diff --git a/doc/guides/sample_app_ug/ioat.rst b/doc/guides/sample_app_ug/ioat.rst
> new file mode 100644
> index 000000000..69621673b
> --- /dev/null
> +++ b/doc/guides/sample_app_ug/ioat.rst
> @@ -0,0 +1,764 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2019 Intel Corporation.
> +
> +Sample Application of packet copying using Intel\ |reg| QuickData Technology
> +============================================================================
> +
> +Overview
> +--------
> +
> +This sample is intended as a demonstration of the basic components of a DPDK
> +forwarding application and example of how to use IOAT driver API to make
> +packets copies.
> +
> +Also while forwarding, the MAC addresses are affected as follows:
> +
> +*   The source MAC address is replaced by the TX port MAC address
> +
> +*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
> +
> +This application can be used to compare performance of using software packet
> +copy with copy done using a DMA device for different sizes of packets.
> +The example will print out statistics each second. The stats shows
> +received/send packets and packets dropped or failed to copy.
> +
> +Compiling the Application
> +-------------------------
> +
> +To compile the sample application see :doc:`compiling`.
> +
> +The application is located in the ``ioat`` sub-directory.
> +
> +
> +Running the Application
> +-----------------------
> +
> +In order to run the hardware copy application, the copying device
> +needs to be bound to user-space IO driver.
> +
> +Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData Technology*
> +guide for information on using the driver.
> +

The document is not called that, as the IOAT guide is just part of the
overall rawdev document. So I suggest you just reference the rawdev guide.

> +The application requires a number of command line options:
> +
> +.. code-block:: console
> +
> +    ./build/ioatfwd [EAL options] -- -p MASK [-q NQ] [-s RS] [-c <sw|hw>]
> +        [--[no-]mac-updating]
> +
> +where,
> +
> +*   p MASK: A hexadecimal bitmask of the ports to configure

Is this a mandatory parameter, or does the app use all detected ports by
default, e.g. like testpmd?

> +
> +*   q NQ: Number of Rx queues used per port equivalent to CBDMA channels
> +    per port
> +
> +*   c CT: Performed packet copy type: software (sw) or hardware using
> +    DMA (hw)

What is the default? Same for next two parameters.

> +
> +*   s RS: Size of IOAT rawdev ring for hardware copy mode or rte_ring for
> +    software copy mode
> +
> +*   --[no-]mac-updating: Whether MAC address of packets should be changed
> +    or not
> +
> +The application can be launched in various configurations depending on
> +provided parameters. Each port can use up to 2 lcores: one of them receives

The app uses 2 data plane cores, total, rather than 2 per-port, I believe.
It would be good to explain the difference here that with 2 cores the
copies are done on one core, and the mac updates on the second one.

> +incoming traffic and makes a copy of each packet. The second lcore then
> +updates MAC address and sends the copy. If one lcore per port is used,
> +both operations are done sequentially. For each configuration an additional
> +lcore is needed since master lcore in use which is responsible for

... since the master lcore does not handle traffic but is responsible for

> +configuration, statistics printing and safe deinitialization of all ports
> +and devices.

s/deinitialization/shutdown/

> +
> +The application can use a maximum of 8 ports.

Is this a hard limit in the app, if so explain why. I see the stats
arrays are limited by "RTE_MAX_ETHPORTS".

> +
> +To run the application in a Linux environment with 3 lcores (one of them
> +is master lcore), 1 port (port 0), software copying and MAC updating issue
> +the command:

s/1 port/a single port/

s/one of them is master lcore/the master lcore, plus two forwarding cores/

Similar comments would apply to text immediately below too.

> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-2 -n 2 -- -p 0x1 --mac-updating -c sw
> +
> +To run the application in a Linux environment with 2 lcores (one of them
> +is master lcore), 2 ports (ports 0 and 1), hardware copying and no MAC
> +updating issue the command:
> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-1 -n 1 -- -p 0x3 --no-mac-updating -c hw
> +
> +Refer to the *DPDK Getting Started Guide* for general information on
> +running applications and the Environment Abstraction Layer (EAL) options.
> +
> +Explanation
> +-----------
> +
> +The following sections provide an explanation of the main components of the
> +code.
> +
> +All DPDK library functions used in the sample code are prefixed with
> +``rte_`` and are explained in detail in the *DPDK API Documentation*.
> +
> +
> +The Main Function
> +~~~~~~~~~~~~~~~~~
> +
> +The ``main()`` function performs the initialization and calls the execution
> +threads for each lcore.
> +
> +The first task is to initialize the Environment Abstraction Layer (EAL).
> +The ``argc`` and ``argv`` arguments are provided to the ``rte_eal_init()``
> +function. The value returned is the number of parsed arguments:
> +
> +.. code-block:: c
> +
> +    /* init EAL */
> +    ret = rte_eal_init(argc, argv);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
> +
> +
> +The ``main()`` also allocates a mempool to hold the mbufs (Message Buffers)
> +used by the application:
> +
> +.. code-block:: c
> +
> +    nb_mbufs = RTE_MAX(rte_eth_dev_count_avail() * (nb_rxd + nb_txd
> +        + MAX_PKT_BURST + rte_lcore_count() * MEMPOOL_CACHE_SIZE),
> +        MIN_POOL_SIZE);
> +
> +    /* Create the mbuf pool */
> +    ioat_pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", nb_mbufs,
> +        MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
> +        rte_socket_id());
> +    if (ioat_pktmbuf_pool == NULL)
> +        rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n");
> +
> +Mbufs are the packet buffer structure used by DPDK. They are explained in
> +detail in the "Mbuf Library" section of the *DPDK Programmer's Guide*.
> +
> +The ``main()`` function also initializes the ports:
> +
> +.. code-block:: c
> +
> +    /* Initialise each port */
> +    RTE_ETH_FOREACH_DEV(portid) {
> +        port_init(portid, ioat_pktmbuf_pool);
> +    }
> +
> +Each port is configured using ``port_init()``:
> +
> +.. code-block:: c
> +
> +     /*
> +     * Initializes a given port using global settings and with the RX buffers
> +     * coming from the mbuf_pool passed as a parameter.
> +     */
> +    static inline void
> +    port_init(uint16_t portid, struct rte_mempool *mbuf_pool, uint16_t nb_queues)
> +    {
> +        /* configuring port to use RSS for multiple RX queues */
> +        static const struct rte_eth_conf port_conf = {
> +            .rxmode = {
> +                .mq_mode        = ETH_MQ_RX_RSS,
> +                .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +            },
> +            .rx_adv_conf = {
> +                .rss_conf = {
> +                    .rss_key = NULL,
> +                    .rss_hf = ETH_RSS_PROTO_MASK,
> +                }
> +            }
> +        };
> +
> +        struct rte_eth_rxconf rxq_conf;
> +        struct rte_eth_txconf txq_conf;
> +        struct rte_eth_conf local_port_conf = port_conf;
> +        struct rte_eth_dev_info dev_info;
> +        int ret, i;
> +
> +        /* Skip ports that are not enabled */
> +        if ((ioat_enabled_port_mask & (1 << portid)) == 0) {
> +            printf("Skipping disabled port %u\n", portid);
> +            return;
> +        }
> +
> +        /* Init port */
> +        printf("Initializing port %u... ", portid);
> +        fflush(stdout);
> +        rte_eth_dev_info_get(portid, &dev_info);
> +        local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
> +            dev_info.flow_type_rss_offloads;
> +        if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
> +            local_port_conf.txmode.offloads |=
> +                DEV_TX_OFFLOAD_MBUF_FAST_FREE;
> +        ret = rte_eth_dev_configure(portid, nb_queues, 1, &local_port_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE, "Cannot configure device:"
> +                " err=%d, port=%u\n", ret, portid);
> +
> +        ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd,
> +                            &nb_txd);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot adjust number of descriptors: err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_macaddr_get(portid, &ioat_ports_eth_addr[portid]);
> +
> +        /* Init Rx queues */
> +        rxq_conf = dev_info.default_rxconf;
> +        rxq_conf.offloads = local_port_conf.rxmode.offloads;
> +        for (i = 0; i < nb_queues; i++) {
> +            ret = rte_eth_rx_queue_setup(portid, i, nb_rxd,
> +                rte_eth_dev_socket_id(portid), &rxq_conf,
> +                mbuf_pool);
> +            if (ret < 0)
> +                rte_exit(EXIT_FAILURE,
> +                    "rte_eth_rx_queue_setup:err=%d,port=%u, queue_id=%u\n",
> +                    ret, portid, i);
> +        }
> +
> +        /* Init one TX queue on each port */
> +        txq_conf = dev_info.default_txconf;
> +        txq_conf.offloads = local_port_conf.txmode.offloads;
> +        ret = rte_eth_tx_queue_setup(portid, 0, nb_txd,
> +                rte_eth_dev_socket_id(portid),
> +                &txq_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_tx_queue_setup:err=%d,port=%u\n",
> +                ret, portid);
> +
> +        /* Initialize TX buffers */
> +        tx_buffer[portid] = rte_zmalloc_socket("tx_buffer",
> +                RTE_ETH_TX_BUFFER_SIZE(MAX_PKT_BURST), 0,
> +                rte_eth_dev_socket_id(portid));
> +        if (tx_buffer[portid] == NULL)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot allocate buffer for tx on port %u\n",
> +                portid);
> +
> +        rte_eth_tx_buffer_init(tx_buffer[portid], MAX_PKT_BURST);
> +
> +        ret = rte_eth_tx_buffer_set_err_callback(tx_buffer[portid],
> +                rte_eth_tx_buffer_count_callback,
> +                &port_statistics.tx_dropped[portid]);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot set error callback for tx buffer on port %u\n",
> +                portid);
> +
> +        /* Start device */
> +        ret = rte_eth_dev_start(portid);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_dev_start:err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_promiscuous_enable(portid);
> +
> +        printf("Port %u, MAC address: %02X:%02X:%02X:%02X:%02X:%02X\n\n",
> +                portid,
> +                ioat_ports_eth_addr[portid].addr_bytes[0],
> +                ioat_ports_eth_addr[portid].addr_bytes[1],
> +                ioat_ports_eth_addr[portid].addr_bytes[2],
> +                ioat_ports_eth_addr[portid].addr_bytes[3],
> +                ioat_ports_eth_addr[portid].addr_bytes[4],
> +                ioat_ports_eth_addr[portid].addr_bytes[5]);
> +
> +        cfg.ports[cfg.nb_ports].rxtx_port = portid;
> +        cfg.ports[cfg.nb_ports++].nb_queues = nb_queues;
> +    }
> +

This code is probably quite similar to that in other sample apps, so I
don't think we need to include the full function here. It makes updating
the code more difficult, so just refer to the function as doing the port
init and leave it at that, I think. The snippets below give enough detail.

> +The Ethernet ports are configured with local settings using the
> +``rte_eth_dev_configure()`` function and the ``port_conf`` struct.
> +The RSS is enabled so that multiple Rx queues could be used for
> +packet receiving and copying by multiple CBDMA channels per port:
> +
> +.. code-block:: c
> +
> +    /* configuring port to use RSS for multiple RX queues */
> +    static const struct rte_eth_conf port_conf = {
> +        .rxmode = {
> +            .mq_mode        = ETH_MQ_RX_RSS,
> +            .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +        },
> +        .rx_adv_conf = {
> +            .rss_conf = {
> +                .rss_key = NULL,
> +                .rss_hf = ETH_RSS_PROTO_MASK,
> +            }
> +        }
> +    };
> +
> +For this example the ports are set up with the number of Rx queues provided
> +with -q option and 1 Tx queue using the ``rte_eth_rx_queue_setup()``
> +and ``rte_eth_tx_queue_setup()`` functions.
> +
> +The Ethernet port is then started:
> +
> +.. code-block:: c
> +
> +    ret = rte_eth_dev_start(portid);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d, port=%u\n",
> +            ret, portid);
> +
> +
> +Finally the Rx port is set in promiscuous mode:
> +
> +.. code-block:: c
> +
> +    rte_eth_promiscuous_enable(portid);
> +
> +
> +After that each port application assigns resources needed.
> +
> +.. code-block:: c
> +
> +    check_link_status(ioat_enabled_port_mask);
> +
> +    if (!cfg.nb_ports) {
> +        rte_exit(EXIT_FAILURE,
> +            "All available ports are disabled. Please set portmask.\n");
> +    }
> +
> +    /* Check if there is enough lcores for all ports. */
> +    cfg.nb_lcores = rte_lcore_count() - 1;
> +    if (cfg.nb_lcores < 1)
> +        rte_exit(EXIT_FAILURE,
> +            "There should be at least one slave lcore.\n");
> +
> +    ret = 0;
> +
> +    if (copy_mode == COPY_MODE_IOAT_NUM) {
> +        assign_rawdevs();
> +    } else /* copy_mode == COPY_MODE_SW_NUM */ {
> +        assign_rings();
> +    }
> +
> +A link status is checked of each port enabled by port mask
> +using ``check_link_status()`` function.
> +

I don't think this block needs to be covered. No need to go into everything
in detail, just focus on the key parts of the app that are unique to it,
i.e. the copying and passing mbufs between threads parts.

> +.. code-block:: c
> +
> +    /* check link status, return true if at least one port is up */
> +    static int
> +    check_link_status(uint32_t port_mask)
> +    {
> +        uint16_t portid;
> +        struct rte_eth_link link;
> +        int retval = 0;
> +
> +        printf("\nChecking link status\n");
> +        RTE_ETH_FOREACH_DEV(portid) {
> +            if ((port_mask & (1 << portid)) == 0)
> +                continue;
> +
> +            memset(&link, 0, sizeof(link));
> +            rte_eth_link_get(portid, &link);
> +
> +            /* Print link status */
> +            if (link.link_status) {
> +                printf(
> +                    "Port %d Link Up. Speed %u Mbps - %s\n",
> +                    portid, link.link_speed,
> +                    (link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
> +                    ("full-duplex") : ("half-duplex\n"));
> +                retval = 1;
> +            } else
> +                printf("Port %d Link Down\n", portid);
> +        }
> +        return retval;
> +    }
> +
  
Bruce Richardson Sept. 27, 2019, 1:22 p.m. UTC | #4
On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and
> code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 

<snip>

> +Depending on mode set (whether copy should be done by software or by hardware)
> +special structures are assigned to each port. If software copy was chosen,
> +application have to assign ring structures for packet exchanging between lcores
> +assigned to ports.
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rings(void)
> +    {
> +        uint32_t i;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            char ring_name[20];
> +
> +            snprintf(ring_name, 20, "rx_to_tx_ring_%u", i);
> +            /* Create ring for inter core communication */
> +            cfg.ports[i].rx_to_tx_ring = rte_ring_create(
> +                    ring_name, ring_size,
> +                    rte_socket_id(), RING_F_SP_ENQ);
> +
> +            if (cfg.ports[i].rx_to_tx_ring == NULL)
> +                rte_exit(EXIT_FAILURE, "%s\n",
> +                        rte_strerror(rte_errno));
> +        }
> +    }
> +
> +
> +When using hardware copy each Rx queue of the port is assigned an
> +IOAT device (``assign_rawdevs()``) using IOAT Rawdev Driver API
> +functions:
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rawdevs(void)
> +    {
> +        uint16_t nb_rawdev = 0, rdev_id = 0;
> +        uint32_t i, j;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            for (j = 0; j < cfg.ports[i].nb_queues; j++) {
> +                struct rte_rawdev_info rdev_info = { 0 };
> +
> +                do {
> +                    if (rdev_id == rte_rawdev_count())
> +                        goto end;
> +                    rte_rawdev_info_get(rdev_id++, &rdev_info);
> +                } while (strcmp(rdev_info.driver_name,
> +                    IOAT_PMD_RAWDEV_NAME_STR) != 0);
> +
> +                cfg.ports[i].ioat_ids[j] = rdev_id - 1;
> +                configure_rawdev_queue(cfg.ports[i].ioat_ids[j]);
> +                ++nb_rawdev;
> +            }
> +        }
> +    end:
> +        if (nb_rawdev < cfg.nb_ports * cfg.ports[0].nb_queues)
> +            rte_exit(EXIT_FAILURE,
> +                "Not enough IOAT rawdevs (%u) for all queues (%u).\n",
> +                nb_rawdev, cfg.nb_ports * cfg.ports[0].nb_queues);
> +        RTE_LOG(INFO, IOAT, "Number of used rawdevs: %u.\n", nb_rawdev);
> +    }
> +
> +
> +The initialization of hardware device is done by ``rte_rawdev_configure()``
> +function and ``rte_rawdev_info`` struct.

... using ``rte_rawdev_info`` struct

> After configuration the device is
> +started using ``rte_rawdev_start()`` function. Each of the above operations
> +is done in ``configure_rawdev_queue()``.

In the block below, there is no mention of where dev_config structure comes
from. Presume it's a global variable, so maybe mention that in the text.

> +
> +.. code-block:: c
> +
> +    static void
> +    configure_rawdev_queue(uint32_t dev_id)
> +    {
> +        struct rte_rawdev_info info = { .dev_private = &dev_config };
> +
> +        /* Configure hardware copy device */
> +        dev_config.ring_size = ring_size;
> +
> +        if (rte_rawdev_configure(dev_id, &info) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_configure()\n");
> +        }
> +        rte_rawdev_info_get(dev_id, &info);
> +        if (dev_config.ring_size != ring_size) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error, ring size is not %d (%d)\n",
> +                ring_size, (int)dev_config.ring_size);
> +        }
> +        if (rte_rawdev_start(dev_id) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_start()\n");
> +        }
> +    }
> +
> +If initialization is successful memory for hardware device
> +statistics is allocated.

Missing "," after successful.
Where is this memory allocated? It is done in main or elsewhere?
> +
> +Finally ``main()`` functions starts all processing lcores and starts

s/functions/function/
s/processing lcores/packet handling lcores/

> +printing stats in a loop on master lcore. The application can be

s/master lcore/the master lcore/

> +interrupted and closed using ``Ctrl-C``. The master lcore waits for
> +all slave processes to finish, deallocates resources and exits.
> +
> +The processing lcores launching function are described below.
> +
> +The Lcores Launching Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +As described above ``main()`` function invokes ``start_forwarding_cores()``

Missing "," after above.

> +function in order to start processing for each lcore:
> +
> +.. code-block:: c
> +
> +    static void start_forwarding_cores(void)
> +    {
> +        uint32_t lcore_id = rte_lcore_id();
> +
> +        RTE_LOG(INFO, IOAT, "Entering %s on lcore %u\n",
> +                __func__, rte_lcore_id());
> +
> +        if (cfg.nb_lcores == 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rxtx_main_loop,
> +                NULL, lcore_id);
> +        } else if (cfg.nb_lcores > 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rx_main_loop,
> +                NULL, lcore_id);
> +
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)tx_main_loop, NULL,
> +                lcore_id);
> +        }
> +    }
> +
> +The function launches Rx/Tx processing functions on configured lcores
> +for each port using ``rte_eal_remote_launch()``. The configured ports,

Remove "for each port"

> +their number and number of assigned lcores are stored in user-defined
> +``rxtx_transmission_config`` struct that is initialized before launching

s/is/has been/
Did you describe how that structure was set up previously?

> +tasks:
> +
> +.. code-block:: c
> +
> +    struct rxtx_transmission_config {
> +        struct rxtx_port_config ports[RTE_MAX_ETHPORTS];
> +        uint16_t nb_ports;
> +        uint16_t nb_lcores;
> +    };
> +
> +The Lcores Processing Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For receiving packets on each port an ``ioat_rx_port()`` function is used.

Missing "," after port.
s/an/the/

> +The function receives packets on each configured Rx queue. Depending on mode

s/mode/the mode/

> +the user chose, it will enqueue packets to IOAT rawdev channels and then invoke
> +copy process (hardware copy), or perform software copy of each packet using
> +``pktmbuf_sw_copy()`` function and enqueue them to 1 rte_ring:

s/1 rte_ring/an rte_ring/

> +
> +.. code-block:: c
> +
> +    /* Receive packets on one port and enqueue to IOAT rawdev or rte_ring. */
> +    static void
> +    ioat_rx_port(struct rxtx_port_config *rx_config)
> +    {
> +        uint32_t nb_rx, nb_enq, i, j;
> +        struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +        for (i = 0; i < rx_config->nb_queues; i++) {
> +
> +            nb_rx = rte_eth_rx_burst(rx_config->rxtx_port, i,
> +                pkts_burst, MAX_PKT_BURST);
> +
> +            if (nb_rx == 0)
> +                continue;
> +
> +            port_statistics.rx[rx_config->rxtx_port] += nb_rx;
> +
> +            if (copy_mode == COPY_MODE_IOAT_NUM) {
> +                /* Perform packet hardware copy */
> +                nb_enq = ioat_enqueue_packets(pkts_burst,
> +                    nb_rx, rx_config->ioat_ids[i]);
> +                if (nb_enq > 0)
> +                    rte_ioat_do_copies(rx_config->ioat_ids[i]);
> +            } else {
> +                /* Perform packet software copy, free source packets */
> +                int ret;
> +                struct rte_mbuf *pkts_burst_copy[MAX_PKT_BURST];
> +
> +                ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst_copy, nb_rx);
> +
> +                if (unlikely(ret < 0))
> +                    rte_exit(EXIT_FAILURE,
> +                        "Unable to allocate memory.\n");
> +
> +                for (j = 0; j < nb_rx; j++)
> +                    pktmbuf_sw_copy(pkts_burst[j],
> +                        pkts_burst_copy[j]);
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst, nb_rx);
> +
> +                nb_enq = rte_ring_enqueue_burst(
> +                    rx_config->rx_to_tx_ring,
> +                    (void *)pkts_burst_copy, nb_rx, NULL);
> +
> +                /* Free any not enqueued packets. */
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&pkts_burst_copy[nb_enq],
> +                    nb_rx - nb_enq);
> +            }
> +
> +            port_statistics.copy_dropped[rx_config->rxtx_port] +=
> +                (nb_rx - nb_enq);
> +        }
> +    }
> +
> +The packets are received in burst mode using ``rte_eth_rx_burst()``
> +function. When using hardware copy mode the packets are enqueued in
> +copying device's buffer using ``ioat_enqueue_packets()`` which calls
> +``rte_ioat_enqueue_copy()``. When all received packets are in the
> +buffer the copies are invoked by calling ``rte_ioat_do_copies()``.

s/copies are invoked/copy operations are started/

> +Function ``rte_ioat_enqueue_copy()`` operates on physical address of
> +the packet. Structure ``rte_mbuf`` contains only physical address to
> +start of the data buffer (``buf_iova``). Thus the address is shifted

s/shifted/adjusted/

> +by ``addr_offset`` value in order to get pointer to ``rearm_data``

s/pointer to/the address of/

> +member of ``rte_mbuf``. That way the packet is copied all at once
> +(with data and metadata).

"That way the both the packet data and metadata can be copied in a single
operation".
Should also note that this shortcut can be used because the mbufs are
"direct" mbufs allocated by the apps. If another app uses external buffers,
or indirect mbufs, then multiple copy operations must be used.

> +
> +.. code-block:: c
> +
> +    static uint32_t
> +    ioat_enqueue_packets(struct rte_mbuf **pkts,
> +        uint32_t nb_rx, uint16_t dev_id)
> +    {
> +        int ret;
> +        uint32_t i;
> +        struct rte_mbuf *pkts_copy[MAX_PKT_BURST];
> +
> +        const uint64_t addr_offset = RTE_PTR_DIFF(pkts[0]->buf_addr,
> +            &pkts[0]->rearm_data);
> +
> +        ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                (void *)pkts_copy, nb_rx);
> +
> +        if (unlikely(ret < 0))
> +            rte_exit(EXIT_FAILURE, "Unable to allocate memory.\n");
> +
> +        for (i = 0; i < nb_rx; i++) {
> +            /* Perform data copy */
> +            ret = rte_ioat_enqueue_copy(dev_id,
> +                pkts[i]->buf_iova
> +                    - addr_offset,
> +                pkts_copy[i]->buf_iova
> +                    - addr_offset,
> +                rte_pktmbuf_data_len(pkts[i])
> +                    + addr_offset,
> +                (uintptr_t)pkts[i],
> +                (uintptr_t)pkts_copy[i],
> +                0 /* nofence */);
> +
> +            if (ret != 1)
> +                break;
> +        }
> +
> +        ret = i;
> +        /* Free any not enqueued packets. */
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts[i], nb_rx - i);
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts_copy[i],
> +            nb_rx - i);
> +
> +        return ret;
> +    }
> +
> +
> +All done copies are processed by ``ioat_tx_port()`` function. When using

s/done/completed/

> +hardware copy mode the function invokes ``rte_ioat_completed_copies()``
> +on each assigned IOAT channel to gather copied packets. If software copy
> +mode is used the function dequeues copied packets from the rte_ring. Then each
> +packet MAC address is changed if it was enabled. After that copies are sent
> +in burst mode using `` rte_eth_tx_burst()``.
> +
> +
> +.. code-block:: c
> +
> +    /* Transmit packets from IOAT rawdev/rte_ring for one port. */
> +    static void
> +    ioat_tx_port(struct rxtx_port_config *tx_config)
> +    {
> +        uint32_t i, j, nb_dq = 0;
> +        struct rte_mbuf *mbufs_src[MAX_PKT_BURST];
> +        struct rte_mbuf *mbufs_dst[MAX_PKT_BURST];
> +
> +        if (copy_mode == COPY_MODE_IOAT_NUM) {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ioat_completed_copies(
> +                    tx_config->ioat_ids[i], MAX_PKT_BURST,
> +                    (void *)mbufs_src, (void *)mbufs_dst);
> +
> +                if (nb_dq == 0)
> +                    break;
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)mbufs_src, nb_dq);
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(
> +                    tx_config->rxtx_port, 0,
> +                    (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +        else {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ring_dequeue_burst(tx_config->rx_to_tx_ring,
> +                    (void *)mbufs_dst, MAX_PKT_BURST, NULL);
> +
> +                if (nb_dq == 0)
> +                    return;
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(tx_config->rxtx_port,
> +                    0, (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +    }
> +
> +The Packet Copying Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In order to perform packet copy there is a user-defined function
> +``pktmbuf_sw_copy()`` used. It copies a whole packet by copying
> +metadata from source packet to new mbuf, and then copying a data
> +chunk of source packet. Both memory copies are done using
> +``rte_memcpy()``:
> +
> +.. code-block:: c
> +
> +    static inline void
> +    pktmbuf_sw_copy(struct rte_mbuf *src, struct rte_mbuf *dst)
> +    {
> +        /* Copy packet metadata */
> +        rte_memcpy(&dst->rearm_data,
> +            &src->rearm_data,
> +            offsetof(struct rte_mbuf, cacheline1)
> +                - offsetof(struct rte_mbuf, rearm_data));
> +
> +        /* Copy packet data */
> +        rte_memcpy(rte_pktmbuf_mtod(dst, char *),
> +            rte_pktmbuf_mtod(src, char *), src->data_len);
> +    }
> +
> +The metadata in this example is copied from ``rearm_data`` member of
> +``rte_mbuf`` struct up to ``cacheline1``.
> +
> +In order to understand why software packet copying is done as shown
> +above please refer to the "Mbuf Library" section of the
> +*DPDK Programmer's Guide*.
> \ No newline at end of file

Use a text editor that adds a newline automatically :-)

/Bruce
  
Marcin Baran Sept. 27, 2019, 2:14 p.m. UTC | #5
-----Original Message-----
From: Bruce Richardson <bruce.richardson@intel.com> 
Sent: Friday, September 27, 2019 12:36 PM
To: Baran, MarcinX <marcinx.baran@intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT sample app guide

On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 
> +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 
> diff --git a/doc/guides/sample_app_ug/index.rst 
> b/doc/guides/sample_app_ug/index.rst
> index f23f8f59e..a6a1d9e7a 100644
> --- a/doc/guides/sample_app_ug/index.rst
> +++ b/doc/guides/sample_app_ug/index.rst
> @@ -23,6 +23,7 @@ Sample Applications User Guides
>      ip_reassembly
>      kernel_nic_interface
>      keep_alive
> +    ioat
>      l2_forward_crypto
>      l2_forward_job_stats
>      l2_forward_real_virtual
> diff --git a/doc/guides/sample_app_ug/intro.rst 
> b/doc/guides/sample_app_ug/intro.rst
> index 90704194a..74462312f 100644
> --- a/doc/guides/sample_app_ug/intro.rst
> +++ b/doc/guides/sample_app_ug/intro.rst
> @@ -91,6 +91,10 @@ examples are highlighted below.
>    forwarding, or ``l3fwd`` application does forwarding based on Internet
>    Protocol, IPv4 or IPv6 like a simple router.
>  
> +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev 
> +driver for
> +  copying packets between two threads.
> +
>  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
>    demonstrates how to distribute packets arriving on an Rx port to different
>    cores for processing and transmission.
> diff --git a/doc/guides/sample_app_ug/ioat.rst 
> b/doc/guides/sample_app_ug/ioat.rst
> new file mode 100644
> index 000000000..69621673b
> --- /dev/null
> +++ b/doc/guides/sample_app_ug/ioat.rst
> @@ -0,0 +1,764 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2019 Intel Corporation.
> +
> +Sample Application of packet copying using Intel\ |reg| QuickData 
> +Technology 
> +=====================================================================
> +=======
> +

Title is too long, you can drop the "Sample Application" at minimum, since this is part of the example applications guide document. Call the section "Packet Copying Using ..."

In order to get the proper (R) symbol, you also need to add an include to the file. Therefore add the following line just after the copyright:

.. include:: <isonum.txt>

/Bruce
[Marcin] Dropped "Sample Application" and added include.
  
Marcin Baran Sept. 27, 2019, 2:51 p.m. UTC | #6
-----Original Message-----
From: Bruce Richardson <bruce.richardson@intel.com> 
Sent: Friday, September 27, 2019 1:02 PM
To: Baran, MarcinX <marcinx.baran@intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT sample app guide

On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 
> +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 
> diff --git a/doc/guides/sample_app_ug/index.rst 
> b/doc/guides/sample_app_ug/index.rst
> index f23f8f59e..a6a1d9e7a 100644
> --- a/doc/guides/sample_app_ug/index.rst
> +++ b/doc/guides/sample_app_ug/index.rst
> @@ -23,6 +23,7 @@ Sample Applications User Guides
>      ip_reassembly
>      kernel_nic_interface
>      keep_alive
> +    ioat
>      l2_forward_crypto
>      l2_forward_job_stats
>      l2_forward_real_virtual
> diff --git a/doc/guides/sample_app_ug/intro.rst 
> b/doc/guides/sample_app_ug/intro.rst
> index 90704194a..74462312f 100644
> --- a/doc/guides/sample_app_ug/intro.rst
> +++ b/doc/guides/sample_app_ug/intro.rst
> @@ -91,6 +91,10 @@ examples are highlighted below.
>    forwarding, or ``l3fwd`` application does forwarding based on Internet
>    Protocol, IPv4 or IPv6 like a simple router.
>  
> +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev 
> +driver for
> +  copying packets between two threads.
> +
>  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
>    demonstrates how to distribute packets arriving on an Rx port to different
>    cores for processing and transmission.
> diff --git a/doc/guides/sample_app_ug/ioat.rst 
> b/doc/guides/sample_app_ug/ioat.rst
> new file mode 100644
> index 000000000..69621673b
> --- /dev/null
> +++ b/doc/guides/sample_app_ug/ioat.rst
> @@ -0,0 +1,764 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2019 Intel Corporation.
> +
> +Sample Application of packet copying using Intel\ |reg| QuickData 
> +Technology 
> +=====================================================================
> +=======
> +
> +Overview
> +--------
> +
> +This sample is intended as a demonstration of the basic components of 
> +a DPDK forwarding application and example of how to use IOAT driver 
> +API to make packets copies.
> +
> +Also while forwarding, the MAC addresses are affected as follows:
> +
> +*   The source MAC address is replaced by the TX port MAC address
> +
> +*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
> +
> +This application can be used to compare performance of using software 
> +packet copy with copy done using a DMA device for different sizes of packets.
> +The example will print out statistics each second. The stats shows 
> +received/send packets and packets dropped or failed to copy.
> +
> +Compiling the Application
> +-------------------------
> +
> +To compile the sample application see :doc:`compiling`.
> +
> +The application is located in the ``ioat`` sub-directory.
> +
> +
> +Running the Application
> +-----------------------
> +
> +In order to run the hardware copy application, the copying device 
> +needs to be bound to user-space IO driver.
> +
> +Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData 
> +Technology* guide for information on using the driver.
> +

The document is not called that, as the IOAT guide is just part of the overall rawdev document. So I suggest you just reference the rawdev guide.
[Marcin] I wanted to refer to the ioat guide in /doc/guides/rawdevs/ioat.rst which has that title. Is there another document or I referenced this one
incorrectly?

> +The application requires a number of command line options:
> +
> +.. code-block:: console
> +
> +    ./build/ioatfwd [EAL options] -- -p MASK [-q NQ] [-s RS] [-c <sw|hw>]
> +        [--[no-]mac-updating]
> +
> +where,
> +
> +*   p MASK: A hexadecimal bitmask of the ports to configure

Is this a mandatory parameter, or does the app use all detected ports by default, e.g. like testpmd?
[Marcin] Optional, the app use all detected ports, added default value comment and tagged -p as
optional.

> +
> +*   q NQ: Number of Rx queues used per port equivalent to CBDMA channels
> +    per port
> +
> +*   c CT: Performed packet copy type: software (sw) or hardware using
> +    DMA (hw)

What is the default? Same for next two parameters.
[Marcin] Added default values description.

> +
> +*   s RS: Size of IOAT rawdev ring for hardware copy mode or rte_ring for
> +    software copy mode
> +
> +*   --[no-]mac-updating: Whether MAC address of packets should be changed
> +    or not
> +
> +The application can be launched in various configurations depending 
> +on provided parameters. Each port can use up to 2 lcores: one of them 
> +receives

The app uses 2 data plane cores, total, rather than 2 per-port, I believe.
It would be good to explain the difference here that with 2 cores the copies are done on one core, and the mac updates on the second one.
[Marcin] Changed the description accordingly.

> +incoming traffic and makes a copy of each packet. The second lcore 
> +then updates MAC address and sends the copy. If one lcore per port is 
> +used, both operations are done sequentially. For each configuration 
> +an additional lcore is needed since master lcore in use which is 
> +responsible for

... since the master lcore does not handle traffic but is responsible for
[Marcin] Changed the description accordingly.

> +configuration, statistics printing and safe deinitialization of all 
> +ports and devices.

s/deinitialization/shutdown/
[Marcin] Changed the description accordingly.

> +
> +The application can use a maximum of 8 ports.

Is this a hard limit in the app, if so explain why. I see the stats arrays are limited by "RTE_MAX_ETHPORTS".
[Marcin] The limit was set for simplicity but also because on testing board there was 16 CBDMA channels total
so when there are 8 ports used, they can be set to work with more than one Rx queue each.

> +
> +To run the application in a Linux environment with 3 lcores (one of 
> +them is master lcore), 1 port (port 0), software copying and MAC 
> +updating issue the command:

s/1 port/a single port/

s/one of them is master lcore/the master lcore, plus two forwarding cores/

Similar comments would apply to text immediately below too.
[Marcin] Changed the description accordingly.

> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-2 -n 2 -- -p 0x1 --mac-updating -c sw
> +
> +To run the application in a Linux environment with 2 lcores (one of 
> +them is master lcore), 2 ports (ports 0 and 1), hardware copying and 
> +no MAC updating issue the command:
> +
> +.. code-block:: console
> +
> +    $ ./build/ioatfwd -l 0-1 -n 1 -- -p 0x3 --no-mac-updating -c hw
> +
> +Refer to the *DPDK Getting Started Guide* for general information on 
> +running applications and the Environment Abstraction Layer (EAL) options.
> +
> +Explanation
> +-----------
> +
> +The following sections provide an explanation of the main components 
> +of the code.
> +
> +All DPDK library functions used in the sample code are prefixed with 
> +``rte_`` and are explained in detail in the *DPDK API Documentation*.
> +
> +
> +The Main Function
> +~~~~~~~~~~~~~~~~~
> +
> +The ``main()`` function performs the initialization and calls the 
> +execution threads for each lcore.
> +
> +The first task is to initialize the Environment Abstraction Layer (EAL).
> +The ``argc`` and ``argv`` arguments are provided to the 
> +``rte_eal_init()`` function. The value returned is the number of parsed arguments:
> +
> +.. code-block:: c
> +
> +    /* init EAL */
> +    ret = rte_eal_init(argc, argv);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
> +
> +
> +The ``main()`` also allocates a mempool to hold the mbufs (Message 
> +Buffers) used by the application:
> +
> +.. code-block:: c
> +
> +    nb_mbufs = RTE_MAX(rte_eth_dev_count_avail() * (nb_rxd + nb_txd
> +        + MAX_PKT_BURST + rte_lcore_count() * MEMPOOL_CACHE_SIZE),
> +        MIN_POOL_SIZE);
> +
> +    /* Create the mbuf pool */
> +    ioat_pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", nb_mbufs,
> +        MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
> +        rte_socket_id());
> +    if (ioat_pktmbuf_pool == NULL)
> +        rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n");
> +
> +Mbufs are the packet buffer structure used by DPDK. They are 
> +explained in detail in the "Mbuf Library" section of the *DPDK Programmer's Guide*.
> +
> +The ``main()`` function also initializes the ports:
> +
> +.. code-block:: c
> +
> +    /* Initialise each port */
> +    RTE_ETH_FOREACH_DEV(portid) {
> +        port_init(portid, ioat_pktmbuf_pool);
> +    }
> +
> +Each port is configured using ``port_init()``:
> +
> +.. code-block:: c
> +
> +     /*
> +     * Initializes a given port using global settings and with the RX buffers
> +     * coming from the mbuf_pool passed as a parameter.
> +     */
> +    static inline void
> +    port_init(uint16_t portid, struct rte_mempool *mbuf_pool, uint16_t nb_queues)
> +    {
> +        /* configuring port to use RSS for multiple RX queues */
> +        static const struct rte_eth_conf port_conf = {
> +            .rxmode = {
> +                .mq_mode        = ETH_MQ_RX_RSS,
> +                .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +            },
> +            .rx_adv_conf = {
> +                .rss_conf = {
> +                    .rss_key = NULL,
> +                    .rss_hf = ETH_RSS_PROTO_MASK,
> +                }
> +            }
> +        };
> +
> +        struct rte_eth_rxconf rxq_conf;
> +        struct rte_eth_txconf txq_conf;
> +        struct rte_eth_conf local_port_conf = port_conf;
> +        struct rte_eth_dev_info dev_info;
> +        int ret, i;
> +
> +        /* Skip ports that are not enabled */
> +        if ((ioat_enabled_port_mask & (1 << portid)) == 0) {
> +            printf("Skipping disabled port %u\n", portid);
> +            return;
> +        }
> +
> +        /* Init port */
> +        printf("Initializing port %u... ", portid);
> +        fflush(stdout);
> +        rte_eth_dev_info_get(portid, &dev_info);
> +        local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
> +            dev_info.flow_type_rss_offloads;
> +        if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
> +            local_port_conf.txmode.offloads |=
> +                DEV_TX_OFFLOAD_MBUF_FAST_FREE;
> +        ret = rte_eth_dev_configure(portid, nb_queues, 1, &local_port_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE, "Cannot configure device:"
> +                " err=%d, port=%u\n", ret, portid);
> +
> +        ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd,
> +                            &nb_txd);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot adjust number of descriptors: err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_macaddr_get(portid, &ioat_ports_eth_addr[portid]);
> +
> +        /* Init Rx queues */
> +        rxq_conf = dev_info.default_rxconf;
> +        rxq_conf.offloads = local_port_conf.rxmode.offloads;
> +        for (i = 0; i < nb_queues; i++) {
> +            ret = rte_eth_rx_queue_setup(portid, i, nb_rxd,
> +                rte_eth_dev_socket_id(portid), &rxq_conf,
> +                mbuf_pool);
> +            if (ret < 0)
> +                rte_exit(EXIT_FAILURE,
> +                    "rte_eth_rx_queue_setup:err=%d,port=%u, queue_id=%u\n",
> +                    ret, portid, i);
> +        }
> +
> +        /* Init one TX queue on each port */
> +        txq_conf = dev_info.default_txconf;
> +        txq_conf.offloads = local_port_conf.txmode.offloads;
> +        ret = rte_eth_tx_queue_setup(portid, 0, nb_txd,
> +                rte_eth_dev_socket_id(portid),
> +                &txq_conf);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_tx_queue_setup:err=%d,port=%u\n",
> +                ret, portid);
> +
> +        /* Initialize TX buffers */
> +        tx_buffer[portid] = rte_zmalloc_socket("tx_buffer",
> +                RTE_ETH_TX_BUFFER_SIZE(MAX_PKT_BURST), 0,
> +                rte_eth_dev_socket_id(portid));
> +        if (tx_buffer[portid] == NULL)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot allocate buffer for tx on port %u\n",
> +                portid);
> +
> +        rte_eth_tx_buffer_init(tx_buffer[portid], MAX_PKT_BURST);
> +
> +        ret = rte_eth_tx_buffer_set_err_callback(tx_buffer[portid],
> +                rte_eth_tx_buffer_count_callback,
> +                &port_statistics.tx_dropped[portid]);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "Cannot set error callback for tx buffer on port %u\n",
> +                portid);
> +
> +        /* Start device */
> +        ret = rte_eth_dev_start(portid);
> +        if (ret < 0)
> +            rte_exit(EXIT_FAILURE,
> +                "rte_eth_dev_start:err=%d, port=%u\n",
> +                ret, portid);
> +
> +        rte_eth_promiscuous_enable(portid);
> +
> +        printf("Port %u, MAC address: %02X:%02X:%02X:%02X:%02X:%02X\n\n",
> +                portid,
> +                ioat_ports_eth_addr[portid].addr_bytes[0],
> +                ioat_ports_eth_addr[portid].addr_bytes[1],
> +                ioat_ports_eth_addr[portid].addr_bytes[2],
> +                ioat_ports_eth_addr[portid].addr_bytes[3],
> +                ioat_ports_eth_addr[portid].addr_bytes[4],
> +                ioat_ports_eth_addr[portid].addr_bytes[5]);
> +
> +        cfg.ports[cfg.nb_ports].rxtx_port = portid;
> +        cfg.ports[cfg.nb_ports++].nb_queues = nb_queues;
> +    }
> +

This code is probably quite similar to that in other sample apps, so I don't think we need to include the full function here. It makes updating the code more difficult, so just refer to the function as doing the port init and leave it at that, I think. The snippets below give enough detail.
[Marcin] Changed the description accordingly.

> +The Ethernet ports are configured with local settings using the 
> +``rte_eth_dev_configure()`` function and the ``port_conf`` struct.
> +The RSS is enabled so that multiple Rx queues could be used for 
> +packet receiving and copying by multiple CBDMA channels per port:
> +
> +.. code-block:: c
> +
> +    /* configuring port to use RSS for multiple RX queues */
> +    static const struct rte_eth_conf port_conf = {
> +        .rxmode = {
> +            .mq_mode        = ETH_MQ_RX_RSS,
> +            .max_rx_pkt_len = RTE_ETHER_MAX_LEN
> +        },
> +        .rx_adv_conf = {
> +            .rss_conf = {
> +                .rss_key = NULL,
> +                .rss_hf = ETH_RSS_PROTO_MASK,
> +            }
> +        }
> +    };
> +
> +For this example the ports are set up with the number of Rx queues 
> +provided with -q option and 1 Tx queue using the 
> +``rte_eth_rx_queue_setup()`` and ``rte_eth_tx_queue_setup()`` functions.
> +
> +The Ethernet port is then started:
> +
> +.. code-block:: c
> +
> +    ret = rte_eth_dev_start(portid);
> +    if (ret < 0)
> +        rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d, port=%u\n",
> +            ret, portid);
> +
> +
> +Finally the Rx port is set in promiscuous mode:
> +
> +.. code-block:: c
> +
> +    rte_eth_promiscuous_enable(portid);
> +
> +
> +After that each port application assigns resources needed.
> +
> +.. code-block:: c
> +
> +    check_link_status(ioat_enabled_port_mask);
> +
> +    if (!cfg.nb_ports) {
> +        rte_exit(EXIT_FAILURE,
> +            "All available ports are disabled. Please set portmask.\n");
> +    }
> +
> +    /* Check if there is enough lcores for all ports. */
> +    cfg.nb_lcores = rte_lcore_count() - 1;
> +    if (cfg.nb_lcores < 1)
> +        rte_exit(EXIT_FAILURE,
> +            "There should be at least one slave lcore.\n");
> +
> +    ret = 0;
> +
> +    if (copy_mode == COPY_MODE_IOAT_NUM) {
> +        assign_rawdevs();
> +    } else /* copy_mode == COPY_MODE_SW_NUM */ {
> +        assign_rings();
> +    }
> +
> +A link status is checked of each port enabled by port mask using 
> +``check_link_status()`` function.
> +

I don't think this block needs to be covered. No need to go into everything in detail, just focus on the key parts of the app that are unique to it, i.e. the copying and passing mbufs between threads parts.
[Marcin] Changed the description accordingly.

> +.. code-block:: c
> +
> +    /* check link status, return true if at least one port is up */
> +    static int
> +    check_link_status(uint32_t port_mask)
> +    {
> +        uint16_t portid;
> +        struct rte_eth_link link;
> +        int retval = 0;
> +
> +        printf("\nChecking link status\n");
> +        RTE_ETH_FOREACH_DEV(portid) {
> +            if ((port_mask & (1 << portid)) == 0)
> +                continue;
> +
> +            memset(&link, 0, sizeof(link));
> +            rte_eth_link_get(portid, &link);
> +
> +            /* Print link status */
> +            if (link.link_status) {
> +                printf(
> +                    "Port %d Link Up. Speed %u Mbps - %s\n",
> +                    portid, link.link_speed,
> +                    (link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
> +                    ("full-duplex") : ("half-duplex\n"));
> +                retval = 1;
> +            } else
> +                printf("Port %d Link Down\n", portid);
> +        }
> +        return retval;
> +    }
> +
  
Bruce Richardson Sept. 27, 2019, 3 p.m. UTC | #7
On Fri, Sep 27, 2019 at 03:51:48PM +0100, Baran, MarcinX wrote:
> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com> 
> Sent: Friday, September 27, 2019 1:02 PM
> To: Baran, MarcinX <marcinx.baran@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT sample app guide
> 
> On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> > Added guide for IOAT sample app usage and code description.
> > 
> > Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> > ---
> >  doc/guides/sample_app_ug/index.rst |   1 +
> >  doc/guides/sample_app_ug/intro.rst |   4 +
> >  doc/guides/sample_app_ug/ioat.rst  | 764 
> > +++++++++++++++++++++++++++++
> >  3 files changed, 769 insertions(+)
> >  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> > 
> > diff --git a/doc/guides/sample_app_ug/index.rst 
> > b/doc/guides/sample_app_ug/index.rst
> > index f23f8f59e..a6a1d9e7a 100644
> > --- a/doc/guides/sample_app_ug/index.rst
> > +++ b/doc/guides/sample_app_ug/index.rst
> > @@ -23,6 +23,7 @@ Sample Applications User Guides
> >      ip_reassembly
> >      kernel_nic_interface
> >      keep_alive
> > +    ioat
> >      l2_forward_crypto
> >      l2_forward_job_stats
> >      l2_forward_real_virtual
> > diff --git a/doc/guides/sample_app_ug/intro.rst 
> > b/doc/guides/sample_app_ug/intro.rst
> > index 90704194a..74462312f 100644
> > --- a/doc/guides/sample_app_ug/intro.rst
> > +++ b/doc/guides/sample_app_ug/intro.rst
> > @@ -91,6 +91,10 @@ examples are highlighted below.
> >    forwarding, or ``l3fwd`` application does forwarding based on Internet
> >    Protocol, IPv4 or IPv6 like a simple router.
> >  
> > +* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
> > +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev 
> > +driver for
> > +  copying packets between two threads.
> > +
> >  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
> >    demonstrates how to distribute packets arriving on an Rx port to different
> >    cores for processing and transmission.
> > diff --git a/doc/guides/sample_app_ug/ioat.rst 
> > b/doc/guides/sample_app_ug/ioat.rst
> > new file mode 100644
> > index 000000000..69621673b
> > --- /dev/null
> > +++ b/doc/guides/sample_app_ug/ioat.rst
> > @@ -0,0 +1,764 @@
> > +..  SPDX-License-Identifier: BSD-3-Clause
> > +    Copyright(c) 2019 Intel Corporation.
> > +
> > +Sample Application of packet copying using Intel\ |reg| QuickData 
> > +Technology 
> > +=====================================================================
> > +=======
> > +
> > +Overview
> > +--------
> > +
> > +This sample is intended as a demonstration of the basic components of 
> > +a DPDK forwarding application and example of how to use IOAT driver 
> > +API to make packets copies.
> > +
> > +Also while forwarding, the MAC addresses are affected as follows:
> > +
> > +*   The source MAC address is replaced by the TX port MAC address
> > +
> > +*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
> > +
> > +This application can be used to compare performance of using software 
> > +packet copy with copy done using a DMA device for different sizes of packets.
> > +The example will print out statistics each second. The stats shows 
> > +received/send packets and packets dropped or failed to copy.
> > +
> > +Compiling the Application
> > +-------------------------
> > +
> > +To compile the sample application see :doc:`compiling`.
> > +
> > +The application is located in the ``ioat`` sub-directory.
> > +
> > +
> > +Running the Application
> > +-----------------------
> > +
> > +In order to run the hardware copy application, the copying device 
> > +needs to be bound to user-space IO driver.
> > +
> > +Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData 
> > +Technology* guide for information on using the driver.
> > +
> 
> The document is not called that, as the IOAT guide is just part of the overall rawdev document. So I suggest you just reference the rawdev guide.
> [Marcin] I wanted to refer to the ioat guide in /doc/guides/rawdevs/ioat.rst which has that title. Is there another document or I referenced this one
> incorrectly?
> 

It's one chapter in the overall "Rawdev Drivers" document:
https://doc.dpdk.org/guides/rawdevs/index.html

Since the (R) symbol doesn't seem to show up correctly in what you have
above (though it looks correct to me), I suggest just referring to the
"IOAT Rawdev Driver" chapter in the "Rawdev Drivers" document.
  
Marcin Baran Sept. 27, 2019, 3:13 p.m. UTC | #8
-----Original Message-----
From: Bruce Richardson <bruce.richardson@intel.com> 
Sent: Friday, September 27, 2019 3:23 PM
To: Baran, MarcinX <marcinx.baran@intel.com>
Cc: dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; Kovacevic, Marko <marko.kovacevic@intel.com>
Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT sample app guide

On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> Added guide for IOAT sample app usage and code description.
> 
> Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> ---
>  doc/guides/sample_app_ug/index.rst |   1 +
>  doc/guides/sample_app_ug/intro.rst |   4 +
>  doc/guides/sample_app_ug/ioat.rst  | 764 
> +++++++++++++++++++++++++++++
>  3 files changed, 769 insertions(+)
>  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> 

<snip>

> +Depending on mode set (whether copy should be done by software or by 
> +hardware) special structures are assigned to each port. If software 
> +copy was chosen, application have to assign ring structures for 
> +packet exchanging between lcores assigned to ports.
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rings(void)
> +    {
> +        uint32_t i;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            char ring_name[20];
> +
> +            snprintf(ring_name, 20, "rx_to_tx_ring_%u", i);
> +            /* Create ring for inter core communication */
> +            cfg.ports[i].rx_to_tx_ring = rte_ring_create(
> +                    ring_name, ring_size,
> +                    rte_socket_id(), RING_F_SP_ENQ);
> +
> +            if (cfg.ports[i].rx_to_tx_ring == NULL)
> +                rte_exit(EXIT_FAILURE, "%s\n",
> +                        rte_strerror(rte_errno));
> +        }
> +    }
> +
> +
> +When using hardware copy each Rx queue of the port is assigned an 
> +IOAT device (``assign_rawdevs()``) using IOAT Rawdev Driver API
> +functions:
> +
> +.. code-block:: c
> +
> +    static void
> +    assign_rawdevs(void)
> +    {
> +        uint16_t nb_rawdev = 0, rdev_id = 0;
> +        uint32_t i, j;
> +
> +        for (i = 0; i < cfg.nb_ports; i++) {
> +            for (j = 0; j < cfg.ports[i].nb_queues; j++) {
> +                struct rte_rawdev_info rdev_info = { 0 };
> +
> +                do {
> +                    if (rdev_id == rte_rawdev_count())
> +                        goto end;
> +                    rte_rawdev_info_get(rdev_id++, &rdev_info);
> +                } while (strcmp(rdev_info.driver_name,
> +                    IOAT_PMD_RAWDEV_NAME_STR) != 0);
> +
> +                cfg.ports[i].ioat_ids[j] = rdev_id - 1;
> +                configure_rawdev_queue(cfg.ports[i].ioat_ids[j]);
> +                ++nb_rawdev;
> +            }
> +        }
> +    end:
> +        if (nb_rawdev < cfg.nb_ports * cfg.ports[0].nb_queues)
> +            rte_exit(EXIT_FAILURE,
> +                "Not enough IOAT rawdevs (%u) for all queues (%u).\n",
> +                nb_rawdev, cfg.nb_ports * cfg.ports[0].nb_queues);
> +        RTE_LOG(INFO, IOAT, "Number of used rawdevs: %u.\n", nb_rawdev);
> +    }
> +
> +
> +The initialization of hardware device is done by 
> +``rte_rawdev_configure()`` function and ``rte_rawdev_info`` struct.

... using ``rte_rawdev_info`` struct
[Marcin] Changed the description accordingly.

> After configuration the device is
> +started using ``rte_rawdev_start()`` function. Each of the above 
> +operations is done in ``configure_rawdev_queue()``.

In the block below, there is no mention of where dev_config structure comes from. Presume it's a global variable, so maybe mention that in the text.
[Marcin] Actually this code snipped was not updated, it now uses local dev_config variable of type struct rte_ioat_rawdev_config, so I updated the snippet.

> +
> +.. code-block:: c
> +
> +    static void
> +    configure_rawdev_queue(uint32_t dev_id)
> +    {
> +        struct rte_rawdev_info info = { .dev_private = &dev_config };
> +
> +        /* Configure hardware copy device */
> +        dev_config.ring_size = ring_size;
> +
> +        if (rte_rawdev_configure(dev_id, &info) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_configure()\n");
> +        }
> +        rte_rawdev_info_get(dev_id, &info);
> +        if (dev_config.ring_size != ring_size) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error, ring size is not %d (%d)\n",
> +                ring_size, (int)dev_config.ring_size);
> +        }
> +        if (rte_rawdev_start(dev_id) != 0) {
> +            rte_exit(EXIT_FAILURE,
> +                "Error with rte_rawdev_start()\n");
> +        }
> +    }
> +
> +If initialization is successful memory for hardware device statistics 
> +is allocated.

Missing "," after successful.
Where is this memory allocated? It is done in main or elsewhere?
[Marcin] This comment is invalid now as code has been updated, I removed it.
> +
> +Finally ``main()`` functions starts all processing lcores and starts

s/functions/function/
s/processing lcores/packet handling lcores/
[Marcin] Changed the description accordingly

> +printing stats in a loop on master lcore. The application can be

s/master lcore/the master lcore/
[Marcin] Changed the description accordingly

> +interrupted and closed using ``Ctrl-C``. The master lcore waits for 
> +all slave processes to finish, deallocates resources and exits.
> +
> +The processing lcores launching function are described below.
> +
> +The Lcores Launching Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +As described above ``main()`` function invokes 
> +``start_forwarding_cores()``

Missing "," after above.
[Marcin] Changed the description accordingly

> +function in order to start processing for each lcore:
> +
> +.. code-block:: c
> +
> +    static void start_forwarding_cores(void)
> +    {
> +        uint32_t lcore_id = rte_lcore_id();
> +
> +        RTE_LOG(INFO, IOAT, "Entering %s on lcore %u\n",
> +                __func__, rte_lcore_id());
> +
> +        if (cfg.nb_lcores == 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rxtx_main_loop,
> +                NULL, lcore_id);
> +        } else if (cfg.nb_lcores > 1) {
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)rx_main_loop,
> +                NULL, lcore_id);
> +
> +            lcore_id = rte_get_next_lcore(lcore_id, true, true);
> +            rte_eal_remote_launch((lcore_function_t *)tx_main_loop, NULL,
> +                lcore_id);
> +        }
> +    }
> +
> +The function launches Rx/Tx processing functions on configured lcores 
> +for each port using ``rte_eal_remote_launch()``. The configured 
> +ports,

Remove "for each port"
[Marcin] Removed

> +their number and number of assigned lcores are stored in user-defined 
> +``rxtx_transmission_config`` struct that is initialized before 
> +launching

s/is/has been/
Did you describe how that structure was set up previously?
[Marcin] Added description as to how and when it is set up.

> +tasks:
> +
> +.. code-block:: c
> +
> +    struct rxtx_transmission_config {
> +        struct rxtx_port_config ports[RTE_MAX_ETHPORTS];
> +        uint16_t nb_ports;
> +        uint16_t nb_lcores;
> +    };
> +
> +The Lcores Processing Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For receiving packets on each port an ``ioat_rx_port()`` function is used.

Missing "," after port.
s/an/the/
[Marcin] Changed the description accordingly

> +The function receives packets on each configured Rx queue. Depending 
> +on mode

s/mode/the mode/
[Marcin] Changed the description accordingly

> +the user chose, it will enqueue packets to IOAT rawdev channels and 
> +then invoke copy process (hardware copy), or perform software copy of 
> +each packet using ``pktmbuf_sw_copy()`` function and enqueue them to 1 rte_ring:

s/1 rte_ring/an rte_ring/
[Marcin] Changed the description accordingly

> +
> +.. code-block:: c
> +
> +    /* Receive packets on one port and enqueue to IOAT rawdev or rte_ring. */
> +    static void
> +    ioat_rx_port(struct rxtx_port_config *rx_config)
> +    {
> +        uint32_t nb_rx, nb_enq, i, j;
> +        struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +        for (i = 0; i < rx_config->nb_queues; i++) {
> +
> +            nb_rx = rte_eth_rx_burst(rx_config->rxtx_port, i,
> +                pkts_burst, MAX_PKT_BURST);
> +
> +            if (nb_rx == 0)
> +                continue;
> +
> +            port_statistics.rx[rx_config->rxtx_port] += nb_rx;
> +
> +            if (copy_mode == COPY_MODE_IOAT_NUM) {
> +                /* Perform packet hardware copy */
> +                nb_enq = ioat_enqueue_packets(pkts_burst,
> +                    nb_rx, rx_config->ioat_ids[i]);
> +                if (nb_enq > 0)
> +                    rte_ioat_do_copies(rx_config->ioat_ids[i]);
> +            } else {
> +                /* Perform packet software copy, free source packets */
> +                int ret;
> +                struct rte_mbuf *pkts_burst_copy[MAX_PKT_BURST];
> +
> +                ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst_copy, nb_rx);
> +
> +                if (unlikely(ret < 0))
> +                    rte_exit(EXIT_FAILURE,
> +                        "Unable to allocate memory.\n");
> +
> +                for (j = 0; j < nb_rx; j++)
> +                    pktmbuf_sw_copy(pkts_burst[j],
> +                        pkts_burst_copy[j]);
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)pkts_burst, nb_rx);
> +
> +                nb_enq = rte_ring_enqueue_burst(
> +                    rx_config->rx_to_tx_ring,
> +                    (void *)pkts_burst_copy, nb_rx, NULL);
> +
> +                /* Free any not enqueued packets. */
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&pkts_burst_copy[nb_enq],
> +                    nb_rx - nb_enq);
> +            }
> +
> +            port_statistics.copy_dropped[rx_config->rxtx_port] +=
> +                (nb_rx - nb_enq);
> +        }
> +    }
> +
> +The packets are received in burst mode using ``rte_eth_rx_burst()`` 
> +function. When using hardware copy mode the packets are enqueued in 
> +copying device's buffer using ``ioat_enqueue_packets()`` which calls 
> +``rte_ioat_enqueue_copy()``. When all received packets are in the 
> +buffer the copies are invoked by calling ``rte_ioat_do_copies()``.

s/copies are invoked/copy operations are started/
[Marcin] Changed the description accordingly

> +Function ``rte_ioat_enqueue_copy()`` operates on physical address of 
> +the packet. Structure ``rte_mbuf`` contains only physical address to 
> +start of the data buffer (``buf_iova``). Thus the address is shifted

s/shifted/adjusted/
[Marcin] Changed the description accordingly

> +by ``addr_offset`` value in order to get pointer to ``rearm_data``

s/pointer to/the address of/
[Marcin] Changed the description accordingly

> +member of ``rte_mbuf``. That way the packet is copied all at once 
> +(with data and metadata).

"That way the both the packet data and metadata can be copied in a single operation".
Should also note that this shortcut can be used because the mbufs are "direct" mbufs allocated by the apps. If another app uses external buffers, or indirect mbufs, then multiple copy operations must be used.
[Marcin] Changed the description accordingly and added additional informations.

> +
> +.. code-block:: c
> +
> +    static uint32_t
> +    ioat_enqueue_packets(struct rte_mbuf **pkts,
> +        uint32_t nb_rx, uint16_t dev_id)
> +    {
> +        int ret;
> +        uint32_t i;
> +        struct rte_mbuf *pkts_copy[MAX_PKT_BURST];
> +
> +        const uint64_t addr_offset = RTE_PTR_DIFF(pkts[0]->buf_addr,
> +            &pkts[0]->rearm_data);
> +
> +        ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
> +                (void *)pkts_copy, nb_rx);
> +
> +        if (unlikely(ret < 0))
> +            rte_exit(EXIT_FAILURE, "Unable to allocate memory.\n");
> +
> +        for (i = 0; i < nb_rx; i++) {
> +            /* Perform data copy */
> +            ret = rte_ioat_enqueue_copy(dev_id,
> +                pkts[i]->buf_iova
> +                    - addr_offset,
> +                pkts_copy[i]->buf_iova
> +                    - addr_offset,
> +                rte_pktmbuf_data_len(pkts[i])
> +                    + addr_offset,
> +                (uintptr_t)pkts[i],
> +                (uintptr_t)pkts_copy[i],
> +                0 /* nofence */);
> +
> +            if (ret != 1)
> +                break;
> +        }
> +
> +        ret = i;
> +        /* Free any not enqueued packets. */
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts[i], nb_rx - i);
> +        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts_copy[i],
> +            nb_rx - i);
> +
> +        return ret;
> +    }
> +
> +
> +All done copies are processed by ``ioat_tx_port()`` function. When 
> +using

s/done/completed/
[Marcin] Changed the description accordingly

> +hardware copy mode the function invokes 
> +``rte_ioat_completed_copies()`` on each assigned IOAT channel to 
> +gather copied packets. If software copy mode is used the function 
> +dequeues copied packets from the rte_ring. Then each packet MAC 
> +address is changed if it was enabled. After that copies are sent in burst mode using `` rte_eth_tx_burst()``.
> +
> +
> +.. code-block:: c
> +
> +    /* Transmit packets from IOAT rawdev/rte_ring for one port. */
> +    static void
> +    ioat_tx_port(struct rxtx_port_config *tx_config)
> +    {
> +        uint32_t i, j, nb_dq = 0;
> +        struct rte_mbuf *mbufs_src[MAX_PKT_BURST];
> +        struct rte_mbuf *mbufs_dst[MAX_PKT_BURST];
> +
> +        if (copy_mode == COPY_MODE_IOAT_NUM) {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ioat_completed_copies(
> +                    tx_config->ioat_ids[i], MAX_PKT_BURST,
> +                    (void *)mbufs_src, (void *)mbufs_dst);
> +
> +                if (nb_dq == 0)
> +                    break;
> +
> +                rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)mbufs_src, nb_dq);
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(
> +                    tx_config->rxtx_port, 0,
> +                    (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +        else {
> +            for (i = 0; i < tx_config->nb_queues; i++) {
> +                /* Deque the mbufs from IOAT device. */
> +                nb_dq = rte_ring_dequeue_burst(tx_config->rx_to_tx_ring,
> +                    (void *)mbufs_dst, MAX_PKT_BURST, NULL);
> +
> +                if (nb_dq == 0)
> +                    return;
> +
> +                /* Update macs if enabled */
> +                if (mac_updating) {
> +                    for (j = 0; j < nb_dq; j++)
> +                        update_mac_addrs(mbufs_dst[j],
> +                            tx_config->rxtx_port);
> +                }
> +
> +                const uint16_t nb_tx = rte_eth_tx_burst(tx_config->rxtx_port,
> +                    0, (void *)mbufs_dst, nb_dq);
> +
> +                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
> +
> +                /* Free any unsent packets. */
> +                if (unlikely(nb_tx < nb_dq))
> +                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
> +                    (void *)&mbufs_dst[nb_tx],
> +                        nb_dq - nb_tx);
> +            }
> +        }
> +    }
> +
> +The Packet Copying Functions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In order to perform packet copy there is a user-defined function 
> +``pktmbuf_sw_copy()`` used. It copies a whole packet by copying 
> +metadata from source packet to new mbuf, and then copying a data 
> +chunk of source packet. Both memory copies are done using
> +``rte_memcpy()``:
> +
> +.. code-block:: c
> +
> +    static inline void
> +    pktmbuf_sw_copy(struct rte_mbuf *src, struct rte_mbuf *dst)
> +    {
> +        /* Copy packet metadata */
> +        rte_memcpy(&dst->rearm_data,
> +            &src->rearm_data,
> +            offsetof(struct rte_mbuf, cacheline1)
> +                - offsetof(struct rte_mbuf, rearm_data));
> +
> +        /* Copy packet data */
> +        rte_memcpy(rte_pktmbuf_mtod(dst, char *),
> +            rte_pktmbuf_mtod(src, char *), src->data_len);
> +    }
> +
> +The metadata in this example is copied from ``rearm_data`` member of 
> +``rte_mbuf`` struct up to ``cacheline1``.
> +
> +In order to understand why software packet copying is done as shown 
> +above please refer to the "Mbuf Library" section of the *DPDK 
> +Programmer's Guide*.
> \ No newline at end of file

Use a text editor that adds a newline automatically :-)
[Marcin] Added new line at end of file.

/Bruce
  
Marcin Baran Sept. 27, 2019, 3:16 p.m. UTC | #9
-----Original Message-----
From: Bruce Richardson <bruce.richardson@intel.com> 
Sent: Friday, September 27, 2019 5:01 PM
To: Baran, MarcinX <marcinx.baran@intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT sample app guide

On Fri, Sep 27, 2019 at 03:51:48PM +0100, Baran, MarcinX wrote:
> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Friday, September 27, 2019 1:02 PM
> To: Baran, MarcinX <marcinx.baran@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/6] doc/guides/: provide IOAT 
> sample app guide
> 
> On Fri, Sep 20, 2019 at 09:37:14AM +0200, Marcin Baran wrote:
> > Added guide for IOAT sample app usage and code description.
> > 
> > Signed-off-by: Marcin Baran <marcinx.baran@intel.com>
> > ---
> >  doc/guides/sample_app_ug/index.rst |   1 +
> >  doc/guides/sample_app_ug/intro.rst |   4 +
> >  doc/guides/sample_app_ug/ioat.rst  | 764
> > +++++++++++++++++++++++++++++
> >  3 files changed, 769 insertions(+)
> >  create mode 100644 doc/guides/sample_app_ug/ioat.rst
> > 
> > diff --git a/doc/guides/sample_app_ug/index.rst
> > b/doc/guides/sample_app_ug/index.rst
> > index f23f8f59e..a6a1d9e7a 100644
> > --- a/doc/guides/sample_app_ug/index.rst
> > +++ b/doc/guides/sample_app_ug/index.rst
> > @@ -23,6 +23,7 @@ Sample Applications User Guides
> >      ip_reassembly
> >      kernel_nic_interface
> >      keep_alive
> > +    ioat
> >      l2_forward_crypto
> >      l2_forward_job_stats
> >      l2_forward_real_virtual
> > diff --git a/doc/guides/sample_app_ug/intro.rst
> > b/doc/guides/sample_app_ug/intro.rst
> > index 90704194a..74462312f 100644
> > --- a/doc/guides/sample_app_ug/intro.rst
> > +++ b/doc/guides/sample_app_ug/intro.rst
> > @@ -91,6 +91,10 @@ examples are highlighted below.
> >    forwarding, or ``l3fwd`` application does forwarding based on Internet
> >    Protocol, IPv4 or IPv6 like a simple router.
> >  
> > +* :doc:`Hardware packet copying<ioat>`: The Hardware packet 
> > +copying,
> > +  or ``ioatfwd`` application demonstrates how to use IOAT rawdev 
> > +driver for
> > +  copying packets between two threads.
> > +
> >  * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
> >    demonstrates how to distribute packets arriving on an Rx port to different
> >    cores for processing and transmission.
> > diff --git a/doc/guides/sample_app_ug/ioat.rst
> > b/doc/guides/sample_app_ug/ioat.rst
> > new file mode 100644
> > index 000000000..69621673b
> > --- /dev/null
> > +++ b/doc/guides/sample_app_ug/ioat.rst
> > @@ -0,0 +1,764 @@
> > +..  SPDX-License-Identifier: BSD-3-Clause
> > +    Copyright(c) 2019 Intel Corporation.
> > +
> > +Sample Application of packet copying using Intel\ |reg| QuickData 
> > +Technology 
> > +===================================================================
> > +==
> > +=======
> > +
> > +Overview
> > +--------
> > +
> > +This sample is intended as a demonstration of the basic components 
> > +of a DPDK forwarding application and example of how to use IOAT 
> > +driver API to make packets copies.
> > +
> > +Also while forwarding, the MAC addresses are affected as follows:
> > +
> > +*   The source MAC address is replaced by the TX port MAC address
> > +
> > +*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
> > +
> > +This application can be used to compare performance of using 
> > +software packet copy with copy done using a DMA device for different sizes of packets.
> > +The example will print out statistics each second. The stats shows 
> > +received/send packets and packets dropped or failed to copy.
> > +
> > +Compiling the Application
> > +-------------------------
> > +
> > +To compile the sample application see :doc:`compiling`.
> > +
> > +The application is located in the ``ioat`` sub-directory.
> > +
> > +
> > +Running the Application
> > +-----------------------
> > +
> > +In order to run the hardware copy application, the copying device 
> > +needs to be bound to user-space IO driver.
> > +
> > +Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData
> > +Technology* guide for information on using the driver.
> > +
> 
> The document is not called that, as the IOAT guide is just part of the overall rawdev document. So I suggest you just reference the rawdev guide.
> [Marcin] I wanted to refer to the ioat guide in 
> /doc/guides/rawdevs/ioat.rst which has that title. Is there another document or I referenced this one incorrectly?
> 

It's one chapter in the overall "Rawdev Drivers" document:
https://doc.dpdk.org/guides/rawdevs/index.html

Since the (R) symbol doesn't seem to show up correctly in what you have above (though it looks correct to me), I suggest just referring to the "IOAT Rawdev Driver" chapter in the "Rawdev Drivers" document.
[Marcin] Ok, I will change it like that.
  

Patch

diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst
index f23f8f59e..a6a1d9e7a 100644
--- a/doc/guides/sample_app_ug/index.rst
+++ b/doc/guides/sample_app_ug/index.rst
@@ -23,6 +23,7 @@  Sample Applications User Guides
     ip_reassembly
     kernel_nic_interface
     keep_alive
+    ioat
     l2_forward_crypto
     l2_forward_job_stats
     l2_forward_real_virtual
diff --git a/doc/guides/sample_app_ug/intro.rst b/doc/guides/sample_app_ug/intro.rst
index 90704194a..74462312f 100644
--- a/doc/guides/sample_app_ug/intro.rst
+++ b/doc/guides/sample_app_ug/intro.rst
@@ -91,6 +91,10 @@  examples are highlighted below.
   forwarding, or ``l3fwd`` application does forwarding based on Internet
   Protocol, IPv4 or IPv6 like a simple router.
 
+* :doc:`Hardware packet copying<ioat>`: The Hardware packet copying,
+  or ``ioatfwd`` application demonstrates how to use IOAT rawdev driver for
+  copying packets between two threads.
+
 * :doc:`Packet Distributor<dist_app>`: The Packet Distributor
   demonstrates how to distribute packets arriving on an Rx port to different
   cores for processing and transmission.
diff --git a/doc/guides/sample_app_ug/ioat.rst b/doc/guides/sample_app_ug/ioat.rst
new file mode 100644
index 000000000..69621673b
--- /dev/null
+++ b/doc/guides/sample_app_ug/ioat.rst
@@ -0,0 +1,764 @@ 
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2019 Intel Corporation.
+
+Sample Application of packet copying using Intel\ |reg| QuickData Technology
+============================================================================
+
+Overview
+--------
+
+This sample is intended as a demonstration of the basic components of a DPDK
+forwarding application and example of how to use IOAT driver API to make
+packets copies.
+
+Also while forwarding, the MAC addresses are affected as follows:
+
+*   The source MAC address is replaced by the TX port MAC address
+
+*   The destination MAC address is replaced by  02:00:00:00:00:TX_PORT_ID
+
+This application can be used to compare performance of using software packet
+copy with copy done using a DMA device for different sizes of packets.
+The example will print out statistics each second. The stats shows
+received/send packets and packets dropped or failed to copy.
+
+Compiling the Application
+-------------------------
+
+To compile the sample application see :doc:`compiling`.
+
+The application is located in the ``ioat`` sub-directory.
+
+
+Running the Application
+-----------------------
+
+In order to run the hardware copy application, the copying device
+needs to be bound to user-space IO driver.
+
+Refer to the *IOAT Rawdev Driver for Intel\ |reg| QuickData Technology*
+guide for information on using the driver.
+
+The application requires a number of command line options:
+
+.. code-block:: console
+
+    ./build/ioatfwd [EAL options] -- -p MASK [-q NQ] [-s RS] [-c <sw|hw>]
+        [--[no-]mac-updating]
+
+where,
+
+*   p MASK: A hexadecimal bitmask of the ports to configure
+
+*   q NQ: Number of Rx queues used per port equivalent to CBDMA channels
+    per port
+
+*   c CT: Performed packet copy type: software (sw) or hardware using
+    DMA (hw)
+
+*   s RS: Size of IOAT rawdev ring for hardware copy mode or rte_ring for
+    software copy mode
+
+*   --[no-]mac-updating: Whether MAC address of packets should be changed
+    or not
+
+The application can be launched in various configurations depending on
+provided parameters. Each port can use up to 2 lcores: one of them receives
+incoming traffic and makes a copy of each packet. The second lcore then
+updates MAC address and sends the copy. If one lcore per port is used,
+both operations are done sequentially. For each configuration an additional
+lcore is needed since master lcore in use which is responsible for
+configuration, statistics printing and safe deinitialization of all ports
+and devices.
+
+The application can use a maximum of 8 ports.
+
+To run the application in a Linux environment with 3 lcores (one of them
+is master lcore), 1 port (port 0), software copying and MAC updating issue
+the command:
+
+.. code-block:: console
+
+    $ ./build/ioatfwd -l 0-2 -n 2 -- -p 0x1 --mac-updating -c sw
+
+To run the application in a Linux environment with 2 lcores (one of them
+is master lcore), 2 ports (ports 0 and 1), hardware copying and no MAC
+updating issue the command:
+
+.. code-block:: console
+
+    $ ./build/ioatfwd -l 0-1 -n 1 -- -p 0x3 --no-mac-updating -c hw
+
+Refer to the *DPDK Getting Started Guide* for general information on
+running applications and the Environment Abstraction Layer (EAL) options.
+
+Explanation
+-----------
+
+The following sections provide an explanation of the main components of the
+code.
+
+All DPDK library functions used in the sample code are prefixed with
+``rte_`` and are explained in detail in the *DPDK API Documentation*.
+
+
+The Main Function
+~~~~~~~~~~~~~~~~~
+
+The ``main()`` function performs the initialization and calls the execution
+threads for each lcore.
+
+The first task is to initialize the Environment Abstraction Layer (EAL).
+The ``argc`` and ``argv`` arguments are provided to the ``rte_eal_init()``
+function. The value returned is the number of parsed arguments:
+
+.. code-block:: c
+
+    /* init EAL */
+    ret = rte_eal_init(argc, argv);
+    if (ret < 0)
+        rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
+
+
+The ``main()`` also allocates a mempool to hold the mbufs (Message Buffers)
+used by the application:
+
+.. code-block:: c
+
+    nb_mbufs = RTE_MAX(rte_eth_dev_count_avail() * (nb_rxd + nb_txd
+        + MAX_PKT_BURST + rte_lcore_count() * MEMPOOL_CACHE_SIZE),
+        MIN_POOL_SIZE);
+
+    /* Create the mbuf pool */
+    ioat_pktmbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", nb_mbufs,
+        MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
+        rte_socket_id());
+    if (ioat_pktmbuf_pool == NULL)
+        rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n");
+
+Mbufs are the packet buffer structure used by DPDK. They are explained in
+detail in the "Mbuf Library" section of the *DPDK Programmer's Guide*.
+
+The ``main()`` function also initializes the ports:
+
+.. code-block:: c
+
+    /* Initialise each port */
+    RTE_ETH_FOREACH_DEV(portid) {
+        port_init(portid, ioat_pktmbuf_pool);
+    }
+
+Each port is configured using ``port_init()``:
+
+.. code-block:: c
+
+     /*
+     * Initializes a given port using global settings and with the RX buffers
+     * coming from the mbuf_pool passed as a parameter.
+     */
+    static inline void
+    port_init(uint16_t portid, struct rte_mempool *mbuf_pool, uint16_t nb_queues)
+    {
+        /* configuring port to use RSS for multiple RX queues */
+        static const struct rte_eth_conf port_conf = {
+            .rxmode = {
+                .mq_mode        = ETH_MQ_RX_RSS,
+                .max_rx_pkt_len = RTE_ETHER_MAX_LEN
+            },
+            .rx_adv_conf = {
+                .rss_conf = {
+                    .rss_key = NULL,
+                    .rss_hf = ETH_RSS_PROTO_MASK,
+                }
+            }
+        };
+
+        struct rte_eth_rxconf rxq_conf;
+        struct rte_eth_txconf txq_conf;
+        struct rte_eth_conf local_port_conf = port_conf;
+        struct rte_eth_dev_info dev_info;
+        int ret, i;
+
+        /* Skip ports that are not enabled */
+        if ((ioat_enabled_port_mask & (1 << portid)) == 0) {
+            printf("Skipping disabled port %u\n", portid);
+            return;
+        }
+
+        /* Init port */
+        printf("Initializing port %u... ", portid);
+        fflush(stdout);
+        rte_eth_dev_info_get(portid, &dev_info);
+        local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
+            dev_info.flow_type_rss_offloads;
+        if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
+            local_port_conf.txmode.offloads |=
+                DEV_TX_OFFLOAD_MBUF_FAST_FREE;
+        ret = rte_eth_dev_configure(portid, nb_queues, 1, &local_port_conf);
+        if (ret < 0)
+            rte_exit(EXIT_FAILURE, "Cannot configure device:"
+                " err=%d, port=%u\n", ret, portid);
+
+        ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd,
+                            &nb_txd);
+        if (ret < 0)
+            rte_exit(EXIT_FAILURE,
+                "Cannot adjust number of descriptors: err=%d, port=%u\n",
+                ret, portid);
+
+        rte_eth_macaddr_get(portid, &ioat_ports_eth_addr[portid]);
+
+        /* Init Rx queues */
+        rxq_conf = dev_info.default_rxconf;
+        rxq_conf.offloads = local_port_conf.rxmode.offloads;
+        for (i = 0; i < nb_queues; i++) {
+            ret = rte_eth_rx_queue_setup(portid, i, nb_rxd,
+                rte_eth_dev_socket_id(portid), &rxq_conf,
+                mbuf_pool);
+            if (ret < 0)
+                rte_exit(EXIT_FAILURE,
+                    "rte_eth_rx_queue_setup:err=%d,port=%u, queue_id=%u\n",
+                    ret, portid, i);
+        }
+
+        /* Init one TX queue on each port */
+        txq_conf = dev_info.default_txconf;
+        txq_conf.offloads = local_port_conf.txmode.offloads;
+        ret = rte_eth_tx_queue_setup(portid, 0, nb_txd,
+                rte_eth_dev_socket_id(portid),
+                &txq_conf);
+        if (ret < 0)
+            rte_exit(EXIT_FAILURE,
+                "rte_eth_tx_queue_setup:err=%d,port=%u\n",
+                ret, portid);
+
+        /* Initialize TX buffers */
+        tx_buffer[portid] = rte_zmalloc_socket("tx_buffer",
+                RTE_ETH_TX_BUFFER_SIZE(MAX_PKT_BURST), 0,
+                rte_eth_dev_socket_id(portid));
+        if (tx_buffer[portid] == NULL)
+            rte_exit(EXIT_FAILURE,
+                "Cannot allocate buffer for tx on port %u\n",
+                portid);
+
+        rte_eth_tx_buffer_init(tx_buffer[portid], MAX_PKT_BURST);
+
+        ret = rte_eth_tx_buffer_set_err_callback(tx_buffer[portid],
+                rte_eth_tx_buffer_count_callback,
+                &port_statistics.tx_dropped[portid]);
+        if (ret < 0)
+            rte_exit(EXIT_FAILURE,
+                "Cannot set error callback for tx buffer on port %u\n",
+                portid);
+
+        /* Start device */
+        ret = rte_eth_dev_start(portid);
+        if (ret < 0)
+            rte_exit(EXIT_FAILURE,
+                "rte_eth_dev_start:err=%d, port=%u\n",
+                ret, portid);
+
+        rte_eth_promiscuous_enable(portid);
+
+        printf("Port %u, MAC address: %02X:%02X:%02X:%02X:%02X:%02X\n\n",
+                portid,
+                ioat_ports_eth_addr[portid].addr_bytes[0],
+                ioat_ports_eth_addr[portid].addr_bytes[1],
+                ioat_ports_eth_addr[portid].addr_bytes[2],
+                ioat_ports_eth_addr[portid].addr_bytes[3],
+                ioat_ports_eth_addr[portid].addr_bytes[4],
+                ioat_ports_eth_addr[portid].addr_bytes[5]);
+
+        cfg.ports[cfg.nb_ports].rxtx_port = portid;
+        cfg.ports[cfg.nb_ports++].nb_queues = nb_queues;
+    }
+
+The Ethernet ports are configured with local settings using the
+``rte_eth_dev_configure()`` function and the ``port_conf`` struct.
+The RSS is enabled so that multiple Rx queues could be used for
+packet receiving and copying by multiple CBDMA channels per port:
+
+.. code-block:: c
+
+    /* configuring port to use RSS for multiple RX queues */
+    static const struct rte_eth_conf port_conf = {
+        .rxmode = {
+            .mq_mode        = ETH_MQ_RX_RSS,
+            .max_rx_pkt_len = RTE_ETHER_MAX_LEN
+        },
+        .rx_adv_conf = {
+            .rss_conf = {
+                .rss_key = NULL,
+                .rss_hf = ETH_RSS_PROTO_MASK,
+            }
+        }
+    };
+
+For this example the ports are set up with the number of Rx queues provided
+with -q option and 1 Tx queue using the ``rte_eth_rx_queue_setup()``
+and ``rte_eth_tx_queue_setup()`` functions.
+
+The Ethernet port is then started:
+
+.. code-block:: c
+
+    ret = rte_eth_dev_start(portid);
+    if (ret < 0)
+        rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d, port=%u\n",
+            ret, portid);
+
+
+Finally the Rx port is set in promiscuous mode:
+
+.. code-block:: c
+
+    rte_eth_promiscuous_enable(portid);
+
+
+After that each port application assigns resources needed.
+
+.. code-block:: c
+
+    check_link_status(ioat_enabled_port_mask);
+
+    if (!cfg.nb_ports) {
+        rte_exit(EXIT_FAILURE,
+            "All available ports are disabled. Please set portmask.\n");
+    }
+
+    /* Check if there is enough lcores for all ports. */
+    cfg.nb_lcores = rte_lcore_count() - 1;
+    if (cfg.nb_lcores < 1)
+        rte_exit(EXIT_FAILURE,
+            "There should be at least one slave lcore.\n");
+
+    ret = 0;
+
+    if (copy_mode == COPY_MODE_IOAT_NUM) {
+        assign_rawdevs();
+    } else /* copy_mode == COPY_MODE_SW_NUM */ {
+        assign_rings();
+    }
+
+A link status is checked of each port enabled by port mask
+using ``check_link_status()`` function.
+
+.. code-block:: c
+
+    /* check link status, return true if at least one port is up */
+    static int
+    check_link_status(uint32_t port_mask)
+    {
+        uint16_t portid;
+        struct rte_eth_link link;
+        int retval = 0;
+
+        printf("\nChecking link status\n");
+        RTE_ETH_FOREACH_DEV(portid) {
+            if ((port_mask & (1 << portid)) == 0)
+                continue;
+
+            memset(&link, 0, sizeof(link));
+            rte_eth_link_get(portid, &link);
+
+            /* Print link status */
+            if (link.link_status) {
+                printf(
+                    "Port %d Link Up. Speed %u Mbps - %s\n",
+                    portid, link.link_speed,
+                    (link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
+                    ("full-duplex") : ("half-duplex\n"));
+                retval = 1;
+            } else
+                printf("Port %d Link Down\n", portid);
+        }
+        return retval;
+    }
+
+Depending on mode set (whether copy should be done by software or by hardware)
+special structures are assigned to each port. If software copy was chosen,
+application have to assign ring structures for packet exchanging between lcores
+assigned to ports.
+
+.. code-block:: c
+
+    static void
+    assign_rings(void)
+    {
+        uint32_t i;
+
+        for (i = 0; i < cfg.nb_ports; i++) {
+            char ring_name[20];
+
+            snprintf(ring_name, 20, "rx_to_tx_ring_%u", i);
+            /* Create ring for inter core communication */
+            cfg.ports[i].rx_to_tx_ring = rte_ring_create(
+                    ring_name, ring_size,
+                    rte_socket_id(), RING_F_SP_ENQ);
+
+            if (cfg.ports[i].rx_to_tx_ring == NULL)
+                rte_exit(EXIT_FAILURE, "%s\n",
+                        rte_strerror(rte_errno));
+        }
+    }
+
+
+When using hardware copy each Rx queue of the port is assigned an
+IOAT device (``assign_rawdevs()``) using IOAT Rawdev Driver API
+functions:
+
+.. code-block:: c
+
+    static void
+    assign_rawdevs(void)
+    {
+        uint16_t nb_rawdev = 0, rdev_id = 0;
+        uint32_t i, j;
+
+        for (i = 0; i < cfg.nb_ports; i++) {
+            for (j = 0; j < cfg.ports[i].nb_queues; j++) {
+                struct rte_rawdev_info rdev_info = { 0 };
+
+                do {
+                    if (rdev_id == rte_rawdev_count())
+                        goto end;
+                    rte_rawdev_info_get(rdev_id++, &rdev_info);
+                } while (strcmp(rdev_info.driver_name,
+                    IOAT_PMD_RAWDEV_NAME_STR) != 0);
+
+                cfg.ports[i].ioat_ids[j] = rdev_id - 1;
+                configure_rawdev_queue(cfg.ports[i].ioat_ids[j]);
+                ++nb_rawdev;
+            }
+        }
+    end:
+        if (nb_rawdev < cfg.nb_ports * cfg.ports[0].nb_queues)
+            rte_exit(EXIT_FAILURE,
+                "Not enough IOAT rawdevs (%u) for all queues (%u).\n",
+                nb_rawdev, cfg.nb_ports * cfg.ports[0].nb_queues);
+        RTE_LOG(INFO, IOAT, "Number of used rawdevs: %u.\n", nb_rawdev);
+    }
+
+
+The initialization of hardware device is done by ``rte_rawdev_configure()``
+function and ``rte_rawdev_info`` struct. After configuration the device is
+started using ``rte_rawdev_start()`` function. Each of the above operations
+is done in ``configure_rawdev_queue()``.
+
+.. code-block:: c
+
+    static void
+    configure_rawdev_queue(uint32_t dev_id)
+    {
+        struct rte_rawdev_info info = { .dev_private = &dev_config };
+
+        /* Configure hardware copy device */
+        dev_config.ring_size = ring_size;
+
+        if (rte_rawdev_configure(dev_id, &info) != 0) {
+            rte_exit(EXIT_FAILURE,
+                "Error with rte_rawdev_configure()\n");
+        }
+        rte_rawdev_info_get(dev_id, &info);
+        if (dev_config.ring_size != ring_size) {
+            rte_exit(EXIT_FAILURE,
+                "Error, ring size is not %d (%d)\n",
+                ring_size, (int)dev_config.ring_size);
+        }
+        if (rte_rawdev_start(dev_id) != 0) {
+            rte_exit(EXIT_FAILURE,
+                "Error with rte_rawdev_start()\n");
+        }
+    }
+
+If initialization is successful memory for hardware device
+statistics is allocated.
+
+Finally ``main()`` functions starts all processing lcores and starts
+printing stats in a loop on master lcore. The application can be
+interrupted and closed using ``Ctrl-C``. The master lcore waits for
+all slave processes to finish, deallocates resources and exits.
+
+The processing lcores launching function are described below.
+
+The Lcores Launching Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As described above ``main()`` function invokes ``start_forwarding_cores()``
+function in order to start processing for each lcore:
+
+.. code-block:: c
+
+    static void start_forwarding_cores(void)
+    {
+        uint32_t lcore_id = rte_lcore_id();
+
+        RTE_LOG(INFO, IOAT, "Entering %s on lcore %u\n",
+                __func__, rte_lcore_id());
+
+        if (cfg.nb_lcores == 1) {
+            lcore_id = rte_get_next_lcore(lcore_id, true, true);
+            rte_eal_remote_launch((lcore_function_t *)rxtx_main_loop,
+                NULL, lcore_id);
+        } else if (cfg.nb_lcores > 1) {
+            lcore_id = rte_get_next_lcore(lcore_id, true, true);
+            rte_eal_remote_launch((lcore_function_t *)rx_main_loop,
+                NULL, lcore_id);
+
+            lcore_id = rte_get_next_lcore(lcore_id, true, true);
+            rte_eal_remote_launch((lcore_function_t *)tx_main_loop, NULL,
+                lcore_id);
+        }
+    }
+
+The function launches Rx/Tx processing functions on configured lcores
+for each port using ``rte_eal_remote_launch()``. The configured ports,
+their number and number of assigned lcores are stored in user-defined
+``rxtx_transmission_config`` struct that is initialized before launching
+tasks:
+
+.. code-block:: c
+
+    struct rxtx_transmission_config {
+        struct rxtx_port_config ports[RTE_MAX_ETHPORTS];
+        uint16_t nb_ports;
+        uint16_t nb_lcores;
+    };
+
+The Lcores Processing Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For receiving packets on each port an ``ioat_rx_port()`` function is used.
+The function receives packets on each configured Rx queue. Depending on mode
+the user chose, it will enqueue packets to IOAT rawdev channels and then invoke
+copy process (hardware copy), or perform software copy of each packet using
+``pktmbuf_sw_copy()`` function and enqueue them to 1 rte_ring:
+
+.. code-block:: c
+
+    /* Receive packets on one port and enqueue to IOAT rawdev or rte_ring. */
+    static void
+    ioat_rx_port(struct rxtx_port_config *rx_config)
+    {
+        uint32_t nb_rx, nb_enq, i, j;
+        struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+        for (i = 0; i < rx_config->nb_queues; i++) {
+
+            nb_rx = rte_eth_rx_burst(rx_config->rxtx_port, i,
+                pkts_burst, MAX_PKT_BURST);
+
+            if (nb_rx == 0)
+                continue;
+
+            port_statistics.rx[rx_config->rxtx_port] += nb_rx;
+
+            if (copy_mode == COPY_MODE_IOAT_NUM) {
+                /* Perform packet hardware copy */
+                nb_enq = ioat_enqueue_packets(pkts_burst,
+                    nb_rx, rx_config->ioat_ids[i]);
+                if (nb_enq > 0)
+                    rte_ioat_do_copies(rx_config->ioat_ids[i]);
+            } else {
+                /* Perform packet software copy, free source packets */
+                int ret;
+                struct rte_mbuf *pkts_burst_copy[MAX_PKT_BURST];
+
+                ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
+                    (void *)pkts_burst_copy, nb_rx);
+
+                if (unlikely(ret < 0))
+                    rte_exit(EXIT_FAILURE,
+                        "Unable to allocate memory.\n");
+
+                for (j = 0; j < nb_rx; j++)
+                    pktmbuf_sw_copy(pkts_burst[j],
+                        pkts_burst_copy[j]);
+
+                rte_mempool_put_bulk(ioat_pktmbuf_pool,
+                    (void *)pkts_burst, nb_rx);
+
+                nb_enq = rte_ring_enqueue_burst(
+                    rx_config->rx_to_tx_ring,
+                    (void *)pkts_burst_copy, nb_rx, NULL);
+
+                /* Free any not enqueued packets. */
+                rte_mempool_put_bulk(ioat_pktmbuf_pool,
+                    (void *)&pkts_burst_copy[nb_enq],
+                    nb_rx - nb_enq);
+            }
+
+            port_statistics.copy_dropped[rx_config->rxtx_port] +=
+                (nb_rx - nb_enq);
+        }
+    }
+
+The packets are received in burst mode using ``rte_eth_rx_burst()``
+function. When using hardware copy mode the packets are enqueued in
+copying device's buffer using ``ioat_enqueue_packets()`` which calls
+``rte_ioat_enqueue_copy()``. When all received packets are in the
+buffer the copies are invoked by calling ``rte_ioat_do_copies()``.
+Function ``rte_ioat_enqueue_copy()`` operates on physical address of
+the packet. Structure ``rte_mbuf`` contains only physical address to
+start of the data buffer (``buf_iova``). Thus the address is shifted
+by ``addr_offset`` value in order to get pointer to ``rearm_data``
+member of ``rte_mbuf``. That way the packet is copied all at once
+(with data and metadata).
+
+.. code-block:: c
+
+    static uint32_t
+    ioat_enqueue_packets(struct rte_mbuf **pkts,
+        uint32_t nb_rx, uint16_t dev_id)
+    {
+        int ret;
+        uint32_t i;
+        struct rte_mbuf *pkts_copy[MAX_PKT_BURST];
+
+        const uint64_t addr_offset = RTE_PTR_DIFF(pkts[0]->buf_addr,
+            &pkts[0]->rearm_data);
+
+        ret = rte_mempool_get_bulk(ioat_pktmbuf_pool,
+                (void *)pkts_copy, nb_rx);
+
+        if (unlikely(ret < 0))
+            rte_exit(EXIT_FAILURE, "Unable to allocate memory.\n");
+
+        for (i = 0; i < nb_rx; i++) {
+            /* Perform data copy */
+            ret = rte_ioat_enqueue_copy(dev_id,
+                pkts[i]->buf_iova
+                    - addr_offset,
+                pkts_copy[i]->buf_iova
+                    - addr_offset,
+                rte_pktmbuf_data_len(pkts[i])
+                    + addr_offset,
+                (uintptr_t)pkts[i],
+                (uintptr_t)pkts_copy[i],
+                0 /* nofence */);
+
+            if (ret != 1)
+                break;
+        }
+
+        ret = i;
+        /* Free any not enqueued packets. */
+        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts[i], nb_rx - i);
+        rte_mempool_put_bulk(ioat_pktmbuf_pool, (void *)&pkts_copy[i],
+            nb_rx - i);
+
+        return ret;
+    }
+
+
+All done copies are processed by ``ioat_tx_port()`` function. When using
+hardware copy mode the function invokes ``rte_ioat_completed_copies()``
+on each assigned IOAT channel to gather copied packets. If software copy
+mode is used the function dequeues copied packets from the rte_ring. Then each
+packet MAC address is changed if it was enabled. After that copies are sent
+in burst mode using `` rte_eth_tx_burst()``.
+
+
+.. code-block:: c
+
+    /* Transmit packets from IOAT rawdev/rte_ring for one port. */
+    static void
+    ioat_tx_port(struct rxtx_port_config *tx_config)
+    {
+        uint32_t i, j, nb_dq = 0;
+        struct rte_mbuf *mbufs_src[MAX_PKT_BURST];
+        struct rte_mbuf *mbufs_dst[MAX_PKT_BURST];
+
+        if (copy_mode == COPY_MODE_IOAT_NUM) {
+            for (i = 0; i < tx_config->nb_queues; i++) {
+                /* Deque the mbufs from IOAT device. */
+                nb_dq = rte_ioat_completed_copies(
+                    tx_config->ioat_ids[i], MAX_PKT_BURST,
+                    (void *)mbufs_src, (void *)mbufs_dst);
+
+                if (nb_dq == 0)
+                    break;
+
+                rte_mempool_put_bulk(ioat_pktmbuf_pool,
+                    (void *)mbufs_src, nb_dq);
+
+                /* Update macs if enabled */
+                if (mac_updating) {
+                    for (j = 0; j < nb_dq; j++)
+                        update_mac_addrs(mbufs_dst[j],
+                            tx_config->rxtx_port);
+                }
+
+                const uint16_t nb_tx = rte_eth_tx_burst(
+                    tx_config->rxtx_port, 0,
+                    (void *)mbufs_dst, nb_dq);
+
+                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
+
+                /* Free any unsent packets. */
+                if (unlikely(nb_tx < nb_dq))
+                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
+                    (void *)&mbufs_dst[nb_tx],
+                        nb_dq - nb_tx);
+            }
+        }
+        else {
+            for (i = 0; i < tx_config->nb_queues; i++) {
+                /* Deque the mbufs from IOAT device. */
+                nb_dq = rte_ring_dequeue_burst(tx_config->rx_to_tx_ring,
+                    (void *)mbufs_dst, MAX_PKT_BURST, NULL);
+
+                if (nb_dq == 0)
+                    return;
+
+                /* Update macs if enabled */
+                if (mac_updating) {
+                    for (j = 0; j < nb_dq; j++)
+                        update_mac_addrs(mbufs_dst[j],
+                            tx_config->rxtx_port);
+                }
+
+                const uint16_t nb_tx = rte_eth_tx_burst(tx_config->rxtx_port,
+                    0, (void *)mbufs_dst, nb_dq);
+
+                port_statistics.tx[tx_config->rxtx_port] += nb_tx;
+
+                /* Free any unsent packets. */
+                if (unlikely(nb_tx < nb_dq))
+                    rte_mempool_put_bulk(ioat_pktmbuf_pool,
+                    (void *)&mbufs_dst[nb_tx],
+                        nb_dq - nb_tx);
+            }
+        }
+    }
+
+The Packet Copying Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to perform packet copy there is a user-defined function
+``pktmbuf_sw_copy()`` used. It copies a whole packet by copying
+metadata from source packet to new mbuf, and then copying a data
+chunk of source packet. Both memory copies are done using
+``rte_memcpy()``:
+
+.. code-block:: c
+
+    static inline void
+    pktmbuf_sw_copy(struct rte_mbuf *src, struct rte_mbuf *dst)
+    {
+        /* Copy packet metadata */
+        rte_memcpy(&dst->rearm_data,
+            &src->rearm_data,
+            offsetof(struct rte_mbuf, cacheline1)
+                - offsetof(struct rte_mbuf, rearm_data));
+
+        /* Copy packet data */
+        rte_memcpy(rte_pktmbuf_mtod(dst, char *),
+            rte_pktmbuf_mtod(src, char *), src->data_len);
+    }
+
+The metadata in this example is copied from ``rearm_data`` member of
+``rte_mbuf`` struct up to ``cacheline1``.
+
+In order to understand why software packet copying is done as shown
+above please refer to the "Mbuf Library" section of the
+*DPDK Programmer's Guide*.
\ No newline at end of file