Message ID | 20210730135533.417611-1-thomas@monjalon.net (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 74634A0C40; Fri, 30 Jul 2021 15:55:51 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E895640040; Fri, 30 Jul 2021 15:55:50 +0200 (CEST) Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by mails.dpdk.org (Postfix) with ESMTP id 92C674003F for <dev@dpdk.org>; Fri, 30 Jul 2021 15:55:49 +0200 (CEST) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id DDE8C5C00CA; Fri, 30 Jul 2021 09:55:48 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute3.internal (MEProxy); Fri, 30 Jul 2021 09:55:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; s=fm1; bh=r/T/nCFpb8TZi BePv39i71sPuWkcM9X6inAKJ961mig=; b=SjfMW0e/D5IMuAxLTDYgrDbqUpPCJ hVPBJz+UeFPyWMTcRnrN2Z+NHB+/SK5L2up/W3AqaaEH3svCthB2/BUQd2HSbKEs qJCt0PEgsF4iDixDS20r0eGHY0aWfVKTCAzGsmceKMjCX/pFgaIdpPLliiA652O+ nc4r4OhLDBlVqWcBrALOrX4TWtCMt/ebjNJ15yZ7Wc1mw2ZwEZ7jx3q4O/8HtdGs 0N/W0Idui+yFesYzMgcSaqLZo2BpY7MH6r+SbJEyw4FeP4/NNQnouaKZ1cURVM4q o98+mIoG3M0UzcEIVrjnQXjGB/2hNV7ghqDzjtsVd0YkHo/eXq4SU2ocg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:from :in-reply-to:message-id:mime-version:references:subject:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; bh=r/T/nCFpb8TZiBePv39i71sPuWkcM9X6inAKJ961mig=; b=VkTpj1LB cCujUV8vFrZdgFveMbSoszEVS4zD4JCn0hseMpIuEPe21DS2wlyCRANF+3xLKZna GthaIVOYq53Uj6/QpdQp2Ti8oZGm8kFSsw07QGzX+QlSmrNo2paxTSA3DjFP2zNN omLOIScjjIp+/8kEKPApmm57XmA/aqgWTjlHviVaWZKSmVM6uDivknfkjJMxccb6 7g5ZZ3LVc8m5zEvxUlDBg7h1rxBoy68TfbElTrVGgH81T2cHUDN4yns7JgP53MJG 5WObrCZn41SeVWstzHgnLJYcaJsnIbrOH6m0ltJ9m6NPtfCJO31wRq5h4whRgEiO 6Mhl2r1tvRfCBw== X-ME-Sender: <xms:4wQEYQK43S8YAaGHV1khd3HA_x51AzaThApWdGREFXVN9NGuS6FJSw> <xme:4wQEYQJBhzrEKuAeukIZTAPJWl760cWa21slyzXBw38oeZ4gHD8I2xwtZkl73DN6W wWHEFqVXuH_nKBT-g> X-ME-Received: <xmr:4wQEYQvSNhFzx00ThD7l35DwUwSUbCNkDTV0FjqVxa62yPR_LTZVuVNum47DpgDGFhc1d4NHu-C1U83sjLbdPGmb6JsTsJ0> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddrheehgdeiiecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkofgjfhgggfestdekredtredttdenucfhrhhomhepvfhhohhmrghs ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf frrghtthgvrhhnpeegiefftefhleffgfdtkefhhfffgeeviedtfedtiedtieefueetgeel uedukeekveenucffohhmrghinheptghonhhfrdhinhdpvhgvrhhsihhonhdrmhgrphenuc evlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehthhhomhgr shesmhhonhhjrghlohhnrdhnvght X-ME-Proxy: <xmx:4wQEYda8FAY-t_NokcRHsqUdklXhHv0VcMYQnStFqBiJkHXHYwn2WQ> <xmx:4wQEYXZvCB7_EmvX4NDCJ2GdTjl8OgspElCP6XgIPLdfIsmaGplplQ> <xmx:4wQEYZAVavOeblwNO0Y_izo9tjO6TCNPtfqc7a1hHjuoQgl2DdlgjQ> <xmx:5AQEYcMhEa-sMIJmI_UlM6d5btcALhkUydYyzDYy9eHzL7XD_KGlXg> Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 30 Jul 2021 09:55:46 -0400 (EDT) From: Thomas Monjalon <thomas@monjalon.net> To: dev@dpdk.org Cc: Stephen Hemminger <stephen@networkplumber.org>, David Marchand <david.marchand@redhat.com>, Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>, Haiyue Wang <haiyue.wang@intel.com>, Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>, Jerin Jacob <jerinj@marvell.com>, Ferruh Yigit <ferruh.yigit@intel.com> Date: Fri, 30 Jul 2021 15:55:26 +0200 Message-Id: <20210730135533.417611-1-thomas@monjalon.net> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210602203531.2288645-1-thomas@monjalon.net> References: <20210602203531.2288645-1-thomas@monjalon.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series |
heterogeneous computing library
|
|
Message
Thomas Monjalon
July 30, 2021, 1:55 p.m. UTC
From: Elena Agostini <eagostini@nvidia.com>
In heterogeneous computing system, processing is not only in the CPU.
Some tasks can be delegated to devices working in parallel.
The goal of this new library is to enhance the collaboration between
DPDK, that's primarily a CPU framework, and other type of devices like GPUs.
When mixing network activity with task processing on a non-CPU device,
there may be the need to put in communication the CPU with the device
in order to manage the memory, synchronize operations, exchange info, etc..
This library provides a number of new features:
- Interoperability with device specific library with generic handlers
- Possibility to allocate and free memory on the device
- Possibility to allocate and free memory on the CPU but visible from the device
- Communication functions to enhance the dialog between the CPU and the device
The infrastructure is prepared to welcome drivers in drivers/hc/
as the upcoming NVIDIA one, implementing the hcdev API.
Some parts are not complete:
- locks
- memory allocation table
- memory freeing
- guide documentation
- integration in devtools/check-doc-vs-code.sh
- unit tests
- integration in testpmd to enable Rx/Tx to/from GPU memory.
Below is a pseudo-code to give an example about how to use functions
in this library in case of a CUDA application.
Elena Agostini (4):
hcdev: introduce heterogeneous computing device library
hcdev: add memory API
hcdev: add communication flag
hcdev: add communication list
Thomas Monjalon (3):
hcdev: add event notification
hcdev: add child device representing a device context
hcdev: support multi-process
.gitignore | 1 +
MAINTAINERS | 6 +
doc/api/doxy-api-index.md | 1 +
doc/api/doxy-api.conf.in | 1 +
doc/guides/conf.py | 8 +
doc/guides/hcdevs/features/default.ini | 13 +
doc/guides/hcdevs/index.rst | 11 +
doc/guides/hcdevs/overview.rst | 11 +
doc/guides/index.rst | 1 +
doc/guides/prog_guide/hcdev.rst | 5 +
doc/guides/prog_guide/index.rst | 1 +
doc/guides/rel_notes/release_21_08.rst | 5 +
drivers/hc/meson.build | 4 +
drivers/meson.build | 1 +
lib/hcdev/hcdev.c | 789 +++++++++++++++++++++++++
lib/hcdev/hcdev_driver.h | 96 +++
lib/hcdev/meson.build | 12 +
lib/hcdev/rte_hcdev.h | 592 +++++++++++++++++++
lib/hcdev/version.map | 35 ++
lib/meson.build | 1 +
20 files changed, 1594 insertions(+)
create mode 100644 doc/guides/hcdevs/features/default.ini
create mode 100644 doc/guides/hcdevs/index.rst
create mode 100644 doc/guides/hcdevs/overview.rst
create mode 100644 doc/guides/prog_guide/hcdev.rst
create mode 100644 drivers/hc/meson.build
create mode 100644 lib/hcdev/hcdev.c
create mode 100644 lib/hcdev/hcdev_driver.h
create mode 100644 lib/hcdev/meson.build
create mode 100644 lib/hcdev/rte_hcdev.h
create mode 100644 lib/hcdev/version.map
////////////////////////////////////////////////////////////////////////
///// HCDEV library + CUDA functions
////////////////////////////////////////////////////////////////////////
#define GPU_PAGE_SHIFT 16
#define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT)
int main() {
struct rte_hcdev_flag quit_flag;
struct rte_hcdev_comm_list *comm_list;
int nb_rx = 0;
int comm_list_entry = 0;
struct rte_mbuf * rx_mbufs[max_rx_mbufs];
cudaStream_t cstream;
struct rte_mempool *mpool_payload, *mpool_header;
struct rte_pktmbuf_extmem ext_mem;
int16_t dev_id;
/* Initialize CUDA objects (cstream, context, etc..). */
/* Use hcdev library to register a new CUDA context if any */
/* Let's assume the application wants to use the default context of the GPU device 0 */
dev_id = 0;
/* Create an external memory mempool using memory allocated on the GPU. */
ext_mem.elt_size = mbufs_headroom_size;
ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE);
ext_mem.buf_iova = RTE_BAD_IOVA;
ext_mem.buf_ptr = rte_hcdev_malloc(dev_id, ext_mem.buf_len, 0);
rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE);
rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len);
mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num,
0, 0, ext_mem.elt_size,
rte_socket_id(), &ext_mem, 1);
/*
* Create CPU - device communication flag. With this flag, the CPU can tell to the CUDA kernel
* to exit from the main loop.
*/
rte_hcdev_comm_create_flag(dev_id, &quit_flag, RTE_HCDEV_COMM_FLAG_CPU);
rte_hcdev_comm_set_flag(&quit_flag, 0);
/*
* Create CPU - device communication list. Each entry of this list will be populated by the CPU
* with a new set of received mbufs that the CUDA kernel has to process.
*/
comm_list = rte_hcdev_comm_create_list(dev_id, num_entries);
/* A very simple CUDA kernel with just 1 CUDA block and RTE_HCDEV_COMM_LIST_PKTS_MAX CUDA threads. */
cuda_kernel_packet_processing<<<1, RTE_HCDEV_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...);
/*
* For simplicity, the CPU here receives only 2 bursts of mbufs.
* In a real application, network activity and device processing should overlap.
*/
nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
rte_hcdev_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx);
nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
rte_hcdev_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx);
/*
* CPU waits for the completion of the packets' processing on the CUDA kernel
* and then it does a cleanup of the received mbufs.
*/
while (rte_hcdev_comm_cleanup_list(comm_list[0]));
while (rte_hcdev_comm_cleanup_list(comm_list[1]));
/* CPU notifies the CUDA kernel that it has to terminate */
rte_hcdev_comm_set_flag(&quit_flag, 1);
/* hcdev objects cleanup/destruction */
/* CUDA cleanup */
/* DPDK cleanup */
return 0;
}
////////////////////////////////////////////////////////////////////////
///// CUDA kernel
////////////////////////////////////////////////////////////////////////
void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_hcdev_comm_list *comm_list, int comm_list_entries) {
int comm_list_index = 0;
struct rte_hcdev_comm_pkt *pkt_list = NULL;
/* Do some pre-processing operations. */
/* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */
while (*quit_flag_ptr == 0)
{
if (comm_list[comm_list_index]->status != RTE_HCDEV_COMM_LIST_READY)
continue;
if (threadIdx.x < comm_list[comm_list_index]->num_pkts)
{
/* Each CUDA thread processes a different packet. */
packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..);
}
__threadfence();
__syncthreads();
/* Wait for new packets on the next communication list entry. */
comm_list_index = (comm_list_index+1) % comm_list_entries;
}
/* Do some post-processing operations. */
}
Comments
On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > From: Elena Agostini <eagostini@nvidia.com> > > In heterogeneous computing system, processing is not only in the CPU. > Some tasks can be delegated to devices working in parallel. > > The goal of this new library is to enhance the collaboration between > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > When mixing network activity with task processing on a non-CPU device, > there may be the need to put in communication the CPU with the device > in order to manage the memory, synchronize operations, exchange info, etc.. > > This library provides a number of new features: > - Interoperability with device specific library with generic handlers > - Possibility to allocate and free memory on the device > - Possibility to allocate and free memory on the CPU but visible from the device > - Communication functions to enhance the dialog between the CPU and the device > > The infrastructure is prepared to welcome drivers in drivers/hc/ > as the upcoming NVIDIA one, implementing the hcdev API. > > Some parts are not complete: > - locks > - memory allocation table > - memory freeing > - guide documentation > - integration in devtools/check-doc-vs-code.sh > - unit tests > - integration in testpmd to enable Rx/Tx to/from GPU memory. Since the above line is the crux of the following text, I will start from this point. + Techboard I can give my honest feedback on this. I can map similar stuff in Marvell HW, where we do machine learning as compute offload on a different class of CPU. In terms of RFC patch features 1) memory API - Use cases are aligned 2) communication flag and communication list Our structure is completely different and we are using HW ring kind of interface to post the job to compute interface and the job completion result happens through the event device. Kind of similar to the DMA API that has been discussed on the mailing list. Now the bigger question is why need to Tx and then Rx something to compute the device Isn't ot offload something? If so, why not add the those offload in respective subsystem to improve the subsystem(ethdev, cryptiodev etc) features set to adapt new features or introduce new subsystem (like ML, Inline Baseband processing) so that it will be an opportunity to implement the same in HW or compute device. For example, if we take this path, ML offloading will be application code like testpmd, which deals with "specific" device commands(aka glorified rawdev) to deal with specific computing device offload "COMMANDS" (The commands will be specific to offload device, the same code wont run on other compute device) Just my _personal_ preference is to have specific subsystems to improve the DPDK instead of raw device kind of path. If we decide another path as a community it is _fine_ too(as a _project manager_ point of view it will be an easy path to dump SDK stuff to DPDK without introducing the pain of the subsystem nor improving the DPDK). > > Below is a pseudo-code to give an example about how to use functions > in this library in case of a CUDA application. > > > Elena Agostini (4): > hcdev: introduce heterogeneous computing device library > hcdev: add memory API > hcdev: add communication flag > hcdev: add communication list > > Thomas Monjalon (3): > hcdev: add event notification > hcdev: add child device representing a device context > hcdev: support multi-process > > .gitignore | 1 + > MAINTAINERS | 6 + > doc/api/doxy-api-index.md | 1 + > doc/api/doxy-api.conf.in | 1 + > doc/guides/conf.py | 8 + > doc/guides/hcdevs/features/default.ini | 13 + > doc/guides/hcdevs/index.rst | 11 + > doc/guides/hcdevs/overview.rst | 11 + > doc/guides/index.rst | 1 + > doc/guides/prog_guide/hcdev.rst | 5 + > doc/guides/prog_guide/index.rst | 1 + > doc/guides/rel_notes/release_21_08.rst | 5 + > drivers/hc/meson.build | 4 + > drivers/meson.build | 1 + > lib/hcdev/hcdev.c | 789 +++++++++++++++++++++++++ > lib/hcdev/hcdev_driver.h | 96 +++ > lib/hcdev/meson.build | 12 + > lib/hcdev/rte_hcdev.h | 592 +++++++++++++++++++ > lib/hcdev/version.map | 35 ++ > lib/meson.build | 1 + > 20 files changed, 1594 insertions(+) > create mode 100644 doc/guides/hcdevs/features/default.ini > create mode 100644 doc/guides/hcdevs/index.rst > create mode 100644 doc/guides/hcdevs/overview.rst > create mode 100644 doc/guides/prog_guide/hcdev.rst > create mode 100644 drivers/hc/meson.build > create mode 100644 lib/hcdev/hcdev.c > create mode 100644 lib/hcdev/hcdev_driver.h > create mode 100644 lib/hcdev/meson.build > create mode 100644 lib/hcdev/rte_hcdev.h > create mode 100644 lib/hcdev/version.map > > > > //////////////////////////////////////////////////////////////////////// > ///// HCDEV library + CUDA functions > //////////////////////////////////////////////////////////////////////// > #define GPU_PAGE_SHIFT 16 > #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT) > > int main() { > struct rte_hcdev_flag quit_flag; > struct rte_hcdev_comm_list *comm_list; > int nb_rx = 0; > int comm_list_entry = 0; > struct rte_mbuf * rx_mbufs[max_rx_mbufs]; > cudaStream_t cstream; > struct rte_mempool *mpool_payload, *mpool_header; > struct rte_pktmbuf_extmem ext_mem; > int16_t dev_id; > > /* Initialize CUDA objects (cstream, context, etc..). */ > /* Use hcdev library to register a new CUDA context if any */ > /* Let's assume the application wants to use the default context of the GPU device 0 */ > dev_id = 0; > > /* Create an external memory mempool using memory allocated on the GPU. */ > ext_mem.elt_size = mbufs_headroom_size; > ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE); > ext_mem.buf_iova = RTE_BAD_IOVA; > ext_mem.buf_ptr = rte_hcdev_malloc(dev_id, ext_mem.buf_len, 0); > rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); > rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); > mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num, > 0, 0, ext_mem.elt_size, > rte_socket_id(), &ext_mem, 1); > > /* > * Create CPU - device communication flag. With this flag, the CPU can tell to the CUDA kernel > * to exit from the main loop. > */ > rte_hcdev_comm_create_flag(dev_id, &quit_flag, RTE_HCDEV_COMM_FLAG_CPU); > rte_hcdev_comm_set_flag(&quit_flag, 0); > > /* > * Create CPU - device communication list. Each entry of this list will be populated by the CPU > * with a new set of received mbufs that the CUDA kernel has to process. > */ > comm_list = rte_hcdev_comm_create_list(dev_id, num_entries); > > /* A very simple CUDA kernel with just 1 CUDA block and RTE_HCDEV_COMM_LIST_PKTS_MAX CUDA threads. */ > cuda_kernel_packet_processing<<<1, RTE_HCDEV_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...); > > /* > * For simplicity, the CPU here receives only 2 bursts of mbufs. > * In a real application, network activity and device processing should overlap. > */ > nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); > rte_hcdev_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx); > nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); > rte_hcdev_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx); > > /* > * CPU waits for the completion of the packets' processing on the CUDA kernel > * and then it does a cleanup of the received mbufs. > */ > while (rte_hcdev_comm_cleanup_list(comm_list[0])); > while (rte_hcdev_comm_cleanup_list(comm_list[1])); > > /* CPU notifies the CUDA kernel that it has to terminate */ > rte_hcdev_comm_set_flag(&quit_flag, 1); > > /* hcdev objects cleanup/destruction */ > /* CUDA cleanup */ > /* DPDK cleanup */ > > return 0; > } > > //////////////////////////////////////////////////////////////////////// > ///// CUDA kernel > //////////////////////////////////////////////////////////////////////// > > void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_hcdev_comm_list *comm_list, int comm_list_entries) { > int comm_list_index = 0; > struct rte_hcdev_comm_pkt *pkt_list = NULL; > > /* Do some pre-processing operations. */ > > /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */ > while (*quit_flag_ptr == 0) > { > if (comm_list[comm_list_index]->status != RTE_HCDEV_COMM_LIST_READY) > continue; > > if (threadIdx.x < comm_list[comm_list_index]->num_pkts) > { > /* Each CUDA thread processes a different packet. */ > packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..); > } > __threadfence(); > __syncthreads(); > > /* Wait for new packets on the next communication list entry. */ > comm_list_index = (comm_list_index+1) % comm_list_entries; > } > > /* Do some post-processing operations. */ > } > > > -- > 2.31.1 >
31/07/2021 09:06, Jerin Jacob: > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > In heterogeneous computing system, processing is not only in the CPU. > > Some tasks can be delegated to devices working in parallel. > > > > The goal of this new library is to enhance the collaboration between > > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > > > When mixing network activity with task processing on a non-CPU device, > > there may be the need to put in communication the CPU with the device > > in order to manage the memory, synchronize operations, exchange info, etc.. > > > > This library provides a number of new features: > > - Interoperability with device specific library with generic handlers > > - Possibility to allocate and free memory on the device > > - Possibility to allocate and free memory on the CPU but visible from the device > > - Communication functions to enhance the dialog between the CPU and the device > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > Some parts are not complete: > > - locks > > - memory allocation table > > - memory freeing > > - guide documentation > > - integration in devtools/check-doc-vs-code.sh > > - unit tests > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > Since the above line is the crux of the following text, I will start > from this point. > > + Techboard > > I can give my honest feedback on this. > > I can map similar stuff in Marvell HW, where we do machine learning > as compute offload > on a different class of CPU. > > In terms of RFC patch features > > 1) memory API - Use cases are aligned > 2) communication flag and communication list > Our structure is completely different and we are using HW ring kind of > interface to post the job to compute interface and > the job completion result happens through the event device. > Kind of similar to the DMA API that has been discussed on the mailing list. Interesting. > Now the bigger question is why need to Tx and then Rx something to > compute the device > Isn't ot offload something? If so, why not add the those offload in > respective subsystem > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > new features or > introduce new subsystem (like ML, Inline Baseband processing) so that > it will be an opportunity to > implement the same in HW or compute device. For example, if we take > this path, ML offloading will > be application code like testpmd, which deals with "specific" device > commands(aka glorified rawdev) > to deal with specific computing device offload "COMMANDS" > (The commands will be specific to offload device, the same code wont > run on other compute device) Having specific features API is convenient for compatibility between devices, yes, for the set of defined features. Our approach is to start with a flexible API that the application can use to implement any processing because with GPU programming, there is no restriction on what can be achieved. This approach does not contradict what you propose, it does not prevent extending existing classes. > Just my _personal_ preference is to have specific subsystems to > improve the DPDK instead of raw device kind of > path. If we decide another path as a community it is _fine_ too(as a > _project manager_ point of view it will be an easy path to dump SDK > stuff to DPDK without introducing the pain of the subsystem nor > improving the DPDK). Adding a new class API is also improving DPDK.
On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 31/07/2021 09:06, Jerin Jacob: > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > In heterogeneous computing system, processing is not only in the CPU. > > > Some tasks can be delegated to devices working in parallel. > > > > > > The goal of this new library is to enhance the collaboration between > > > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > > > > > When mixing network activity with task processing on a non-CPU device, > > > there may be the need to put in communication the CPU with the device > > > in order to manage the memory, synchronize operations, exchange info, etc.. > > > > > > This library provides a number of new features: > > > - Interoperability with device specific library with generic handlers > > > - Possibility to allocate and free memory on the device > > > - Possibility to allocate and free memory on the CPU but visible from the device > > > - Communication functions to enhance the dialog between the CPU and the device > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > Some parts are not complete: > > > - locks > > > - memory allocation table > > > - memory freeing > > > - guide documentation > > > - integration in devtools/check-doc-vs-code.sh > > > - unit tests > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > Since the above line is the crux of the following text, I will start > > from this point. > > > > + Techboard > > > > I can give my honest feedback on this. > > > > I can map similar stuff in Marvell HW, where we do machine learning > > as compute offload > > on a different class of CPU. > > > > In terms of RFC patch features > > > > 1) memory API - Use cases are aligned > > 2) communication flag and communication list > > Our structure is completely different and we are using HW ring kind of > > interface to post the job to compute interface and > > the job completion result happens through the event device. > > Kind of similar to the DMA API that has been discussed on the mailing list. > > Interesting. It is hard to generalize the communication mechanism. Is other GPU vendors have a similar communication mechanism? AMD, Intel ?? > > > Now the bigger question is why need to Tx and then Rx something to > > compute the device > > Isn't ot offload something? If so, why not add the those offload in > > respective subsystem > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > new features or > > introduce new subsystem (like ML, Inline Baseband processing) so that > > it will be an opportunity to > > implement the same in HW or compute device. For example, if we take > > this path, ML offloading will > > be application code like testpmd, which deals with "specific" device > > commands(aka glorified rawdev) > > to deal with specific computing device offload "COMMANDS" > > (The commands will be specific to offload device, the same code wont > > run on other compute device) > > Having specific features API is convenient for compatibility > between devices, yes, for the set of defined features. > Our approach is to start with a flexible API that the application > can use to implement any processing because with GPU programming, > there is no restriction on what can be achieved. > This approach does not contradict what you propose, > it does not prevent extending existing classes. It does prevent extending the existing classes as no one is going to extent it there is the path of not doing do. If an application can run only on a specific device, it is similar to a raw device, where the device definition is not defined. (i.e JOB metadata is not defined and it is specific to the device). > > > Just my _personal_ preference is to have specific subsystems to > > improve the DPDK instead of raw device kind of > > path. If we decide another path as a community it is _fine_ too(as a > > _project manager_ point of view it will be an easy path to dump SDK > > stuff to DPDK without introducing the pain of the subsystem nor > > improving the DPDK). > > Adding a new class API is also improving DPDK. But the class is similar as raw dev class. The reason I say, Job submission and response is can be abstracted as queue/dequeue APIs. Taks/Job metadata is specific to compute devices (and it can not be generalized). If we generalize it makes sense to have a new class that does "specific function". > >
31/07/2021 15:42, Jerin Jacob: > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 31/07/2021 09:06, Jerin Jacob: > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > In heterogeneous computing system, processing is not only in the CPU. > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > The goal of this new library is to enhance the collaboration between > > > > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > > > > > > > When mixing network activity with task processing on a non-CPU device, > > > > there may be the need to put in communication the CPU with the device > > > > in order to manage the memory, synchronize operations, exchange info, etc.. > > > > > > > > This library provides a number of new features: > > > > - Interoperability with device specific library with generic handlers > > > > - Possibility to allocate and free memory on the device > > > > - Possibility to allocate and free memory on the CPU but visible from the device > > > > - Communication functions to enhance the dialog between the CPU and the device > > > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > > > Some parts are not complete: > > > > - locks > > > > - memory allocation table > > > > - memory freeing > > > > - guide documentation > > > > - integration in devtools/check-doc-vs-code.sh > > > > - unit tests > > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > > > Since the above line is the crux of the following text, I will start > > > from this point. > > > > > > + Techboard > > > > > > I can give my honest feedback on this. > > > > > > I can map similar stuff in Marvell HW, where we do machine learning > > > as compute offload > > > on a different class of CPU. > > > > > > In terms of RFC patch features > > > > > > 1) memory API - Use cases are aligned > > > 2) communication flag and communication list > > > Our structure is completely different and we are using HW ring kind of > > > interface to post the job to compute interface and > > > the job completion result happens through the event device. > > > Kind of similar to the DMA API that has been discussed on the mailing list. > > > > Interesting. > > It is hard to generalize the communication mechanism. > Is other GPU vendors have a similar communication mechanism? AMD, Intel ?? I don't know who to ask in AMD & Intel. Any ideas? > > > Now the bigger question is why need to Tx and then Rx something to > > > compute the device > > > Isn't ot offload something? If so, why not add the those offload in > > > respective subsystem > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > > new features or > > > introduce new subsystem (like ML, Inline Baseband processing) so that > > > it will be an opportunity to > > > implement the same in HW or compute device. For example, if we take > > > this path, ML offloading will > > > be application code like testpmd, which deals with "specific" device > > > commands(aka glorified rawdev) > > > to deal with specific computing device offload "COMMANDS" > > > (The commands will be specific to offload device, the same code wont > > > run on other compute device) > > > > Having specific features API is convenient for compatibility > > between devices, yes, for the set of defined features. > > Our approach is to start with a flexible API that the application > > can use to implement any processing because with GPU programming, > > there is no restriction on what can be achieved. > > This approach does not contradict what you propose, > > it does not prevent extending existing classes. > > It does prevent extending the existing classes as no one is going to > extent it there is the path of not doing do. I disagree. Specific API is more convenient for some tasks, so there is an incentive to define or extend specific device class APIs. But it should not forbid doing custom processing. > If an application can run only on a specific device, it is similar to > a raw device, > where the device definition is not defined. (i.e JOB metadata is not defined and > it is specific to the device). > > > > Just my _personal_ preference is to have specific subsystems to > > > improve the DPDK instead of raw device kind of > > > path. If we decide another path as a community it is _fine_ too(as a > > > _project manager_ point of view it will be an easy path to dump SDK > > > stuff to DPDK without introducing the pain of the subsystem nor > > > improving the DPDK). > > > > Adding a new class API is also improving DPDK. > > But the class is similar as raw dev class. The reason I say, > Job submission and response is can be abstracted as queue/dequeue APIs. > Taks/Job metadata is specific to compute devices (and it can not be > generalized). > If we generalize it makes sense to have a new class that does > "specific function". Computing device programming is already generalized with languages like OpenCL. We should not try to reinvent the same. We are just trying to properly integrate the concept in DPDK and allow building on top of it.
On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > 31/07/2021 15:42, Jerin Jacob: > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > 31/07/2021 09:06, Jerin Jacob: > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > > > In heterogeneous computing system, processing is not only in the CPU. > > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > > > The goal of this new library is to enhance the collaboration between > > > > > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > > > > > > > > > When mixing network activity with task processing on a non-CPU device, > > > > > there may be the need to put in communication the CPU with the device > > > > > in order to manage the memory, synchronize operations, exchange info, etc.. > > > > > > > > > > This library provides a number of new features: > > > > > - Interoperability with device specific library with generic handlers > > > > > - Possibility to allocate and free memory on the device > > > > > - Possibility to allocate and free memory on the CPU but visible from the device > > > > > - Communication functions to enhance the dialog between the CPU and the device > > > > > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > > > > > Some parts are not complete: > > > > > - locks > > > > > - memory allocation table > > > > > - memory freeing > > > > > - guide documentation > > > > > - integration in devtools/check-doc-vs-code.sh > > > > > - unit tests > > > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > > > > > Since the above line is the crux of the following text, I will start > > > > from this point. > > > > > > > > + Techboard > > > > > > > > I can give my honest feedback on this. > > > > > > > > I can map similar stuff in Marvell HW, where we do machine learning > > > > as compute offload > > > > on a different class of CPU. > > > > > > > > In terms of RFC patch features > > > > > > > > 1) memory API - Use cases are aligned > > > > 2) communication flag and communication list > > > > Our structure is completely different and we are using HW ring kind of > > > > interface to post the job to compute interface and > > > > the job completion result happens through the event device. > > > > Kind of similar to the DMA API that has been discussed on the mailing list. > > > > > > Interesting. > > > > It is hard to generalize the communication mechanism. > > Is other GPU vendors have a similar communication mechanism? AMD, Intel ?? > > I don't know who to ask in AMD & Intel. Any ideas? Good question. At least in Marvell HW, the communication flag and communication list is our structure is completely different and we are using HW ring kind of interface to post the job to compute interface and the job completion result happens through the event device. kind of similar to the DMA API that has been discussed on the mailing list. > > > > > Now the bigger question is why need to Tx and then Rx something to > > > > compute the device > > > > Isn't ot offload something? If so, why not add the those offload in > > > > respective subsystem > > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > > > new features or > > > > introduce new subsystem (like ML, Inline Baseband processing) so that > > > > it will be an opportunity to > > > > implement the same in HW or compute device. For example, if we take > > > > this path, ML offloading will > > > > be application code like testpmd, which deals with "specific" device > > > > commands(aka glorified rawdev) > > > > to deal with specific computing device offload "COMMANDS" > > > > (The commands will be specific to offload device, the same code wont > > > > run on other compute device) > > > > > > Having specific features API is convenient for compatibility > > > between devices, yes, for the set of defined features. > > > Our approach is to start with a flexible API that the application > > > can use to implement any processing because with GPU programming, > > > there is no restriction on what can be achieved. > > > This approach does not contradict what you propose, > > > it does not prevent extending existing classes. > > > > It does prevent extending the existing classes as no one is going to > > extent it there is the path of not doing do. > > I disagree. Specific API is more convenient for some tasks, > so there is an incentive to define or extend specific device class APIs. > But it should not forbid doing custom processing. This is the same as the raw device is in DPDK where the device personality is not defined. Even if define another API and if the personality is not defined, it comes similar to the raw device as similar to rawdev enqueue and dequeue. To summarize, 1) My _personal_ preference is to have specific subsystems to improve the DPDK instead of the raw device kind of path. 2) If the device personality is not defined, use rawdev 3) All computing devices do not use "communication flag" and "communication list" kind of structure. If are targeting a generic computing device then that is not a portable scheme. For GPU abstraction if "communication flag" and "communication list" is the right kind of mechanism then we can have a separate library for GPU communication specific to GPU <-> DPDK communication needs and explicit for GPU. I think generic DPDK applications like testpmd should not pollute with device-specific functions. Like, call device-specific messages from the application which makes the application runs only one device. I don't have a strong opinion(expect standardizing "communication flag" and "communication list" as generic computing device communication mechanism) of others think it is OK to do that way in DPDK. > > > If an application can run only on a specific device, it is similar to > > a raw device, > > where the device definition is not defined. (i.e JOB metadata is not defined and > > it is specific to the device). > > > > > > Just my _personal_ preference is to have specific subsystems to > > > > improve the DPDK instead of raw device kind of > > > > path. If we decide another path as a community it is _fine_ too(as a > > > > _project manager_ point of view it will be an easy path to dump SDK > > > > stuff to DPDK without introducing the pain of the subsystem nor > > > > improving the DPDK). > > > > > > Adding a new class API is also improving DPDK. > > > > But the class is similar as raw dev class. The reason I say, > > Job submission and response is can be abstracted as queue/dequeue APIs. > > Taks/Job metadata is specific to compute devices (and it can not be > > generalized). > > If we generalize it makes sense to have a new class that does > > "specific function". > > Computing device programming is already generalized with languages like OpenCL. > We should not try to reinvent the same. > We are just trying to properly integrate the concept in DPDK > and allow building on top of it. See above. > >
> -----Original Message----- > From: Jerin Jacob <jerinjacobk@gmail.com> > Sent: Friday, August 27, 2021 20:19 > To: Thomas Monjalon <thomas@monjalon.net> > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen Hemminger > <stephen@networkplumber.org>; David Marchand <david.marchand@redhat.com>; Andrew Rybchenko > <andrew.rybchenko@oktetlabs.ru>; Wang, Haiyue <haiyue.wang@intel.com>; Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; techboard@dpdk.org; Elena > Agostini <eagostini@nvidia.com> > Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > 31/07/2021 15:42, Jerin Jacob: > > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > 31/07/2021 09:06, Jerin Jacob: > > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > > > > > In heterogeneous computing system, processing is not only in the CPU. > > > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > > > > > The goal of this new library is to enhance the collaboration between > > > > > > DPDK, that's primarily a CPU framework, and other type of devices like GPUs. > > > > > > > > > > > > When mixing network activity with task processing on a non-CPU device, > > > > > > there may be the need to put in communication the CPU with the device > > > > > > in order to manage the memory, synchronize operations, exchange info, etc.. > > > > > > > > > > > > This library provides a number of new features: > > > > > > - Interoperability with device specific library with generic handlers > > > > > > - Possibility to allocate and free memory on the device > > > > > > - Possibility to allocate and free memory on the CPU but visible from the device > > > > > > - Communication functions to enhance the dialog between the CPU and the device > > > > > > > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > > > > > > > Some parts are not complete: > > > > > > - locks > > > > > > - memory allocation table > > > > > > - memory freeing > > > > > > - guide documentation > > > > > > - integration in devtools/check-doc-vs-code.sh > > > > > > - unit tests > > > > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > > > > > > > Since the above line is the crux of the following text, I will start > > > > > from this point. > > > > > > > > > > + Techboard > > > > > > > > > > I can give my honest feedback on this. > > > > > > > > > > I can map similar stuff in Marvell HW, where we do machine learning > > > > > as compute offload > > > > > on a different class of CPU. > > > > > > > > > > In terms of RFC patch features > > > > > > > > > > 1) memory API - Use cases are aligned > > > > > 2) communication flag and communication list > > > > > Our structure is completely different and we are using HW ring kind of > > > > > interface to post the job to compute interface and > > > > > the job completion result happens through the event device. > > > > > Kind of similar to the DMA API that has been discussed on the mailing list. > > > > > > > > Interesting. > > > > > > It is hard to generalize the communication mechanism. > > > Is other GPU vendors have a similar communication mechanism? AMD, Intel ?? > > > > I don't know who to ask in AMD & Intel. Any ideas? > > Good question. > > At least in Marvell HW, the communication flag and communication list is > our structure is completely different and we are using HW ring kind of > interface to post the job to compute interface and > the job completion result happens through the event device. > kind of similar to the DMA API that has been discussed on the mailing list. > > > > > > > > Now the bigger question is why need to Tx and then Rx something to > > > > > compute the device > > > > > Isn't ot offload something? If so, why not add the those offload in > > > > > respective subsystem > > > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > > > > new features or > > > > > introduce new subsystem (like ML, Inline Baseband processing) so that > > > > > it will be an opportunity to > > > > > implement the same in HW or compute device. For example, if we take > > > > > this path, ML offloading will > > > > > be application code like testpmd, which deals with "specific" device > > > > > commands(aka glorified rawdev) > > > > > to deal with specific computing device offload "COMMANDS" > > > > > (The commands will be specific to offload device, the same code wont > > > > > run on other compute device) > > > > > > > > Having specific features API is convenient for compatibility > > > > between devices, yes, for the set of defined features. > > > > Our approach is to start with a flexible API that the application > > > > can use to implement any processing because with GPU programming, > > > > there is no restriction on what can be achieved. > > > > This approach does not contradict what you propose, > > > > it does not prevent extending existing classes. > > > > > > It does prevent extending the existing classes as no one is going to > > > extent it there is the path of not doing do. > > > > I disagree. Specific API is more convenient for some tasks, > > so there is an incentive to define or extend specific device class APIs. > > But it should not forbid doing custom processing. > > This is the same as the raw device is in DPDK where the device > personality is not defined. > > Even if define another API and if the personality is not defined, > it comes similar to the raw device as similar > to rawdev enqueue and dequeue. > > To summarize, > > 1) My _personal_ preference is to have specific subsystems > to improve the DPDK instead of the raw device kind of path. Something like rte_memdev to focus on device (GPU) memory management ? The new DPDK auxiliary bus maybe make life easier to solve the complex heterogeneous computing library. ;-) > 2) If the device personality is not defined, use rawdev > 3) All computing devices do not use "communication flag" and > "communication list" > kind of structure. If are targeting a generic computing device then > that is not a portable scheme. > For GPU abstraction if "communication flag" and "communication list" > is the right kind of mechanism > then we can have a separate library for GPU communication specific to GPU <-> > DPDK communication needs and explicit for GPU. > > I think generic DPDK applications like testpmd should not > pollute with device-specific functions. Like, call device-specific > messages from the application > which makes the application runs only one device. I don't have a > strong opinion(expect > standardizing "communication flag" and "communication list" as > generic computing device > communication mechanism) of others think it is OK to do that way in DPDK. > > > > > > If an application can run only on a specific device, it is similar to > > > a raw device, > > > where the device definition is not defined. (i.e JOB metadata is not defined and > > > it is specific to the device). > > > > > > > > Just my _personal_ preference is to have specific subsystems to > > > > > improve the DPDK instead of raw device kind of > > > > > path. If we decide another path as a community it is _fine_ too(as a > > > > > _project manager_ point of view it will be an easy path to dump SDK > > > > > stuff to DPDK without introducing the pain of the subsystem nor > > > > > improving the DPDK). > > > > > > > > Adding a new class API is also improving DPDK. > > > > > > But the class is similar as raw dev class. The reason I say, > > > Job submission and response is can be abstracted as queue/dequeue APIs. > > > Taks/Job metadata is specific to compute devices (and it can not be > > > generalized). > > > If we generalize it makes sense to have a new class that does > > > "specific function". > > > > Computing device programming is already generalized with languages like OpenCL. > > We should not try to reinvent the same. > > We are just trying to properly integrate the concept in DPDK > > and allow building on top of it. > > See above. > > > > >
> -----Original Message----- > From: Wang, Haiyue <haiyue.wang@intel.com> > Sent: Sunday, August 29, 2021 7:33 AM > To: Jerin Jacob <jerinjacobk@gmail.com>; NBU-Contact-Thomas Monjalon > <thomas@monjalon.net> > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen > Hemminger <stephen@networkplumber.org>; David Marchand > <david.marchand@redhat.com>; Andrew Rybchenko > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > techboard@dpdk.org; Elena Agostini <eagostini@nvidia.com> > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > > -----Original Message----- > > From: Jerin Jacob <jerinjacobk@gmail.com> > > Sent: Friday, August 27, 2021 20:19 > > To: Thomas Monjalon <thomas@monjalon.net> > > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen > Hemminger > > <stephen@networkplumber.org>; David Marchand > <david.marchand@redhat.com>; Andrew Rybchenko > > <andrew.rybchenko@oktetlabs.ru>; Wang, Haiyue <haiyue.wang@intel.com>; > Honnappa Nagarahalli > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > techboard@dpdk.org; Elena > > Agostini <eagostini@nvidia.com> > > Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon <thomas@monjalon.net> > wrote: > > > > > > 31/07/2021 15:42, Jerin Jacob: > > > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon > <thomas@monjalon.net> wrote: > > > > > 31/07/2021 09:06, Jerin Jacob: > > > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon > <thomas@monjalon.net> wrote: > > > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > > > > > > > In heterogeneous computing system, processing is not only in the > CPU. > > > > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > > > > > > > The goal of this new library is to enhance the collaboration between > > > > > > > DPDK, that's primarily a CPU framework, and other type of devices > like GPUs. > > > > > > > > > > > > > > When mixing network activity with task processing on a non-CPU > device, > > > > > > > there may be the need to put in communication the CPU with the > device > > > > > > > in order to manage the memory, synchronize operations, exchange > info, etc.. > > > > > > > > > > > > > > This library provides a number of new features: > > > > > > > - Interoperability with device specific library with generic handlers > > > > > > > - Possibility to allocate and free memory on the device > > > > > > > - Possibility to allocate and free memory on the CPU but visible from > the device > > > > > > > - Communication functions to enhance the dialog between the CPU > and the device > > > > > > > > > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > > > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > > > > > > > > > Some parts are not complete: > > > > > > > - locks > > > > > > > - memory allocation table > > > > > > > - memory freeing > > > > > > > - guide documentation > > > > > > > - integration in devtools/check-doc-vs-code.sh > > > > > > > - unit tests > > > > > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > > > > > > > > > Since the above line is the crux of the following text, I will start > > > > > > from this point. > > > > > > > > > > > > + Techboard > > > > > > > > > > > > I can give my honest feedback on this. > > > > > > > > > > > > I can map similar stuff in Marvell HW, where we do machine learning > > > > > > as compute offload > > > > > > on a different class of CPU. > > > > > > > > > > > > In terms of RFC patch features > > > > > > > > > > > > 1) memory API - Use cases are aligned > > > > > > 2) communication flag and communication list > > > > > > Our structure is completely different and we are using HW ring kind of > > > > > > interface to post the job to compute interface and > > > > > > the job completion result happens through the event device. > > > > > > Kind of similar to the DMA API that has been discussed on the mailing > list. > > > > > > > > > > Interesting. > > > > > > > > It is hard to generalize the communication mechanism. > > > > Is other GPU vendors have a similar communication mechanism? AMD, > Intel ?? > > > > > > I don't know who to ask in AMD & Intel. Any ideas? > > > > Good question. > > > > At least in Marvell HW, the communication flag and communication list is > > our structure is completely different and we are using HW ring kind of > > interface to post the job to compute interface and > > the job completion result happens through the event device. > > kind of similar to the DMA API that has been discussed on the mailing list. Please correct me if I'm wrong but what you are describing is a specific way to submit work on the device. Communication flag/list here is a direct data communication between the CPU and some kind of workload (e.g. GPU kernel) that's already running on the device. The rationale here is that: - some work has been already submitted on the device and it's running - CPU needs a real-time direct interaction through memory with the device - the workload on the device needs some info from the CPU it can't get at submission time This is good enough for NVIDIA and AMD GPU. Need to double check for Intel GPU. > > > > > > > > > > > Now the bigger question is why need to Tx and then Rx something to > > > > > > compute the device > > > > > > Isn't ot offload something? If so, why not add the those offload in > > > > > > respective subsystem > > > > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > > > > > new features or > > > > > > introduce new subsystem (like ML, Inline Baseband processing) so that > > > > > > it will be an opportunity to > > > > > > implement the same in HW or compute device. For example, if we take > > > > > > this path, ML offloading will > > > > > > be application code like testpmd, which deals with "specific" device > > > > > > commands(aka glorified rawdev) > > > > > > to deal with specific computing device offload "COMMANDS" > > > > > > (The commands will be specific to offload device, the same code wont > > > > > > run on other compute device) > > > > > > > > > > Having specific features API is convenient for compatibility > > > > > between devices, yes, for the set of defined features. > > > > > Our approach is to start with a flexible API that the application > > > > > can use to implement any processing because with GPU programming, > > > > > there is no restriction on what can be achieved. > > > > > This approach does not contradict what you propose, > > > > > it does not prevent extending existing classes. > > > > > > > > It does prevent extending the existing classes as no one is going to > > > > extent it there is the path of not doing do. > > > > > > I disagree. Specific API is more convenient for some tasks, > > > so there is an incentive to define or extend specific device class APIs. > > > But it should not forbid doing custom processing. > > > > This is the same as the raw device is in DPDK where the device > > personality is not defined. > > > > Even if define another API and if the personality is not defined, > > it comes similar to the raw device as similar > > to rawdev enqueue and dequeue. > > > > To summarize, > > > > 1) My _personal_ preference is to have specific subsystems > > to improve the DPDK instead of the raw device kind of path. > > Something like rte_memdev to focus on device (GPU) memory management ? > > The new DPDK auxiliary bus maybe make life easier to solve the complex > heterogeneous computing library. ;-) To get a concrete idea about what's the best and most comprehensive approach we should start with something that's flexible and simple enough. A dedicated library it's a good starting point: easy to implement and embed in DPDK applications, isolated from other components and users can play with it learning from the code. As a second step we can think to embed the functionality in some other way within DPDK (e.g. split memory management and communication features). > > > 2) If the device personality is not defined, use rawdev > > 3) All computing devices do not use "communication flag" and > > "communication list" > > kind of structure. If are targeting a generic computing device then > > that is not a portable scheme. > > For GPU abstraction if "communication flag" and "communication list" > > is the right kind of mechanism > > then we can have a separate library for GPU communication specific to GPU <- > > > > DPDK communication needs and explicit for GPU. > > > > I think generic DPDK applications like testpmd should not > > pollute with device-specific functions. Like, call device-specific > > messages from the application > > which makes the application runs only one device. I don't have a > > strong opinion(expect > > standardizing "communication flag" and "communication list" as > > generic computing device > > communication mechanism) of others think it is OK to do that way in DPDK. I'd like to introduce (with a dedicated option) the memory API in testpmd to provide an example of how to TX/RX packets using device memory. I agree to not embed communication flag/list features. > > > > > > > > > If an application can run only on a specific device, it is similar to > > > > a raw device, > > > > where the device definition is not defined. (i.e JOB metadata is not defined > and > > > > it is specific to the device). > > > > > > > > > > Just my _personal_ preference is to have specific subsystems to > > > > > > improve the DPDK instead of raw device kind of > > > > > > path. If we decide another path as a community it is _fine_ too(as a > > > > > > _project manager_ point of view it will be an easy path to dump SDK > > > > > > stuff to DPDK without introducing the pain of the subsystem nor > > > > > > improving the DPDK). > > > > > > > > > > Adding a new class API is also improving DPDK. > > > > > > > > But the class is similar as raw dev class. The reason I say, > > > > Job submission and response is can be abstracted as queue/dequeue APIs. > > > > Taks/Job metadata is specific to compute devices (and it can not be > > > > generalized). > > > > If we generalize it makes sense to have a new class that does > > > > "specific function". > > > > > > Computing device programming is already generalized with languages like > OpenCL. > > > We should not try to reinvent the same. > > > We are just trying to properly integrate the concept in DPDK > > > and allow building on top of it. Agree. > > > > See above. > > > > > > > >
On Wed, Sep 1, 2021 at 9:05 PM Elena Agostini <eagostini@nvidia.com> wrote: > > > > -----Original Message----- > > From: Wang, Haiyue <haiyue.wang@intel.com> > > Sent: Sunday, August 29, 2021 7:33 AM > > To: Jerin Jacob <jerinjacobk@gmail.com>; NBU-Contact-Thomas Monjalon > > <thomas@monjalon.net> > > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen > > Hemminger <stephen@networkplumber.org>; David Marchand > > <david.marchand@redhat.com>; Andrew Rybchenko > > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > > techboard@dpdk.org; Elena Agostini <eagostini@nvidia.com> > > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > > > > > > -----Original Message----- > > > From: Jerin Jacob <jerinjacobk@gmail.com> > > > Sent: Friday, August 27, 2021 20:19 > > > To: Thomas Monjalon <thomas@monjalon.net> > > > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen > > Hemminger > > > <stephen@networkplumber.org>; David Marchand > > <david.marchand@redhat.com>; Andrew Rybchenko > > > <andrew.rybchenko@oktetlabs.ru>; Wang, Haiyue <haiyue.wang@intel.com>; > > Honnappa Nagarahalli > > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > > techboard@dpdk.org; Elena > > > Agostini <eagostini@nvidia.com> > > > Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > > > On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon <thomas@monjalon.net> > > wrote: > > > > > > > > 31/07/2021 15:42, Jerin Jacob: > > > > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon > > <thomas@monjalon.net> wrote: > > > > > > 31/07/2021 09:06, Jerin Jacob: > > > > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon > > <thomas@monjalon.net> wrote: > > > > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > > > > > > > > > In heterogeneous computing system, processing is not only in the > > CPU. > > > > > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > > > > > > > > > The goal of this new library is to enhance the collaboration between > > > > > > > > DPDK, that's primarily a CPU framework, and other type of devices > > like GPUs. > > > > > > > > > > > > > > > > When mixing network activity with task processing on a non-CPU > > device, > > > > > > > > there may be the need to put in communication the CPU with the > > device > > > > > > > > in order to manage the memory, synchronize operations, exchange > > info, etc.. > > > > > > > > > > > > > > > > This library provides a number of new features: > > > > > > > > - Interoperability with device specific library with generic handlers > > > > > > > > - Possibility to allocate and free memory on the device > > > > > > > > - Possibility to allocate and free memory on the CPU but visible from > > the device > > > > > > > > - Communication functions to enhance the dialog between the CPU > > and the device > > > > > > > > > > > > > > > > The infrastructure is prepared to welcome drivers in drivers/hc/ > > > > > > > > as the upcoming NVIDIA one, implementing the hcdev API. > > > > > > > > > > > > > > > > Some parts are not complete: > > > > > > > > - locks > > > > > > > > - memory allocation table > > > > > > > > - memory freeing > > > > > > > > - guide documentation > > > > > > > > - integration in devtools/check-doc-vs-code.sh > > > > > > > > - unit tests > > > > > > > > - integration in testpmd to enable Rx/Tx to/from GPU memory. > > > > > > > > > > > > > > Since the above line is the crux of the following text, I will start > > > > > > > from this point. > > > > > > > > > > > > > > + Techboard > > > > > > > > > > > > > > I can give my honest feedback on this. > > > > > > > > > > > > > > I can map similar stuff in Marvell HW, where we do machine learning > > > > > > > as compute offload > > > > > > > on a different class of CPU. > > > > > > > > > > > > > > In terms of RFC patch features > > > > > > > > > > > > > > 1) memory API - Use cases are aligned > > > > > > > 2) communication flag and communication list > > > > > > > Our structure is completely different and we are using HW ring kind of > > > > > > > interface to post the job to compute interface and > > > > > > > the job completion result happens through the event device. > > > > > > > Kind of similar to the DMA API that has been discussed on the mailing > > list. > > > > > > > > > > > > Interesting. > > > > > > > > > > It is hard to generalize the communication mechanism. > > > > > Is other GPU vendors have a similar communication mechanism? AMD, > > Intel ?? > > > > > > > > I don't know who to ask in AMD & Intel. Any ideas? > > > > > > Good question. > > > > > > At least in Marvell HW, the communication flag and communication list is > > > our structure is completely different and we are using HW ring kind of > > > interface to post the job to compute interface and > > > the job completion result happens through the event device. > > > kind of similar to the DMA API that has been discussed on the mailing list. > > Please correct me if I'm wrong but what you are describing is a specific way > to submit work on the device. Communication flag/list here is a direct data > communication between the CPU and some kind of workload (e.g. GPU kernel) > that's already running on the device. Exactly. What I meant is Communication flag/list is not generic enough to express and generic compute device. If all GPU works in this way, we could make the library name as GPU specific and add GPU specific communication mechanism. > > The rationale here is that: > - some work has been already submitted on the device and it's running > - CPU needs a real-time direct interaction through memory with the device > - the workload on the device needs some info from the CPU it can't get at submission time > > This is good enough for NVIDIA and AMD GPU. > Need to double check for Intel GPU. > > > > > > > > > > > > > > > Now the bigger question is why need to Tx and then Rx something to > > > > > > > compute the device > > > > > > > Isn't ot offload something? If so, why not add the those offload in > > > > > > > respective subsystem > > > > > > > to improve the subsystem(ethdev, cryptiodev etc) features set to adapt > > > > > > > new features or > > > > > > > introduce new subsystem (like ML, Inline Baseband processing) so that > > > > > > > it will be an opportunity to > > > > > > > implement the same in HW or compute device. For example, if we take > > > > > > > this path, ML offloading will > > > > > > > be application code like testpmd, which deals with "specific" device > > > > > > > commands(aka glorified rawdev) > > > > > > > to deal with specific computing device offload "COMMANDS" > > > > > > > (The commands will be specific to offload device, the same code wont > > > > > > > run on other compute device) > > > > > > > > > > > > Having specific features API is convenient for compatibility > > > > > > between devices, yes, for the set of defined features. > > > > > > Our approach is to start with a flexible API that the application > > > > > > can use to implement any processing because with GPU programming, > > > > > > there is no restriction on what can be achieved. > > > > > > This approach does not contradict what you propose, > > > > > > it does not prevent extending existing classes. > > > > > > > > > > It does prevent extending the existing classes as no one is going to > > > > > extent it there is the path of not doing do. > > > > > > > > I disagree. Specific API is more convenient for some tasks, > > > > so there is an incentive to define or extend specific device class APIs. > > > > But it should not forbid doing custom processing. > > > > > > This is the same as the raw device is in DPDK where the device > > > personality is not defined. > > > > > > Even if define another API and if the personality is not defined, > > > it comes similar to the raw device as similar > > > to rawdev enqueue and dequeue. > > > > > > To summarize, > > > > > > 1) My _personal_ preference is to have specific subsystems > > > to improve the DPDK instead of the raw device kind of path. > > > > Something like rte_memdev to focus on device (GPU) memory management ? > > > > The new DPDK auxiliary bus maybe make life easier to solve the complex > > heterogeneous computing library. ;-) > > To get a concrete idea about what's the best and most comprehensive > approach we should start with something that's flexible and simple enough. > > A dedicated library it's a good starting point: easy to implement and embed in DPDK applications, > isolated from other components and users can play with it learning from the code. > As a second step we can think to embed the functionality in some other way > within DPDK (e.g. split memory management and communication features). > > > > > > 2) If the device personality is not defined, use rawdev > > > 3) All computing devices do not use "communication flag" and > > > "communication list" > > > kind of structure. If are targeting a generic computing device then > > > that is not a portable scheme. > > > For GPU abstraction if "communication flag" and "communication list" > > > is the right kind of mechanism > > > then we can have a separate library for GPU communication specific to GPU <- > > > > > > DPDK communication needs and explicit for GPU. > > > > > > I think generic DPDK applications like testpmd should not > > > pollute with device-specific functions. Like, call device-specific > > > messages from the application > > > which makes the application runs only one device. I don't have a > > > strong opinion(expect > > > standardizing "communication flag" and "communication list" as > > > generic computing device > > > communication mechanism) of others think it is OK to do that way in DPDK. > > I'd like to introduce (with a dedicated option) the memory API in testpmd to > provide an example of how to TX/RX packets using device memory. Not sure without embedding sideband communication mechanism how it can notify to GPU and back to CPU. If you could share the example API sequence that helps to us understand the level of coupling with testpmd. > > I agree to not embed communication flag/list features. > > > > > > > > > > > > > If an application can run only on a specific device, it is similar to > > > > > a raw device, > > > > > where the device definition is not defined. (i.e JOB metadata is not defined > > and > > > > > it is specific to the device). > > > > > > > > > > > > Just my _personal_ preference is to have specific subsystems to > > > > > > > improve the DPDK instead of raw device kind of > > > > > > > path. If we decide another path as a community it is _fine_ too(as a > > > > > > > _project manager_ point of view it will be an easy path to dump SDK > > > > > > > stuff to DPDK without introducing the pain of the subsystem nor > > > > > > > improving the DPDK). > > > > > > > > > > > > Adding a new class API is also improving DPDK. > > > > > > > > > > But the class is similar as raw dev class. The reason I say, > > > > > Job submission and response is can be abstracted as queue/dequeue APIs. > > > > > Taks/Job metadata is specific to compute devices (and it can not be > > > > > generalized). > > > > > If we generalize it makes sense to have a new class that does > > > > > "specific function". > > > > > > > > Computing device programming is already generalized with languages like > > OpenCL. > > > > We should not try to reinvent the same. > > > > We are just trying to properly integrate the concept in DPDK > > > > and allow building on top of it. > > Agree. > > > > > > > See above. > > > > > > > > > > >
> -----Original Message----- > From: Jerin Jacob <jerinjacobk@gmail.com> > Sent: Thursday, September 2, 2021 3:12 PM > To: Elena Agostini <eagostini@nvidia.com> > Cc: Wang, Haiyue <haiyue.wang@intel.com>; NBU-Contact-Thomas > Monjalon <thomas@monjalon.net>; Jerin Jacob <jerinj@marvell.com>; > dpdk-dev <dev@dpdk.org>; Stephen Hemminger > <stephen@networkplumber.org>; David Marchand > <david.marchand@redhat.com>; Andrew Rybchenko > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > techboard@dpdk.org > Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > library > > > On Wed, Sep 1, 2021 at 9:05 PM Elena Agostini <eagostini@nvidia.com> > wrote: > > > > > > > -----Original Message----- > > > From: Wang, Haiyue <haiyue.wang@intel.com> > > > Sent: Sunday, August 29, 2021 7:33 AM > > > To: Jerin Jacob <jerinjacobk@gmail.com>; NBU-Contact-Thomas > Monjalon > > > <thomas@monjalon.net> > > > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; > > > Stephen Hemminger <stephen@networkplumber.org>; David Marchand > > > <david.marchand@redhat.com>; Andrew Rybchenko > > > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh > > > <ferruh.yigit@intel.com>; techboard@dpdk.org; Elena Agostini > > > <eagostini@nvidia.com> > > > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > > > library > > > > > > > > > > > > > -----Original Message----- > > > > From: Jerin Jacob <jerinjacobk@gmail.com> > > > > Sent: Friday, August 27, 2021 20:19 > > > > To: Thomas Monjalon <thomas@monjalon.net> > > > > Cc: Jerin Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; > > > > Stephen > > > Hemminger > > > > <stephen@networkplumber.org>; David Marchand > > > <david.marchand@redhat.com>; Andrew Rybchenko > > > > <andrew.rybchenko@oktetlabs.ru>; Wang, Haiyue > > > > <haiyue.wang@intel.com>; > > > Honnappa Nagarahalli > > > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh > > > > <ferruh.yigit@intel.com>; > > > techboard@dpdk.org; Elena > > > > Agostini <eagostini@nvidia.com> > > > > Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > > > > library > > > > > > > > On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon > > > > <thomas@monjalon.net> > > > wrote: > > > > > > > > > > 31/07/2021 15:42, Jerin Jacob: > > > > > > On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon > > > <thomas@monjalon.net> wrote: > > > > > > > 31/07/2021 09:06, Jerin Jacob: > > > > > > > > On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon > > > <thomas@monjalon.net> wrote: > > > > > > > > > From: Elena Agostini <eagostini@nvidia.com> > > > > > > > > > > > > > > > > > > In heterogeneous computing system, processing is not > > > > > > > > > only in the > > > CPU. > > > > > > > > > Some tasks can be delegated to devices working in parallel. > > > > > > > > > > > > > > > > > > The goal of this new library is to enhance the > > > > > > > > > collaboration between DPDK, that's primarily a CPU > > > > > > > > > framework, and other type of devices > > > like GPUs. > > > > > > > > > > > > > > > > > > When mixing network activity with task processing on a > > > > > > > > > non-CPU > > > device, > > > > > > > > > there may be the need to put in communication the CPU > > > > > > > > > with the > > > device > > > > > > > > > in order to manage the memory, synchronize operations, > > > > > > > > > exchange > > > info, etc.. > > > > > > > > > > > > > > > > > > This library provides a number of new features: > > > > > > > > > - Interoperability with device specific library with > > > > > > > > > generic handlers > > > > > > > > > - Possibility to allocate and free memory on the device > > > > > > > > > - Possibility to allocate and free memory on the CPU but > > > > > > > > > visible from > > > the device > > > > > > > > > - Communication functions to enhance the dialog between > > > > > > > > > the CPU > > > and the device > > > > > > > > > > > > > > > > > > The infrastructure is prepared to welcome drivers in > > > > > > > > > drivers/hc/ as the upcoming NVIDIA one, implementing the > hcdev API. > > > > > > > > > > > > > > > > > > Some parts are not complete: > > > > > > > > > - locks > > > > > > > > > - memory allocation table > > > > > > > > > - memory freeing > > > > > > > > > - guide documentation > > > > > > > > > - integration in devtools/check-doc-vs-code.sh > > > > > > > > > - unit tests > > > > > > > > > - integration in testpmd to enable Rx/Tx to/from GPU > memory. > > > > > > > > > > > > > > > > Since the above line is the crux of the following text, I > > > > > > > > will start from this point. > > > > > > > > > > > > > > > > + Techboard > > > > > > > > > > > > > > > > I can give my honest feedback on this. > > > > > > > > > > > > > > > > I can map similar stuff in Marvell HW, where we do > > > > > > > > machine learning as compute offload on a different class > > > > > > > > of CPU. > > > > > > > > > > > > > > > > In terms of RFC patch features > > > > > > > > > > > > > > > > 1) memory API - Use cases are aligned > > > > > > > > 2) communication flag and communication list Our structure > > > > > > > > is completely different and we are using HW ring kind of > > > > > > > > interface to post the job to compute interface and the job > > > > > > > > completion result happens through the event device. > > > > > > > > Kind of similar to the DMA API that has been discussed on > > > > > > > > the mailing > > > list. > > > > > > > > > > > > > > Interesting. > > > > > > > > > > > > It is hard to generalize the communication mechanism. > > > > > > Is other GPU vendors have a similar communication mechanism? > > > > > > AMD, > > > Intel ?? > > > > > > > > > > I don't know who to ask in AMD & Intel. Any ideas? > > > > > > > > Good question. > > > > > > > > At least in Marvell HW, the communication flag and communication > > > > list is our structure is completely different and we are using HW > > > > ring kind of interface to post the job to compute interface and > > > > the job completion result happens through the event device. > > > > kind of similar to the DMA API that has been discussed on the mailing > list. > > > > Please correct me if I'm wrong but what you are describing is a > > specific way to submit work on the device. Communication flag/list > > here is a direct data communication between the CPU and some kind of > > workload (e.g. GPU kernel) that's already running on the device. > > Exactly. What I meant is Communication flag/list is not generic enough to > express and generic compute device. If all GPU works in this way, we could > make the library name as GPU specific and add GPU specific communication > mechanism. I'm in favor of reverting the name of the library with a more specific gpudev name instead of hcdev. This library (both memory allocations and fancy features like communication lists) can be tested on various GPUs but I'm not sure about other type of devices. Again, as initial step, I would not complicate things Let's have a GPU oriented library for now. > > > > > > The rationale here is that: > > - some work has been already submitted on the device and it's running > > - CPU needs a real-time direct interaction through memory with the > > device > > - the workload on the device needs some info from the CPU it can't get > > at submission time > > > > This is good enough for NVIDIA and AMD GPU. > > Need to double check for Intel GPU. > > > > > > > > > > > > > > > > > > > Now the bigger question is why need to Tx and then Rx > > > > > > > > something to compute the device Isn't ot offload > > > > > > > > something? If so, why not add the those offload in > > > > > > > > respective subsystem to improve the subsystem(ethdev, > > > > > > > > cryptiodev etc) features set to adapt new features or > > > > > > > > introduce new subsystem (like ML, Inline Baseband > > > > > > > > processing) so that it will be an opportunity to implement > > > > > > > > the same in HW or compute device. For example, if we take > > > > > > > > this path, ML offloading will be application code like > > > > > > > > testpmd, which deals with "specific" device commands(aka > > > > > > > > glorified rawdev) to deal with specific computing device > > > > > > > > offload "COMMANDS" > > > > > > > > (The commands will be specific to offload device, the > > > > > > > > same code wont run on other compute device) > > > > > > > > > > > > > > Having specific features API is convenient for compatibility > > > > > > > between devices, yes, for the set of defined features. > > > > > > > Our approach is to start with a flexible API that the > > > > > > > application can use to implement any processing because with > > > > > > > GPU programming, there is no restriction on what can be > achieved. > > > > > > > This approach does not contradict what you propose, it does > > > > > > > not prevent extending existing classes. > > > > > > > > > > > > It does prevent extending the existing classes as no one is > > > > > > going to extent it there is the path of not doing do. > > > > > > > > > > I disagree. Specific API is more convenient for some tasks, so > > > > > there is an incentive to define or extend specific device class APIs. > > > > > But it should not forbid doing custom processing. > > > > > > > > This is the same as the raw device is in DPDK where the device > > > > personality is not defined. > > > > > > > > Even if define another API and if the personality is not defined, > > > > it comes similar to the raw device as similar to rawdev enqueue > > > > and dequeue. > > > > > > > > To summarize, > > > > > > > > 1) My _personal_ preference is to have specific subsystems to > > > > improve the DPDK instead of the raw device kind of path. > > > > > > Something like rte_memdev to focus on device (GPU) memory > management ? > > > > > > The new DPDK auxiliary bus maybe make life easier to solve the > > > complex heterogeneous computing library. ;-) > > > > To get a concrete idea about what's the best and most comprehensive > > approach we should start with something that's flexible and simple > enough. > > > > A dedicated library it's a good starting point: easy to implement and > > embed in DPDK applications, isolated from other components and users > can play with it learning from the code. > > As a second step we can think to embed the functionality in some other > > way within DPDK (e.g. split memory management and communication > features). > > > > > > > > > 2) If the device personality is not defined, use rawdev > > > > 3) All computing devices do not use "communication flag" and > > > > "communication list" > > > > kind of structure. If are targeting a generic computing device > > > > then that is not a portable scheme. > > > > For GPU abstraction if "communication flag" and "communication > list" > > > > is the right kind of mechanism > > > > then we can have a separate library for GPU communication specific > > > > to GPU <- > > > > > > > > DPDK communication needs and explicit for GPU. > > > > > > > > I think generic DPDK applications like testpmd should not pollute > > > > with device-specific functions. Like, call device-specific > > > > messages from the application which makes the application runs > > > > only one device. I don't have a strong opinion(expect > > > > standardizing "communication flag" and "communication list" as > > > > generic computing device communication mechanism) of others think > > > > it is OK to do that way in DPDK. > > > > I'd like to introduce (with a dedicated option) the memory API in > > testpmd to provide an example of how to TX/RX packets using device > memory. > > Not sure without embedding sideband communication mechanism how it > can notify to GPU and back to CPU. If you could share the example API > sequence that helps to us understand the level of coupling with testpmd. > There is no need of communication mechanism here. Assuming there is not workload to process network packets (to not complicate things), the steps are: 1) Create a DPDK mempool with device external memory using the hcdev (or gpudev) library 2) Use that mempool to tx/rx/fwd packets As an example, you look at my l2fwd-nv application here: https://github.com/NVIDIA/l2fwd-nv > > > > > I agree to not embed communication flag/list features. > > > > > > > > > > > > > > > > > If an application can run only on a specific device, it is > > > > > > similar to a raw device, where the device definition is not > > > > > > defined. (i.e JOB metadata is not defined > > > and > > > > > > it is specific to the device). > > > > > > > > > > > > > > Just my _personal_ preference is to have specific > > > > > > > > subsystems to improve the DPDK instead of raw device kind > > > > > > > > of path. If we decide another path as a community it is > > > > > > > > _fine_ too(as a _project manager_ point of view it will be > > > > > > > > an easy path to dump SDK stuff to DPDK without introducing > > > > > > > > the pain of the subsystem nor improving the DPDK). > > > > > > > > > > > > > > Adding a new class API is also improving DPDK. > > > > > > > > > > > > But the class is similar as raw dev class. The reason I say, > > > > > > Job submission and response is can be abstracted as > queue/dequeue APIs. > > > > > > Taks/Job metadata is specific to compute devices (and it can > > > > > > not be generalized). > > > > > > If we generalize it makes sense to have a new class that does > > > > > > "specific function". > > > > > > > > > > Computing device programming is already generalized with > > > > > languages like > > > OpenCL. > > > > > We should not try to reinvent the same. > > > > > We are just trying to properly integrate the concept in DPDK and > > > > > allow building on top of it. > > > > Agree. > > > > > > > > > > See above. > > > > > > > > > > > > > >
> -----Original Message----- > From: Elena Agostini <eagostini@nvidia.com> > Sent: Tuesday, September 7, 2021 00:11 > To: Jerin Jacob <jerinjacobk@gmail.com> > Cc: Wang, Haiyue <haiyue.wang@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Jerin > Jacob <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen Hemminger <stephen@networkplumber.org>; > David Marchand <david.marchand@redhat.com>; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Honnappa > Nagarahalli <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; techboard@dpdk.org > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > > > > > > I'd like to introduce (with a dedicated option) the memory API in > > > testpmd to provide an example of how to TX/RX packets using device > > memory. > > > > Not sure without embedding sideband communication mechanism how it > > can notify to GPU and back to CPU. If you could share the example API > > sequence that helps to us understand the level of coupling with testpmd. > > > > There is no need of communication mechanism here. > Assuming there is not workload to process network packets (to not complicate > things), the steps are: > 1) Create a DPDK mempool with device external memory using the hcdev (or gpudev) library > 2) Use that mempool to tx/rx/fwd packets > > As an example, you look at my l2fwd-nv application here: https://github.com/NVIDIA/l2fwd-nv > To enhance the 'rte_extmem_register' / 'rte_pktmbuf_pool_create_extbuf' ? if (l2fwd_mem_type == MEM_HOST_PINNED) { ext_mem.buf_ptr = rte_malloc("extmem", ext_mem.buf_len, 0); CUDA_CHECK(cudaHostRegister(ext_mem.buf_ptr, ext_mem.buf_len, cudaHostRegisterMapped)); void *pDevice; CUDA_CHECK(cudaHostGetDevicePointer(&pDevice, ext_mem.buf_ptr, 0)); if (pDevice != ext_mem.buf_ptr) rte_exit(EXIT_FAILURE, "GPU pointer does not match CPU pointer\n"); } else { ext_mem.buf_iova = RTE_BAD_IOVA; CUDA_CHECK(cudaMalloc(&ext_mem.buf_ptr, ext_mem.buf_len)); if (ext_mem.buf_ptr == NULL) rte_exit(EXIT_FAILURE, "Could not allocate GPU memory\n"); unsigned int flag = 1; CUresult status = cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, (CUdeviceptr)ext_mem.buf_ptr); if (CUDA_SUCCESS != status) { rte_exit(EXIT_FAILURE, "Could not set SYNC MEMOP attribute for GPU memory at %llx\n", (CUdeviceptr)ext_mem.buf_ptr); } ret = rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); if (ret) rte_exit(EXIT_FAILURE, "Could not register GPU memory\n"); } ret = rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); if (ret) rte_exit(EXIT_FAILURE, "Could not DMA map EXT memory\n"); mpool_payload = rte_pktmbuf_pool_create_extbuf("payload_mpool", l2fwd_nb_mbufs, 0, 0, ext_mem.elt_size, rte_socket_id(), &ext_mem, 1); if (mpool_payload == NULL) rte_exit(EXIT_FAILURE, "Could not create EXT memory mempool\n");
> -----Original Message----- > From: Wang, Haiyue <haiyue.wang@intel.com> > Sent: Monday, September 6, 2021 7:15 PM > To: Elena Agostini <eagostini@nvidia.com>; Jerin Jacob > <jerinjacobk@gmail.com> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob > <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen Hemminger > <stephen@networkplumber.org>; David Marchand > <david.marchand@redhat.com>; Andrew Rybchenko > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > techboard@dpdk.org > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > library > > > > -----Original Message----- > > From: Elena Agostini <eagostini@nvidia.com> > > Sent: Tuesday, September 7, 2021 00:11 > > To: Jerin Jacob <jerinjacobk@gmail.com> > > Cc: Wang, Haiyue <haiyue.wang@intel.com>; NBU-Contact-Thomas > Monjalon > > <thomas@monjalon.net>; Jerin Jacob <jerinj@marvell.com>; dpdk-dev > > <dev@dpdk.org>; Stephen Hemminger <stephen@networkplumber.org>; > David > > Marchand <david.marchand@redhat.com>; Andrew Rybchenko > > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh > > <ferruh.yigit@intel.com>; techboard@dpdk.org > > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > > library > > > > > > > > > > > > > > > > I'd like to introduce (with a dedicated option) the memory API in > > > > testpmd to provide an example of how to TX/RX packets using device > > > memory. > > > > > > Not sure without embedding sideband communication mechanism how > it > > > can notify to GPU and back to CPU. If you could share the example > > > API sequence that helps to us understand the level of coupling with > testpmd. > > > > > > > There is no need of communication mechanism here. > > Assuming there is not workload to process network packets (to not > > complicate things), the steps are: > > 1) Create a DPDK mempool with device external memory using the hcdev > > (or gpudev) library > > 2) Use that mempool to tx/rx/fwd packets > > > > As an example, you look at my l2fwd-nv application here: > > https://github.com/NVIDIA/l2fwd-nv > > > > To enhance the 'rte_extmem_register' / 'rte_pktmbuf_pool_create_extbuf' > ? > The purpose of these two functions is different. Here DPDK allows the user to use any kind of memory to rx/tx packets. It's not about allocating memory. Maybe I'm missing the point here: what's the main objection in having a GPU library? > if (l2fwd_mem_type == MEM_HOST_PINNED) { > ext_mem.buf_ptr = rte_malloc("extmem", ext_mem.buf_len, 0); > CUDA_CHECK(cudaHostRegister(ext_mem.buf_ptr, > ext_mem.buf_len, cudaHostRegisterMapped)); > void *pDevice; > CUDA_CHECK(cudaHostGetDevicePointer(&pDevice, > ext_mem.buf_ptr, 0)); > if (pDevice != ext_mem.buf_ptr) > rte_exit(EXIT_FAILURE, "GPU pointer does not match CPU > pointer\n"); > } else { > ext_mem.buf_iova = RTE_BAD_IOVA; > CUDA_CHECK(cudaMalloc(&ext_mem.buf_ptr, > ext_mem.buf_len)); > if (ext_mem.buf_ptr == NULL) > rte_exit(EXIT_FAILURE, "Could not allocate GPU memory\n"); > > unsigned int flag = 1; > CUresult status = cuPointerSetAttribute(&flag, > CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, (CUdeviceptr)ext_mem.buf_ptr); > if (CUDA_SUCCESS != status) { > rte_exit(EXIT_FAILURE, "Could not set SYNC MEMOP attribute > for GPU memory at %llx\n", (CUdeviceptr)ext_mem.buf_ptr); > } > ret = rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, > NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); > if (ret) > rte_exit(EXIT_FAILURE, "Could not register GPU memory\n"); > } > ret = rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, > ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); > if (ret) > rte_exit(EXIT_FAILURE, "Could not DMA map EXT memory\n"); > mpool_payload = rte_pktmbuf_pool_create_extbuf("payload_mpool", > l2fwd_nb_mbufs, > 0, 0, ext_mem.elt_size, > rte_socket_id(), > &ext_mem, 1); > if (mpool_payload == NULL) > rte_exit(EXIT_FAILURE, "Could not create EXT memory > mempool\n"); > >
> -----Original Message----- > From: Elena Agostini <eagostini@nvidia.com> > Sent: Tuesday, September 7, 2021 01:23 > To: Wang, Haiyue <haiyue.wang@intel.com>; Jerin Jacob <jerinjacobk@gmail.com> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob <jerinj@marvell.com>; dpdk-dev > <dev@dpdk.org>; Stephen Hemminger <stephen@networkplumber.org>; David Marchand > <david.marchand@redhat.com>; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; techboard@dpdk.org > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library > > > > > -----Original Message----- > > From: Wang, Haiyue <haiyue.wang@intel.com> > > Sent: Monday, September 6, 2021 7:15 PM > > To: Elena Agostini <eagostini@nvidia.com>; Jerin Jacob > > <jerinjacobk@gmail.com> > > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob > > <jerinj@marvell.com>; dpdk-dev <dev@dpdk.org>; Stephen Hemminger > > <stephen@networkplumber.org>; David Marchand > > <david.marchand@redhat.com>; Andrew Rybchenko > > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; > > techboard@dpdk.org > > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > > library > > > > > > > -----Original Message----- > > > From: Elena Agostini <eagostini@nvidia.com> > > > Sent: Tuesday, September 7, 2021 00:11 > > > To: Jerin Jacob <jerinjacobk@gmail.com> > > > Cc: Wang, Haiyue <haiyue.wang@intel.com>; NBU-Contact-Thomas > > Monjalon > > > <thomas@monjalon.net>; Jerin Jacob <jerinj@marvell.com>; dpdk-dev > > > <dev@dpdk.org>; Stephen Hemminger <stephen@networkplumber.org>; > > David > > > Marchand <david.marchand@redhat.com>; Andrew Rybchenko > > > <andrew.rybchenko@oktetlabs.ru>; Honnappa Nagarahalli > > > <honnappa.nagarahalli@arm.com>; Yigit, Ferruh > > > <ferruh.yigit@intel.com>; techboard@dpdk.org > > > Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing > > > library > > > > > > > > > > > > > > > > > > > > > > > I'd like to introduce (with a dedicated option) the memory API in > > > > > testpmd to provide an example of how to TX/RX packets using device > > > > memory. > > > > > > > > Not sure without embedding sideband communication mechanism how > > it > > > > can notify to GPU and back to CPU. If you could share the example > > > > API sequence that helps to us understand the level of coupling with > > testpmd. > > > > > > > > > > There is no need of communication mechanism here. > > > Assuming there is not workload to process network packets (to not > > > complicate things), the steps are: > > > 1) Create a DPDK mempool with device external memory using the hcdev > > > (or gpudev) library > > > 2) Use that mempool to tx/rx/fwd packets > > > > > > As an example, you look at my l2fwd-nv application here: > > > https://github.com/NVIDIA/l2fwd-nv > > > > > > > To enhance the 'rte_extmem_register' / 'rte_pktmbuf_pool_create_extbuf' > > ? > > > > The purpose of these two functions is different. > Here DPDK allows the user to use any kind of memory to rx/tx packets. > It's not about allocating memory. > > Maybe I'm missing the point here: what's the main objection in having a GPU library? Exactly. ;-) Maybe a real device code is worth for people to get the whole picture. > > > if (l2fwd_mem_type == MEM_HOST_PINNED) { > > ext_mem.buf_ptr = rte_malloc("extmem", ext_mem.buf_len, 0); > > CUDA_CHECK(cudaHostRegister(ext_mem.buf_ptr, > > ext_mem.buf_len, cudaHostRegisterMapped)); > > void *pDevice; > > CUDA_CHECK(cudaHostGetDevicePointer(&pDevice, > > ext_mem.buf_ptr, 0)); > > if (pDevice != ext_mem.buf_ptr) > > rte_exit(EXIT_FAILURE, "GPU pointer does not match CPU > > pointer\n"); > > } else { > > ext_mem.buf_iova = RTE_BAD_IOVA; > > CUDA_CHECK(cudaMalloc(&ext_mem.buf_ptr, > > ext_mem.buf_len)); > > if (ext_mem.buf_ptr == NULL) > > rte_exit(EXIT_FAILURE, "Could not allocate GPU memory\n"); > > > > unsigned int flag = 1; > > CUresult status = cuPointerSetAttribute(&flag, > > CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, (CUdeviceptr)ext_mem.buf_ptr); > > if (CUDA_SUCCESS != status) { > > rte_exit(EXIT_FAILURE, "Could not set SYNC MEMOP attribute > > for GPU memory at %llx\n", (CUdeviceptr)ext_mem.buf_ptr); > > } > > ret = rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, > > NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); > > if (ret) > > rte_exit(EXIT_FAILURE, "Could not register GPU memory\n"); > > } > > ret = rte_dev_dma_map(rte_eth_devices[l2fwd_port_id].device, > > ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); > > if (ret) > > rte_exit(EXIT_FAILURE, "Could not DMA map EXT memory\n"); > > mpool_payload = rte_pktmbuf_pool_create_extbuf("payload_mpool", > > l2fwd_nb_mbufs, > > 0, 0, > ext_mem.elt_size, > > > rte_socket_id(), > > &ext_mem, 1); > > if (mpool_payload == NULL) > > rte_exit(EXIT_FAILURE, "Could not create EXT memory > > mempool\n"); > > > >