[v2,06/15] app/testpmd: add common fwd wrapper function

Message ID 20210811140418.393264-6-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2,01/15] ethdev: introduce shared Rx queue |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Xueming Li Aug. 11, 2021, 2:04 p.m. UTC
  From: Xiaoyu Min <jackmin@nvidia.com>

Added an inline common wrapper function for all fwd engines
which do the following in common:

1. get_start_cycles
2. rte_eth_rx_burst(...,nb_pkt_per_burst)
3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
4. get_end_cycle

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
  

Comments

Jerin Jacob Aug. 17, 2021, 9:37 a.m. UTC | #1
On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> From: Xiaoyu Min <jackmin@nvidia.com>
>
> Added an inline common wrapper function for all fwd engines
> which do the following in common:
>
> 1. get_start_cycles
> 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
> 4. get_end_cycle
>
> Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> ---
>  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index 13141dfed9..b685ac48d6 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
>  void remove_tx_dynf_callback(portid_t portid);
>  int update_jumbo_frame_offload(portid_t portid);
>
> +static inline void
> +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
> +{
> +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +       uint16_t nb_rx;
> +       uint64_t start_tsc = 0;
> +
> +       get_start_cycles(&start_tsc);
> +
> +       /*
> +        * Receive a burst of packets and forward them.
> +        */
> +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> +                       pkts_burst, nb_pkt_per_burst);
> +       inc_rx_burst_stats(fs, nb_rx);
> +       if (unlikely(nb_rx == 0))
> +               return;
> +       if (unlikely(rxq_share > 0))

See below. It reads a global memory.

> +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> +       else
> +               (*fwd)(fs, nb_rx, pkts_burst);

New function pointer in fastpath.

IMO, We should not create performance regression for the existing
forward engine.
Can we have a new forward engine just for shared memory testing?

> +       get_end_cycles(fs, start_tsc);
> +}
> +
>  /*
>   * Work-around of a compilation error with ICC on invocations of the
>   * rte_be_to_cpu_16() function.
> --
> 2.25.1
>
  
Xueming Li Aug. 18, 2021, 11:27 a.m. UTC | #2
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:37 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > From: Xiaoyu Min <jackmin@nvidia.com>
> >
> > Added an inline common wrapper function for all fwd engines which do
> > the following in common:
> >
> > 1. get_start_cycles
> > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > get_end_cycle
> >
> > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > ---
> >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 13141dfed9..b685ac48d6 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > void remove_tx_dynf_callback(portid_t portid);  int
> > update_jumbo_frame_offload(portid_t portid);
> >
> > +static inline void
> > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > +       uint16_t nb_rx;
> > +       uint64_t start_tsc = 0;
> > +
> > +       get_start_cycles(&start_tsc);
> > +
> > +       /*
> > +        * Receive a burst of packets and forward them.
> > +        */
> > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > +                       pkts_burst, nb_pkt_per_burst);
> > +       inc_rx_burst_stats(fs, nb_rx);
> > +       if (unlikely(nb_rx == 0))
> > +               return;
> > +       if (unlikely(rxq_share > 0))
> 
> See below. It reads a global memory.
> 
> > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > +       else
> > +               (*fwd)(fs, nb_rx, pkts_burst);
> 
> New function pointer in fastpath.
> 
> IMO, We should not create performance regression for the existing forward engine.
> Can we have a new forward engine just for shared memory testing?

Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
Based on test data, the impacts almost invisible in legacy mode.

From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

> 
> > +       get_end_cycles(fs, start_tsc); }
> > +
> >  /*
> >   * Work-around of a compilation error with ICC on invocations of the
> >   * rte_be_to_cpu_16() function.
> > --
> > 2.25.1
> >
  
Jerin Jacob Aug. 18, 2021, 11:47 a.m. UTC | #3
On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:37 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > From: Xiaoyu Min <jackmin@nvidia.com>
> > >
> > > Added an inline common wrapper function for all fwd engines which do
> > > the following in common:
> > >
> > > 1. get_start_cycles
> > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > get_end_cycle
> > >
> > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > ---
> > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > 13141dfed9..b685ac48d6 100644
> > > --- a/app/test-pmd/testpmd.h
> > > +++ b/app/test-pmd/testpmd.h
> > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > void remove_tx_dynf_callback(portid_t portid);  int
> > > update_jumbo_frame_offload(portid_t portid);
> > >
> > > +static inline void
> > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > +       uint16_t nb_rx;
> > > +       uint64_t start_tsc = 0;
> > > +
> > > +       get_start_cycles(&start_tsc);
> > > +
> > > +       /*
> > > +        * Receive a burst of packets and forward them.
> > > +        */
> > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > +                       pkts_burst, nb_pkt_per_burst);
> > > +       inc_rx_burst_stats(fs, nb_rx);
> > > +       if (unlikely(nb_rx == 0))
> > > +               return;
> > > +       if (unlikely(rxq_share > 0))
> >
> > See below. It reads a global memory.
> >
> > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > +       else
> > > +               (*fwd)(fs, nb_rx, pkts_burst);
> >
> > New function pointer in fastpath.
> >
> > IMO, We should not create performance regression for the existing forward engine.
> > Can we have a new forward engine just for shared memory testing?
>
> Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> Based on test data, the impacts almost invisible in legacy mode.

Are you saying there is zero % regression? If not, could you share the data?

>
> From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
> regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

Yes. it good goal. testpmd forward performance using as synthetic
bench everyone.
I think, we are aligned to not have any regression for the generic
forward engine.

>
> >
> > > +       get_end_cycles(fs, start_tsc); }
> > > +
> > >  /*
> > >   * Work-around of a compilation error with ICC on invocations of the
> > >   * rte_be_to_cpu_16() function.
> > > --
> > > 2.25.1
> > >
  
Xueming Li Aug. 18, 2021, 2:08 p.m. UTC | #4
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 18, 2021 7:48 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > >
> > > > Added an inline common wrapper function for all fwd engines which
> > > > do the following in common:
> > > >
> > > > 1. get_start_cycles
> > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > get_end_cycle
> > > >
> > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > ---
> > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > 13141dfed9..b685ac48d6 100644
> > > > --- a/app/test-pmd/testpmd.h
> > > > +++ b/app/test-pmd/testpmd.h
> > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > update_jumbo_frame_offload(portid_t portid);
> > > >
> > > > +static inline void
> > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > +       uint16_t nb_rx;
> > > > +       uint64_t start_tsc = 0;
> > > > +
> > > > +       get_start_cycles(&start_tsc);
> > > > +
> > > > +       /*
> > > > +        * Receive a burst of packets and forward them.
> > > > +        */
> > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > +       if (unlikely(nb_rx == 0))
> > > > +               return;
> > > > +       if (unlikely(rxq_share > 0))
> > >
> > > See below. It reads a global memory.
> > >
> > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > +       else
> > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > >
> > > New function pointer in fastpath.
> > >
> > > IMO, We should not create performance regression for the existing forward engine.
> > > Can we have a new forward engine just for shared memory testing?
> >
> > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > Based on test data, the impacts almost invisible in legacy mode.
> 
> Are you saying there is zero % regression? If not, could you share the data?

Almost zero, here is a quick single core result of rxonly:
	32.2Mpps, 58.9cycles/packet
Revert the patch to rxonly.c:
	32.1Mpps 59.9cycles/packet
The result doesn't make sense and I realized that I used batch mbuf free, apply it now: 
	32.2Mpps, 58.9cycles/packet
There were small digit jumps between testpmd restart, I picked the best one.
The result is almost same, seems the cost of each packet is small enough.
BTW, I'm testing with default burst size and queue depth.

> 
> >
> > From test perspective, better to have all forward engine to verify
> > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> seamlessly.
> 
> Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> I think, we are aligned to not have any regression for the generic forward engine.
> 
> >
> > >
> > > > +       get_end_cycles(fs, start_tsc); }
> > > > +
> > > >  /*
> > > >   * Work-around of a compilation error with ICC on invocations of the
> > > >   * rte_be_to_cpu_16() function.
> > > > --
> > > > 2.25.1
> > > >
  
Jerin Jacob Aug. 26, 2021, 11:28 a.m. UTC | #5
On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Wednesday, August 18, 2021 7:48 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > > Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > > wrapper function
> > > >
> > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > >
> > > > > Added an inline common wrapper function for all fwd engines which
> > > > > do the following in common:
> > > > >
> > > > > 1. get_start_cycles
> > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > get_end_cycle
> > > > >
> > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > ---
> > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > >  1 file changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > > 13141dfed9..b685ac48d6 100644
> > > > > --- a/app/test-pmd/testpmd.h
> > > > > +++ b/app/test-pmd/testpmd.h
> > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > > update_jumbo_frame_offload(portid_t portid);
> > > > >
> > > > > +static inline void
> > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > +       uint16_t nb_rx;
> > > > > +       uint64_t start_tsc = 0;
> > > > > +
> > > > > +       get_start_cycles(&start_tsc);
> > > > > +
> > > > > +       /*
> > > > > +        * Receive a burst of packets and forward them.
> > > > > +        */
> > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > +       if (unlikely(nb_rx == 0))
> > > > > +               return;
> > > > > +       if (unlikely(rxq_share > 0))
> > > >
> > > > See below. It reads a global memory.
> > > >
> > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > +       else
> > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > >
> > > > New function pointer in fastpath.
> > > >
> > > > IMO, We should not create performance regression for the existing forward engine.
> > > > Can we have a new forward engine just for shared memory testing?
> > >
> > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > Based on test data, the impacts almost invisible in legacy mode.
> >
> > Are you saying there is zero % regression? If not, could you share the data?
>
> Almost zero, here is a quick single core result of rxonly:
>         32.2Mpps, 58.9cycles/packet
> Revert the patch to rxonly.c:
>         32.1Mpps 59.9cycles/packet
> The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
>         32.2Mpps, 58.9cycles/packet
> There were small digit jumps between testpmd restart, I picked the best one.
> The result is almost same, seems the cost of each packet is small enough.
> BTW, I'm testing with default burst size and queue depth.

I tested this on octeontx2 with iofwd with single core with 100Gbps
Without this patch - 73.5mpps
With this patch - 72.8 mpps

We are taking the shared queue runtime option without a separate fwd engine.
and to have zero performance impact and no compile time flag
Then I think, only way to have a function template .
Example change to outline function template principle.

static inline
__pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag)
{

Introduce new checks under
if (flags & SHARED_QUEUE)


}

Have two versions of io_fwd_engine.packet_fwd per engine.

- first version
static pkt_burst_io_forward(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, 0);
}

- Second version
static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, SHARED_QUEUE);
}


Update io_fwd_engine.packet_fwd in slowpath to respective version
based on offload.

If shared offoad is not selected, pkt_burst_io_forward() will be selected and
__pkt_burst_io_forward() will be a compile time version of
!SHARED_QUEUE aka same as existing coe.

>
> >
> > >
> > > From test perspective, better to have all forward engine to verify
> > > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> > seamlessly.
> >
> > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > I think, we are aligned to not have any regression for the generic forward engine.
> >
> > >
> > > >
> > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > +
> > > > >  /*
> > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > >   * rte_be_to_cpu_16() function.
> > > > > --
> > > > > 2.25.1
> > > > >
  
Xueming Li Aug. 29, 2021, 7:07 a.m. UTC | #6
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:28 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > >
> > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > which do the following in common:
> > > > > >
> > > > > > 1. get_start_cycles
> > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > get_end_cycle
> > > > > >
> > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > ---
> > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > >  1 file changed, 24 insertions(+)
> > > > > >
> > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > index
> > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > >
> > > > > > +static inline void
> > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > +       uint16_t nb_rx;
> > > > > > +       uint64_t start_tsc = 0;
> > > > > > +
> > > > > > +       get_start_cycles(&start_tsc);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Receive a burst of packets and forward them.
> > > > > > +        */
> > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > +               return;
> > > > > > +       if (unlikely(rxq_share > 0))
> > > > >
> > > > > See below. It reads a global memory.
> > > > >
> > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > +       else
> > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > >
> > > > > New function pointer in fastpath.
> > > > >
> > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > Can we have a new forward engine just for shared memory testing?
> > > >
> > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > Based on test data, the impacts almost invisible in legacy mode.
> > >
> > > Are you saying there is zero % regression? If not, could you share the data?
> >
> > Almost zero, here is a quick single core result of rxonly:
> >         32.2Mpps, 58.9cycles/packet
> > Revert the patch to rxonly.c:
> >         32.1Mpps 59.9cycles/packet
> > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> >         32.2Mpps, 58.9cycles/packet
> > There were small digit jumps between testpmd restart, I picked the best one.
> > The result is almost same, seems the cost of each packet is small enough.
> > BTW, I'm testing with default burst size and queue depth.
> 
> I tested this on octeontx2 with iofwd with single core with 100Gbps Without this patch - 73.5mpps With this patch - 72.8 mpps
> 
> We are taking the shared queue runtime option without a separate fwd engine.
> and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> Example change to outline function template principle.
> 
> static inline
> __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> 
> Introduce new checks under
> if (flags & SHARED_QUEUE)
> 
> 
> }
> 
> Have two versions of io_fwd_engine.packet_fwd per engine.
> 
> - first version
> static pkt_burst_io_forward(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, 0); }
> 
> - Second version
> static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> 
> 
> Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> 
> If shared offoad is not selected, pkt_burst_io_forward() will be selected and
> __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.

Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed
to function parameter, right? Have you tested this performance? If not, I could verify.

> 
> >
> > >
> > > >
> > > > From test perspective, better to have all forward engine to verify
> > > > shared rxq, test team want to run the regression with less
> > > > impacts. Hope to have a solution to utilize all forwarding engines
> > > seamlessly.
> > >
> > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > I think, we are aligned to not have any regression for the generic forward engine.
> > >
> > > >
> > > > >
> > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > +
> > > > > >  /*
> > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > >   * rte_be_to_cpu_16() function.
> > > > > > --
> > > > > > 2.25.1
> > > > > >
  
Xueming Li Sept. 1, 2021, 2:44 p.m. UTC | #7
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Sunday, August 29, 2021 3:08 PM
> To: Jerin Jacob <jerinjacobk@gmail.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:28 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > wrapper function
> >
> > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > fwd wrapper function
> > > >
> > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > common fwd wrapper function
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > >
> > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > which do the following in common:
> > > > > > >
> > > > > > > 1. get_start_cycles
> > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > get_end_cycle
> > > > > > >
> > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > ---
> > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > >  1 file changed, 24 insertions(+)
> > > > > > >
> > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > index
> > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > >
> > > > > > > +static inline void
> > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > +       uint16_t nb_rx;
> > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > +
> > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > +        */
> > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > +               return;
> > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > >
> > > > > > See below. It reads a global memory.
> > > > > >
> > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > +       else
> > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > >
> > > > > > New function pointer in fastpath.
> > > > > >
> > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > Can we have a new forward engine just for shared memory testing?
> > > > >
> > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > >
> > > > Are you saying there is zero % regression? If not, could you share the data?
> > >
> > > Almost zero, here is a quick single core result of rxonly:
> > >         32.2Mpps, 58.9cycles/packet
> > > Revert the patch to rxonly.c:
> > >         32.1Mpps 59.9cycles/packet
> > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > >         32.2Mpps, 58.9cycles/packet
> > > There were small digit jumps between testpmd restart, I picked the best one.
> > > The result is almost same, seems the cost of each packet is small enough.
> > > BTW, I'm testing with default burst size and queue depth.
> >
> > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > Without this patch - 73.5mpps With this patch - 72.8 mpps
> >
> > We are taking the shared queue runtime option without a separate fwd engine.
> > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > Example change to outline function template principle.
> >
> > static inline
> > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> >
> > Introduce new checks under
> > if (flags & SHARED_QUEUE)
> >
> >
> > }
> >
> > Have two versions of io_fwd_engine.packet_fwd per engine.
> >
> > - first version
> > static pkt_burst_io_forward(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, 0); }
> >
> > - Second version
> > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> >
> >
> > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> >
> > If shared offoad is not selected, pkt_burst_io_forward() will be
> > selected and
> > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> 
> Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> right? Have you tested this performance? If not, I could verify.

Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

> 
> >
> > >
> > > >
> > > > >
> > > > > From test perspective, better to have all forward engine to
> > > > > verify shared rxq, test team want to run the regression with
> > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > engines
> > > > seamlessly.
> > > >
> > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > >
> > > > >
> > > > > >
> > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
  
Xueming Li Sept. 28, 2021, 5:54 a.m. UTC | #8
On Wed, 2021-09-01 at 14:44 +0000, Xueming(Steven) Li wrote:
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> > Sent: Sunday, August 29, 2021 3:08 PM
> > To: Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:28 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > > <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > > 
> > > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > > common fwd wrapper function
> > > > > > > 
> > > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > 
> > > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > > which do the following in common:
> > > > > > > > 
> > > > > > > > 1. get_start_cycles
> > > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > > get_end_cycle
> > > > > > > > 
> > > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > ---
> > > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > > >  1 file changed, 24 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > > index
> > > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > > > 
> > > > > > > > +static inline void
> > > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > > +       uint16_t nb_rx;
> > > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > > +
> > > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > > +        */
> > > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > > +               return;
> > > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > > > 
> > > > > > > See below. It reads a global memory.
> > > > > > > 
> > > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > > +       else
> > > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > > > 
> > > > > > > New function pointer in fastpath.
> > > > > > > 
> > > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > > Can we have a new forward engine just for shared memory testing?
> > > > > > 
> > > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > > > 
> > > > > Are you saying there is zero % regression? If not, could you share the data?
> > > > 
> > > > Almost zero, here is a quick single core result of rxonly:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > Revert the patch to rxonly.c:
> > > >         32.1Mpps 59.9cycles/packet
> > > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > There were small digit jumps between testpmd restart, I picked the best one.
> > > > The result is almost same, seems the cost of each packet is small enough.
> > > > BTW, I'm testing with default burst size and queue depth.
> > > 
> > > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > > Without this patch - 73.5mpps With this patch - 72.8 mpps
> > > 
> > > We are taking the shared queue runtime option without a separate fwd engine.
> > > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > > Example change to outline function template principle.
> > > 
> > > static inline
> > > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> > > 
> > > Introduce new checks under
> > > if (flags & SHARED_QUEUE)
> > > 
> > > 
> > > }
> > > 
> > > Have two versions of io_fwd_engine.packet_fwd per engine.
> > > 
> > > - first version
> > > static pkt_burst_io_forward(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, 0); }
> > > 
> > > - Second version
> > > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> > > 
> > > 
> > > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> > > 
> > > If shared offoad is not selected, pkt_burst_io_forward() will be
> > > selected and
> > > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> > 
> > Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> > right? Have you tested this performance? If not, I could verify.
> 
> Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
> Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

The changes reflected in v3, also consolidated patches for each
forwarding engine into one, please check.

> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > From test perspective, better to have all forward engine to
> > > > > > verify shared rxq, test team want to run the regression with
> > > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > > engines
> > > > > seamlessly.
> > > > > 
> > > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > > +
> > > > > > > >  /*
> > > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >
  

Patch

diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 13141dfed9..b685ac48d6 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1022,6 +1022,30 @@  void add_tx_dynf_callback(portid_t portid);
 void remove_tx_dynf_callback(portid_t portid);
 int update_jumbo_frame_offload(portid_t portid);
 
+static inline void
+do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+
+	/*
+	 * Receive a burst of packets and forward them.
+	 */
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
+			pkts_burst, nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	if (unlikely(rxq_share > 0))
+		forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
+	else
+		(*fwd)(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
 /*
  * Work-around of a compilation error with ICC on invocations of the
  * rte_be_to_cpu_16() function.