app/testpmd: fix txonly mode timestamp intitialization

Message ID 1595863595-11344-1-git-send-email-viacheslavo@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series app/testpmd: fix txonly mode timestamp intitialization |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/travis-robot success Travis build: passed
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Slava Ovsiienko July 27, 2020, 3:26 p.m. UTC
  The testpmd application forwards data in multiple threads.
In the txonly mode the Tx timestamps must be initialized
on per thread basis to provide phase shift for the packet
burst being sent. This per thread initialization was performed
on zero value of the variable in thread local storage and
happened only once after testpmd forwarding start. Executing
"start" and "stop" commands did not cause thread local variables
zeroing and wrong timestamp values were used.

Fixes: 4940344dab1d ("app/testpmd: add Tx scheduling command")

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 app/test-pmd/txonly.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
  

Comments

Phil Yang July 28, 2020, 4:23 p.m. UTC | #1
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Viacheslav Ovsiienko
> Sent: Monday, July 27, 2020 11:27 PM
> To: dev@dpdk.org
> Cc: matan@mellanox.com; rasland@mellanox.com; thomas@monjalon.net;
> ferruh.yigit@intel.com
> Subject: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> intitialization
> 
> The testpmd application forwards data in multiple threads.
> In the txonly mode the Tx timestamps must be initialized
> on per thread basis to provide phase shift for the packet
> burst being sent. This per thread initialization was performed
> on zero value of the variable in thread local storage and
> happened only once after testpmd forwarding start. Executing
> "start" and "stop" commands did not cause thread local variables
> zeroing and wrong timestamp values were used.

I think it is too heavy to use rte_wmb() to guarantee the visibility of 'timestamp_init_req' updating for subsequent read operations.
We can use C11 atomics with explicit memory ordering instead of rte_wmb() to achieve the same goal.

> 
> Fixes: 4940344dab1d ("app/testpmd: add Tx scheduling command")
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  app/test-pmd/txonly.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
> index 97f4a45..415431d 100644
> --- a/app/test-pmd/txonly.c
> +++ b/app/test-pmd/txonly.c
> @@ -55,9 +55,13 @@
>  static struct rte_udp_hdr pkt_udp_hdr; /**< UDP header of tx packets. */
>  RTE_DEFINE_PER_LCORE(uint64_t, timestamp_qskew);
>  					/**< Timestamp offset per queue */
> +RTE_DEFINE_PER_LCORE(uint32_t, timestamp_idone); /**< Timestamp init
> done. */
> +
>  static uint64_t timestamp_mask; /**< Timestamp dynamic flag mask */
>  static int32_t timestamp_off; /**< Timestamp dynamic field offset */
>  static bool timestamp_enable; /**< Timestamp enable */
> +static volatile uint32_t timestamp_init_req;

If we use C11 atomic builtins for 'timestamp_init_req' accessing, the volatile key word becomes unnecessary.
Because they will generate same instructions.

> +				 /**< Timestamp initialization request. */
>  static uint64_t timestamp_initial[RTE_MAX_ETHPORTS];
> 
>  static void
> @@ -229,7 +233,8 @@
>  			rte_be64_t ts;
>  		} timestamp_mark;
> 
> -		if (unlikely(!skew)) {
> +		if (unlikely(timestamp_init_req !=

if (unlikely(__atomic_load_n(&timestamp_init_req, __ATOMIC_RELAXED) !=

> +			RTE_PER_LCORE(timestamp_idone))) {
>  			struct rte_eth_dev *dev = &rte_eth_devices[fs-
> >tx_port];
>  			unsigned int txqs_n = dev->data->nb_tx_queues;
>  			uint64_t phase = tx_pkt_times_inter * fs->tx_queue
> /
> @@ -241,6 +246,7 @@
>  			skew = timestamp_initial[fs->tx_port] +
>  			       tx_pkt_times_inter + phase;
>  			RTE_PER_LCORE(timestamp_qskew) = skew;
> +			RTE_PER_LCORE(timestamp_idone) =
> timestamp_init_req;


RTE_PER_LCORE(timestamp_idone) = __atomic_load_n(&timestamp_init_req, __ATOMIC_RELAXED);


>  		}
>  		timestamp_mark.pkt_idx = rte_cpu_to_be_16(idx);
>  		timestamp_mark.queue_idx = rte_cpu_to_be_16(fs-
> >tx_queue);
> @@ -426,6 +432,9 @@
>  			   timestamp_mask &&
>  			   timestamp_off >= 0 &&
>  			   !rte_eth_read_clock(pi, &timestamp_initial[pi]);
> +	if (timestamp_enable)
> +		timestamp_init_req++;


__atomic_add_fetch(&timestamp_init_req, 1, __ATOMIC_ACQ_REL);


> +	rte_wmb();

We can remove it now.


Thanks,
Phil
  
Slava Ovsiienko July 29, 2020, 8:08 a.m. UTC | #2
Hi, Phil

Very nice comment, I found it is useful, I'm on moving mlx5 PMD to use C11 atomic primitives, thank you.
But, don't you think it is an overkill for testpmd case and would make code wordy and less readable ?
How does testpmd start the forwarding? Let's have a glance at launch_packet_forwarding():

- initialize forwarding with port_fwd_begin() - It is tx_only_begin for txonly mode, master core context
- launch the forwarding cores rte_eal_remote_launch(), the forwarding will happen in forward core context

Let's reconsider:

 - tx_only_begin() is called once in master core context, forwarding is not launched yet, there is NO any 
 concurrent access to variables, we can set ones in any way, NO atomic, volatile, etc. is needed at all.
 We should do setup and commit all our memory writes - rte_wmb() seems to be the very native way
to do it once for all preceding writes. Performance can be neglected here - it is not a datapath.

- pkt_burst_prepare() is being executed in the forwarding core context, and perform only READ access
 to the global variables, those are stable while forwarding lasts, and can be considered as constants.
 No atomic, barriers, etc. are needed either. I agree - the volatile is redundant ("over-reassurance"), let's
 remove it. But atomic access - not needed, would make code complicated.

What do you think? Do you insist on atomic-barriered access implementing?

With best regards, 
Slava

> -----Original Message-----
> From: Phil Yang <Phil.Yang@arm.com>
> Sent: Tuesday, July 28, 2020 19:24
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Matan Azrad <matan@mellanox.com>; Raslan Darawsheh
> <rasland@mellanox.com>; Thomas Monjalon <thomas@monjalon.net>;
> ferruh.yigit@intel.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; dev@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> intitialization
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Viacheslav Ovsiienko
> > Sent: Monday, July 27, 2020 11:27 PM
> > To: dev@dpdk.org
> > Cc: matan@mellanox.com; rasland@mellanox.com;
> thomas@monjalon.net;
> > ferruh.yigit@intel.com
> > Subject: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> > intitialization
> >
> > The testpmd application forwards data in multiple threads.
> > In the txonly mode the Tx timestamps must be initialized on per thread
> > basis to provide phase shift for the packet burst being sent. This per
> > thread initialization was performed on zero value of the variable in
> > thread local storage and happened only once after testpmd forwarding
> > start. Executing "start" and "stop" commands did not cause thread
> > local variables zeroing and wrong timestamp values were used.
> 
> I think it is too heavy to use rte_wmb() to guarantee the visibility of
> 'timestamp_init_req' updating for subsequent read operations.
> We can use C11 atomics with explicit memory ordering instead of rte_wmb()
> to achieve the same goal.
> 
> >
> > Fixes: 4940344dab1d ("app/testpmd: add Tx scheduling command")
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  app/test-pmd/txonly.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c index
> > 97f4a45..415431d 100644
> > --- a/app/test-pmd/txonly.c
> > +++ b/app/test-pmd/txonly.c
> > @@ -55,9 +55,13 @@
> >  static struct rte_udp_hdr pkt_udp_hdr; /**< UDP header of tx packets.
> > */  RTE_DEFINE_PER_LCORE(uint64_t, timestamp_qskew);
> >  					/**< Timestamp offset per queue */
> > +RTE_DEFINE_PER_LCORE(uint32_t, timestamp_idone); /**< Timestamp
> init
> > done. */
> > +
> >  static uint64_t timestamp_mask; /**< Timestamp dynamic flag mask */
> > static int32_t timestamp_off; /**< Timestamp dynamic field offset */
> > static bool timestamp_enable; /**< Timestamp enable */
> > +static volatile uint32_t timestamp_init_req;
> 
> If we use C11 atomic builtins for 'timestamp_init_req' accessing, the volatile
> key word becomes unnecessary.
> Because they will generate same instructions.
> 
> > +				 /**< Timestamp initialization request. */
> >  static uint64_t timestamp_initial[RTE_MAX_ETHPORTS];
> >
> >  static void
> > @@ -229,7 +233,8 @@
> >  			rte_be64_t ts;
> >  		} timestamp_mark;
> >
> > -		if (unlikely(!skew)) {
> > +		if (unlikely(timestamp_init_req !=
> 
> if (unlikely(__atomic_load_n(&timestamp_init_req, __ATOMIC_RELAXED) !=
> 
> > +			RTE_PER_LCORE(timestamp_idone))) {
> >  			struct rte_eth_dev *dev = &rte_eth_devices[fs-
> > >tx_port];
> >  			unsigned int txqs_n = dev->data->nb_tx_queues;
> >  			uint64_t phase = tx_pkt_times_inter * fs->tx_queue /
> @@ -241,6
> > +246,7 @@
> >  			skew = timestamp_initial[fs->tx_port] +
> >  			       tx_pkt_times_inter + phase;
> >  			RTE_PER_LCORE(timestamp_qskew) = skew;
> > +			RTE_PER_LCORE(timestamp_idone) =
> > timestamp_init_req;
> 
> 
> RTE_PER_LCORE(timestamp_idone) =
> __atomic_load_n(&timestamp_init_req, __ATOMIC_RELAXED);
> 
> 
> >  		}
> >  		timestamp_mark.pkt_idx = rte_cpu_to_be_16(idx);
> >  		timestamp_mark.queue_idx = rte_cpu_to_be_16(fs-
> > >tx_queue);
> > @@ -426,6 +432,9 @@
> >  			   timestamp_mask &&
> >  			   timestamp_off >= 0 &&
> >  			   !rte_eth_read_clock(pi, &timestamp_initial[pi]);
> > +	if (timestamp_enable)
> > +		timestamp_init_req++;
> 
> 
> __atomic_add_fetch(&timestamp_init_req, 1, __ATOMIC_ACQ_REL);
> 
> 
> > +	rte_wmb();
> 
> We can remove it now.
> 
> 
> Thanks,
> Phil
  
Phil Yang July 29, 2020, 9:37 a.m. UTC | #3
Hi Slava,

Thanks for your analysis. Some comments inline.

> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@mellanox.com>
> Sent: Wednesday, July 29, 2020 4:08 PM
> To: Phil Yang <Phil.Yang@arm.com>
> Cc: Matan Azrad <matan@mellanox.com>; Raslan Darawsheh
> <rasland@mellanox.com>; thomas@monjalon.net; ferruh.yigit@intel.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>; dev@dpdk.org; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> intitialization
> 
> Hi, Phil
> 
> Very nice comment, I found it is useful, I'm on moving mlx5 PMD to use C11
> atomic primitives, thank you.
> But, don't you think it is an overkill for testpmd case and would make code
> wordy and less readable ?
> How does testpmd start the forwarding? Let's have a glance at
> launch_packet_forwarding():
> 
> - initialize forwarding with port_fwd_begin() - It is tx_only_begin for txonly
> mode, master core context
> - launch the forwarding cores rte_eal_remote_launch(), the forwarding will
> happen in forward core context
> 
> Let's reconsider:
> 
>  - tx_only_begin() is called once in master core context, forwarding is not
> launched yet, there is NO any
>  concurrent access to variables, we can set ones in any way, NO atomic,
> volatile, etc. is needed at all.
>  We should do setup and commit all our memory writes - rte_wmb() seems

I thought this barrier in this patch was only for synchronizing the 'timestamp_init_req' between master core and forwarding cores.
But in fact, it is more than that.  I am clear now.
I think it might be helpful to add some comments to this barrier. 


> to be the very native way
> to do it once for all preceding writes. Performance can be neglected here - it
> is not a datapath.

Agreed. It is not in the datapath and it won't hurt performance.

> 
> - pkt_burst_prepare() is being executed in the forwarding core context, and
> perform only READ access
>  to the global variables, those are stable while forwarding lasts, and can be
> considered as constants.
>  No atomic, barriers, etc. are needed either. I agree - the volatile is
> redundant ("over-reassurance"), let's
>  remove it. But atomic access - not needed, would make code complicated.

If we remove the volatile qualifier, then these atomic builtins becomes unnecessary.
Because all these 32-bits READs are atomic.

Thanks,
Phil

> 
> What do you think? Do you insist on atomic-barriered access implementing?
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Phil Yang <Phil.Yang@arm.com>
> > Sent: Tuesday, July 28, 2020 19:24
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Matan Azrad <matan@mellanox.com>; Raslan Darawsheh
> > <rasland@mellanox.com>; Thomas Monjalon <thomas@monjalon.net>;
> > ferruh.yigit@intel.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; dev@dpdk.org;
> nd
> > <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> > intitialization
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Viacheslav Ovsiienko
> > > Sent: Monday, July 27, 2020 11:27 PM
> > > To: dev@dpdk.org
> > > Cc: matan@mellanox.com; rasland@mellanox.com;
> > thomas@monjalon.net;
> > > ferruh.yigit@intel.com
> > > Subject: [dpdk-dev] [PATCH] app/testpmd: fix txonly mode timestamp
> > > intitialization
> > >
> > > The testpmd application forwards data in multiple threads.
> > > In the txonly mode the Tx timestamps must be initialized on per thread
> > > basis to provide phase shift for the packet burst being sent. This per
> > > thread initialization was performed on zero value of the variable in
> > > thread local storage and happened only once after testpmd forwarding
> > > start. Executing "start" and "stop" commands did not cause thread
> > > local variables zeroing and wrong timestamp values were used.
> >
> > I think it is too heavy to use rte_wmb() to guarantee the visibility of
> > 'timestamp_init_req' updating for subsequent read operations.
> > We can use C11 atomics with explicit memory ordering instead of
> rte_wmb()
> > to achieve the same goal.
> >
> > >
> > > Fixes: 4940344dab1d ("app/testpmd: add Tx scheduling command")
> > >
> > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > ---
> > >  app/test-pmd/txonly.c | 11 ++++++++++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c index
> > > 97f4a45..415431d 100644
> > > --- a/app/test-pmd/txonly.c
> > > +++ b/app/test-pmd/txonly.c
> > > @@ -55,9 +55,13 @@
> > >  static struct rte_udp_hdr pkt_udp_hdr; /**< UDP header of tx packets.
> > > */  RTE_DEFINE_PER_LCORE(uint64_t, timestamp_qskew);
> > >  					/**< Timestamp offset per queue */
> > > +RTE_DEFINE_PER_LCORE(uint32_t, timestamp_idone); /**< Timestamp
> > init
> > > done. */
> > > +
> > >  static uint64_t timestamp_mask; /**< Timestamp dynamic flag mask */
> > > static int32_t timestamp_off; /**< Timestamp dynamic field offset */
> > > static bool timestamp_enable; /**< Timestamp enable */
> > > +static volatile uint32_t timestamp_init_req;
> >
> > If we use C11 atomic builtins for 'timestamp_init_req' accessing, the volatile
> > key word becomes unnecessary.
> > Because they will generate same instructions.
> >
> > > +				 /**< Timestamp initialization request. */
> > >  static uint64_t timestamp_initial[RTE_MAX_ETHPORTS];
> > >
> > >  static void
> > > @@ -229,7 +233,8 @@
> > >  			rte_be64_t ts;
> > >  		} timestamp_mark;
> > >
> > > -		if (unlikely(!skew)) {
> > > +		if (unlikely(timestamp_init_req !=
> >
> > if (unlikely(__atomic_load_n(&timestamp_init_req,
> __ATOMIC_RELAXED) !=
> >
> > > +			RTE_PER_LCORE(timestamp_idone))) {
> > >  			struct rte_eth_dev *dev = &rte_eth_devices[fs-
> > > >tx_port];
> > >  			unsigned int txqs_n = dev->data->nb_tx_queues;
> > >  			uint64_t phase = tx_pkt_times_inter * fs->tx_queue
> /
> > @@ -241,6
> > > +246,7 @@
> > >  			skew = timestamp_initial[fs->tx_port] +
> > >  			       tx_pkt_times_inter + phase;
> > >  			RTE_PER_LCORE(timestamp_qskew) = skew;
> > > +			RTE_PER_LCORE(timestamp_idone) =
> > > timestamp_init_req;
> >
> >
> > RTE_PER_LCORE(timestamp_idone) =
> > __atomic_load_n(&timestamp_init_req, __ATOMIC_RELAXED);
> >
> >
> > >  		}
> > >  		timestamp_mark.pkt_idx = rte_cpu_to_be_16(idx);
> > >  		timestamp_mark.queue_idx = rte_cpu_to_be_16(fs-
> > > >tx_queue);
> > > @@ -426,6 +432,9 @@
> > >  			   timestamp_mask &&
> > >  			   timestamp_off >= 0 &&
> > >  			   !rte_eth_read_clock(pi, &timestamp_initial[pi]);
> > > +	if (timestamp_enable)
> > > +		timestamp_init_req++;
> >
> >
> > __atomic_add_fetch(&timestamp_init_req, 1, __ATOMIC_ACQ_REL);
> >
> >
> > > +	rte_wmb();
> >
> > We can remove it now.
> >
> >
> > Thanks,
> > Phil
  

Patch

diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 97f4a45..415431d 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -55,9 +55,13 @@ 
 static struct rte_udp_hdr pkt_udp_hdr; /**< UDP header of tx packets. */
 RTE_DEFINE_PER_LCORE(uint64_t, timestamp_qskew);
 					/**< Timestamp offset per queue */
+RTE_DEFINE_PER_LCORE(uint32_t, timestamp_idone); /**< Timestamp init done. */
+
 static uint64_t timestamp_mask; /**< Timestamp dynamic flag mask */
 static int32_t timestamp_off; /**< Timestamp dynamic field offset */
 static bool timestamp_enable; /**< Timestamp enable */
+static volatile uint32_t timestamp_init_req;
+				 /**< Timestamp initialization request. */
 static uint64_t timestamp_initial[RTE_MAX_ETHPORTS];
 
 static void
@@ -229,7 +233,8 @@ 
 			rte_be64_t ts;
 		} timestamp_mark;
 
-		if (unlikely(!skew)) {
+		if (unlikely(timestamp_init_req !=
+			RTE_PER_LCORE(timestamp_idone))) {
 			struct rte_eth_dev *dev = &rte_eth_devices[fs->tx_port];
 			unsigned int txqs_n = dev->data->nb_tx_queues;
 			uint64_t phase = tx_pkt_times_inter * fs->tx_queue /
@@ -241,6 +246,7 @@ 
 			skew = timestamp_initial[fs->tx_port] +
 			       tx_pkt_times_inter + phase;
 			RTE_PER_LCORE(timestamp_qskew) = skew;
+			RTE_PER_LCORE(timestamp_idone) = timestamp_init_req;
 		}
 		timestamp_mark.pkt_idx = rte_cpu_to_be_16(idx);
 		timestamp_mark.queue_idx = rte_cpu_to_be_16(fs->tx_queue);
@@ -426,6 +432,9 @@ 
 			   timestamp_mask &&
 			   timestamp_off >= 0 &&
 			   !rte_eth_read_clock(pi, &timestamp_initial[pi]);
+	if (timestamp_enable)
+		timestamp_init_req++;
+	rte_wmb();
 }
 
 struct fwd_engine tx_only_engine = {