[v4,2/3] test/distributor: replace sync builtins with atomic builtins

Message ID 1554692551-28275-3-git-send-email-phil.yang@arm.com (mailing list archive)
State Accepted, archived
Delegated to: Thomas Monjalon
Headers
Series [v4,1/3] packet_ordering: add statistics for each worker thread |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Phil Yang April 8, 2019, 3:02 a.m. UTC
  '__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.

Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest

*** distributor_perf_autotest without this patch ***
  

Comments

Hunt, David April 10, 2019, 2:05 p.m. UTC | #1
Hi Phil,

On 8/4/2019 4:02 AM, Phil Yang wrote:
> '__sync' built-in functions are deprecated, should use the '__atomic'
> built-in instead. the sync built-in functions are full barriers, while
> atomic built-in functions offer less restrictive one-way barriers,
> which help performance.
>
> Here is the example test result on TX2:
> sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
> -n 4 --socket-mem=1024,1024 -- -i
> RTE>>distributor_perf_autotest
>
> *** distributor_perf_autotest without this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1519202730 ticks
> Ticks per iteration = 45
>
> *** distributor_perf_autotest with this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1251715496 ticks
> Ticks per iteration = 37
>
> Less ticks needed for the cache line switch test. It got 17% of
> performance improvement.


I'm seeing about an 8% performance degradation on my platform for the 
cache line switch test with the patch, however the single mode and burst 
mode tests area showing no difference, which are the more important 
tests. What kind of differences are you seeing in the single/burst mode 
tests?

Rgds,
Dave.


---snip---
  
Phil Yang April 11, 2019, 11:31 a.m. UTC | #2
> -----Original Message-----
> From: Hunt, David <david.hunt@intel.com>
> Sent: Wednesday, April 10, 2019 10:06 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: reshma.pattan@intel.com; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v4 2/3] test/distributor: replace sync builtins with atomic
> builtins
> 
> Hi Phil,
> 
> On 8/4/2019 4:02 AM, Phil Yang wrote:
> > '__sync' built-in functions are deprecated, should use the '__atomic'
> > built-in instead. the sync built-in functions are full barriers, while
> > atomic built-in functions offer less restrictive one-way barriers,
> > which help performance.
> >
> > Here is the example test result on TX2:
> > sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4
> > --socket-mem=1024,1024 -- -i
> > RTE>>distributor_perf_autotest
> >
> > *** distributor_perf_autotest without this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1519202730 ticks Ticks
> > per iteration = 45
> >
> > *** distributor_perf_autotest with this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1251715496 ticks Ticks
> > per iteration = 37
> >
> > Less ticks needed for the cache line switch test. It got 17% of
> > performance improvement.
> 
> 
Hi, Dave

Thanks for your input.

> I'm seeing about an 8% performance degradation on my platform for the

I'd tested this patch on our x86 server (E5-2640 v3 @ 2.60GHz) several rounds. However, I didn't found performance degradation. Please check the test result below.
$ sudo  ./x86_64-native-linuxapp-gcc/app/test -l 8-15 -n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest

####  without this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12379399910 ticks
Ticks per iteration = 368

=== Performance test of distributor (single mode) ===
Time per burst:  5815
Time per packet: 90

=== Performance test of distributor (burst mode) ===
Time per burst:  3487
Time per packet: 54

####  with this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12388791845 ticks
Ticks per iteration = 369

=== Performance test of distributor (single mode) ===
Time per burst:  5796
Time per packet: 90

=== Performance test of distributor (burst mode) ===
Time per burst:  3477
Time per packet: 54

From my test, there was a little bit of performance improvement (You can also think of it as a measurement bias) on x86. 

> cache line switch test with the patch, however the single mode and burst
> mode tests area showing no difference, which are the more important tests.
> What kind of differences are you seeing in the single/burst mode tests?

Actually, I found no difference in the single mode and burst mode on aarch64 neither. I think it means this test case is not the hotspot for those two mode's performance. 

Just like the __sync_xxx builtins, the __atomic_xxx builtins are atomic operations, which elide the memory barrier. So I think it should benefit all platform.

Thanks,
Phil
> 
> Rgds,
> Dave.
> 
> 
> ---snip---
> 
>
  

Patch

==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45

*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37

Less ticks needed for the cache line switch test. It got 17% of
performance improvement.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/test_distributor.c      | 7 ++++---
 app/test/test_distributor_perf.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..0364637 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,7 +62,7 @@  handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	int i;
 
 	for (i = 0; i < 8; i++)
@@ -270,7 +270,7 @@  handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +343,8 @@  handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
-	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..89b28f0 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,7 +111,7 @@  handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
 	for (i = 0; i < 8; i++)