rte_ring: fix racy dequeue/enqueue in ppc64
diff mbox series

Message ID 20180712024414.4756-1-t.yoshimura8869@gmail.com
State New
Delegated to: Thomas Monjalon
Headers show
Series
  • rte_ring: fix racy dequeue/enqueue in ppc64
Related show

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Takeshi Yoshimura July 12, 2018, 2:44 a.m. UTC
SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
It uses a single consumer and multiple producers for a rte_ring.
The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().

The reordered loads happened on r->prod.tail in
__rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
dependency, but the code does not satisfy it. Note that they are
not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
cmpset invokes a read barrier.

The paired stores on these loads are in ENQUEUE_PTRS() and
update_tail(). Simplified code around the reorder is the following.

Consumer             Producer
load idx[ring]
                     store idx[ring]
                     store r->prod.tail
load r->prod.tail

In this case, the consumer loads old idx[ring] and confirms the load
is valid with the new r->prod.tail.

I added a read barrier in the case where __IS_SC is passed to
__rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
to avoid similar problems with a single producer.

Cc: stable@dpdk.org

Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
---
 lib/librte_ring/rte_ring_generic.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Jerin Jacob July 12, 2018, 5:08 p.m. UTC | #1
-----Original Message-----
> Date: Thu, 12 Jul 2018 11:44:14 +0900
> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> To: dev@dpdk.org
> Cc: Takeshi Yoshimura <t.yoshimura8869@gmail.com>, stable@dpdk.org, Takeshi
>  Yoshimura <tyos@jp.ibm.com>
> Subject: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
> X-Mailer: git-send-email 2.15.1
> 
> External Email
> 
> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> It uses a single consumer and multiple producers for a rte_ring.
> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().

Adding rte_smp_rmb() cause performance regression on non x86 platforms.
Having said that, load-load barrier can be expressed very  well with C11 memory
model. I guess ppc64 supports C11 memory model. If so,
Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
original issue?

> 
> The reordered loads happened on r->prod.tail in
> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> dependency, but the code does not satisfy it. Note that they are
> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
> cmpset invokes a read barrier.
> 
> The paired stores on these loads are in ENQUEUE_PTRS() and
> update_tail(). Simplified code around the reorder is the following.
> 
> Consumer             Producer
> load idx[ring]
>                      store idx[ring]
>                      store r->prod.tail
> load r->prod.tail
> 
> In this case, the consumer loads old idx[ring] and confirms the load
> is valid with the new r->prod.tail.
> 
> I added a read barrier in the case where __IS_SC is passed to
> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
> to avoid similar problems with a single producer.
> 
> Cc: stable@dpdk.org
> 
> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> ---
>  lib/librte_ring/rte_ring_generic.h | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> index ea7dbe5b9..477326180 100644
> --- a/lib/librte_ring/rte_ring_generic.h
> +++ b/lib/librte_ring/rte_ring_generic.h
> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>                         return 0;
> 
>                 *new_head = *old_head + n;
> -               if (is_sp)
> +               if (is_sp) {
> +                       rte_smp_rmb();
>                         r->prod.head = *new_head, success = 1;
> -               else
> +               } else
>                         success = rte_atomic32_cmpset(&r->prod.head,
>                                         *old_head, *new_head);
>         } while (unlikely(success == 0));
> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
>                         return 0;
> 
>                 *new_head = *old_head + n;
> -               if (is_sc)
> +               if (is_sc) {
> +                       rte_smp_rmb();
>                         r->cons.head = *new_head, success = 1;
> -               else
> +               } else
>                         success = rte_atomic32_cmpset(&r->cons.head, *old_head,
>                                         *new_head);
>         } while (unlikely(success == 0));
> --
> 2.17.1
>
Takeshi Yoshimura July 17, 2018, 2:54 a.m. UTC | #2
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very  well with C11 memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?

Yes, the performance regression happens on non-x86 with single
producer/consumer.
The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
simple experiment. But, I think it is worth it.


I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
I tried to fix the C11 problem as well, but I also found the C11
rte_ring had other potential
incorrect choices of memory orders, which caused another race
condition in ppc64.

For example,
__ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
I am not sure why the load-acquire is used for the compare exchange.
Also in update_tail, the pause can be called before the data copy because
of ht->tail load without atomic_load_n.

The memory order is simply difficult, so it might take a bit longer
time to check
if the code is correct. I think I can fix the C11 rte_ring as another patch.

2018-07-13 2:08 GMT+09:00 Jerin Jacob <jerin.jacob@caviumnetworks.com>:
> -----Original Message-----
>> Date: Thu, 12 Jul 2018 11:44:14 +0900
>> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
>> To: dev@dpdk.org
>> Cc: Takeshi Yoshimura <t.yoshimura8869@gmail.com>, stable@dpdk.org, Takeshi
>>  Yoshimura <tyos@jp.ibm.com>
>> Subject: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
>> X-Mailer: git-send-email 2.15.1
>>
>> External Email
>>
>> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
>> It uses a single consumer and multiple producers for a rte_ring.
>> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
>
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very  well with C11 memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?
>
>>
>> The reordered loads happened on r->prod.tail in
>> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
>> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
>> dependency, but the code does not satisfy it. Note that they are
>> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
>> cmpset invokes a read barrier.
>>
>> The paired stores on these loads are in ENQUEUE_PTRS() and
>> update_tail(). Simplified code around the reorder is the following.
>>
>> Consumer             Producer
>> load idx[ring]
>>                      store idx[ring]
>>                      store r->prod.tail
>> load r->prod.tail
>>
>> In this case, the consumer loads old idx[ring] and confirms the load
>> is valid with the new r->prod.tail.
>>
>> I added a read barrier in the case where __IS_SC is passed to
>> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
>> to avoid similar problems with a single producer.
>>
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
>> ---
>>  lib/librte_ring/rte_ring_generic.h | 10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
>> index ea7dbe5b9..477326180 100644
>> --- a/lib/librte_ring/rte_ring_generic.h
>> +++ b/lib/librte_ring/rte_ring_generic.h
>> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>>                         return 0;
>>
>>                 *new_head = *old_head + n;
>> -               if (is_sp)
>> +               if (is_sp) {
>> +                       rte_smp_rmb();
>>                         r->prod.head = *new_head, success = 1;
>> -               else
>> +               } else
>>                         success = rte_atomic32_cmpset(&r->prod.head,
>>                                         *old_head, *new_head);
>>         } while (unlikely(success == 0));
>> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
>>                         return 0;
>>
>>                 *new_head = *old_head + n;
>> -               if (is_sc)
>> +               if (is_sc) {
>> +                       rte_smp_rmb();
>>                         r->cons.head = *new_head, success = 1;
>> -               else
>> +               } else
>>                         success = rte_atomic32_cmpset(&r->cons.head, *old_head,
>>                                         *new_head);
>>         } while (unlikely(success == 0));
>> --
>> 2.17.1
>>
Jerin Jacob July 17, 2018, 3:34 a.m. UTC | #3
-----Original Message-----
> Date: Tue, 17 Jul 2018 11:54:18 +0900
> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org, stable@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
> 

Cc: olivier.matz@6wind.com
Cc: chaozhu@linux.vnet.ibm.com
Cc: konstantin.ananyev@intel.com

> 
> > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > Having said that, load-load barrier can be expressed very  well with C11 memory
> > model. I guess ppc64 supports C11 memory model. If so,
> > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > original issue?
> 
> Yes, the performance regression happens on non-x86 with single
> producer/consumer.
> The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
> simple experiment. But, I think it is worth it.

That varies to machine to machine. What is the burst size etc.

> 
> 
> I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
> I tried to fix the C11 problem as well, but I also found the C11
> rte_ring had other potential
> incorrect choices of memory orders, which caused another race
> condition in ppc64.

Does it happens on all ppc64 machines? Or on a specific machine?
Is following tests are passing on your system without the patch?
test/test/test_ring_perf.c
test/test/test_ring.c

> 
> For example,
> __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
> I am not sure why the load-acquire is used for the compare exchange.

It correct as per C11 acquire and release semantics.

> Also in update_tail, the pause can be called before the data copy because
> of ht->tail load without atomic_load_n.
> 
> The memory order is simply difficult, so it might take a bit longer
> time to check
> if the code is correct. I think I can fix the C11 rte_ring as another patch.
> 
> >>
> >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> >> It uses a single consumer and multiple producers for a rte_ring.
> >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
> >
> > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > Having said that, load-load barrier can be expressed very  well with C11 memory
> > model. I guess ppc64 supports C11 memory model. If so,
> > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > original issue?
> >
> >>
> >> The reordered loads happened on r->prod.tail in

There is rte_smp_rmb() just before reading r->prod.tail in 
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_rte_ring_move_cons_head(). Would that not suffice the requirement?

Can you check adding compiler barrier and see is compiler is reordering
the stuff?

DPDK's ring implementation is based freebsd's ring implementation, I
don't see need for such barrier

https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h

If it is something specific to ppc64 or a specific ppc64 machine, we 
could add a compile option as it is arch specific to avoid performance
impact on other architectures.

> >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> >> dependency, but the code does not satisfy it. Note that they are
> >> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
> >> cmpset invokes a read barrier.
> >>
> >> The paired stores on these loads are in ENQUEUE_PTRS() and
> >> update_tail(). Simplified code around the reorder is the following.
> >>
> >> Consumer             Producer
> >> load idx[ring]
> >>                      store idx[ring]
> >>                      store r->prod.tail
> >> load r->prod.tail
> >>
> >> In this case, the consumer loads old idx[ring] and confirms the load
> >> is valid with the new r->prod.tail.
> >>
> >> I added a read barrier in the case where __IS_SC is passed to
> >> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
> >> to avoid similar problems with a single producer.
> >>
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> >> ---
> >>  lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> >>  1 file changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> >> index ea7dbe5b9..477326180 100644
> >> --- a/lib/librte_ring/rte_ring_generic.h
> >> +++ b/lib/librte_ring/rte_ring_generic.h
> >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
> >>                         return 0;
> >>
> >>                 *new_head = *old_head + n;
> >> -               if (is_sp)
> >> +               if (is_sp) {
> >> +                       rte_smp_rmb();
> >>                         r->prod.head = *new_head, success = 1;
> >> -               else
> >> +               } else
> >>                         success = rte_atomic32_cmpset(&r->prod.head,
> >>                                         *old_head, *new_head);
> >>         } while (unlikely(success == 0));
> >> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> >>                         return 0;
> >>
> >>                 *new_head = *old_head + n;
> >> -               if (is_sc)
> >> +               if (is_sc) {
> >> +                       rte_smp_rmb();
> >>                         r->cons.head = *new_head, success = 1;
> >> -               else
> >> +               } else
> >>                         success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> >>                                         *new_head);
> >>         } while (unlikely(success == 0));
> >> --
> >> 2.17.1
> >>

Patch
diff mbox series

diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..477326180 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -90,9 +90,10 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 			return 0;
 
 		*new_head = *old_head + n;
-		if (is_sp)
+		if (is_sp) {
+			rte_smp_rmb();
 			r->prod.head = *new_head, success = 1;
-		else
+		} else
 			success = rte_atomic32_cmpset(&r->prod.head,
 					*old_head, *new_head);
 	} while (unlikely(success == 0));
@@ -158,9 +159,10 @@  __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 			return 0;
 
 		*new_head = *old_head + n;
-		if (is_sc)
+		if (is_sc) {
+			rte_smp_rmb();
 			r->cons.head = *new_head, success = 1;
-		else
+		} else
 			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
 					*new_head);
 	} while (unlikely(success == 0));