[v4,1/4] doc: add generic atomic deprecation section
Checks
Commit Message
Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
doc/guides/prog_guide/writing_efficient_code.rst | 139 ++++++++++++++++++++++-
1 file changed, 138 insertions(+), 1 deletion(-)
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
Locks and Atomic Operations
---------------------------
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
causing the processor's LOCK# signal to be asserted during execution of the following instruction.
This has a big impact on performance in a multicore environment.
@@ -176,6 +182,137 @@ It can often be replaced by other solutions like per-lcore variables.
Also, some locking techniques are more efficient than others.
For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+ static __rte_always_inline void
+ virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+ struct rte_mbuf *m)
+ {
+ ...
+ ...
+ if (enable_stats) {
+ __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+ __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+ ...
+ }
+ }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
+Reader-Writer Concurrency
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Lock-free reader-writer concurrency is one of the common use cases in DPDK.
+
+The payload or the data that the writer wants to communicate to the reader,
+can be written with RELAXED memory order. However, the guard variable should
+be written with RELEASE memory order. This ensures that the store to guard
+variable is observable only after the store to payload is observable.
+Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example.
+
+.. code-block:: c
+
+ static inline int32_t
+ rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
+ ...
+ int32_t *ret_val)
+ {
+ ...
+ ...
+
+ /* Insert new entry if there is room in the primary
+ * bucket.
+ */
+ for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+ /* Check if slot is available */
+ if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
+ prim_bkt->sig_current[i] = sig;
+ /* Store to signature and key should not
+ * leak after the store to key_idx. i.e.
+ * key_idx is the guard variable for signature
+ * and key.
+ */
+ __atomic_store_n(&prim_bkt->key_idx[i],
+ new_idx,
+ __ATOMIC_RELEASE);
+ break;
+ }
+ }
+
+ ...
+ }
+
+Correspondingly, on the reader side, the guard variable should be read
+with ACQUIRE memory order. The payload or the data the writer communicated,
+can be read with RELAXED memory order. This ensures that, if the store to
+guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example.
+
+.. code-block:: c
+
+ static inline int32_t
+ search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig,
+ void **data, const struct rte_hash_bucket *bkt)
+ {
+ ...
+
+ for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+ ....
+ if (bkt->sig_current[i] == sig) {
+ key_idx = __atomic_load_n(&bkt->key_idx[i],
+ __ATOMIC_ACQUIRE);
+ if (key_idx != EMPTY_SLOT) {
+ k = (struct rte_hash_key *) ((char *)keys +
+ key_idx * h->key_entry_size);
+
+ if (rte_hash_cmp_eq(key, k->key, h) == 0) {
+ if (data != NULL) {
+ *data = __atomic_load_n(&k->pdata,
+ __ATOMIC_ACQUIRE);
+ }
+
+ /*
+ * Return index where key is stored,
+ * subtracting the first dummy index
+ */
+ return key_idx - 1;
+ }
+ ...
+ }
+
Coding Considerations
---------------------