[v5,6/6] vhost: optimize memcpy routines when cc memcpy is used

Message ID 20240724075357.546248-7-mattias.ronnblom@ericsson.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series Optionally have rte_memcpy delegate to compiler memcpy |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build fail github build: failed
ci/intel-Functional success Functional PASS
ci/iol-mellanox-Performance pending Performance Testing pending
ci/iol-marvell-Functional success Functional Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS

Commit Message

Mattias Rönnblom July 24, 2024, 7:53 a.m. UTC
In build where use_cc_memcpy is set to true, the vhost user PMD
suffers a large performance drop on Intel P-cores for small packets,
at least when built by GCC and (to a much lesser extent) clang.

This patch addresses that issue by using a custom virtio
memcpy()-based packet copying routine.

Performance results from a Raptor Lake @ 3,2 GHz:

GCC 12.3.0
64 bytes packets
Core  Mode              Mpps
E     RTE memcpy        9.5
E     cc memcpy         9.7
E     cc memcpy+pktcpy  9.0

P     RTE memcpy        16.4
P     cc memcpy         13.5
P     cc memcpy+pktcpy  16.2

GCC 12.3.0
1500 bytes packets
Core  Mode              Mpps
P    RTE memcpy         5.8
P    cc memcpy          5.9
P    cc memcpy+pktcpy   5.9

clang 15.0.7
64 bytes packets
Core  Mode              Mpps
P     RTE memcpy        13.3
P     cc memcpy         12.9
P     cc memcpy+pktcpy  13.9

"RTE memcpy" is use_cc_memcpy=false, "cc memcpy" is use_cc_memcpy=true
and "pktcpy" is when this patch is applied.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/vhost/virtio_net.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)
  

Comments

Morten Brørup July 29, 2024, 11 a.m. UTC | #1
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 24 July 2024 09.54

Which packet mix was used for your tests? Synthetic IMIX, or some live data?

> +/* The code generated by GCC (and to a lesser extent, clang) with just
> + * a straight memcpy() to copy packets is less than optimal on Intel
> + * P-cores, for small packets. Thus the need of this specialized
> + * memcpy() in builds where use_cc_memcpy is set to true.
> + */
> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
> +static __rte_always_inline void
> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
> +{
> +	void *dst = __builtin_assume_aligned(in_dst, 16);
> +	const void *src = __builtin_assume_aligned(in_src, 16);
> +
> +	if (len <= 256) {
> +		size_t left;
> +
> +		for (left = len; left >= 32; left -= 32) {
> +			memcpy(dst, src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +			src = RTE_PTR_ADD(src, 32);
> +		}
> +
> +		memcpy(dst, src, left);
> +	} else

Although the packets within a burst often have similar size, I'm not sure you can rely on the dynamic branch predictor here.

Looking at the ethdev packet size counters at an ISP (at the core of their Layer 3 network), 71 % are 256 byte or larger [1].

For static branch prediction, I would consider > 256 more likely and swap the two branches, i.e. compare (len > 256) instead of (len <= 256).

But again: I don't know how the dynamic branch predictor behaves here. Perhaps my suggested change makes no difference.

> +		memcpy(dst, src, len);
> +}

With or without suggested change,
Acked-by: Morten Brørup <mb@smartsharesystems.com>


[1]: Details (incl. one VLAN tag)
tx_size_64_packets            1,1 %
tx_size_65_to_127_packets    25,7 %
tx_size_128_to_255_packets    2,6 %
tx_size_256_to_511_packets    1,4 %
tx_size_512_to_1023_packets   1,7 %
tx_size_1024_to_1522_packets 67,6 %
  
Mattias Rönnblom July 29, 2024, 7:27 p.m. UTC | #2
On 2024-07-29 13:00, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 24 July 2024 09.54
> 
> Which packet mix was used for your tests? Synthetic IMIX, or some live data?
> 

I used the same test as was being done when the performance regression 
was demonstrated (i.e., 2x testpmd with fixed packet size).

>> +/* The code generated by GCC (and to a lesser extent, clang) with just
>> + * a straight memcpy() to copy packets is less than optimal on Intel
>> + * P-cores, for small packets. Thus the need of this specialized
>> + * memcpy() in builds where use_cc_memcpy is set to true.
>> + */
>> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
>> +static __rte_always_inline void
>> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
>> +{
>> +	void *dst = __builtin_assume_aligned(in_dst, 16);
>> +	const void *src = __builtin_assume_aligned(in_src, 16);
>> +
>> +	if (len <= 256) {
>> +		size_t left;
>> +
>> +		for (left = len; left >= 32; left -= 32) {
>> +			memcpy(dst, src, 32);
>> +			dst = RTE_PTR_ADD(dst, 32);
>> +			src = RTE_PTR_ADD(src, 32);
>> +		}
>> +
>> +		memcpy(dst, src, left);
>> +	} else
> 
> Although the packets within a burst often have similar size, I'm not sure you can rely on the dynamic branch predictor here.
> 

I agree that the pktcpy() routine will likely often suffer a 
size-related branch mispredict with real packet size mix. A benchmark 
with a real packet mix would be much better than the tests I've run.

This needs to be compared, of course, with the overhead imposed by 
conditionals included in other pktcpy() implementations.

> Looking at the ethdev packet size counters at an ISP (at the core of their Layer 3 network), 71 % are 256 byte or larger [1].
> 
> For static branch prediction, I would consider > 256 more likely and swap the two branches, i.e. compare (len > 256) instead of (len <= 256).
> 

OK, I'll add likely() instead, to make it more explicit.

> But again: I don't know how the dynamic branch predictor behaves here. Perhaps my suggested change makes no difference.
> 

I think it will, but it will be tiny. From what I understand, even when 
the branch prediction guessed correctly, one receive a slight benefit if 
the branch is not taken.

>> +		memcpy(dst, src, len);
>> +}
> 
> With or without suggested change,
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> 
> [1]: Details (incl. one VLAN tag)
> tx_size_64_packets            1,1 %
> tx_size_65_to_127_packets    25,7 %
> tx_size_128_to_255_packets    2,6 %
> tx_size_256_to_511_packets    1,4 %
> tx_size_512_to_1023_packets   1,7 %
> tx_size_1024_to_1522_packets 67,6 %
>
  
Morten Brørup July 29, 2024, 7:56 p.m. UTC | #3
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 29 July 2024 21.27
> 
> On 2024-07-29 13:00, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Wednesday, 24 July 2024 09.54
> >
> > Which packet mix was used for your tests? Synthetic IMIX, or some live
> data?
> >
> 
> I used the same test as was being done when the performance regression
> was demonstrated (i.e., 2x testpmd with fixed packet size).
> 
> >> +/* The code generated by GCC (and to a lesser extent, clang) with
> just
> >> + * a straight memcpy() to copy packets is less than optimal on Intel
> >> + * P-cores, for small packets. Thus the need of this specialized
> >> + * memcpy() in builds where use_cc_memcpy is set to true.
> >> + */
> >> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
> >> +static __rte_always_inline void
> >> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t
> len)
> >> +{
> >> +	void *dst = __builtin_assume_aligned(in_dst, 16);
> >> +	const void *src = __builtin_assume_aligned(in_src, 16);
> >> +
> >> +	if (len <= 256) {
> >> +		size_t left;
> >> +
> >> +		for (left = len; left >= 32; left -= 32) {
> >> +			memcpy(dst, src, 32);
> >> +			dst = RTE_PTR_ADD(dst, 32);
> >> +			src = RTE_PTR_ADD(src, 32);
> >> +		}
> >> +
> >> +		memcpy(dst, src, left);
> >> +	} else
> >
> > Although the packets within a burst often have similar size, I'm not
> sure you can rely on the dynamic branch predictor here.
> >
> 
> I agree that the pktcpy() routine will likely often suffer a
> size-related branch mispredict with real packet size mix. A benchmark
> with a real packet mix would be much better than the tests I've run.
> 
> This needs to be compared, of course, with the overhead imposed by
> conditionals included in other pktcpy() implementations.

If testing with fixed packet size, only one of the branches will be taken - always!
And thus the branch predictor will always predict it correctly - in the test.

So, if this code performs better than simple memcpy(), I can conclude that you are testing with packet size <= 256.

> 
> > Looking at the ethdev packet size counters at an ISP (at the core of
> their Layer 3 network), 71 % are 256 byte or larger [1].
> >
> > For static branch prediction, I would consider > 256 more likely and
> swap the two branches, i.e. compare (len > 256) instead of (len <= 256).
> >
> 
> OK, I'll add likely() instead, to make it more explicit.
> 
> > But again: I don't know how the dynamic branch predictor behaves here.
> Perhaps my suggested change makes no difference.
> >
> 
> I think it will, but it will be tiny. From what I understand, even when
> the branch prediction guessed correctly, one receive a slight benefit if
> the branch is not taken.
> 
> >> +		memcpy(dst, src, len);
> >> +}
> >
> > With or without suggested change,
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
> >
> > [1]: Details (incl. one VLAN tag)
> > tx_size_64_packets            1,1 %
> > tx_size_65_to_127_packets    25,7 %
> > tx_size_128_to_255_packets    2,6 %
> > tx_size_256_to_511_packets    1,4 %
> > tx_size_512_to_1023_packets   1,7 %
> > tx_size_1024_to_1522_packets 67,6 %
> >
  

Patch

diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index 370402d849..63571587a8 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -231,6 +231,39 @@  vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t
 	return nr_copies;
 }
 
+/* The code generated by GCC (and to a lesser extent, clang) with just
+ * a straight memcpy() to copy packets is less than optimal on Intel
+ * P-cores, for small packets. Thus the need of this specialized
+ * memcpy() in builds where use_cc_memcpy is set to true.
+ */
+#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
+static __rte_always_inline void
+pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
+{
+	void *dst = __builtin_assume_aligned(in_dst, 16);
+	const void *src = __builtin_assume_aligned(in_src, 16);
+
+	if (len <= 256) {
+		size_t left;
+
+		for (left = len; left >= 32; left -= 32) {
+			memcpy(dst, src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+			src = RTE_PTR_ADD(src, 32);
+		}
+
+		memcpy(dst, src, left);
+	} else
+		memcpy(dst, src, len);
+}
+#else
+static __rte_always_inline void
+pktcpy(void *dst, const void *src, size_t len)
+{
+	rte_memcpy(dst, src, len);
+}
+#endif
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	__rte_shared_locks_required(&vq->iotlb_lock)
@@ -240,7 +273,7 @@  do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	int i;
 
 	for (i = 0; i < count; i++) {
-		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
+		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
 		vhost_log_cache_write_iova(dev, vq, elem[i].log_addr,
 					   elem[i].len);
 		PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0);
@@ -257,7 +290,7 @@  do_data_copy_dequeue(struct vhost_virtqueue *vq)
 	int i;
 
 	for (i = 0; i < count; i++)
-		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
+		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
 
 	vq->batch_copy_nb_elems = 0;
 }