[v4] net: fix checksum with unaligned buffer

Message ID 20220623123900.38283-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v4] net: fix checksum with unaligned buffer |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-abi-testing warning Testing issues

Commit Message

Morten Brørup June 23, 2022, 12:39 p.m. UTC
  With this patch, the checksum can be calculated on an unaligned buffer.
I.e. the buf parameter is no longer required to be 16 bit aligned.

The checksum is still calculated using a 16 bit aligned pointer, so the
compiler can auto-vectorize the function's inner loop.

When the buffer is unaligned, the first byte of the buffer is handled
separately. Furthermore, the calculated checksum of the buffer is byte
shifted before being added to the initial checksum, to compensate for the
checksum having been calculated on the buffer shifted by one byte.

v4:
* Add copyright notice.
* Include stdbool.h (Emil Berg).
* Use RTE_PTR_ADD (Emil Berg).
* Fix one more typo in commit message. Is 'unligned' even a word?
v3:
* Remove braces from single statement block.
* Fix typo in commit message.
v2:
* Do not assume that the buffer is part of an aligned packet buffer.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)
  

Comments

Morten Brørup June 23, 2022, 12:51 p.m. UTC | #1
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 23 June 2022 14.39
> 
> With this patch, the checksum can be calculated on an unaligned buffer.
> I.e. the buf parameter is no longer required to be 16 bit aligned.
> 
> The checksum is still calculated using a 16 bit aligned pointer, so the
> compiler can auto-vectorize the function's inner loop.
> 
> When the buffer is unaligned, the first byte of the buffer is handled
> separately. Furthermore, the calculated checksum of the buffer is byte
> shifted before being added to the initial checksum, to compensate for
> the
> checksum having been calculated on the buffer shifted by one byte.
> 
> v4:
> * Add copyright notice.
> * Include stdbool.h (Emil Berg).
> * Use RTE_PTR_ADD (Emil Berg).
> * Fix one more typo in commit message. Is 'unligned' even a word?
> v3:
> * Remove braces from single statement block.
> * Fix typo in commit message.
> v2:
> * Do not assume that the buffer is part of an aligned packet buffer.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..738d643da0 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -3,6 +3,7 @@
>   *      The Regents of the University of California.
>   * Copyright(c) 2010-2014 Intel Corporation.
>   * Copyright(c) 2014 6WIND S.A.
> + * Copyright(c) 2022 SmartShare Systems.
>   * All rights reserved.
>   */
> 
> @@ -15,6 +16,7 @@
>   * IP-related defines
>   */
> 
> +#include <stdbool.h>
>  #include <stdint.h>
> 
>  #ifdef RTE_EXEC_ENV_WINDOWS
> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> uint32_t sum)
>  {
>  	/* extend strict-aliasing rules */
>  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const u16_p *u16_buf;
> +	const u16_p *end;
> +	uint32_t bsum = 0;
> +	const bool unaligned = (uintptr_t)buf & 1;
> +
> +	/* if buffer is unaligned, keeping it byte order independent */
> +	if (unlikely(unaligned)) {
> +		uint16_t first = 0;
> +		if (unlikely(len == 0))
> +			return 0;
> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
> +		bsum += first;
> +		buf = RTE_PTR_ADD(buf, 1);
> +		len--;
> +	}
> 
> +	/* aligned access for compiler auto-vectorization */
> +	u16_buf = (const u16_p *)buf;
> +	end = u16_buf + len / sizeof(*u16_buf);
>  	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +		bsum += *u16_buf;
> 
>  	/* if length is odd, keeping it byte order independent */
>  	if (unlikely(len % 2)) {
>  		uint16_t left = 0;
>  		*(unsigned char *)&left = *(const unsigned char *)end;
> -		sum += left;
> +		bsum += left;
>  	}
> 
> -	return sum;
> +	/* if buffer is unaligned, swap the checksum bytes */
> +	if (unlikely(unaligned))
> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
> +
> +	return sum + bsum;
>  }
> 
>  /**
> --
> 2.17.1

@Emil, thank you for thoroughly reviewing the previous versions.

If your test succeeds and you are satisfied with the patch, remember to reply with a "Tested-by" tag for patchwork.
  
Emil Berg June 27, 2022, 7:56 a.m. UTC | #2
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 23 juni 2022 14:51
> To: Emil Berg <emil.berg@ericsson.com>; bruce.richardson@intel.com;
> dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> hofors@lysator.liu.se; olivier.matz@6wind.com
> Subject: RE: [PATCH v4] net: fix checksum with unaligned buffer
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 23 June 2022 14.39
> >
> > With this patch, the checksum can be calculated on an unaligned buffer.
> > I.e. the buf parameter is no longer required to be 16 bit aligned.
> >
> > The checksum is still calculated using a 16 bit aligned pointer, so
> > the compiler can auto-vectorize the function's inner loop.
> >
> > When the buffer is unaligned, the first byte of the buffer is handled
> > separately. Furthermore, the calculated checksum of the buffer is byte
> > shifted before being added to the initial checksum, to compensate for
> > the checksum having been calculated on the buffer shifted by one byte.
> >
> > v4:
> > * Add copyright notice.
> > * Include stdbool.h (Emil Berg).
> > * Use RTE_PTR_ADD (Emil Berg).
> > * Fix one more typo in commit message. Is 'unligned' even a word?
> > v3:
> > * Remove braces from single statement block.
> > * Fix typo in commit message.
> > v2:
> > * Do not assume that the buffer is part of an aligned packet buffer.
> >
> > Bugzilla ID: 1035
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Tested-by: Emil Berg <emil.berg@ericsson.com>
> > ---
> >  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> >  1 file changed, 27 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > b502481670..738d643da0 100644
> > --- a/lib/net/rte_ip.h
> > +++ b/lib/net/rte_ip.h
> > @@ -3,6 +3,7 @@
> >   *      The Regents of the University of California.
> >   * Copyright(c) 2010-2014 Intel Corporation.
> >   * Copyright(c) 2014 6WIND S.A.
> > + * Copyright(c) 2022 SmartShare Systems.
> >   * All rights reserved.
> >   */
> >
> > @@ -15,6 +16,7 @@
> >   * IP-related defines
> >   */
> >
> > +#include <stdbool.h>
> >  #include <stdint.h>
> >
> >  #ifdef RTE_EXEC_ENV_WINDOWS
> > @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > uint32_t sum)  {
> >  	/* extend strict-aliasing rules */
> >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > -	const u16_p *u16_buf = (const u16_p *)buf;
> > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > +	const u16_p *u16_buf;
> > +	const u16_p *end;
> > +	uint32_t bsum = 0;
> > +	const bool unaligned = (uintptr_t)buf & 1;
> > +
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely(unaligned)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> > +		bsum += first;
> > +		buf = RTE_PTR_ADD(buf, 1);
> > +		len--;
> > +	}
> >
> > +	/* aligned access for compiler auto-vectorization */
> > +	u16_buf = (const u16_p *)buf;
> > +	end = u16_buf + len / sizeof(*u16_buf);
> >  	for (; u16_buf != end; ++u16_buf)
> > -		sum += *u16_buf;
> > +		bsum += *u16_buf;
> >
> >  	/* if length is odd, keeping it byte order independent */
> >  	if (unlikely(len % 2)) {
> >  		uint16_t left = 0;
> >  		*(unsigned char *)&left = *(const unsigned char
> *)end;
> > -		sum += left;
> > +		bsum += left;
> >  	}
> >
> > -	return sum;
> > +	/* if buffer is unaligned, swap the checksum bytes */
> > +	if (unlikely(unaligned))
> > +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> 0x00FF00FF) << 8;
> > +
> > +	return sum + bsum;
> >  }
> >
> >  /**
> > --
> > 2.17.1
> 
> @Emil, thank you for thoroughly reviewing the previous versions.
> 
> If your test succeeds and you are satisfied with the patch, remember to reply
> with a "Tested-by" tag for patchwork.

The test succeeded and I'm satisfied with the patch. I added 'Tested-by: Emil Berg <emil.berg@ericsson.com>' to the patch above, hopefully as you intended.
  
Morten Brørup June 27, 2022, 10:54 a.m. UTC | #3
> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Monday, 27 June 2022 09.57
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 23 juni 2022 14:51
> >
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Thursday, 23 June 2022 14.39
> > >
> > > With this patch, the checksum can be calculated on an unaligned
> buffer.
> > > I.e. the buf parameter is no longer required to be 16 bit aligned.
> > >
> > > The checksum is still calculated using a 16 bit aligned pointer, so
> > > the compiler can auto-vectorize the function's inner loop.
> > >
> > > When the buffer is unaligned, the first byte of the buffer is
> handled
> > > separately. Furthermore, the calculated checksum of the buffer is
> byte
> > > shifted before being added to the initial checksum, to compensate
> for
> > > the checksum having been calculated on the buffer shifted by one
> byte.
> > >
> > > v4:
> > > * Add copyright notice.
> > > * Include stdbool.h (Emil Berg).
> > > * Use RTE_PTR_ADD (Emil Berg).
> > > * Fix one more typo in commit message. Is 'unligned' even a word?
> > > v3:
> > > * Remove braces from single statement block.
> > > * Fix typo in commit message.
> > > v2:
> > > * Do not assume that the buffer is part of an aligned packet
> buffer.
> > >
> > > Bugzilla ID: 1035
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > Tested-by: Emil Berg <emil.berg@ericsson.com>
> > > ---
> > >  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > >  1 file changed, 27 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > b502481670..738d643da0 100644
> > > --- a/lib/net/rte_ip.h
> > > +++ b/lib/net/rte_ip.h
> > > @@ -3,6 +3,7 @@
> > >   *      The Regents of the University of California.
> > >   * Copyright(c) 2010-2014 Intel Corporation.
> > >   * Copyright(c) 2014 6WIND S.A.
> > > + * Copyright(c) 2022 SmartShare Systems.
> > >   * All rights reserved.
> > >   */
> > >
> > > @@ -15,6 +16,7 @@
> > >   * IP-related defines
> > >   */
> > >
> > > +#include <stdbool.h>
> > >  #include <stdint.h>
> > >
> > >  #ifdef RTE_EXEC_ENV_WINDOWS
> > > @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > > uint32_t sum)  {
> > >  	/* extend strict-aliasing rules */
> > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > +	const u16_p *u16_buf;
> > > +	const u16_p *end;
> > > +	uint32_t bsum = 0;
> > > +	const bool unaligned = (uintptr_t)buf & 1;
> > > +
> > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > +	if (unlikely(unaligned)) {
> > > +		uint16_t first = 0;
> > > +		if (unlikely(len == 0))
> > > +			return 0;
> > > +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> > > +		bsum += first;
> > > +		buf = RTE_PTR_ADD(buf, 1);
> > > +		len--;
> > > +	}
> > >
> > > +	/* aligned access for compiler auto-vectorization */
> > > +	u16_buf = (const u16_p *)buf;
> > > +	end = u16_buf + len / sizeof(*u16_buf);
> > >  	for (; u16_buf != end; ++u16_buf)
> > > -		sum += *u16_buf;
> > > +		bsum += *u16_buf;
> > >
> > >  	/* if length is odd, keeping it byte order independent */
> > >  	if (unlikely(len % 2)) {
> > >  		uint16_t left = 0;
> > >  		*(unsigned char *)&left = *(const unsigned char
> > *)end;
> > > -		sum += left;
> > > +		bsum += left;
> > >  	}
> > >
> > > -	return sum;
> > > +	/* if buffer is unaligned, swap the checksum bytes */
> > > +	if (unlikely(unaligned))
> > > +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> > 0x00FF00FF) << 8;
> > > +
> > > +	return sum + bsum;
> > >  }
> > >
> > >  /**
> > > --
> > > 2.17.1
> >
> > @Emil, thank you for thoroughly reviewing the previous versions.
> >
> > If your test succeeds and you are satisfied with the patch, remember
> to reply
> > with a "Tested-by" tag for patchwork.
> 
> The test succeeded and I'm satisfied with the patch. I added 'Tested-
> by: Emil Berg <emil.berg@ericsson.com>' to the patch above, hopefully
> as you intended.

Thank you for testing! You don't need to put the tag inside the patch. Next time, just put the tag in your reply, like here, and Patchwork will catch it.

Tested-by: Emil Berg <emil.berg@ericsson.com>

The same goes for other tags, like Acked-by, Reviewed-by, etc..

-Morten
  
Mattias Rönnblom June 27, 2022, 12:28 p.m. UTC | #4
On 2022-06-23 14:51, Morten Brørup wrote:
>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>> Sent: Thursday, 23 June 2022 14.39
>>
>> With this patch, the checksum can be calculated on an unaligned buffer.
>> I.e. the buf parameter is no longer required to be 16 bit aligned.
>>
>> The checksum is still calculated using a 16 bit aligned pointer, so the
>> compiler can auto-vectorize the function's inner loop.
>>
>> When the buffer is unaligned, the first byte of the buffer is handled
>> separately. Furthermore, the calculated checksum of the buffer is byte
>> shifted before being added to the initial checksum, to compensate for
>> the
>> checksum having been calculated on the buffer shifted by one byte.
>>
>> v4:
>> * Add copyright notice.
>> * Include stdbool.h (Emil Berg).
>> * Use RTE_PTR_ADD (Emil Berg).
>> * Fix one more typo in commit message. Is 'unligned' even a word?
>> v3:
>> * Remove braces from single statement block.
>> * Fix typo in commit message.
>> v2:
>> * Do not assume that the buffer is part of an aligned packet buffer.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>>   1 file changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
>> index b502481670..738d643da0 100644
>> --- a/lib/net/rte_ip.h
>> +++ b/lib/net/rte_ip.h
>> @@ -3,6 +3,7 @@
>>    *      The Regents of the University of California.
>>    * Copyright(c) 2010-2014 Intel Corporation.
>>    * Copyright(c) 2014 6WIND S.A.
>> + * Copyright(c) 2022 SmartShare Systems.
>>    * All rights reserved.
>>    */
>>
>> @@ -15,6 +16,7 @@
>>    * IP-related defines
>>    */
>>
>> +#include <stdbool.h>
>>   #include <stdint.h>
>>
>>   #ifdef RTE_EXEC_ENV_WINDOWS
>> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
>> uint32_t sum)
>>   {
>>   	/* extend strict-aliasing rules */
>>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>> -	const u16_p *u16_buf = (const u16_p *)buf;
>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>> +	const u16_p *u16_buf;
>> +	const u16_p *end;
>> +	uint32_t bsum = 0;
>> +	const bool unaligned = (uintptr_t)buf & 1;
>> +
>> +	/* if buffer is unaligned, keeping it byte order independent */
>> +	if (unlikely(unaligned)) {
>> +		uint16_t first = 0;
>> +		if (unlikely(len == 0))
>> +			return 0;
>> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
>> +		bsum += first;
>> +		buf = RTE_PTR_ADD(buf, 1);
>> +		len--;
>> +	}
>>
>> +	/* aligned access for compiler auto-vectorization */

The compiler will be able to auto vectorize even unaligned accesses, 
just with different instructions. From what I can tell, there's no 
performance impact, at least not on the x86_64 systems I tried on.

I think you should remove the first special case conditional and use 
memcpy() instead of the cumbersome __may_alias__ construct to retrieve 
the data.

>> +	u16_buf = (const u16_p *)buf;
>> +	end = u16_buf + len / sizeof(*u16_buf);
>>   	for (; u16_buf != end; ++u16_buf)
>> -		sum += *u16_buf;
>> +		bsum += *u16_buf;
>>
>>   	/* if length is odd, keeping it byte order independent */
>>   	if (unlikely(len % 2)) {
>>   		uint16_t left = 0;
>>   		*(unsigned char *)&left = *(const unsigned char *)end;
>> -		sum += left;
>> +		bsum += left;
>>   	}
>>
>> -	return sum;
>> +	/* if buffer is unaligned, swap the checksum bytes */
>> +	if (unlikely(unaligned))
>> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
>> +
>> +	return sum + bsum;
>>   }
>>
>>   /**
>> --
>> 2.17.1
> 
> @Emil, thank you for thoroughly reviewing the previous versions.
> 
> If your test succeeds and you are satisfied with the patch, remember to reply with a "Tested-by" tag for patchwork.
>
  
Emil Berg June 27, 2022, 12:46 p.m. UTC | #5
> -----Original Message-----
> From: Mattias Rönnblom <hofors@lysator.liu.se>
> Sent: den 27 juni 2022 14:28
> To: Morten Brørup <mb@smartsharesystems.com>; Emil Berg
> <emil.berg@ericsson.com>; bruce.richardson@intel.com; dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> olivier.matz@6wind.com
> Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> 
> On 2022-06-23 14:51, Morten Brørup wrote:
> >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >> Sent: Thursday, 23 June 2022 14.39
> >>
> >> With this patch, the checksum can be calculated on an unaligned buffer.
> >> I.e. the buf parameter is no longer required to be 16 bit aligned.
> >>
> >> The checksum is still calculated using a 16 bit aligned pointer, so
> >> the compiler can auto-vectorize the function's inner loop.
> >>
> >> When the buffer is unaligned, the first byte of the buffer is handled
> >> separately. Furthermore, the calculated checksum of the buffer is
> >> byte shifted before being added to the initial checksum, to
> >> compensate for the checksum having been calculated on the buffer
> >> shifted by one byte.
> >>
> >> v4:
> >> * Add copyright notice.
> >> * Include stdbool.h (Emil Berg).
> >> * Use RTE_PTR_ADD (Emil Berg).
> >> * Fix one more typo in commit message. Is 'unligned' even a word?
> >> v3:
> >> * Remove braces from single statement block.
> >> * Fix typo in commit message.
> >> v2:
> >> * Do not assume that the buffer is part of an aligned packet buffer.
> >>
> >> Bugzilla ID: 1035
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> >>   1 file changed, 27 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> >> b502481670..738d643da0 100644
> >> --- a/lib/net/rte_ip.h
> >> +++ b/lib/net/rte_ip.h
> >> @@ -3,6 +3,7 @@
> >>    *      The Regents of the University of California.
> >>    * Copyright(c) 2010-2014 Intel Corporation.
> >>    * Copyright(c) 2014 6WIND S.A.
> >> + * Copyright(c) 2022 SmartShare Systems.
> >>    * All rights reserved.
> >>    */
> >>
> >> @@ -15,6 +16,7 @@
> >>    * IP-related defines
> >>    */
> >>
> >> +#include <stdbool.h>
> >>   #include <stdint.h>
> >>
> >>   #ifdef RTE_EXEC_ENV_WINDOWS
> >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> >> uint32_t sum)
> >>   {
> >>   	/* extend strict-aliasing rules */
> >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> >> -	const u16_p *u16_buf = (const u16_p *)buf;
> >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> >> +	const u16_p *u16_buf;
> >> +	const u16_p *end;
> >> +	uint32_t bsum = 0;
> >> +	const bool unaligned = (uintptr_t)buf & 1;
> >> +
> >> +	/* if buffer is unaligned, keeping it byte order independent */
> >> +	if (unlikely(unaligned)) {
> >> +		uint16_t first = 0;
> >> +		if (unlikely(len == 0))
> >> +			return 0;
> >> +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> >> +		bsum += first;
> >> +		buf = RTE_PTR_ADD(buf, 1);
> >> +		len--;
> >> +	}
> >>
> >> +	/* aligned access for compiler auto-vectorization */
> 
> The compiler will be able to auto vectorize even unaligned accesses, just with
> different instructions. From what I can tell, there's no performance impact, at
> least not on the x86_64 systems I tried on.
> 
> I think you should remove the first special case conditional and use
> memcpy() instead of the cumbersome __may_alias__ construct to retrieve
> the data.
> 

Here:
https://www.agner.org/optimize/instruction_tables.pdf
it lists the latency of vmovdqa (aligned) as 6 cycles and the latency for vmovdqu (unaligned) as 7 cycles. So I guess there can be some difference.
Although in practice I'm not sure what difference it makes. I've not seen any difference in runtime between the two versions.

> >> +	u16_buf = (const u16_p *)buf;
> >> +	end = u16_buf + len / sizeof(*u16_buf);
> >>   	for (; u16_buf != end; ++u16_buf)
> >> -		sum += *u16_buf;
> >> +		bsum += *u16_buf;
> >>
> >>   	/* if length is odd, keeping it byte order independent */
> >>   	if (unlikely(len % 2)) {
> >>   		uint16_t left = 0;
> >>   		*(unsigned char *)&left = *(const unsigned char
> *)end;
> >> -		sum += left;
> >> +		bsum += left;
> >>   	}
> >>
> >> -	return sum;
> >> +	/* if buffer is unaligned, swap the checksum bytes */
> >> +	if (unlikely(unaligned))
> >> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> 0x00FF00FF) << 8;
> >> +
> >> +	return sum + bsum;
> >>   }
> >>
> >>   /**
> >> --
> >> 2.17.1
> >
> > @Emil, thank you for thoroughly reviewing the previous versions.
> >
> > If your test succeeds and you are satisfied with the patch, remember to
> reply with a "Tested-by" tag for patchwork.
> >
  
Emil Berg June 27, 2022, 12:50 p.m. UTC | #6
> -----Original Message-----
> From: Emil Berg
> Sent: den 27 juni 2022 14:46
> To: Mattias Rönnblom <hofors@lysator.liu.se>; Morten Brørup
> <mb@smartsharesystems.com>; bruce.richardson@intel.com;
> dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> olivier.matz@6wind.com
> Subject: RE: [PATCH v4] net: fix checksum with unaligned buffer
> 
> 
> 
> > -----Original Message-----
> > From: Mattias Rönnblom <hofors@lysator.liu.se>
> > Sent: den 27 juni 2022 14:28
> > To: Morten Brørup <mb@smartsharesystems.com>; Emil Berg
> > <emil.berg@ericsson.com>; bruce.richardson@intel.com; dev@dpdk.org
> > Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> > olivier.matz@6wind.com
> > Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> >
> > On 2022-06-23 14:51, Morten Brørup wrote:
> > >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > >> Sent: Thursday, 23 June 2022 14.39
> > >>
> > >> With this patch, the checksum can be calculated on an unaligned buffer.
> > >> I.e. the buf parameter is no longer required to be 16 bit aligned.
> > >>
> > >> The checksum is still calculated using a 16 bit aligned pointer, so
> > >> the compiler can auto-vectorize the function's inner loop.
> > >>
> > >> When the buffer is unaligned, the first byte of the buffer is
> > >> handled separately. Furthermore, the calculated checksum of the
> > >> buffer is byte shifted before being added to the initial checksum,
> > >> to compensate for the checksum having been calculated on the buffer
> > >> shifted by one byte.
> > >>
> > >> v4:
> > >> * Add copyright notice.
> > >> * Include stdbool.h (Emil Berg).
> > >> * Use RTE_PTR_ADD (Emil Berg).
> > >> * Fix one more typo in commit message. Is 'unligned' even a word?
> > >> v3:
> > >> * Remove braces from single statement block.
> > >> * Fix typo in commit message.
> > >> v2:
> > >> * Do not assume that the buffer is part of an aligned packet buffer.
> > >>
> > >> Bugzilla ID: 1035
> > >> Cc: stable@dpdk.org
> > >>
> > >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >> ---
> > >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > >>   1 file changed, 27 insertions(+), 5 deletions(-)
> > >>
> > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > >> b502481670..738d643da0 100644
> > >> --- a/lib/net/rte_ip.h
> > >> +++ b/lib/net/rte_ip.h
> > >> @@ -3,6 +3,7 @@
> > >>    *      The Regents of the University of California.
> > >>    * Copyright(c) 2010-2014 Intel Corporation.
> > >>    * Copyright(c) 2014 6WIND S.A.
> > >> + * Copyright(c) 2022 SmartShare Systems.
> > >>    * All rights reserved.
> > >>    */
> > >>
> > >> @@ -15,6 +16,7 @@
> > >>    * IP-related defines
> > >>    */
> > >>
> > >> +#include <stdbool.h>
> > >>   #include <stdint.h>
> > >>
> > >>   #ifdef RTE_EXEC_ENV_WINDOWS
> > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > >> uint32_t sum)
> > >>   {
> > >>   	/* extend strict-aliasing rules */
> > >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > >> -	const u16_p *u16_buf = (const u16_p *)buf;
> > >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > >> +	const u16_p *u16_buf;
> > >> +	const u16_p *end;
> > >> +	uint32_t bsum = 0;
> > >> +	const bool unaligned = (uintptr_t)buf & 1;
> > >> +
> > >> +	/* if buffer is unaligned, keeping it byte order independent */
> > >> +	if (unlikely(unaligned)) {
> > >> +		uint16_t first = 0;
> > >> +		if (unlikely(len == 0))
> > >> +			return 0;
> > >> +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> > >> +		bsum += first;
> > >> +		buf = RTE_PTR_ADD(buf, 1);
> > >> +		len--;
> > >> +	}
> > >>
> > >> +	/* aligned access for compiler auto-vectorization */
> >
> > The compiler will be able to auto vectorize even unaligned accesses,
> > just with different instructions. From what I can tell, there's no
> > performance impact, at least not on the x86_64 systems I tried on.
> >
> > I think you should remove the first special case conditional and use
> > memcpy() instead of the cumbersome __may_alias__ construct to retrieve
> > the data.
> >
> 
> Here:
> https://www.agner.org/optimize/instruction_tables.pdf
> it lists the latency of vmovdqa (aligned) as 6 cycles and the latency for
> vmovdqu (unaligned) as 7 cycles. So I guess there can be some difference.
> Although in practice I'm not sure what difference it makes. I've not seen any
> difference in runtime between the two versions.
> 

Correction to my comment:
Those stats are for some older CPU. For some newer CPUs such as Tiger Lake the stats seem to be the same regardless of aligned or unaligned.

> > >> +	u16_buf = (const u16_p *)buf;
> > >> +	end = u16_buf + len / sizeof(*u16_buf);
> > >>   	for (; u16_buf != end; ++u16_buf)
> > >> -		sum += *u16_buf;
> > >> +		bsum += *u16_buf;
> > >>
> > >>   	/* if length is odd, keeping it byte order independent */
> > >>   	if (unlikely(len % 2)) {
> > >>   		uint16_t left = 0;
> > >>   		*(unsigned char *)&left = *(const unsigned char
> > *)end;
> > >> -		sum += left;
> > >> +		bsum += left;
> > >>   	}
> > >>
> > >> -	return sum;
> > >> +	/* if buffer is unaligned, swap the checksum bytes */
> > >> +	if (unlikely(unaligned))
> > >> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> > 0x00FF00FF) << 8;
> > >> +
> > >> +	return sum + bsum;
> > >>   }
> > >>
> > >>   /**
> > >> --
> > >> 2.17.1
> > >
> > > @Emil, thank you for thoroughly reviewing the previous versions.
> > >
> > > If your test succeeds and you are satisfied with the patch, remember
> > > to
> > reply with a "Tested-by" tag for patchwork.
> > >
  
Morten Brørup June 27, 2022, 1:22 p.m. UTC | #7
> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Monday, 27 June 2022 14.51
> 
> > From: Emil Berg
> > Sent: den 27 juni 2022 14:46
> >
> > > From: Mattias Rönnblom <hofors@lysator.liu.se>
> > > Sent: den 27 juni 2022 14:28
> > >
> > > On 2022-06-23 14:51, Morten Brørup wrote:
> > > >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > >> Sent: Thursday, 23 June 2022 14.39
> > > >>
> > > >> With this patch, the checksum can be calculated on an unaligned
> buffer.
> > > >> I.e. the buf parameter is no longer required to be 16 bit
> aligned.
> > > >>
> > > >> The checksum is still calculated using a 16 bit aligned pointer,
> so
> > > >> the compiler can auto-vectorize the function's inner loop.
> > > >>
> > > >> When the buffer is unaligned, the first byte of the buffer is
> > > >> handled separately. Furthermore, the calculated checksum of the
> > > >> buffer is byte shifted before being added to the initial
> checksum,
> > > >> to compensate for the checksum having been calculated on the
> buffer
> > > >> shifted by one byte.
> > > >>
> > > >> v4:
> > > >> * Add copyright notice.
> > > >> * Include stdbool.h (Emil Berg).
> > > >> * Use RTE_PTR_ADD (Emil Berg).
> > > >> * Fix one more typo in commit message. Is 'unligned' even a
> word?
> > > >> v3:
> > > >> * Remove braces from single statement block.
> > > >> * Fix typo in commit message.
> > > >> v2:
> > > >> * Do not assume that the buffer is part of an aligned packet
> buffer.
> > > >>
> > > >> Bugzilla ID: 1035
> > > >> Cc: stable@dpdk.org
> > > >>
> > > >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > >> ---
> > > >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > > >>   1 file changed, 27 insertions(+), 5 deletions(-)
> > > >>
> > > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > >> b502481670..738d643da0 100644
> > > >> --- a/lib/net/rte_ip.h
> > > >> +++ b/lib/net/rte_ip.h
> > > >> @@ -3,6 +3,7 @@
> > > >>    *      The Regents of the University of California.
> > > >>    * Copyright(c) 2010-2014 Intel Corporation.
> > > >>    * Copyright(c) 2014 6WIND S.A.
> > > >> + * Copyright(c) 2022 SmartShare Systems.
> > > >>    * All rights reserved.
> > > >>    */
> > > >>
> > > >> @@ -15,6 +16,7 @@
> > > >>    * IP-related defines
> > > >>    */
> > > >>
> > > >> +#include <stdbool.h>
> > > >>   #include <stdint.h>
> > > >>
> > > >>   #ifdef RTE_EXEC_ENV_WINDOWS
> > > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t
> len,
> > > >> uint32_t sum)
> > > >>   {
> > > >>   	/* extend strict-aliasing rules */
> > > >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > >> -	const u16_p *u16_buf = (const u16_p *)buf;
> > > >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > >> +	const u16_p *u16_buf;
> > > >> +	const u16_p *end;
> > > >> +	uint32_t bsum = 0;
> > > >> +	const bool unaligned = (uintptr_t)buf & 1;
> > > >> +
> > > >> +	/* if buffer is unaligned, keeping it byte order
> independent */
> > > >> +	if (unlikely(unaligned)) {
> > > >> +		uint16_t first = 0;
> > > >> +		if (unlikely(len == 0))
> > > >> +			return 0;
> > > >> +		((unsigned char *)&first)[1] = *(const unsigned
> > > char *)buf;
> > > >> +		bsum += first;
> > > >> +		buf = RTE_PTR_ADD(buf, 1);
> > > >> +		len--;
> > > >> +	}
> > > >>
> > > >> +	/* aligned access for compiler auto-vectorization */
> > >
> > > The compiler will be able to auto vectorize even unaligned
> accesses,
> > > just with different instructions. From what I can tell, there's no
> > > performance impact, at least not on the x86_64 systems I tried on.
> > >
> > > I think you should remove the first special case conditional and
> use
> > > memcpy() instead of the cumbersome __may_alias__ construct to
> retrieve
> > > the data.
> > >
> >
> > Here:
> > https://www.agner.org/optimize/instruction_tables.pdf
> > it lists the latency of vmovdqa (aligned) as 6 cycles and the latency
> for
> > vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> difference.
> > Although in practice I'm not sure what difference it makes. I've not
> seen any
> > difference in runtime between the two versions.
> >
> 
> Correction to my comment:
> Those stats are for some older CPU. For some newer CPUs such as Tiger
> Lake the stats seem to be the same regardless of aligned or unaligned.
> 

I agree that the memcpy method is more elegant and easy to read.

However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet.

So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned).

I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch.

[1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D8712B@smartserver.smartshare.dk/
[2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/
  
Mattias Rönnblom June 27, 2022, 5:22 p.m. UTC | #8
On 2022-06-27 15:22, Morten Brørup wrote:
>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>> Sent: Monday, 27 June 2022 14.51
>>
>>> From: Emil Berg
>>> Sent: den 27 juni 2022 14:46
>>>
>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
>>>> Sent: den 27 juni 2022 14:28
>>>>
>>>> On 2022-06-23 14:51, Morten Brørup wrote:
>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>>>> Sent: Thursday, 23 June 2022 14.39
>>>>>>
>>>>>> With this patch, the checksum can be calculated on an unaligned
>> buffer.
>>>>>> I.e. the buf parameter is no longer required to be 16 bit
>> aligned.
>>>>>>
>>>>>> The checksum is still calculated using a 16 bit aligned pointer,
>> so
>>>>>> the compiler can auto-vectorize the function's inner loop.
>>>>>>
>>>>>> When the buffer is unaligned, the first byte of the buffer is
>>>>>> handled separately. Furthermore, the calculated checksum of the
>>>>>> buffer is byte shifted before being added to the initial
>> checksum,
>>>>>> to compensate for the checksum having been calculated on the
>> buffer
>>>>>> shifted by one byte.
>>>>>>
>>>>>> v4:
>>>>>> * Add copyright notice.
>>>>>> * Include stdbool.h (Emil Berg).
>>>>>> * Use RTE_PTR_ADD (Emil Berg).
>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
>> word?
>>>>>> v3:
>>>>>> * Remove braces from single statement block.
>>>>>> * Fix typo in commit message.
>>>>>> v2:
>>>>>> * Do not assume that the buffer is part of an aligned packet
>> buffer.
>>>>>>
>>>>>> Bugzilla ID: 1035
>>>>>> Cc: stable@dpdk.org
>>>>>>
>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>> ---
>>>>>>    lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>>>>>>    1 file changed, 27 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
>>>>>> b502481670..738d643da0 100644
>>>>>> --- a/lib/net/rte_ip.h
>>>>>> +++ b/lib/net/rte_ip.h
>>>>>> @@ -3,6 +3,7 @@
>>>>>>     *      The Regents of the University of California.
>>>>>>     * Copyright(c) 2010-2014 Intel Corporation.
>>>>>>     * Copyright(c) 2014 6WIND S.A.
>>>>>> + * Copyright(c) 2022 SmartShare Systems.
>>>>>>     * All rights reserved.
>>>>>>     */
>>>>>>
>>>>>> @@ -15,6 +16,7 @@
>>>>>>     * IP-related defines
>>>>>>     */
>>>>>>
>>>>>> +#include <stdbool.h>
>>>>>>    #include <stdint.h>
>>>>>>
>>>>>>    #ifdef RTE_EXEC_ENV_WINDOWS
>>>>>> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t
>> len,
>>>>>> uint32_t sum)
>>>>>>    {
>>>>>>    	/* extend strict-aliasing rules */
>>>>>>    	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>>>>>> -	const u16_p *u16_buf = (const u16_p *)buf;
>>>>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>>>>>> +	const u16_p *u16_buf;
>>>>>> +	const u16_p *end;
>>>>>> +	uint32_t bsum = 0;
>>>>>> +	const bool unaligned = (uintptr_t)buf & 1;
>>>>>> +
>>>>>> +	/* if buffer is unaligned, keeping it byte order
>> independent */
>>>>>> +	if (unlikely(unaligned)) {
>>>>>> +		uint16_t first = 0;
>>>>>> +		if (unlikely(len == 0))
>>>>>> +			return 0;
>>>>>> +		((unsigned char *)&first)[1] = *(const unsigned
>>>> char *)buf;
>>>>>> +		bsum += first;
>>>>>> +		buf = RTE_PTR_ADD(buf, 1);
>>>>>> +		len--;
>>>>>> +	}
>>>>>>
>>>>>> +	/* aligned access for compiler auto-vectorization */
>>>>
>>>> The compiler will be able to auto vectorize even unaligned
>> accesses,
>>>> just with different instructions. From what I can tell, there's no
>>>> performance impact, at least not on the x86_64 systems I tried on.
>>>>
>>>> I think you should remove the first special case conditional and
>> use
>>>> memcpy() instead of the cumbersome __may_alias__ construct to
>> retrieve
>>>> the data.
>>>>
>>>
>>> Here:
>>> https://www.agner.org/optimize/instruction_tables.pdf
>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the latency
>> for
>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
>> difference.
>>> Although in practice I'm not sure what difference it makes. I've not
>> seen any
>>> difference in runtime between the two versions.
>>>
>>
>> Correction to my comment:
>> Those stats are for some older CPU. For some newer CPUs such as Tiger
>> Lake the stats seem to be the same regardless of aligned or unaligned.
>>
> 
> I agree that the memcpy method is more elegant and easy to read.
> 
> However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet.
> 

I think you've misunderstood what latency means in such tables. It's a 
data dependency thing, not a measure of throughput. The throughput is 
*much* higher. My guess would be two such instruction per clock.

For your 1460 bytes example, my Zen3 AMD needs performs identical with 
both the current DPDK implementation, your patch, and a memcpy()-ified 
version of the current implementation. They all need ~130 clock 
cycles/packet, with warm caches. IPC is 3 instructions per cycle, but 
obvious not all instructions are SIMD.

The main issue with checksumming on the CPU is, in my experience, not 
that you don't have enough compute, but that you trash the caches.

> So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned).
> 

You will see performance degradation with this solution as well, under 
certain conditions. For unaligned 100 bytes of data, the current DPDK 
implementation and the memcpy()-fied version needs ~21 cc/packet. Your 
patch needs 54 cc/packet.

But the old version didn't support unaligned accesses? In many compiler 
flag/machine combinations it did.

> I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch.
> 
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D8712B@smartserver.smartshare.dk/
> [2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/
>
  
Morten Brørup June 27, 2022, 8:21 p.m. UTC | #9
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 27 June 2022 19.23
> 
> On 2022-06-27 15:22, Morten Brørup wrote:
> >> From: Emil Berg [mailto:emil.berg@ericsson.com]
> >> Sent: Monday, 27 June 2022 14.51
> >>
> >>> From: Emil Berg
> >>> Sent: den 27 juni 2022 14:46
> >>>
> >>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> >>>> Sent: den 27 juni 2022 14:28
> >>>>
> >>>> On 2022-06-23 14:51, Morten Brørup wrote:
> >>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >>>>>> Sent: Thursday, 23 June 2022 14.39
> >>>>>>
> >>>>>> With this patch, the checksum can be calculated on an unaligned
> >> buffer.
> >>>>>> I.e. the buf parameter is no longer required to be 16 bit
> >> aligned.
> >>>>>>
> >>>>>> The checksum is still calculated using a 16 bit aligned pointer,
> >> so
> >>>>>> the compiler can auto-vectorize the function's inner loop.
> >>>>>>
> >>>>>> When the buffer is unaligned, the first byte of the buffer is
> >>>>>> handled separately. Furthermore, the calculated checksum of the
> >>>>>> buffer is byte shifted before being added to the initial
> >> checksum,
> >>>>>> to compensate for the checksum having been calculated on the
> >> buffer
> >>>>>> shifted by one byte.
> >>>>>>
> >>>>>> v4:
> >>>>>> * Add copyright notice.
> >>>>>> * Include stdbool.h (Emil Berg).
> >>>>>> * Use RTE_PTR_ADD (Emil Berg).
> >>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> >> word?
> >>>>>> v3:
> >>>>>> * Remove braces from single statement block.
> >>>>>> * Fix typo in commit message.
> >>>>>> v2:
> >>>>>> * Do not assume that the buffer is part of an aligned packet
> >> buffer.
> >>>>>>
> >>>>>> Bugzilla ID: 1035
> >>>>>> Cc: stable@dpdk.org
> >>>>>>
> >>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

[...]

> >>>>
> >>>> The compiler will be able to auto vectorize even unaligned
> >> accesses,
> >>>> just with different instructions. From what I can tell, there's no
> >>>> performance impact, at least not on the x86_64 systems I tried on.
> >>>>
> >>>> I think you should remove the first special case conditional and
> >> use
> >>>> memcpy() instead of the cumbersome __may_alias__ construct to
> >> retrieve
> >>>> the data.
> >>>>
> >>>
> >>> Here:
> >>> https://www.agner.org/optimize/instruction_tables.pdf
> >>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> latency
> >> for
> >>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> >> difference.
> >>> Although in practice I'm not sure what difference it makes. I've
> not
> >> seen any
> >>> difference in runtime between the two versions.
> >>>
> >>
> >> Correction to my comment:
> >> Those stats are for some older CPU. For some newer CPUs such as
> Tiger
> >> Lake the stats seem to be the same regardless of aligned or
> unaligned.
> >>
> >
> > I agree that the memcpy method is more elegant and easy to read.
> >
> > However, we would need to performance test the modified checksum
> function with a large number of CPUs to prove that we don't introduce a
> performance regression on any CPU architecture still supported by DPDK.
> And Emil already found a CPU where it costs 1 extra cycle per 16 bytes,
> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> packet.
> >
> 
> I think you've misunderstood what latency means in such tables. It's a
> data dependency thing, not a measure of throughput. The throughput is
> *much* higher. My guess would be two such instruction per clock.
> 
> For your 1460 bytes example, my Zen3 AMD needs performs identical with
> both the current DPDK implementation, your patch, and a memcpy()-ified
> version of the current implementation. They all need ~130 clock
> cycles/packet, with warm caches. IPC is 3 instructions per cycle, but
> obvious not all instructions are SIMD.

You're right, I wasn't thinking deeper about it before extrapolating.

Great to see some real numbers! I wish someone would do the same testing on an old ARM CPU, so we could also see the other end of the scale.

> The main issue with checksumming on the CPU is, in my experience, not
> that you don't have enough compute, but that you trash the caches.

Agree. I have noticed that x86 has "non-temporal" instruction variants to load/store data without trashing the cache entirely.

A variant of the checksum function using such instructions might be handy.

Variants of the memcpy function using such instructions might also be handy for some purposes, e.g. copying the contents of packets, where the original and/or copy will not accessed shortly thereafter.

> > So I opted for a solution with zero changes to the inner loop, so no
> performance retesting is required (for the previously supported use
> cases, where the buffer is aligned).
> >
> 
> You will see performance degradation with this solution as well, under
> certain conditions. For unaligned 100 bytes of data, the current DPDK
> implementation and the memcpy()-fied version needs ~21 cc/packet. Your
> patch needs 54 cc/packet.

Yes, it's a tradeoff. I exclusively aimed at maintaining performance for the case with aligned buffers (under all circumstances, with all CPUs etc.), and ignored how it affects the performance for the case with unaligned buffers.

Unlike this patch, the memcpy() variant has no additional branches for the unaligned case, so its performance should be generally unaffected by the buffer being aligned or not. However, I don't have sufficient in-depth CPU knowledge to say if this also applies to RISCV and older ARM CPUs still supported by DPDK.

I don't have strong feelings for this patch, so you can provide a memcpy() based alternative patch, and we will let the community decide which one they prefer.

> But the old version didn't support unaligned accesses? In many compiler
> flag/machine combinations it did.

It crashed in some cases. That was the point of the bug report [1], and the reason to provide a patch.

[1] https://bugs.dpdk.org/show_bug.cgi?id=1035
  
Mattias Rönnblom June 28, 2022, 6:28 a.m. UTC | #10
On 2022-06-27 22:21, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 27 June 2022 19.23
>>
>> On 2022-06-27 15:22, Morten Brørup wrote:
>>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>>>> Sent: Monday, 27 June 2022 14.51
>>>>
>>>>> From: Emil Berg
>>>>> Sent: den 27 juni 2022 14:46
>>>>>
>>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
>>>>>> Sent: den 27 juni 2022 14:28
>>>>>>
>>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
>>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>>>>>> Sent: Thursday, 23 June 2022 14.39
>>>>>>>>
>>>>>>>> With this patch, the checksum can be calculated on an unaligned
>>>> buffer.
>>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
>>>> aligned.
>>>>>>>>
>>>>>>>> The checksum is still calculated using a 16 bit aligned pointer,
>>>> so
>>>>>>>> the compiler can auto-vectorize the function's inner loop.
>>>>>>>>
>>>>>>>> When the buffer is unaligned, the first byte of the buffer is
>>>>>>>> handled separately. Furthermore, the calculated checksum of the
>>>>>>>> buffer is byte shifted before being added to the initial
>>>> checksum,
>>>>>>>> to compensate for the checksum having been calculated on the
>>>> buffer
>>>>>>>> shifted by one byte.
>>>>>>>>
>>>>>>>> v4:
>>>>>>>> * Add copyright notice.
>>>>>>>> * Include stdbool.h (Emil Berg).
>>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
>>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
>>>> word?
>>>>>>>> v3:
>>>>>>>> * Remove braces from single statement block.
>>>>>>>> * Fix typo in commit message.
>>>>>>>> v2:
>>>>>>>> * Do not assume that the buffer is part of an aligned packet
>>>> buffer.
>>>>>>>>
>>>>>>>> Bugzilla ID: 1035
>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>
>>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> [...]
> 
>>>>>>
>>>>>> The compiler will be able to auto vectorize even unaligned
>>>> accesses,
>>>>>> just with different instructions. From what I can tell, there's no
>>>>>> performance impact, at least not on the x86_64 systems I tried on.
>>>>>>
>>>>>> I think you should remove the first special case conditional and
>>>> use
>>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
>>>> retrieve
>>>>>> the data.
>>>>>>
>>>>>
>>>>> Here:
>>>>> https://www.agner.org/optimize/instruction_tables.pdf
>>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
>> latency
>>>> for
>>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
>>>> difference.
>>>>> Although in practice I'm not sure what difference it makes. I've
>> not
>>>> seen any
>>>>> difference in runtime between the two versions.
>>>>>
>>>>
>>>> Correction to my comment:
>>>> Those stats are for some older CPU. For some newer CPUs such as
>> Tiger
>>>> Lake the stats seem to be the same regardless of aligned or
>> unaligned.
>>>>
>>>
>>> I agree that the memcpy method is more elegant and easy to read.
>>>
>>> However, we would need to performance test the modified checksum
>> function with a large number of CPUs to prove that we don't introduce a
>> performance regression on any CPU architecture still supported by DPDK.
>> And Emil already found a CPU where it costs 1 extra cycle per 16 bytes,
>> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
>> packet.
>>>
>>
>> I think you've misunderstood what latency means in such tables. It's a
>> data dependency thing, not a measure of throughput. The throughput is
>> *much* higher. My guess would be two such instruction per clock.
>>
>> For your 1460 bytes example, my Zen3 AMD needs performs identical with
>> both the current DPDK implementation, your patch, and a memcpy()-ified
>> version of the current implementation. They all need ~130 clock
>> cycles/packet, with warm caches. IPC is 3 instructions per cycle, but
>> obvious not all instructions are SIMD.
> 
> You're right, I wasn't thinking deeper about it before extrapolating.
> 
> Great to see some real numbers! I wish someone would do the same testing on an old ARM CPU, so we could also see the other end of the scale.
> 

I've ran it on an ARM A72. For the aligned 1460 bytes case I got: 
Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They 
performed about the same for all unaligned/aligned and sizes I tested. 
This platform (or could be GCC version as well) doesn't suffer from the 
unaligned performance degradation your patch showed on my AMD machine.

>> The main issue with checksumming on the CPU is, in my experience, not
>> that you don't have enough compute, but that you trash the caches.
> 
> Agree. I have noticed that x86 has "non-temporal" instruction variants to load/store data without trashing the cache entirely.
> 
> A variant of the checksum function using such instructions might be handy.
> 

Yes, although you may need to prefetch the payload for good performance.

> Variants of the memcpy function using such instructions might also be handy for some purposes, e.g. copying the contents of packets, where the original and/or copy will not accessed shortly thereafter.
> 

Indeed and I think it's been discussed on the list. There's some work to 
get it right, since alignment requirement and the fact a different 
memory model is used for those SIMD instructions causes trouble for a 
generic implementation. (For x86_64.)

>>> So I opted for a solution with zero changes to the inner loop, so no
>> performance retesting is required (for the previously supported use
>> cases, where the buffer is aligned).
>>>
>>
>> You will see performance degradation with this solution as well, under
>> certain conditions. For unaligned 100 bytes of data, the current DPDK
>> implementation and the memcpy()-fied version needs ~21 cc/packet. Your
>> patch needs 54 cc/packet.
> 
> Yes, it's a tradeoff. I exclusively aimed at maintaining performance for the case with aligned buffers (under all circumstances, with all CPUs etc.), and ignored how it affects the performance for the case with unaligned buffers.
> 
> Unlike this patch, the memcpy() variant has no additional branches for the unaligned case, so its performance should be generally unaffected by the buffer being aligned or not. However, I don't have sufficient in-depth CPU knowledge to say if this also applies to RISCV and older ARM CPUs still supported by DPDK.
> 

I don't think avoiding RISCV non-catastrophic regressions triumphs 
improving performance on mainstream CPUs and avoiding code quality 
regressions.

> I don't have strong feelings for this patch, so you can provide a memcpy() based alternative patch, and we will let the community decide which one they prefer.
> 

This is what my conversion looks like:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
{
	const void *end;

	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
		uint16_t v;

		memcpy(&v, buf, sizeof(uint16_t));
		sum += v;
	}

	/* if length is odd, keeping it byte order independent */
	if (unlikely(len % 2)) {
		uint8_t last;
		uint16_t left = 0;

		memcpy(&last, end, 1);
		*(unsigned char *)&left = last;
		sum += left;
	}

	return sum;
}

Maybe the maintainer can have an opinion, before I provide a real patch, 
to save me some work. I would really like to contribute some performance 
autotests in that case. I mean, you could be right and the performance 
could be totally off on some platform.

>> But the old version didn't support unaligned accesses? In many compiler
>> flag/machine combinations it did.
> 
> It crashed in some cases. That was the point of the bug report [1], and the reason to provide a patch.
> 
> [1] https://bugs.dpdk.org/show_bug.cgi?id=1035
>
  
Morten Brørup June 30, 2022, 4:28 p.m. UTC | #11
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 28 June 2022 08.28
> 
> On 2022-06-27 22:21, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 27 June 2022 19.23
> >>
> >> On 2022-06-27 15:22, Morten Brørup wrote:
> >>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
> >>>> Sent: Monday, 27 June 2022 14.51
> >>>>
> >>>>> From: Emil Berg
> >>>>> Sent: den 27 juni 2022 14:46
> >>>>>
> >>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> >>>>>> Sent: den 27 juni 2022 14:28
> >>>>>>
> >>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
> >>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >>>>>>>> Sent: Thursday, 23 June 2022 14.39
> >>>>>>>>
> >>>>>>>> With this patch, the checksum can be calculated on an
> unaligned
> >>>> buffer.
> >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
> >>>> aligned.
> >>>>>>>>
> >>>>>>>> The checksum is still calculated using a 16 bit aligned
> pointer,
> >>>> so
> >>>>>>>> the compiler can auto-vectorize the function's inner loop.
> >>>>>>>>
> >>>>>>>> When the buffer is unaligned, the first byte of the buffer is
> >>>>>>>> handled separately. Furthermore, the calculated checksum of
> the
> >>>>>>>> buffer is byte shifted before being added to the initial
> >>>> checksum,
> >>>>>>>> to compensate for the checksum having been calculated on the
> >>>> buffer
> >>>>>>>> shifted by one byte.
> >>>>>>>>
> >>>>>>>> v4:
> >>>>>>>> * Add copyright notice.
> >>>>>>>> * Include stdbool.h (Emil Berg).
> >>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
> >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> >>>> word?
> >>>>>>>> v3:
> >>>>>>>> * Remove braces from single statement block.
> >>>>>>>> * Fix typo in commit message.
> >>>>>>>> v2:
> >>>>>>>> * Do not assume that the buffer is part of an aligned packet
> >>>> buffer.
> >>>>>>>>
> >>>>>>>> Bugzilla ID: 1035
> >>>>>>>> Cc: stable@dpdk.org
> >>>>>>>>
> >>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > [...]
> >
> >>>>>>
> >>>>>> The compiler will be able to auto vectorize even unaligned
> >>>> accesses,
> >>>>>> just with different instructions. From what I can tell, there's
> no
> >>>>>> performance impact, at least not on the x86_64 systems I tried
> on.
> >>>>>>
> >>>>>> I think you should remove the first special case conditional and
> >>>> use
> >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
> >>>> retrieve
> >>>>>> the data.
> >>>>>>
> >>>>>
> >>>>> Here:
> >>>>> https://www.agner.org/optimize/instruction_tables.pdf
> >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> >> latency
> >>>> for
> >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> >>>> difference.
> >>>>> Although in practice I'm not sure what difference it makes. I've
> >> not
> >>>> seen any
> >>>>> difference in runtime between the two versions.
> >>>>>
> >>>>
> >>>> Correction to my comment:
> >>>> Those stats are for some older CPU. For some newer CPUs such as
> >> Tiger
> >>>> Lake the stats seem to be the same regardless of aligned or
> >> unaligned.
> >>>>
> >>>
> >>> I agree that the memcpy method is more elegant and easy to read.
> >>>
> >>> However, we would need to performance test the modified checksum
> >> function with a large number of CPUs to prove that we don't
> introduce a
> >> performance regression on any CPU architecture still supported by
> DPDK.
> >> And Emil already found a CPU where it costs 1 extra cycle per 16
> bytes,
> >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> >> packet.
> >>>
> >>
> >> I think you've misunderstood what latency means in such tables. It's
> a
> >> data dependency thing, not a measure of throughput. The throughput
> is
> >> *much* higher. My guess would be two such instruction per clock.
> >>
> >> For your 1460 bytes example, my Zen3 AMD needs performs identical
> with
> >> both the current DPDK implementation, your patch, and a memcpy()-
> ified
> >> version of the current implementation. They all need ~130 clock
> >> cycles/packet, with warm caches. IPC is 3 instructions per cycle,
> but
> >> obvious not all instructions are SIMD.
> >
> > You're right, I wasn't thinking deeper about it before extrapolating.
> >
> > Great to see some real numbers! I wish someone would do the same
> testing on an old ARM CPU, so we could also see the other end of the
> scale.
> >
> 
> I've ran it on an ARM A72. For the aligned 1460 bytes case I got:
> Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They
> performed about the same for all unaligned/aligned and sizes I tested.
> This platform (or could be GCC version as well) doesn't suffer from the
> unaligned performance degradation your patch showed on my AMD machine.
> 
> >> The main issue with checksumming on the CPU is, in my experience,
> not
> >> that you don't have enough compute, but that you trash the caches.
> >
> > Agree. I have noticed that x86 has "non-temporal" instruction
> variants to load/store data without trashing the cache entirely.
> >
> > A variant of the checksum function using such instructions might be
> handy.
> >
> 
> Yes, although you may need to prefetch the payload for good
> performance.
> 
> > Variants of the memcpy function using such instructions might also be
> handy for some purposes, e.g. copying the contents of packets, where
> the original and/or copy will not accessed shortly thereafter.
> >
> 
> Indeed and I think it's been discussed on the list. There's some work
> to
> get it right, since alignment requirement and the fact a different
> memory model is used for those SIMD instructions causes trouble for a
> generic implementation. (For x86_64.)

I just posted an RFC [1] for such memcpy() and memset() functions,
so let's see how it fans out.

[1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk/T/#u

> 
> >>> So I opted for a solution with zero changes to the inner loop, so
> no
> >> performance retesting is required (for the previously supported use
> >> cases, where the buffer is aligned).
> >>>
> >>
> >> You will see performance degradation with this solution as well,
> under
> >> certain conditions. For unaligned 100 bytes of data, the current
> DPDK
> >> implementation and the memcpy()-fied version needs ~21 cc/packet.
> Your
> >> patch needs 54 cc/packet.
> >
> > Yes, it's a tradeoff. I exclusively aimed at maintaining performance
> for the case with aligned buffers (under all circumstances, with all
> CPUs etc.), and ignored how it affects the performance for the case
> with unaligned buffers.
> >
> > Unlike this patch, the memcpy() variant has no additional branches
> for the unaligned case, so its performance should be generally
> unaffected by the buffer being aligned or not. However, I don't have
> sufficient in-depth CPU knowledge to say if this also applies to RISCV
> and older ARM CPUs still supported by DPDK.
> >
> 
> I don't think avoiding RISCV non-catastrophic regressions triumphs
> improving performance on mainstream CPUs and avoiding code quality
> regressions.
+1

> 
> > I don't have strong feelings for this patch, so you can provide a
> memcpy() based alternative patch, and we will let the community decide
> which one they prefer.
> >
> 
> This is what my conversion looks like:
> 
> static inline uint32_t
> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> {
> 	const void *end;
> 
> 	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
> sizeof(uint16_t));
> 	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> 		uint16_t v;
> 
> 		memcpy(&v, buf, sizeof(uint16_t));
> 		sum += v;
> 	}
> 
> 	/* if length is odd, keeping it byte order independent */
> 	if (unlikely(len % 2)) {
> 		uint8_t last;
> 		uint16_t left = 0;
> 
> 		memcpy(&last, end, 1);
> 		*(unsigned char *)&left = last;
> 		sum += left;
> 	}
> 
> 	return sum;
> }
> 
> Maybe the maintainer can have an opinion, before I provide a real
> patch,
> to save me some work. I would really like to contribute some
> performance
> autotests in that case. I mean, you could be right and the performance
> could be totally off on some platform.

Looking at the meson files, A72 seems to be the weakest type of CPU actively supported by DPDK.
So your testing shows better performance for your solution than mine under all conditions.

If you post a patch, I'll close this one as superseded.

> 
> >> But the old version didn't support unaligned accesses? In many
> compiler
> >> flag/machine combinations it did.
> >
> > It crashed in some cases. That was the point of the bug report [1],
> and the reason to provide a patch.
> >
> > [1] https://bugs.dpdk.org/show_bug.cgi?id=1035
> >
  
Stephen Hemminger June 30, 2022, 5:41 p.m. UTC | #12
On Thu, 23 Jun 2022 14:39:00 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..738d643da0 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -3,6 +3,7 @@
>   *      The Regents of the University of California.
>   * Copyright(c) 2010-2014 Intel Corporation.
>   * Copyright(c) 2014 6WIND S.A.
> + * Copyright(c) 2022 SmartShare Systems.
>   * All rights reserved.
>   */

NAK
Doing a small incremental fix should not change copyright.
  
Stephen Hemminger June 30, 2022, 5:45 p.m. UTC | #13
On Thu, 23 Jun 2022 14:39:00 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> +	/* if buffer is unaligned, keeping it byte order independent */
> +	if (unlikely(unaligned)) {
> +		uint16_t first = 0;
> +		if (unlikely(len == 0))
> +			return 0;

Why is length == 0 unique to unaligned case?

> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;

Use a proper union instead of casting to avoid aliasing warnings.

> +		bsum += first;
> +		buf = RTE_PTR_ADD(buf, 1);
> +		len--;
> +	}

Many CPU's (such as x86) won't care about alignment and therefore the extra
code to handle this is not worth doing.

Perhaps DPDK needs a macro (like Linux kernel) for efficient unaligned access.

In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
  
Emil Berg July 1, 2022, 4:11 a.m. UTC | #14
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: den 30 juni 2022 19:46
> To: Morten Brørup <mb@smartsharesystems.com>
> Cc: Emil Berg <emil.berg@ericsson.com>; bruce.richardson@intel.com;
> dev@dpdk.org; stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com
> Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> 
> On Thu, 23 Jun 2022 14:39:00 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely(unaligned)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> 
> Why is length == 0 unique to unaligned case?
> 
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> 
> Use a proper union instead of casting to avoid aliasing warnings.
> 
> > +		bsum += first;
> > +		buf = RTE_PTR_ADD(buf, 1);
> > +		len--;
> > +	}
> 
> Many CPU's (such as x86) won't care about alignment and therefore the
> extra code to handle this is not worth doing.
> 

x86 does care about alignment. An example is the vmovdqa instruction, where 'a' stands for 'aligned'. The description in the link below says: "When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. "

https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64

> Perhaps DPDK needs a macro (like Linux kernel) for efficient unaligned
> access.
> 
> In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
  
Morten Brørup July 1, 2022, 4:50 p.m. UTC | #15
> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Friday, 1 July 2022 06.11
> 
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: den 30 juni 2022 19:46
> >
> > On Thu, 23 Jun 2022 14:39:00 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > +	if (unlikely(unaligned)) {
> > > +		uint16_t first = 0;
> > > +		if (unlikely(len == 0))
> > > +			return 0;
> >
> > Why is length == 0 unique to unaligned case?

Because the aligned case handles it gracefully. The unaligned case subtracts one from 'len', which (being unsigned) would become a very large number, causing 'end' to become way off.

> >
> > > +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> >
> > Use a proper union instead of casting to avoid aliasing warnings.

I copied what is done by 'left', so it resembles the existing code in the function, making it easier to review.

It is part of the endian neutral checksum handling. Tricky stuff! :-)

> >
> > > +		bsum += first;
> > > +		buf = RTE_PTR_ADD(buf, 1);
> > > +		len--;
> > > +	}
> >
> > Many CPU's (such as x86) won't care about alignment and therefore the
> > extra code to handle this is not worth doing.
> >
> 
> x86 does care about alignment. An example is the vmovdqa instruction,
> where 'a' stands for 'aligned'. The description in the link below says:
> "When the source or destination operand is a memory operand, the
> operand must be aligned on a 16-byte boundary or a general-protection
> exception (#GP) will be generated. "
> 
> https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64
> 

Also, this misconception is exactly the bug [1] this patch fixes.

[1] https://bugs.dpdk.org/show_bug.cgi?id=1035

> > Perhaps DPDK needs a macro (like Linux kernel) for efficient
> unaligned
> > access.
> >
> > In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

I recently stumbled across RTE_ARCH_STRICT_ALIGN in /lib/eal/include/rte_common.h.

But I guess it is something else.

Anyway, this function has ugly alignment problems (also before the patch), and has gone through a couple of iterations to silence warnings from the compiler. These warnings should have been addressed instead of silenced. Mattias has suggested a far better solution [2] than mine, which also correctly addresses the compiler alignment warnings, so we will probably end up with his solution instead.

[2] http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR07MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5a9f4660
  
Stephen Hemminger July 1, 2022, 5:04 p.m. UTC | #16
On Fri, 1 Jul 2022 18:50:34 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> But I guess it is something else.
> 
> Anyway, this function has ugly alignment problems (also before the patch), and has gone through a couple of iterations to silence warnings from the compiler. These warnings should have been addressed instead of silenced. Mattias has suggested a far better solution [2] than mine, which also correctly addresses the compiler alignment warnings, so we will probably end up with his solution instead.
> 
> [2] http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR07MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5a9f4660
> 


Maybe some mix of the memcpy for unaligned and odd length and faster (unrolled?) version for the case of aligned and exact multiple?
Or just take code from FreeBSD?
  
Morten Brørup July 1, 2022, 8:46 p.m. UTC | #17
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 1 July 2022 19.05
> 
> On Fri, 1 Jul 2022 18:50:34 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > But I guess it is something else.
> >
> > Anyway, this function has ugly alignment problems (also before the
> patch), and has gone through a couple of iterations to silence warnings
> from the compiler. These warnings should have been addressed instead of
> silenced. Mattias has suggested a far better solution [2] than mine,
> which also correctly addresses the compiler alignment warnings, so we
> will probably end up with his solution instead.
> >
> > [2]
> http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR0
> 7MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5
> a9f4660
> >
> 
> 
> Maybe some mix of the memcpy for unaligned and odd length and faster
> (unrolled?) version for the case of aligned and exact multiple?
> Or just take code from FreeBSD?

I just took a look at the BSD code, and it starts with the same "if ptr is unaligned" as my patch, and then does some manual loop unrolling, which we expect the compiler to do. Mattias has demonstrated that his solution has better performance, not only on modern X86 CPUs, also on an A72 CPU, so I prefer his solution. And the difference between using "vmovdqa" and "vmovdq" instructions here seem to be insignificant.
  
Stanislaw Kardach July 7, 2022, 3:21 p.m. UTC | #18
On Thu, Jun 30, 2022 at 6:32 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Tuesday, 28 June 2022 08.28
> >
> > On 2022-06-27 22:21, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >> Sent: Monday, 27 June 2022 19.23
> > >>
> > >> On 2022-06-27 15:22, Morten Brørup wrote:
> > >>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
> > >>>> Sent: Monday, 27 June 2022 14.51
> > >>>>
> > >>>>> From: Emil Berg
> > >>>>> Sent: den 27 juni 2022 14:46
> > >>>>>
> > >>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> > >>>>>> Sent: den 27 juni 2022 14:28
> > >>>>>>
> > >>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
> > >>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > >>>>>>>> Sent: Thursday, 23 June 2022 14.39
> > >>>>>>>>
> > >>>>>>>> With this patch, the checksum can be calculated on an
> > unaligned
> > >>>> buffer.
> > >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
> > >>>> aligned.
> > >>>>>>>>
> > >>>>>>>> The checksum is still calculated using a 16 bit aligned
> > pointer,
> > >>>> so
> > >>>>>>>> the compiler can auto-vectorize the function's inner loop.
> > >>>>>>>>
> > >>>>>>>> When the buffer is unaligned, the first byte of the buffer is
> > >>>>>>>> handled separately. Furthermore, the calculated checksum of
> > the
> > >>>>>>>> buffer is byte shifted before being added to the initial
> > >>>> checksum,
> > >>>>>>>> to compensate for the checksum having been calculated on the
> > >>>> buffer
> > >>>>>>>> shifted by one byte.
> > >>>>>>>>
> > >>>>>>>> v4:
> > >>>>>>>> * Add copyright notice.
> > >>>>>>>> * Include stdbool.h (Emil Berg).
> > >>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
> > >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> > >>>> word?
> > >>>>>>>> v3:
> > >>>>>>>> * Remove braces from single statement block.
> > >>>>>>>> * Fix typo in commit message.
> > >>>>>>>> v2:
> > >>>>>>>> * Do not assume that the buffer is part of an aligned packet
> > >>>> buffer.
> > >>>>>>>>
> > >>>>>>>> Bugzilla ID: 1035
> > >>>>>>>> Cc: stable@dpdk.org
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > [...]
> > >
> > >>>>>>
> > >>>>>> The compiler will be able to auto vectorize even unaligned
> > >>>> accesses,
> > >>>>>> just with different instructions. From what I can tell, there's
> > no
> > >>>>>> performance impact, at least not on the x86_64 systems I tried
> > on.
> > >>>>>>
> > >>>>>> I think you should remove the first special case conditional and
> > >>>> use
> > >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
> > >>>> retrieve
> > >>>>>> the data.
> > >>>>>>
> > >>>>>
> > >>>>> Here:
> > >>>>> https://www.agner.org/optimize/instruction_tables.pdf
> > >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> > >> latency
> > >>>> for
> > >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> > >>>> difference.
> > >>>>> Although in practice I'm not sure what difference it makes. I've
> > >> not
> > >>>> seen any
> > >>>>> difference in runtime between the two versions.
> > >>>>>
> > >>>>
> > >>>> Correction to my comment:
> > >>>> Those stats are for some older CPU. For some newer CPUs such as
> > >> Tiger
> > >>>> Lake the stats seem to be the same regardless of aligned or
> > >> unaligned.
> > >>>>
> > >>>
> > >>> I agree that the memcpy method is more elegant and easy to read.
> > >>>
> > >>> However, we would need to performance test the modified checksum
> > >> function with a large number of CPUs to prove that we don't
> > introduce a
> > >> performance regression on any CPU architecture still supported by
> > DPDK.
> > >> And Emil already found a CPU where it costs 1 extra cycle per 16
> > bytes,
> > >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> > >> packet.
> > >>>
> > >>
> > >> I think you've misunderstood what latency means in such tables. It's
> > a
> > >> data dependency thing, not a measure of throughput. The throughput
> > is
> > >> *much* higher. My guess would be two such instruction per clock.
> > >>
> > >> For your 1460 bytes example, my Zen3 AMD needs performs identical
> > with
> > >> both the current DPDK implementation, your patch, and a memcpy()-
> > ified
> > >> version of the current implementation. They all need ~130 clock
> > >> cycles/packet, with warm caches. IPC is 3 instructions per cycle,
> > but
> > >> obvious not all instructions are SIMD.
> > >
> > > You're right, I wasn't thinking deeper about it before extrapolating.
> > >
> > > Great to see some real numbers! I wish someone would do the same
> > testing on an old ARM CPU, so we could also see the other end of the
> > scale.
> > >
> >
> > I've ran it on an ARM A72. For the aligned 1460 bytes case I got:
> > Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They
> > performed about the same for all unaligned/aligned and sizes I tested.
> > This platform (or could be GCC version as well) doesn't suffer from the
> > unaligned performance degradation your patch showed on my AMD machine.
> >
> > >> The main issue with checksumming on the CPU is, in my experience,
> > not
> > >> that you don't have enough compute, but that you trash the caches.
> > >
> > > Agree. I have noticed that x86 has "non-temporal" instruction
> > variants to load/store data without trashing the cache entirely.
> > >
> > > A variant of the checksum function using such instructions might be
> > handy.
> > >
> >
> > Yes, although you may need to prefetch the payload for good
> > performance.
> >
> > > Variants of the memcpy function using such instructions might also be
> > handy for some purposes, e.g. copying the contents of packets, where
> > the original and/or copy will not accessed shortly thereafter.
> > >
> >
> > Indeed and I think it's been discussed on the list. There's some work
> > to
> > get it right, since alignment requirement and the fact a different
> > memory model is used for those SIMD instructions causes trouble for a
> > generic implementation. (For x86_64.)
>
> I just posted an RFC [1] for such memcpy() and memset() functions,
> so let's see how it fans out.
>
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk/T/#u
>
> >
> > >>> So I opted for a solution with zero changes to the inner loop, so
> > no
> > >> performance retesting is required (for the previously supported use
> > >> cases, where the buffer is aligned).
> > >>>
> > >>
> > >> You will see performance degradation with this solution as well,
> > under
> > >> certain conditions. For unaligned 100 bytes of data, the current
> > DPDK
> > >> implementation and the memcpy()-fied version needs ~21 cc/packet.
> > Your
> > >> patch needs 54 cc/packet.
> > >
> > > Yes, it's a tradeoff. I exclusively aimed at maintaining performance
> > for the case with aligned buffers (under all circumstances, with all
> > CPUs etc.), and ignored how it affects the performance for the case
> > with unaligned buffers.
> > >
> > > Unlike this patch, the memcpy() variant has no additional branches
> > for the unaligned case, so its performance should be generally
> > unaffected by the buffer being aligned or not. However, I don't have
> > sufficient in-depth CPU knowledge to say if this also applies to RISCV
> > and older ARM CPUs still supported by DPDK.
> > >
> >
> > I don't think avoiding RISCV non-catastrophic regressions triumphs
> > improving performance on mainstream CPUs and avoiding code quality
> > regressions.
> +1
+1. In general RISC-V spec leaves the unaligned load/store handling to
implementation (it might fault, it might not). The U74 core that I
have at hand allows unaligned reads/writes. Though it's not a platform
for performance evaluation (time measurement causes a trap to
firmware), so I won't say anything on that.


--
Best Regards,
Stanisław Kardach
  

Patch

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..738d643da0 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -3,6 +3,7 @@ 
  *      The Regents of the University of California.
  * Copyright(c) 2010-2014 Intel Corporation.
  * Copyright(c) 2014 6WIND S.A.
+ * Copyright(c) 2022 SmartShare Systems.
  * All rights reserved.
  */
 
@@ -15,6 +16,7 @@ 
  * IP-related defines
  */
 
+#include <stdbool.h>
 #include <stdint.h>
 
 #ifdef RTE_EXEC_ENV_WINDOWS
@@ -162,20 +164,40 @@  __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
 	/* extend strict-aliasing rules */
 	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const u16_p *u16_buf;
+	const u16_p *end;
+	uint32_t bsum = 0;
+	const bool unaligned = (uintptr_t)buf & 1;
+
+	/* if buffer is unaligned, keeping it byte order independent */
+	if (unlikely(unaligned)) {
+		uint16_t first = 0;
+		if (unlikely(len == 0))
+			return 0;
+		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
+		bsum += first;
+		buf = RTE_PTR_ADD(buf, 1);
+		len--;
+	}
 
+	/* aligned access for compiler auto-vectorization */
+	u16_buf = (const u16_p *)buf;
+	end = u16_buf + len / sizeof(*u16_buf);
 	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+		bsum += *u16_buf;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
 		*(unsigned char *)&left = *(const unsigned char *)end;
-		sum += left;
+		bsum += left;
 	}
 
-	return sum;
+	/* if buffer is unaligned, swap the checksum bytes */
+	if (unlikely(unaligned))
+		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
+
+	return sum + bsum;
 }
 
 /**