Message ID | 20220225163804.506142-1-lucp.at.work@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | David Marchand |
Headers | show |
Series | [v7] eal: fix rte_memcpy strict aliasing/alignment bugs | expand |
Context | Check | Description |
---|---|---|
ci/intel-Testing | success | Testing PASS |
ci/Intel-compilation | success | Compilation OK |
ci/iol-abi-testing | success | Testing PASS |
ci/iol-x86_64-unit-testing | success | Testing PASS |
ci/iol-aarch64-compile-testing | success | Testing PASS |
ci/iol-x86_64-compile-testing | success | Testing PASS |
ci/iol-aarch64-unit-testing | success | Testing PASS |
ci/github-robot: build | success | github build: passed |
ci/iol-intel-Performance | success | Performance Testing PASS |
ci/iol-broadcom-Performance | success | Performance Testing PASS |
ci/iol-mellanox-Performance | success | Performance Testing PASS |
ci/iol-intel-Functional | success | Functional Testing PASS |
ci/iol-broadcom-Functional | success | Functional Testing PASS |
ci/checkpatch | warning | coding style issues |
> Calls to rte_memcpy for 1 < n < 16 could result in unaligned > loads/stores, which is undefined behaviour according to the C > standard, and strict aliasing violations. > > The code was changed to use a packed structure that allows aliasing > (using the __may_alias__ attribute) to perform the load/store > operations. This results in code that has the same performance as the > original code and that is also C standards-compliant. > > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time") > Cc: Xiaoyun Li <xiaoyun.li@intel.com> > Cc: stable@dpdk.org > > Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com> > --- > v7: > * Fix coding style issue by adding new __rte_may_alias macro rather > than directly use __attribute__ > > v6: > * Refocus to fix strict aliasing problems discovered following > discussions in this thread. > * Modified the code to use __may_alias__ and packed structure. This fixes > both the undefined behaviour of unaligned access (which is not the main > concern), and also fixes the strict aliasing violations (which can cause > major bugs, as demonstrated in a previous message in this thread). > * Renamed new function from rte_mov15_or_less_unaligned to > rte_mov15_or_less. > * Modified code that copies <= 15 bytes to call rte_mov15_or_less. > > v5: > * Replaced assembly with pure C code that uses a packed structure to make > unaligned loads conform to C standard. > > v4: > * Added volatile to asm statements, which is required under gcc. > > v3: > * Removed for loop and went back to using assembly. > > v2: > * Replaced assembly with a regular for loop that copies bytes one by > one. > > v1: > * Fix undefined behaviour of unaligned stores/loads by using assembly > to perform stores/loads. LGTM, thanks for highlighting and fixing the problem. Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Tested-by: Konstantin Ananyev <konstantin.ananyev@intel.com> As a side note, we probably need to check other similar places in DPDK code.
On Thu, Mar 10, 2022 at 3:55 PM Ananyev, Konstantin <konstantin.ananyev@intel.com> wrote: > > Calls to rte_memcpy for 1 < n < 16 could result in unaligned > > loads/stores, which is undefined behaviour according to the C > > standard, and strict aliasing violations. > > > > The code was changed to use a packed structure that allows aliasing > > (using the __may_alias__ attribute) to perform the load/store > > operations. This results in code that has the same performance as the > > original code and that is also C standards-compliant. > > > > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time") > > Cc: stable@dpdk.org > > > > Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > Tested-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Thanks, applied. > As a side note, we probably need to check other similar places in DPDK code. What would be the best way to detect those problematic places? I tried UBsan, and it did report some of the issues fixed with this patch.
On Thu, Apr 7, 2022 at 5:24 PM David Marchand <david.marchand@redhat.com> wrote: > > On Thu, Mar 10, 2022 at 3:55 PM Ananyev, Konstantin > <konstantin.ananyev@intel.com> wrote: > > > Calls to rte_memcpy for 1 < n < 16 could result in unaligned > > > loads/stores, which is undefined behaviour according to the C > > > standard, and strict aliasing violations. > > > > > > The code was changed to use a packed structure that allows aliasing > > > (using the __may_alias__ attribute) to perform the load/store > > > operations. This results in code that has the same performance as the > > > original code and that is also C standards-compliant. > > > > > > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time") Actually, looking again at the history, it fixes: Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86") I'll change before pushing.
On Thu, Apr 7, 2022 at 5:32 PM David Marchand <david.marchand@redhat.com> wrote: > > On Thu, Apr 7, 2022 at 5:24 PM David Marchand <david.marchand@redhat.com> wrote: > > > > On Thu, Mar 10, 2022 at 3:55 PM Ananyev, Konstantin > > <konstantin.ananyev@intel.com> wrote: > > > > Calls to rte_memcpy for 1 < n < 16 could result in unaligned > > > > loads/stores, which is undefined behaviour according to the C > > > > standard, and strict aliasing violations. > > > > > > > > The code was changed to use a packed structure that allows aliasing > > > > (using the __may_alias__ attribute) to perform the load/store > > > > operations. This results in code that has the same performance as the > > > > original code and that is also C standards-compliant. > > > > > > > > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time") > > Actually, looking again at the history, it fixes: > Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86") Nop, that's probably even older, could you double check? I'll hold on pushing this fix.
Hi David, Le jeu. 7 avr. 2022 à 11:24, David Marchand <david.marchand@redhat.com> a écrit : > > As a side note, we probably need to check other similar places in DPDK code. > > What would be the best way to detect those problematic places? As far as I'm aware, there is no silver bullet to detect all strict aliasing violations. A good summary of the different options are described here: https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8#catching-strict-aliasing-violations However, this is not 100% and might still miss some strict aliasing violations. The bottom line is that anywhere there's a cast to something other than void* or char*, it could be a strict aliasing violation. So, doing an exhaustive search throughout the code base for casts seems like the only (tedious, time-consuming) solution.
Hi David, > > Actually, looking again at the history, it fixes: > > Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86") > > Nop, that's probably even older, could you double check? > I'll hold on pushing this fix. It seems you still haven't received a response. I'll take a stab at this. I think it fixes: Fixes: 76746eb13f08 ("eal/x86: fix strict aliasing rules") However, the ordering of the ifs (from 1 to 8, rather than 8 to 1) seems to date back to the first commit (af75078fece3). So, I'm not sure how you want to handle this. Thanks.
Hello Luc, On Fri, May 13, 2022 at 9:16 PM Luc Pelletier <lucp.at.work@gmail.com> wrote: > > > Actually, looking again at the history, it fixes: > > > Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86") > > > > Nop, that's probably even older, could you double check? > > I'll hold on pushing this fix. > > It seems you still haven't received a response. I'll take a stab at this. > > I think it fixes: > > Fixes: 76746eb13f08 ("eal/x86: fix strict aliasing rules") > > However, the ordering of the ifs (from 1 to 8, rather than 8 to 1) > seems to date back to the first commit (af75078fece3). So, I'm not > sure how you want to handle this. My understanding was that the issue was there since "day 1". I'll go with the latter then. Backporting will be for all LTS branches in any case. Thanks!
On Thu, Apr 7, 2022 at 5:24 PM David Marchand <david.marchand@redhat.com> wrote: > On Thu, Mar 10, 2022 at 3:55 PM Ananyev, Konstantin > <konstantin.ananyev@intel.com> wrote: > > > Calls to rte_memcpy for 1 < n < 16 could result in unaligned > > > loads/stores, which is undefined behaviour according to the C > > > standard, and strict aliasing violations. > > > > > > The code was changed to use a packed structure that allows aliasing > > > (using the __may_alias__ attribute) to perform the load/store > > > operations. This results in code that has the same performance as the > > > original code and that is also C standards-compliant. > > > Fixes: af75078fece3 ("first public release") > > > Cc: stable@dpdk.org > > > > > > Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > > Tested-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > Applied, thanks Luc.
diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h index 4a399cc7c8..2f1ec69f3d 100644 --- a/lib/eal/include/rte_common.h +++ b/lib/eal/include/rte_common.h @@ -85,6 +85,11 @@ typedef uint16_t unaligned_uint16_t; */ #define __rte_packed __attribute__((__packed__)) +/** + * Macro to mark a type that is not subject to type-based aliasing rules + */ +#define __rte_may_alias __attribute__((__may_alias__)) + /******* Macro to mark functions and fields scheduled for removal *****/ #define __rte_deprecated __attribute__((__deprecated__)) #define __rte_deprecated_msg(msg) __attribute__((__deprecated__(msg))) diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h index 1b6c6e585f..18aa4e43a7 100644 --- a/lib/eal/x86/include/rte_memcpy.h +++ b/lib/eal/x86/include/rte_memcpy.h @@ -45,6 +45,52 @@ extern "C" { static __rte_always_inline void * rte_memcpy(void *dst, const void *src, size_t n); +/** + * Copy bytes from one location to another, + * locations should not overlap. + * Use with n <= 15. + */ +static __rte_always_inline void * +rte_mov15_or_less(void *dst, const void *src, size_t n) +{ + /** + * Use the following structs to avoid violating C standard + * alignment requirements and to avoid strict aliasing bugs + */ + struct rte_uint64_alias { + uint64_t val; + } __rte_packed __rte_may_alias; + struct rte_uint32_alias { + uint32_t val; + } __rte_packed __rte_may_alias; + struct rte_uint16_alias { + uint16_t val; + } __rte_packed __rte_may_alias; + + void *ret = dst; + if (n & 8) { + ((struct rte_uint64_alias *)dst)->val = + ((const struct rte_uint64_alias *)src)->val; + src = (const uint64_t *)src + 1; + dst = (uint64_t *)dst + 1; + } + if (n & 4) { + ((struct rte_uint32_alias *)dst)->val = + ((const struct rte_uint32_alias *)src)->val; + src = (const uint32_t *)src + 1; + dst = (uint32_t *)dst + 1; + } + if (n & 2) { + ((struct rte_uint16_alias *)dst)->val = + ((const struct rte_uint16_alias *)src)->val; + src = (const uint16_t *)src + 1; + dst = (uint16_t *)dst + 1; + } + if (n & 1) + *(uint8_t *)dst = *(const uint8_t *)src; + return ret; +} + #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 #define ALIGNMENT_MASK 0x3F @@ -171,8 +217,6 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n) static __rte_always_inline void * rte_memcpy_generic(void *dst, const void *src, size_t n) { - uintptr_t dstu = (uintptr_t)dst; - uintptr_t srcu = (uintptr_t)src; void *ret = dst; size_t dstofss; size_t bits; @@ -181,24 +225,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n) * Copy less than 16 bytes */ if (n < 16) { - if (n & 0x01) { - *(uint8_t *)dstu = *(const uint8_t *)srcu; - srcu = (uintptr_t)((const uint8_t *)srcu + 1); - dstu = (uintptr_t)((uint8_t *)dstu + 1); - } - if (n & 0x02) { - *(uint16_t *)dstu = *(const uint16_t *)srcu; - srcu = (uintptr_t)((const uint16_t *)srcu + 1); - dstu = (uintptr_t)((uint16_t *)dstu + 1); - } - if (n & 0x04) { - *(uint32_t *)dstu = *(const uint32_t *)srcu; - srcu = (uintptr_t)((const uint32_t *)srcu + 1); - dstu = (uintptr_t)((uint32_t *)dstu + 1); - } - if (n & 0x08) - *(uint64_t *)dstu = *(const uint64_t *)srcu; - return ret; + return rte_mov15_or_less(dst, src, n); } /** @@ -379,8 +406,6 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n) static __rte_always_inline void * rte_memcpy_generic(void *dst, const void *src, size_t n) { - uintptr_t dstu = (uintptr_t)dst; - uintptr_t srcu = (uintptr_t)src; void *ret = dst; size_t dstofss; size_t bits; @@ -389,25 +414,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n) * Copy less than 16 bytes */ if (n < 16) { - if (n & 0x01) { - *(uint8_t *)dstu = *(const uint8_t *)srcu; - srcu = (uintptr_t)((const uint8_t *)srcu + 1); - dstu = (uintptr_t)((uint8_t *)dstu + 1); - } - if (n & 0x02) { - *(uint16_t *)dstu = *(const uint16_t *)srcu; - srcu = (uintptr_t)((const uint16_t *)srcu + 1); - dstu = (uintptr_t)((uint16_t *)dstu + 1); - } - if (n & 0x04) { - *(uint32_t *)dstu = *(const uint32_t *)srcu; - srcu = (uintptr_t)((const uint32_t *)srcu + 1); - dstu = (uintptr_t)((uint32_t *)dstu + 1); - } - if (n & 0x08) { - *(uint64_t *)dstu = *(const uint64_t *)srcu; - } - return ret; + return rte_mov15_or_less(dst, src, n); } /** @@ -672,8 +679,6 @@ static __rte_always_inline void * rte_memcpy_generic(void *dst, const void *src, size_t n) { __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8; - uintptr_t dstu = (uintptr_t)dst; - uintptr_t srcu = (uintptr_t)src; void *ret = dst; size_t dstofss; size_t srcofs; @@ -682,25 +687,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n) * Copy less than 16 bytes */ if (n < 16) { - if (n & 0x01) { - *(uint8_t *)dstu = *(const uint8_t *)srcu; - srcu = (uintptr_t)((const uint8_t *)srcu + 1); - dstu = (uintptr_t)((uint8_t *)dstu + 1); - } - if (n & 0x02) { - *(uint16_t *)dstu = *(const uint16_t *)srcu; - srcu = (uintptr_t)((const uint16_t *)srcu + 1); - dstu = (uintptr_t)((uint16_t *)dstu + 1); - } - if (n & 0x04) { - *(uint32_t *)dstu = *(const uint32_t *)srcu; - srcu = (uintptr_t)((const uint32_t *)srcu + 1); - dstu = (uintptr_t)((uint32_t *)dstu + 1); - } - if (n & 0x08) { - *(uint64_t *)dstu = *(const uint64_t *)srcu; - } - return ret; + return rte_mov15_or_less(dst, src, n); } /** @@ -818,27 +805,9 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n) { void *ret = dst; - /* Copy size <= 16 bytes */ + /* Copy size < 16 bytes */ if (n < 16) { - if (n & 0x01) { - *(uint8_t *)dst = *(const uint8_t *)src; - src = (const uint8_t *)src + 1; - dst = (uint8_t *)dst + 1; - } - if (n & 0x02) { - *(uint16_t *)dst = *(const uint16_t *)src; - src = (const uint16_t *)src + 1; - dst = (uint16_t *)dst + 1; - } - if (n & 0x04) { - *(uint32_t *)dst = *(const uint32_t *)src; - src = (const uint32_t *)src + 1; - dst = (uint32_t *)dst + 1; - } - if (n & 0x08) - *(uint64_t *)dst = *(const uint64_t *)src; - - return ret; + return rte_mov15_or_less(dst, src, n); } /* Copy 16 <= size <= 32 bytes */
Calls to rte_memcpy for 1 < n < 16 could result in unaligned loads/stores, which is undefined behaviour according to the C standard, and strict aliasing violations. The code was changed to use a packed structure that allows aliasing (using the __may_alias__ attribute) to perform the load/store operations. This results in code that has the same performance as the original code and that is also C standards-compliant. Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time") Cc: Xiaoyun Li <xiaoyun.li@intel.com> Cc: stable@dpdk.org Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com> --- v7: * Fix coding style issue by adding new __rte_may_alias macro rather than directly use __attribute__ v6: * Refocus to fix strict aliasing problems discovered following discussions in this thread. * Modified the code to use __may_alias__ and packed structure. This fixes both the undefined behaviour of unaligned access (which is not the main concern), and also fixes the strict aliasing violations (which can cause major bugs, as demonstrated in a previous message in this thread). * Renamed new function from rte_mov15_or_less_unaligned to rte_mov15_or_less. * Modified code that copies <= 15 bytes to call rte_mov15_or_less. v5: * Replaced assembly with pure C code that uses a packed structure to make unaligned loads conform to C standard. v4: * Added volatile to asm statements, which is required under gcc. v3: * Removed for loop and went back to using assembly. v2: * Replaced assembly with a regular for loop that copies bytes one by one. v1: * Fix undefined behaviour of unaligned stores/loads by using assembly to perform stores/loads. lib/eal/include/rte_common.h | 5 ++ lib/eal/x86/include/rte_memcpy.h | 133 ++++++++++++------------------- 2 files changed, 56 insertions(+), 82 deletions(-)