[dpdk-dev] mbuf: replace c memcpy code semantics with optimized rte_memcpy

Message ID 1464101442-10501-1-git-send-email-jerin.jacob@caviumnetworks.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Jerin Jacob May 24, 2016, 2:50 p.m. UTC
  Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_mempool/rte_mempool.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
  

Comments

Olivier Matz May 24, 2016, 2:59 p.m. UTC | #1
Hi Jerin,


On 05/24/2016 04:50 PM, Jerin Jacob wrote:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  lib/librte_mempool/rte_mempool.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index ed2c110..ebe399a 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -74,6 +74,7 @@
>  #include <rte_memory.h>
>  #include <rte_branch_prediction.h>
>  #include <rte_ring.h>
> +#include <rte_memcpy.h>
>  
>  #ifdef __cplusplus
>  extern "C" {
> @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  		    unsigned n, __rte_unused int is_mp)
>  {
>  	struct rte_mempool_cache *cache;
> -	uint32_t index;
>  	void **cache_objs;
>  	unsigned lcore_id = rte_lcore_id();
>  	uint32_t cache_size = mp->cache_size;
> @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  	 */
>  
>  	/* Add elements back into the cache */
> -	for (index = 0; index < n; ++index, obj_table++)
> -		cache_objs[index] = *obj_table;
> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>  
>  	cache->len += n;
>  
> 

The commit title should be "mempool" instead of "mbuf".
Are you seeing some performance improvement by using rte_memcpy()?

Regards
Olivier
  
Jerin Jacob May 24, 2016, 3:17 p.m. UTC | #2
On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
> Hi Jerin,
> 
> 
> On 05/24/2016 04:50 PM, Jerin Jacob wrote:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> >  lib/librte_mempool/rte_mempool.h | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> > index ed2c110..ebe399a 100644
> > --- a/lib/librte_mempool/rte_mempool.h
> > +++ b/lib/librte_mempool/rte_mempool.h
> > @@ -74,6 +74,7 @@
> >  #include <rte_memory.h>
> >  #include <rte_branch_prediction.h>
> >  #include <rte_ring.h>
> > +#include <rte_memcpy.h>
> >  
> >  #ifdef __cplusplus
> >  extern "C" {
> > @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  		    unsigned n, __rte_unused int is_mp)
> >  {
> >  	struct rte_mempool_cache *cache;
> > -	uint32_t index;
> >  	void **cache_objs;
> >  	unsigned lcore_id = rte_lcore_id();
> >  	uint32_t cache_size = mp->cache_size;
> > @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  	 */
> >  
> >  	/* Add elements back into the cache */
> > -	for (index = 0; index < n; ++index, obj_table++)
> > -		cache_objs[index] = *obj_table;
> > +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >  
> >  	cache->len += n;
> >  
> > 
> 
> The commit title should be "mempool" instead of "mbuf".

I will fix it.

> Are you seeing some performance improvement by using rte_memcpy()?

Yes, In some case, In default case, It was replaced with memcpy by the
compiler itself(gcc 5.3). But when I tried external mempool manager patch and
then performance dropped almost 800Kpps. Debugging further it turns out that
external mempool managers unrelated change was knocking out the memcpy.
explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
unknown(In my test setup, packets are in the local cache, so it must be
something do with __mempool_put_bulk text alignment change or similar.

Anyone else observed performance drop with external poolmanager?

Jerin

> 
> Regards
> Olivier
  
Hunt, David May 27, 2016, 10:24 a.m. UTC | #3
On 5/24/2016 4:17 PM, Jerin Jacob wrote:
> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>
>> Are you seeing some performance improvement by using rte_memcpy()?
> Yes, In some case, In default case, It was replaced with memcpy by the
> compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> then performance dropped almost 800Kpps. Debugging further it turns out that
> external mempool managers unrelated change was knocking out the memcpy.
> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> unknown(In my test setup, packets are in the local cache, so it must be
> something do with __mempool_put_bulk text alignment change or similar.
>
> Anyone else observed performance drop with external poolmanager?
>
> Jerin

Jerin,
     I'm seeing a 300kpps drop in throughput when I apply this on top of 
the external
mempool manager patch. If you're seeing an increase if you apply this 
patch first, then
a drop when applying the mempool manager, the two patches must be 
conflicting in
some way. We probably need to investigate further.
Regards,
Dave.
  
Jerin Jacob May 27, 2016, 11:42 a.m. UTC | #4
On Fri, May 27, 2016 at 11:24:57AM +0100, Hunt, David wrote:
> 
> 
> On 5/24/2016 4:17 PM, Jerin Jacob wrote:
> > On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
> > 
> > > Are you seeing some performance improvement by using rte_memcpy()?
> > Yes, In some case, In default case, It was replaced with memcpy by the
> > compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> > then performance dropped almost 800Kpps. Debugging further it turns out that
> > external mempool managers unrelated change was knocking out the memcpy.
> > explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> > unknown(In my test setup, packets are in the local cache, so it must be
> > something do with __mempool_put_bulk text alignment change or similar.
> > 
> > Anyone else observed performance drop with external poolmanager?
> > 
> > Jerin
> 
> Jerin,
>     I'm seeing a 300kpps drop in throughput when I apply this on top of the
> external
> mempool manager patch. If you're seeing an increase if you apply this patch
> first, then
> a drop when applying the mempool manager, the two patches must be
> conflicting in
> some way. We probably need to investigate further.

In general, My concern is that most probably this patch also will get dropped
on floor due to conflit in different architecture and some architecture/platform
need to maintain this out out tree.

Unlike other projects, DPDK modules are hand optimized due do that
some change are depended register allocations and compiler version and
text alignment etc.

IMHO, I think we should have means to abstract this _logical_ changes
under conditional compilation flags and any arch/platform can choose
to select what it suites better for that arch/platform.

We may NOT need to have frequent patches to select the specific
configuration, but logical patches under compilation flags can be accepted and
each arch/platform can choose specific set configuration when we make
the final release candidate for the release.

Any thoughts?

Jerin
  
Hunt, David May 27, 2016, 1:45 p.m. UTC | #5
On 5/27/2016 11:24 AM, Hunt, David wrote:
>
>
> On 5/24/2016 4:17 PM, Jerin Jacob wrote:
>> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>>
>>> Are you seeing some performance improvement by using rte_memcpy()?
>> Yes, In some case, In default case, It was replaced with memcpy by the
>> compiler itself(gcc 5.3). But when I tried external mempool manager 
>> patch and
>> then performance dropped almost 800Kpps. Debugging further it turns 
>> out that
>> external mempool managers unrelated change was knocking out the memcpy.
>> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
>> unknown(In my test setup, packets are in the local cache, so it must be
>> something do with __mempool_put_bulk text alignment change or similar.
>>
>> Anyone else observed performance drop with external poolmanager?
>>
>> Jerin
>
> Jerin,
>     I'm seeing a 300kpps drop in throughput when I apply this on top 
> of the external
> mempool manager patch. If you're seeing an increase if you apply this 
> patch first, then
> a drop when applying the mempool manager, the two patches must be 
> conflicting in
> some way. We probably need to investigate further.
> Regards,
> Dave.
>

On further investigation, I now have a setup with no performance 
degradation. My previous tests
were accessing the NICS on a different NUMA node. Once I initiated 
testPMD with the correct coremask,
the difference between pre and post rte_memcpy patch is negligible 
(maybe 0.1% drop).

Regards,
Dave.
  
Thomas Monjalon May 27, 2016, 3:05 p.m. UTC | #6
2016-05-27 17:12, Jerin Jacob:
> IMHO, I think we should have means to abstract this _logical_ changes
> under conditional compilation flags and any arch/platform can choose
> to select what it suites better for that arch/platform.
> 
> We may NOT need to have frequent patches to select the specific
> configuration, but logical patches under compilation flags can be accepted and
> each arch/platform can choose specific set configuration when we make
> the final release candidate for the release.
> 
> Any thoughts?

Yes having some #ifdefs for arch configuration may be reasonnable.
But other methods must be preffered first:
1/ try implementing the function in arch-specific files
2/ and check at runtime if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_X
3/ or check #ifdef RTE_MACHINE_CPUFLAG_X
4/ or check #ifdef RTE_ARCH_Y
5/ or check a specific #ifdef RTE_FEATURE_NAME to choose in config files

The option 2 is a nice to have which implies other options.

Maybe that doc/guides/contributing/design.rst needs to be updated.
  
Olivier Matz May 30, 2016, 8:44 a.m. UTC | #7
On 05/27/2016 05:05 PM, Thomas Monjalon wrote:
> 2016-05-27 17:12, Jerin Jacob:
>> IMHO, I think we should have means to abstract this _logical_ changes
>> under conditional compilation flags and any arch/platform can choose
>> to select what it suites better for that arch/platform.
>>
>> We may NOT need to have frequent patches to select the specific
>> configuration, but logical patches under compilation flags can be accepted and
>> each arch/platform can choose specific set configuration when we make
>> the final release candidate for the release.
>>
>> Any thoughts?
> 
> Yes having some #ifdefs for arch configuration may be reasonnable.
> But other methods must be preffered first:
> 1/ try implementing the function in arch-specific files

I agree with Thomas. This option should be preferred, and I think we
should avoid as much as possible to have:

#if ARCH1
  do stuff optimized for arch1
#elif ARCH2
  do the same stuff optimized for arch2
#else
  ...


In this particular case, rte_memcpy() seems to be the appropriate
function, because it should already be arch-optimized.


> 2/ and check at runtime if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_X
> 3/ or check #ifdef RTE_MACHINE_CPUFLAG_X
> 4/ or check #ifdef RTE_ARCH_Y
> 5/ or check a specific #ifdef RTE_FEATURE_NAME to choose in config files
> 
> The option 2 is a nice to have which implies other options.
> 
> Maybe that doc/guides/contributing/design.rst needs to be updated.


Regards,
Olivier
  
Hunt, David June 24, 2016, 3:56 p.m. UTC | #8
Hi Jerin,

I just ran a couple of tests on this patch on the latest master head on 
a couple of machines. An older quad socket E5-4650 and a quad socket 
E5-2699 v3

E5-4650:
I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the 
cached tests.

E5-2699 v3:
I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the 
cached tests.

This is purely the autotest comparison, I don't have traffic generator 
results. But based on the above, I don't think there are any performance 
issues with the patch.

Regards,
Dave.




On 24/5/2016 4:17 PM, Jerin Jacob wrote:
> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>> Hi Jerin,
>>
>>
>> On 05/24/2016 04:50 PM, Jerin Jacob wrote:
>>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>> ---
>>>   lib/librte_mempool/rte_mempool.h | 5 ++---
>>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
>>> index ed2c110..ebe399a 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -74,6 +74,7 @@
>>>   #include <rte_memory.h>
>>>   #include <rte_branch_prediction.h>
>>>   #include <rte_ring.h>
>>> +#include <rte_memcpy.h>
>>>   
>>>   #ifdef __cplusplus
>>>   extern "C" {
>>> @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>>>   		    unsigned n, __rte_unused int is_mp)
>>>   {
>>>   	struct rte_mempool_cache *cache;
>>> -	uint32_t index;
>>>   	void **cache_objs;
>>>   	unsigned lcore_id = rte_lcore_id();
>>>   	uint32_t cache_size = mp->cache_size;
>>> @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>>>   	 */
>>>   
>>>   	/* Add elements back into the cache */
>>> -	for (index = 0; index < n; ++index, obj_table++)
>>> -		cache_objs[index] = *obj_table;
>>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>>>   
>>>   	cache->len += n;
>>>   
>>>
>> The commit title should be "mempool" instead of "mbuf".
> I will fix it.
>
>> Are you seeing some performance improvement by using rte_memcpy()?
> Yes, In some case, In default case, It was replaced with memcpy by the
> compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> then performance dropped almost 800Kpps. Debugging further it turns out that
> external mempool managers unrelated change was knocking out the memcpy.
> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> unknown(In my test setup, packets are in the local cache, so it must be
> something do with __mempool_put_bulk text alignment change or similar.
>
> Anyone else observed performance drop with external poolmanager?
>
> Jerin
>
>> Regards
>> Olivier
  
Olivier Matz June 24, 2016, 4:02 p.m. UTC | #9
Hi Dave,

On 06/24/2016 05:56 PM, Hunt, David wrote:
> Hi Jerin,
> 
> I just ran a couple of tests on this patch on the latest master head on
> a couple of machines. An older quad socket E5-4650 and a quad socket
> E5-2699 v3
> 
> E5-4650:
> I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the
> cached tests.
> 
> E5-2699 v3:
> I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the
> cached tests.
> 
> This is purely the autotest comparison, I don't have traffic generator
> results. But based on the above, I don't think there are any performance
> issues with the patch.
> 

Thanks for doing the test on your side. I think it's probably enough
to integrate Jerin's patch .

About using a rte_memcpy() in the mempool_get(), I don't think I'll have
the time to do a more exhaustive test before the 16.07, so I'll come
back with it later.

I'm sending an ack on the v2 thread.
  

Patch

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index ed2c110..ebe399a 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -74,6 +74,7 @@ 
 #include <rte_memory.h>
 #include <rte_branch_prediction.h>
 #include <rte_ring.h>
+#include <rte_memcpy.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -917,7 +918,6 @@  __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 		    unsigned n, __rte_unused int is_mp)
 {
 	struct rte_mempool_cache *cache;
-	uint32_t index;
 	void **cache_objs;
 	unsigned lcore_id = rte_lcore_id();
 	uint32_t cache_size = mp->cache_size;
@@ -946,8 +946,7 @@  __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
-	for (index = 0; index < n; ++index, obj_table++)
-		cache_objs[index] = *obj_table;
+	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
 
 	cache->len += n;