[dpdk-dev,v2] mempool: replace c memcpy code semantics with optimized rte_memcpy

Message ID 1464250025-9191-1-git-send-email-jerin.jacob@caviumnetworks.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Jerin Jacob May 26, 2016, 8:07 a.m. UTC
  Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
v1..v2
Corrected the the git commit message(s/mbuf/mempool/g)
---
 lib/librte_mempool/rte_mempool.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
  

Comments

Olivier Matz May 30, 2016, 8:45 a.m. UTC | #1
Hi Jerin,

On 05/26/2016 10:07 AM, Jerin Jacob wrote:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
> v1..v2
> Corrected the the git commit message(s/mbuf/mempool/g)
> ---
>  lib/librte_mempool/rte_mempool.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index 60339bd..24876a2 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -73,6 +73,7 @@
>  #include <rte_memory.h>
>  #include <rte_branch_prediction.h>
>  #include <rte_ring.h>
> +#include <rte_memcpy.h>
>  
>  #ifdef __cplusplus
>  extern "C" {
> @@ -739,7 +740,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  		    unsigned n, int is_mp)
>  {
>  	struct rte_mempool_cache *cache;
> -	uint32_t index;
>  	void **cache_objs;
>  	unsigned lcore_id = rte_lcore_id();
>  	uint32_t cache_size = mp->cache_size;
> @@ -768,8 +768,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  	 */
>  
>  	/* Add elements back into the cache */
> -	for (index = 0; index < n; ++index, obj_table++)
> -		cache_objs[index] = *obj_table;
> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>  
>  	cache->len += n;
>  
> 

I also checked in the get_bulk() function, which looks like that:

	/* Now fill in the response ... */
	for (index = 0, len = cache->len - 1;
			index < n;
			++index, len--, obj_table++)
		*obj_table = cache_objs[len];

I think we could replace it by something like:

	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);

The only difference is that it won't reverse the pointers in the
table, but I don't see any problem with that.

What do you think?


Regards,
Olivier
  
Jerin Jacob May 31, 2016, 12:58 p.m. UTC | #2
On Mon, May 30, 2016 at 10:45:11AM +0200, Olivier Matz wrote:
> Hi Jerin,
> 
> On 05/26/2016 10:07 AM, Jerin Jacob wrote:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> > v1..v2
> > Corrected the the git commit message(s/mbuf/mempool/g)
> > ---
> >  lib/librte_mempool/rte_mempool.h | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> > index 60339bd..24876a2 100644
> > --- a/lib/librte_mempool/rte_mempool.h
> > +++ b/lib/librte_mempool/rte_mempool.h
> > @@ -73,6 +73,7 @@
> >  #include <rte_memory.h>
> >  #include <rte_branch_prediction.h>
> >  #include <rte_ring.h>
> > +#include <rte_memcpy.h>
> >  
> >  #ifdef __cplusplus
> >  extern "C" {
> > @@ -739,7 +740,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  		    unsigned n, int is_mp)
> >  {
> >  	struct rte_mempool_cache *cache;
> > -	uint32_t index;
> >  	void **cache_objs;
> >  	unsigned lcore_id = rte_lcore_id();
> >  	uint32_t cache_size = mp->cache_size;
> > @@ -768,8 +768,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  	 */
> >  
> >  	/* Add elements back into the cache */
> > -	for (index = 0; index < n; ++index, obj_table++)
> > -		cache_objs[index] = *obj_table;
> > +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >  
> >  	cache->len += n;
> >  
> > 
> 
> I also checked in the get_bulk() function, which looks like that:
> 
> 	/* Now fill in the response ... */
> 	for (index = 0, len = cache->len - 1;
> 			index < n;
> 			++index, len--, obj_table++)
> 		*obj_table = cache_objs[len];
> 
> I think we could replace it by something like:
> 
> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
> 
> The only difference is that it won't reverse the pointers in the
> table, but I don't see any problem with that.
> 
> What do you think?

In true sense, it will _not_ be LIFO. Not sure about cache usage implications
on the specific use cases.

Jerin

> 
> 
> Regards,
> Olivier
>
  
Olivier Matz May 31, 2016, 9:05 p.m. UTC | #3
Hi Jerin,

>>>  	/* Add elements back into the cache */
>>> -	for (index = 0; index < n; ++index, obj_table++)
>>> -		cache_objs[index] = *obj_table;
>>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>>>  
>>>  	cache->len += n;
>>>  
>>>
>>
>> I also checked in the get_bulk() function, which looks like that:
>>
>> 	/* Now fill in the response ... */
>> 	for (index = 0, len = cache->len - 1;
>> 			index < n;
>> 			++index, len--, obj_table++)
>> 		*obj_table = cache_objs[len];
>>
>> I think we could replace it by something like:
>>
>> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
>>
>> The only difference is that it won't reverse the pointers in the
>> table, but I don't see any problem with that.
>>
>> What do you think?
> 
> In true sense, it will _not_ be LIFO. Not sure about cache usage implications
> on the specific use cases.

Today, the objects pointers are reversed only in the get(). It means
that this code:

	rte_mempool_get_bulk(mp, table, 4);
	for (i = 0; i < 4; i++)
		printf("obj = %p\n", t[i]);
	rte_mempool_put_bulk(mp, table, 4);


	printf("-----\n");
	rte_mempool_get_bulk(mp, table, 4);
	for (i = 0; i < 4; i++)
		printf("obj = %p\n", t[i]);
	rte_mempool_put_bulk(mp, table, 4);

prints:

	addr1
	addr2
	addr3
	addr4
	-----
	addr4
	addr3
	addr2
	addr1

Which is quite strange.

I don't think it would be an issue to replace the loop by a
rte_memcpy(), it may increase the copy speed and it will be
more coherent with the put().


Olivier
  
Jerin Jacob June 1, 2016, 7 a.m. UTC | #4
On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
> Hi Jerin,

Hi Olivier,

> 
> >>>  	/* Add elements back into the cache */
> >>> -	for (index = 0; index < n; ++index, obj_table++)
> >>> -		cache_objs[index] = *obj_table;
> >>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >>>  
> >>>  	cache->len += n;
> >>>  
> >>>
> >>
> >> I also checked in the get_bulk() function, which looks like that:
> >>
> >> 	/* Now fill in the response ... */
> >> 	for (index = 0, len = cache->len - 1;
> >> 			index < n;
> >> 			++index, len--, obj_table++)
> >> 		*obj_table = cache_objs[len];
> >>
> >> I think we could replace it by something like:
> >>
> >> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
> >>
> >> The only difference is that it won't reverse the pointers in the
> >> table, but I don't see any problem with that.
> >>
> >> What do you think?
> > 
> > In true sense, it will _not_ be LIFO. Not sure about cache usage implications
> > on the specific use cases.
> 
> Today, the objects pointers are reversed only in the get(). It means
> that this code:
> 
> 	rte_mempool_get_bulk(mp, table, 4);
> 	for (i = 0; i < 4; i++)
> 		printf("obj = %p\n", t[i]);
> 	rte_mempool_put_bulk(mp, table, 4);
> 
> 
> 	printf("-----\n");
> 	rte_mempool_get_bulk(mp, table, 4);
> 	for (i = 0; i < 4; i++)
> 		printf("obj = %p\n", t[i]);
> 	rte_mempool_put_bulk(mp, table, 4);
> 
> prints:
> 
> 	addr1
> 	addr2
> 	addr3
> 	addr4
> 	-----
> 	addr4
> 	addr3
> 	addr2
> 	addr1
> 
> Which is quite strange.

IMO, It is the expected LIFO behavior. Right ?

What is not expected is the following, which is the case after change. Or Am I
missing something here?

addr1
addr2
addr3
addr4
-----
addr1
addr2
addr3
addr4

> 
> I don't think it would be an issue to replace the loop by a
> rte_memcpy(), it may increase the copy speed and it will be
> more coherent with the put().
> 
> 
> Olivier
  
Olivier Matz June 2, 2016, 7:36 a.m. UTC | #5
Hi Jerin,

On 06/01/2016 09:00 AM, Jerin Jacob wrote:
> On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
>> Today, the objects pointers are reversed only in the get(). It means
>> that this code:
>>
>> 	rte_mempool_get_bulk(mp, table, 4);
>> 	for (i = 0; i < 4; i++)
>> 		printf("obj = %p\n", t[i]);
>> 	rte_mempool_put_bulk(mp, table, 4);
>>
>>
>> 	printf("-----\n");
>> 	rte_mempool_get_bulk(mp, table, 4);
>> 	for (i = 0; i < 4; i++)
>> 		printf("obj = %p\n", t[i]);
>> 	rte_mempool_put_bulk(mp, table, 4);
>>
>> prints:
>>
>> 	addr1
>> 	addr2
>> 	addr3
>> 	addr4
>> 	-----
>> 	addr4
>> 	addr3
>> 	addr2
>> 	addr1
>>
>> Which is quite strange.
> 
> IMO, It is the expected LIFO behavior. Right ?
> 
> What is not expected is the following, which is the case after change. Or Am I
> missing something here?
> 
> addr1
> addr2
> addr3
> addr4
> -----
> addr1
> addr2
> addr3
> addr4
> 
>>
>> I don't think it would be an issue to replace the loop by a
>> rte_memcpy(), it may increase the copy speed and it will be
>> more coherent with the put().
>>

I think the LIFO behavior should occur on a per-bulk basis. I mean,
it should behave like in the exemplaes below:

  // pool cache is in state X
  obj1 = mempool_get(mp)
  obj2 = mempool_get(mp)
  mempool_put(mp, obj2)
  mempool_put(mp, obj1)
  // pool cache is back in state X

  // pool cache is in state X
  bulk1 = mempool_get_bulk(mp, 16)
  bulk2 = mempool_get_bulk(mp, 16)
  mempool_put_bulk(mp, bulk2, 16)
  mempool_put_bulk(mp, bulk1, 16)
  // pool cache is back in state X

Note that today it's not the case for bulks, since object addresses
are reversed only in get(), we are not back in the original state.
I don't really see the advantage of this.

Removing the reversing may accelerate the cache in case of bulk get,
I think.

Regards,
Olivier
  
Jerin Jacob June 2, 2016, 9:39 a.m. UTC | #6
On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote:
> Hi Jerin,
> 
> On 06/01/2016 09:00 AM, Jerin Jacob wrote:
> > On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
> >> Today, the objects pointers are reversed only in the get(). It means
> >> that this code:
> >>
> >> 	rte_mempool_get_bulk(mp, table, 4);
> >> 	for (i = 0; i < 4; i++)
> >> 		printf("obj = %p\n", t[i]);
> >> 	rte_mempool_put_bulk(mp, table, 4);
> >>
> >>
> >> 	printf("-----\n");
> >> 	rte_mempool_get_bulk(mp, table, 4);
> >> 	for (i = 0; i < 4; i++)
> >> 		printf("obj = %p\n", t[i]);
> >> 	rte_mempool_put_bulk(mp, table, 4);
> >>
> >> prints:
> >>
> >> 	addr1
> >> 	addr2
> >> 	addr3
> >> 	addr4
> >> 	-----
> >> 	addr4
> >> 	addr3
> >> 	addr2
> >> 	addr1
> >>
> >> Which is quite strange.
> > 
> > IMO, It is the expected LIFO behavior. Right ?
> > 
> > What is not expected is the following, which is the case after change. Or Am I
> > missing something here?
> > 
> > addr1
> > addr2
> > addr3
> > addr4
> > -----
> > addr1
> > addr2
> > addr3
> > addr4
> > 
> >>
> >> I don't think it would be an issue to replace the loop by a
> >> rte_memcpy(), it may increase the copy speed and it will be
> >> more coherent with the put().
> >>
> 
> I think the LIFO behavior should occur on a per-bulk basis. I mean,
> it should behave like in the exemplaes below:
> 
>   // pool cache is in state X
>   obj1 = mempool_get(mp)
>   obj2 = mempool_get(mp)
>   mempool_put(mp, obj2)
>   mempool_put(mp, obj1)
>   // pool cache is back in state X
> 
>   // pool cache is in state X
>   bulk1 = mempool_get_bulk(mp, 16)
>   bulk2 = mempool_get_bulk(mp, 16)
>   mempool_put_bulk(mp, bulk2, 16)
>   mempool_put_bulk(mp, bulk1, 16)
>   // pool cache is back in state X
> 

Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued buffer
comes out for next "get" makes more chance that buffer in Last level cache.

> Note that today it's not the case for bulks, since object addresses
> are reversed only in get(), we are not back in the original state.
> I don't really see the advantage of this.
> 
> Removing the reversing may accelerate the cache in case of bulk get,
> I think.

I tried in my setup, it was dropping the performance. Have you got
improvement in any setup?

Jerin

> 
> Regards,
> Olivier
  
Olivier Matz June 2, 2016, 9:16 p.m. UTC | #7
Hi Jerin,

On 06/02/2016 11:39 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote:
>> I think the LIFO behavior should occur on a per-bulk basis. I mean,
>> it should behave like in the exemplaes below:
>>
>>   // pool cache is in state X
>>   obj1 = mempool_get(mp)
>>   obj2 = mempool_get(mp)
>>   mempool_put(mp, obj2)
>>   mempool_put(mp, obj1)
>>   // pool cache is back in state X
>>
>>   // pool cache is in state X
>>   bulk1 = mempool_get_bulk(mp, 16)
>>   bulk2 = mempool_get_bulk(mp, 16)
>>   mempool_put_bulk(mp, bulk2, 16)
>>   mempool_put_bulk(mp, bulk1, 16)
>>   // pool cache is back in state X
>>
> 
> Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued buffer
> comes out for next "get" makes more chance that buffer in Last level cache.

Yes, from a memory cache perspective, I think you are right.

In practise, I'm not sure it's so important because on many hw drivers,
a packet buffer returns to the pool only after a round of the tx ring.
So I'd say it won't make a big difference here.

>> Note that today it's not the case for bulks, since object addresses
>> are reversed only in get(), we are not back in the original state.
>> I don't really see the advantage of this.
>>
>> Removing the reversing may accelerate the cache in case of bulk get,
>> I think.
> 
> I tried in my setup, it was dropping the performance. Have you got
> improvement in any setup?

I know that the mempool_perf autotest is not representative
of a real use case, but it gives a trend. I did a quick test with
- the legacy code,
- the rte_memcpy in put()
- the rte_memcpy in both put() and get() (no reverse)
It seems that removing the reversing brings ~50% of enhancement
with bulks of 32 (on an westmere):

legacy
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=839922483
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=849792204
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1617022156
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1675087052
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=3202914713
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=3268725963

rte_memcpy in put() (your patch proposal)
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=762157465
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=744593817
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1500276326
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1461347942
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2974076107
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2928122264

rte_memcpy in put() and get()
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=974834892
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1129329459
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2147798220
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2232457625
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=4510816664
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=4582421298

This is probably more a measure of the pure CPU cost of the mempool
function, without considering the memory cache aspect. So, of course,
a real use-case test should be done to confirm or not that it increases
the performance. I'll manage to do a test and let you know the result.

By the way, not all drivers are allocating or freeing the mbufs by
bulk, so this modification would only affect these ones. What driver
are you using for your test?


Regards,
Olivier
  
Jerin Jacob June 3, 2016, 7:02 a.m. UTC | #8
On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
Hi Olivier,

> This is probably more a measure of the pure CPU cost of the mempool
> function, without considering the memory cache aspect. So, of course,
> a real use-case test should be done to confirm or not that it increases
> the performance. I'll manage to do a test and let you know the result.

OK

IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
However, if get rte_memcpy with behavioral changes makes sense some platform
then we can enable it on conditional basics(I am OK with that)

> 
> By the way, not all drivers are allocating or freeing the mbufs by
> bulk, so this modification would only affect these ones. What driver
> are you using for your test?

I have tested with ThunderX nicvf pmd(uses the bulk mode).
Recently sent out driver in ml for review

Jerin

> 
> 
> Regards,
> Olivier
> 
>
  
Olivier Matz June 17, 2016, 10:40 a.m. UTC | #9
Hi Jerin,

On 06/03/2016 09:02 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
> Hi Olivier,
> 
>> This is probably more a measure of the pure CPU cost of the mempool
>> function, without considering the memory cache aspect. So, of course,
>> a real use-case test should be done to confirm or not that it increases
>> the performance. I'll manage to do a test and let you know the result.
> 
> OK
> 
> IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
> However, if get rte_memcpy with behavioral changes makes sense some platform
> then we can enable it on conditional basics(I am OK with that)
> 
>>
>> By the way, not all drivers are allocating or freeing the mbufs by
>> bulk, so this modification would only affect these ones. What driver
>> are you using for your test?
> 
> I have tested with ThunderX nicvf pmd(uses the bulk mode).
> Recently sent out driver in ml for review

Just to let you know I do not forget this. I still need to
find some time to do a performance test.

Regards,
Olivier
  
Olivier Matz June 24, 2016, 4:04 p.m. UTC | #10
On 06/17/2016 12:40 PM, Olivier Matz wrote:
> Hi Jerin,
> 
> On 06/03/2016 09:02 AM, Jerin Jacob wrote:
>> On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
>> Hi Olivier,
>>
>>> This is probably more a measure of the pure CPU cost of the mempool
>>> function, without considering the memory cache aspect. So, of course,
>>> a real use-case test should be done to confirm or not that it increases
>>> the performance. I'll manage to do a test and let you know the result.
>>
>> OK
>>
>> IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
>> However, if get rte_memcpy with behavioral changes makes sense some platform
>> then we can enable it on conditional basics(I am OK with that)
>>
>>>
>>> By the way, not all drivers are allocating or freeing the mbufs by
>>> bulk, so this modification would only affect these ones. What driver
>>> are you using for your test?
>>
>> I have tested with ThunderX nicvf pmd(uses the bulk mode).
>> Recently sent out driver in ml for review
> 
> Just to let you know I do not forget this. I still need to
> find some time to do a performance test.


Quoting from the other thread [1] too to save this in patchwork:
[1] http://dpdk.org/ml/archives/dev/2016-June/042701.html


> On 06/24/2016 05:56 PM, Hunt, David wrote:
>> Hi Jerin,
>>
>> I just ran a couple of tests on this patch on the latest master head on
>> a couple of machines. An older quad socket E5-4650 and a quad socket
>> E5-2699 v3
>>
>> E5-4650:
>> I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the
>> cached tests.
>>
>> E5-2699 v3:
>> I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the
>> cached tests.
>>
>> This is purely the autotest comparison, I don't have traffic generator
>> results. But based on the above, I don't think there are any performance
>> issues with the patch.
>>
> 
> Thanks for doing the test on your side. I think it's probably enough
> to integrate Jerin's patch .
> 
> About using a rte_memcpy() in the mempool_get(), I don't think I'll have
> the time to do a more exhaustive test before the 16.07, so I'll come
> back with it later.
> 
> I'm sending an ack on the v2 thread.


Acked-by: Olivier Matz <olivier.matz@6wind.com>
  
Thomas Monjalon June 30, 2016, 9:41 a.m. UTC | #11
2016-05-26 13:37, Jerin Jacob:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

Please Jerin (or anyone else), could you rebase this patch?
Thanks
  
Jerin Jacob June 30, 2016, 11:38 a.m. UTC | #12
On Thu, Jun 30, 2016 at 11:41:59AM +0200, Thomas Monjalon wrote:
> 2016-05-26 13:37, Jerin Jacob:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> 
> Please Jerin (or anyone else), could you rebase this patch?

OK. I will send the rebased version

> Thanks
  

Patch

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 60339bd..24876a2 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -73,6 +73,7 @@ 
 #include <rte_memory.h>
 #include <rte_branch_prediction.h>
 #include <rte_ring.h>
+#include <rte_memcpy.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -739,7 +740,6 @@  __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 		    unsigned n, int is_mp)
 {
 	struct rte_mempool_cache *cache;
-	uint32_t index;
 	void **cache_objs;
 	unsigned lcore_id = rte_lcore_id();
 	uint32_t cache_size = mp->cache_size;
@@ -768,8 +768,7 @@  __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
-	for (index = 0; index < n; ++index, obj_table++)
-		cache_objs[index] = *obj_table;
+	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
 
 	cache->len += n;