net/af_packet: fix ignoring full ring on tx

Message ID 1629466761-127333-1-git-send-email-tudor.cornea@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series net/af_packet: fix ignoring full ring on tx |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/github-robot: build success github build: passed
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Tudor Cornea Aug. 20, 2021, 1:39 p.m. UTC
  The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

We also now eliminate the timestamp status from the frame status.

Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 47 +++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)
  

Comments

Ferruh Yigit Sept. 1, 2021, 4:34 p.m. UTC | #1
On 8/20/2021 2:39 PM, Tudor Cornea wrote:
> The poll call can return POLLERR which is ignored, or it can return
> POLLOUT, even if there are no free frames in the mmap-ed area.
> 
> We can account for both of these cases by re-checking if the next
> frame is empty before writing into it.
> 
> We also now eliminate the timestamp status from the frame status.
> 

Hi Tudor,

Would you mind separate timestamp status fix to its own patch? I think better to
fix 'ignoring full Tx ring' first, to not make it dependent to timestamp patch.

> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
> ---
>  drivers/net/af_packet/rte_eth_af_packet.c | 47 +++++++++++++++++++++++++++++--
>  1 file changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index b73b211..3845df5 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>  	return num_rx;
>  }
>  
> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
> +{
> +	return *tp_status &
> +		~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);

I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the kernel
versions the bug exists, this flag is not set, but can you please confirm?

> +}
> +
>  /*
>   * Callback to handle sending packets through a real NIC.
>   */
> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>  			}
>  		}
>  
> +		/*
> +		 * We must eliminate the timestamp status from the packet
> +		 * status. This should only matter if timestamping is enabled
> +		 * on the socket, but there is a BUG in the kernel which is
> +		 * fixed in newer releases.
> +
> +		 * For interfaces of type 'veth', the sent skb is forwarded
> +		 * to the peer and back into the network stack which timestamps
> +		 * it on the RX path if timestamping is enabled globally
> +		 * (which happens if any socket enables timestamping).
> +
> +		 * When the skb is destructed, tpacket_destruct_skb() is called
> +		 * and it calls __packet_set_timestamp() which doesn't check
> +		 * the flags on the socket and returns the timestamp if it is
> +		 * set in the skb (and for veth it is, as mentioned above).
> +		 */
> +

Can you give some more details on this bug, any link etc..

And does it only seen with veth, if so I wonder if we can ignore it, not sure
how common to use af_packet PMD over veth interface, do you have this usecase?

And if only specific kernel versions impacted from this bug, what do you think
about adding kernel version check to reduce the scope of the fix in af_packet PMD.

>  		/* point at the next incoming frame */
> -		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> -		    (poll(&pfd, 1, -1) < 0))
> +		if ((tx_ring_status_remove_ts(&ppd->tp_status)
> +			!= TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
> +			break;
> +
> +		/*
> +		 * Poll can return POLLERR if the interface is down or POLLOUT,
> +		 * even if there are no extra buffers available.
> +		 * This happens, because packet_poll() calls datagram_poll()
> +		 * which checks the space left in the socket buffer and in the
> +		 * case of packet_mmap the default socket buffer length
> +		 * doesn't match the requested size for the tx_ring so there
> +		 * is always space left in socket buffer, which doesn't seem
> +		 * to be correlated to the requested size for the tx_ring
> +		 * in packet_mmap.
> +		 */
> +		if (tx_ring_status_remove_ts(&ppd->tp_status)
> +			!= TP_STATUS_AVAILABLE)
>  			break;

In this case should we break or poll again?

If 'POLLERR' is received when interface is down, it makes sense to break. Do you
know if is there any other case that 'POLLERR' is returned?

And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, can we
poll again to wait Tx ring available to send more packets?

>  
>  		/* copy the tx frame data */
> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>  		rte_pktmbuf_free(mbuf);
>  	}
>  
> +	/*
> +	 * We might have to ignore a few more errnos here since the packets
> +	 * remain in the mmap-ed queue and will be sent later, presumably.
> +	 */
> +

Can you please describe above comment more? What errors ignored?
And won't all packets sent will below sendto() call?

>  	/* kick-off transmits */
>  	if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
>  			errno != ENOBUFS && errno != EAGAIN) {
>
  
Tudor Cornea Sept. 6, 2021, 10:23 a.m. UTC | #2
Hi Ferruh,

Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.


Agreed. There are two issues solved by this patch. We will break it in two
different patches.

I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
> kernel
> versions the bug exists, this flag is not set, but can you please confirm?


And does it only seen with veth, if so I wonder if we can ignore it, not
> sure
> how common to use af_packet PMD over veth interface, do you have this
> usecase?


We've seen the timestamping issue only when running af_packet over
veth interfaces. We have a particular use-case internally, in which we need
to run inside a Kubernetes cluster.
We've found the following resources [1] , [2] related to this behavior in
the kernel.

We believe that issue #2 (the ring getting full), can theoretically occur
on any type of NIC.
We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
interface on / off
ifconfig $iface_name down; sleep 10; ifconfig $iface_name up

We will attempt to give more context on the issue below, about what we
think happens:
- we have a 2048 queue shared between the kernel and the dpdk socket,
there's an index the queue in both the kernel and the dpdk driver
- the dpdk driver writes a packet or a burst, advances its idx and tells
the kernel to send the packets via a call to sendto() and the kernel sends
the packets and advances its idx
- once the interface is down the kernel can no longer send packets, but it
doesn't drop them, it just doesn't advance its idx
- for each packet there is header and in the header there is a status
integer which, among others, indicates the owner of the packet: the
userspace or the kernel - the userspace (dpdk driver) sets the status as
owned by the kernel when it adds another packet ; the kernel sets the
status back to owned by the userspace once it sends a packet
- the dpdk driver was ignoring this status bit and,  even after the queue
was full, it would continue to put packets in the queue - its idx would be
"after" that of the kernel
- once the interface is brought up, the kernel would send all the packets
in the queue (since they have the status of being owned by the kernel) on
the next call to sendto() and the idx would be back to where it was before
the interface was brought up (let's call it k1)
- the dpdk driver idx into the queue would point somewhere in the queue
(let's call it d1) and would continue to add packets at that point, but the
kernel wouldn't send any packet anymore since there is now a gap of packets
owned by the userspace between the kernel index (k1) and the dpdk driver
idx (d1)
- the dpdk idx would eventually reach k1 and packets would be transferred
at a normal rate until both the dpdk idx and the kernel idx would reach d1
again
- between d1 and k1 there are only packets with the status as owned by the
kernel - which where added by the dpdk driver while its index was between
d1 and k1 ; thus the kernel would burst all the packets till k1, while the
dpdk idx is at d1
- the cycle repeats

If a new traffic config comes (in our application) while this cycle is
happening, it could be that some of the packets of the old config are still
in queue (between d1 and k1) and will be bursted when the dpdk and kernel
idx reach d1 ; this would explain seeing packets from an old config, but
only in the first 2048 packets (which is the queue size)


[1] https://www.spinics.net/lists/kernel/msg3959391.html
[2] https://www.spinics.net/lists/netdev/msg739372.html

On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
> > The poll call can return POLLERR which is ignored, or it can return
> > POLLOUT, even if there are no free frames in the mmap-ed area.
> >
> > We can account for both of these cases by re-checking if the next
> > frame is empty before writing into it.
> >
> > We also now eliminate the timestamp status from the frame status.
> >
>
> Hi Tudor,
>
> Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.
>
> > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 47
> +++++++++++++++++++++++++++++--
> >  1 file changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..3845df5 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >       return num_rx;
> >  }
> >
> > +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
> > +{
> > +     return *tp_status &
> > +             ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
>
> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
> kernel
> versions the bug exists, this flag is not set, but can you please confirm?
>
> > +}
> > +
> >  /*
> >   * Callback to handle sending packets through a real NIC.
> >   */
> > @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >                       }
> >               }
> >
> > +             /*
> > +              * We must eliminate the timestamp status from the packet
> > +              * status. This should only matter if timestamping is
> enabled
> > +              * on the socket, but there is a BUG in the kernel which is
> > +              * fixed in newer releases.
> > +
> > +              * For interfaces of type 'veth', the sent skb is forwarded
> > +              * to the peer and back into the network stack which
> timestamps
> > +              * it on the RX path if timestamping is enabled globally
> > +              * (which happens if any socket enables timestamping).
> > +
> > +              * When the skb is destructed, tpacket_destruct_skb() is
> called
> > +              * and it calls __packet_set_timestamp() which doesn't
> check
> > +              * the flags on the socket and returns the timestamp if it
> is
> > +              * set in the skb (and for veth it is, as mentioned above).
> > +              */
> > +
>
> Can you give some more details on this bug, any link etc..
>
> And does it only seen with veth, if so I wonder if we can ignore it, not
> sure
> how common to use af_packet PMD over veth interface, do you have this
> usecase?
>
> And if only specific kernel versions impacted from this bug, what do you
> think
> about adding kernel version check to reduce the scope of the fix in
> af_packet PMD.
>
> >               /* point at the next incoming frame */
> > -             if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > -                 (poll(&pfd, 1, -1) < 0))
> > +             if ((tx_ring_status_remove_ts(&ppd->tp_status)
> > +                     != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
> > +                     break;
> > +
> > +             /*
> > +              * Poll can return POLLERR if the interface is down or
> POLLOUT,
> > +              * even if there are no extra buffers available.
> > +              * This happens, because packet_poll() calls
> datagram_poll()
> > +              * which checks the space left in the socket buffer and in
> the
> > +              * case of packet_mmap the default socket buffer length
> > +              * doesn't match the requested size for the tx_ring so
> there
> > +              * is always space left in socket buffer, which doesn't
> seem
> > +              * to be correlated to the requested size for the tx_ring
> > +              * in packet_mmap.
> > +              */
> > +             if (tx_ring_status_remove_ts(&ppd->tp_status)
> > +                     != TP_STATUS_AVAILABLE)
> >                       break;
>
> In this case should we break or poll again?
>
> If 'POLLERR' is received when interface is down, it makes sense to break.
> Do you
> know if is there any other case that 'POLLERR' is returned?
>
> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received,
> can we
> poll again to wait Tx ring available to send more packets?
>
> >
> >               /* copy the tx frame data */
> > @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >               rte_pktmbuf_free(mbuf);
> >       }
> >
> > +     /*
> > +      * We might have to ignore a few more errnos here since the packets
> > +      * remain in the mmap-ed queue and will be sent later, presumably.
> > +      */
> > +
>
> Can you please describe above comment more? What errors ignored?
> And won't all packets sent will below sendto() call?
>
> >       /* kick-off transmits */
> >       if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
> >                       errno != ENOBUFS && errno != EAGAIN) {
> >
>
>
  
Ferruh Yigit Sept. 20, 2021, 5:11 p.m. UTC | #3
On 9/6/2021 11:23 AM, Tudor Cornea wrote:
> Hi Ferruh,
> 
> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
> 
> 
> Agreed. There are two issues solved by this patch. We will break it in two
> different patches.
> 
> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
> 
> 
> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
> 
> 
> We've seen the timestamping issue only when running af_packet over
> veth interfaces. We have a particular use-case internally, in which we need
> to run inside a Kubernetes cluster.
> We've found the following resources [1] , [2] related to this behavior in
> the kernel.
> 
> We believe that issue #2 (the ring getting full), can theoretically occur
> on any type of NIC.
> We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
> interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
> interface on / off
> ifconfig $iface_name down; sleep 10; ifconfig $iface_name up
> 
> We will attempt to give more context on the issue below, about what we
> think happens:
> - we have a 2048 queue shared between the kernel and the dpdk socket,
> there's an index the queue in both the kernel and the dpdk driver
> - the dpdk driver writes a packet or a burst, advances its idx and tells
> the kernel to send the packets via a call to sendto() and the kernel sends
> the packets and advances its idx
> - once the interface is down the kernel can no longer send packets, but it
> doesn't drop them, it just doesn't advance its idx
> - for each packet there is header and in the header there is a status
> integer which, among others, indicates the owner of the packet: the
> userspace or the kernel - the userspace (dpdk driver) sets the status as
> owned by the kernel when it adds another packet ; the kernel sets the
> status back to owned by the userspace once it sends a packet
> - the dpdk driver was ignoring this status bit and,  even after the queue
> was full, it would continue to put packets in the queue - its idx would be
> "after" that of the kernel
> - once the interface is brought up, the kernel would send all the packets
> in the queue (since they have the status of being owned by the kernel) on
> the next call to sendto() and the idx would be back to where it was before
> the interface was brought up (let's call it k1)
> - the dpdk driver idx into the queue would point somewhere in the queue
> (let's call it d1) and would continue to add packets at that point, but the
> kernel wouldn't send any packet anymore since there is now a gap of packets
> owned by the userspace between the kernel index (k1) and the dpdk driver
> idx (d1)
> - the dpdk idx would eventually reach k1 and packets would be transferred
> at a normal rate until both the dpdk idx and the kernel idx would reach d1
> again
> - between d1 and k1 there are only packets with the status as owned by the
> kernel - which where added by the dpdk driver while its index was between
> d1 and k1 ; thus the kernel would burst all the packets till k1, while the
> dpdk idx is at d1
> - the cycle repeats
> 
> If a new traffic config comes (in our application) while this cycle is
> happening, it could be that some of the packets of the old config are still
> in queue (between d1 and k1) and will be bursted when the dpdk and kernel
> idx reach d1 ; this would explain seeing packets from an old config, but
> only in the first 2048 packets (which is the queue size)
> 
> 

Hi Tudor,

If there is an usage on of veth, OK to fix the timestamps issue.

What you described above looks like a ring buffer with single producer and
single consumer, and producer overwrites the not consumed items.

I assume this happens because af_packet (consumer) can't send the packets
because of the timestamp defect. (Also producer (dpdk app) should have checks to
prevent overwrite, but that is a different issue.)

I will comment to the new versions of the patches.

Our of curiosity, are you using an modified af_packet implementation in kernel
for above described usage?

> [1] https://www.spinics.net/lists/kernel/msg3959391.html
> [2] https://www.spinics.net/lists/netdev/msg739372.html
> 
> On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> 
>> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
>>> The poll call can return POLLERR which is ignored, or it can return
>>> POLLOUT, even if there are no free frames in the mmap-ed area.
>>>
>>> We can account for both of these cases by re-checking if the next
>>> frame is empty before writing into it.
>>>
>>> We also now eliminate the timestamp status from the frame status.
>>>
>>
>> Hi Tudor,
>>
>> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
>>
>>> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
>>> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
>>> ---
>>>  drivers/net/af_packet/rte_eth_af_packet.c | 47
>> +++++++++++++++++++++++++++++--
>>>  1 file changed, 45 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index b73b211..3845df5 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>       return num_rx;
>>>  }
>>>
>>> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
>>> +{
>>> +     return *tp_status &
>>> +             ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
>>
>> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
>>
>>> +}
>>> +
>>>  /*
>>>   * Callback to handle sending packets through a real NIC.
>>>   */
>>> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>                       }
>>>               }
>>>
>>> +             /*
>>> +              * We must eliminate the timestamp status from the packet
>>> +              * status. This should only matter if timestamping is
>> enabled
>>> +              * on the socket, but there is a BUG in the kernel which is
>>> +              * fixed in newer releases.
>>> +
>>> +              * For interfaces of type 'veth', the sent skb is forwarded
>>> +              * to the peer and back into the network stack which
>> timestamps
>>> +              * it on the RX path if timestamping is enabled globally
>>> +              * (which happens if any socket enables timestamping).
>>> +
>>> +              * When the skb is destructed, tpacket_destruct_skb() is
>> called
>>> +              * and it calls __packet_set_timestamp() which doesn't
>> check
>>> +              * the flags on the socket and returns the timestamp if it
>> is
>>> +              * set in the skb (and for veth it is, as mentioned above).
>>> +              */
>>> +
>>
>> Can you give some more details on this bug, any link etc..
>>
>> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
>>
>> And if only specific kernel versions impacted from this bug, what do you
>> think
>> about adding kernel version check to reduce the scope of the fix in
>> af_packet PMD.
>>
>>>               /* point at the next incoming frame */
>>> -             if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
>>> -                 (poll(&pfd, 1, -1) < 0))
>>> +             if ((tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
>>> +                     break;
>>> +
>>> +             /*
>>> +              * Poll can return POLLERR if the interface is down or
>> POLLOUT,
>>> +              * even if there are no extra buffers available.
>>> +              * This happens, because packet_poll() calls
>> datagram_poll()
>>> +              * which checks the space left in the socket buffer and in
>> the
>>> +              * case of packet_mmap the default socket buffer length
>>> +              * doesn't match the requested size for the tx_ring so
>> there
>>> +              * is always space left in socket buffer, which doesn't
>> seem
>>> +              * to be correlated to the requested size for the tx_ring
>>> +              * in packet_mmap.
>>> +              */
>>> +             if (tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE)
>>>                       break;
>>
>> In this case should we break or poll again?
>>
>> If 'POLLERR' is received when interface is down, it makes sense to break.
>> Do you
>> know if is there any other case that 'POLLERR' is returned?
>>
>> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received,
>> can we
>> poll again to wait Tx ring available to send more packets?
>>
>>>
>>>               /* copy the tx frame data */
>>> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>               rte_pktmbuf_free(mbuf);
>>>       }
>>>
>>> +     /*
>>> +      * We might have to ignore a few more errnos here since the packets
>>> +      * remain in the mmap-ed queue and will be sent later, presumably.
>>> +      */
>>> +
>>
>> Can you please describe above comment more? What errors ignored?
>> And won't all packets sent will below sendto() call?
>>
>>>       /* kick-off transmits */
>>>       if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
>>>                       errno != ENOBUFS && errno != EAGAIN) {
>>>
>>
>>
  

Patch

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..3845df5 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -167,6 +167,12 @@  eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return num_rx;
 }
 
+static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
+{
+	return *tp_status &
+		~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
+}
+
 /*
  * Callback to handle sending packets through a real NIC.
  */
@@ -211,9 +217,41 @@  eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			}
 		}
 
+		/*
+		 * We must eliminate the timestamp status from the packet
+		 * status. This should only matter if timestamping is enabled
+		 * on the socket, but there is a BUG in the kernel which is
+		 * fixed in newer releases.
+
+		 * For interfaces of type 'veth', the sent skb is forwarded
+		 * to the peer and back into the network stack which timestamps
+		 * it on the RX path if timestamping is enabled globally
+		 * (which happens if any socket enables timestamping).
+
+		 * When the skb is destructed, tpacket_destruct_skb() is called
+		 * and it calls __packet_set_timestamp() which doesn't check
+		 * the flags on the socket and returns the timestamp if it is
+		 * set in the skb (and for veth it is, as mentioned above).
+		 */
+
 		/* point at the next incoming frame */
-		if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
-		    (poll(&pfd, 1, -1) < 0))
+		if ((tx_ring_status_remove_ts(&ppd->tp_status)
+			!= TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
+			break;
+
+		/*
+		 * Poll can return POLLERR if the interface is down or POLLOUT,
+		 * even if there are no extra buffers available.
+		 * This happens, because packet_poll() calls datagram_poll()
+		 * which checks the space left in the socket buffer and in the
+		 * case of packet_mmap the default socket buffer length
+		 * doesn't match the requested size for the tx_ring so there
+		 * is always space left in socket buffer, which doesn't seem
+		 * to be correlated to the requested size for the tx_ring
+		 * in packet_mmap.
+		 */
+		if (tx_ring_status_remove_ts(&ppd->tp_status)
+			!= TP_STATUS_AVAILABLE)
 			break;
 
 		/* copy the tx frame data */
@@ -242,6 +280,11 @@  eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		rte_pktmbuf_free(mbuf);
 	}
 
+	/*
+	 * We might have to ignore a few more errnos here since the packets
+	 * remain in the mmap-ed queue and will be sent later, presumably.
+	 */
+
 	/* kick-off transmits */
 	if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
 			errno != ENOBUFS && errno != EAGAIN) {