diff mbox series

[v2] net/pcap: fix timeout of stopping device

Message ID 20220906080511.46088-1-yidingx.zhou@intel.com (mailing list archive)
State New
Delegated to: Ferruh Yigit
Headers show
Series [v2] net/pcap: fix timeout of stopping device | expand

Checks

Context Check Description
ci/intel-Testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/checkpatch success coding style OK

Commit Message

Zhou, YidingX Sept. 6, 2022, 8:05 a.m. UTC
The pcap file will be synchronized to the disk when stopping the device.
It takes a long time if the file is large that would cause the
'detach sync request' timeout when the device is closed under multi-process
scenario.

This commit fixes the issue by using alarm handler to release dumper.

Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
Cc: stable@dpdk.org

Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>

---
v2: use alarm handler to release dumper
---
 drivers/net/pcap/pcap_ethdev.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Stephen Hemminger Sept. 6, 2022, 2:57 p.m. UTC | #1
On Tue,  6 Sep 2022 16:05:11 +0800
Yiding Zhou <yidingx.zhou@intel.com> wrote:

> The pcap file will be synchronized to the disk when stopping the device.
> It takes a long time if the file is large that would cause the
> 'detach sync request' timeout when the device is closed under multi-process
> scenario.
> 
> This commit fixes the issue by using alarm handler to release dumper.
> 
> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>


I think you need to redesign the handshake if this the case.
Forcing 30 second delay at the end of all uses of pcap is not acceptable.
Zhou, YidingX Sept. 6, 2022, 4:21 p.m. UTC | #2
On Tue,  6 Sep 2022 16:05:11 +0800
Yiding Zhou <yidingx.zhou@intel.com> wrote:

> The pcap file will be synchronized to the disk when stopping the device.
> It takes a long time if the file is large that would cause the 'detach 
> sync request' timeout when the device is closed under multi-process 
> scenario.
> 
> This commit fixes the issue by using alarm handler to release dumper.
> 
> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>


I think you need to redesign the handshake if this the case.
Forcing 30 second delay at the end of all uses of pcap is not acceptable.

Thanks for your comments.
According to my test, the time required to sync a 100G pcap file is about 20s, 
so I set a delay of 30s. I also tried to use AIO, but some issue in the multi-process scenario.
And I also consider io_uring,  it is only supported in linux-5.x, we need to consider compatibility,. 
Maybe better way is to do more work to redesign the handshake.
Zhou, YidingX Sept. 21, 2022, 7:14 a.m. UTC | #3
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, September 6, 2022 10:58 PM
> To: Zhou, YidingX <yidingx.zhou@intel.com>
> Cc: dev@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>; He, Xingguang <xingguang.he@intel.com>;
> stable@dpdk.org
> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
> 
> On Tue,  6 Sep 2022 16:05:11 +0800
> Yiding Zhou <yidingx.zhou@intel.com> wrote:
> 
> > The pcap file will be synchronized to the disk when stopping the device.
> > It takes a long time if the file is large that would cause the 'detach
> > sync request' timeout when the device is closed under multi-process
> > scenario.
> >
> > This commit fixes the issue by using alarm handler to release dumper.
> >
> > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
> 
> 
> I think you need to redesign the handshake if this the case.
> Forcing 30 second delay at the end of all uses of pcap is not acceptable.

@Zhang, Qi Z Do we need to redesign the handshake to fix this?
Ferruh Yigit Oct. 3, 2022, 3 p.m. UTC | #4
On 9/21/2022 8:14 AM, Zhou, YidingX wrote:
> 
> 
>> -----Original Message-----
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Tuesday, September 6, 2022 10:58 PM
>> To: Zhou, YidingX <yidingx.zhou@intel.com>
>> Cc: dev@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>; Burakov, Anatoly
>> <anatoly.burakov@intel.com>; He, Xingguang <xingguang.he@intel.com>;
>> stable@dpdk.org
>> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
>>
>> On Tue,  6 Sep 2022 16:05:11 +0800
>> Yiding Zhou <yidingx.zhou@intel.com> wrote:
>>
>>> The pcap file will be synchronized to the disk when stopping the device.
>>> It takes a long time if the file is large that would cause the 'detach
>>> sync request' timeout when the device is closed under multi-process
>>> scenario.
>>>
>>> This commit fixes the issue by using alarm handler to release dumper.
>>>
>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
>>
>>
>> I think you need to redesign the handshake if this the case.
>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> 
> @Zhang, Qi Z Do we need to redesign the handshake to fix this?

Hi Yiding,

Your approach works, but Stephen's comment is to not add this delay by 
default, because this delay is not always required.

Can you please provide more details on multi-process communication and 
call trace, to help us think about a solution to address this issue in a 
more generic way (not just for pcap but for any case device close takes 
more than multi-process timeout)?
Zhou, YidingX Nov. 22, 2022, 9:25 a.m. UTC | #5
> -----Original Message-----
> From: Zhou, YidingX <yidingx.zhou@intel.com>
> Sent: Wednesday, September 21, 2022 3:15 PM
> To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
> <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
> Xingguang <xingguang.he@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
> 
> 
> 
> > -----Original Message-----
> > From: Stephen Hemminger <mailto:stephen@networkplumber.org>
> > Sent: Tuesday, September 6, 2022 10:58 PM
> > To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
> > Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
> > <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
> > mailto:stable@dpdk.org
> > Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
> >
> > On Tue,  6 Sep 2022 16:05:11 +0800
> > Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
> >
> > > The pcap file will be synchronized to the disk when stopping the device.
> > > It takes a long time if the file is large that would cause the
> > > 'detach sync request' timeout when the device is closed under
> > > multi-process scenario.
> > >
> > > This commit fixes the issue by using alarm handler to release dumper.
> > >
> > > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > > Cc: mailto:stable@dpdk.org
> > >
> > > Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> >
> >
> > I think you need to redesign the handshake if this the case.
> > Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> 
> @Zhang, Qi Z Do we need to redesign the handshake to fix this?

Hi, Ferruh
Sorry for the late reply.
I did not receive your email on Oct 6, I got your comments from patchwork.

"Can you please provide more details on multi-process communication and 
call trace, to help us think about a solution to address this issue in a 
more generic way (not just for pcap but for any case device close takes 
more than multi-process timeout)?"

I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.

       thread                                 intr thread           intr thread             thread
    of secondary                       of secondary          of primary          of primary
             |                                              |                         |                          |
             |                                              |                         |                          |
rte_eal_hotplug_remove
rte_dev_remove
eal_dev_hotplug_request_to_primary
rte_mp_request_sync ------------------------------------------------------->|
                                                                                                                    |
                                                                                              handle_secondary_request
                                                                                         |<-----------------|
                                                                                         |
                                                                   __handle_secondary_request
                                                          eal_dev_hotplug_request_to_secondary
           |<------------------------------------- rte_mp_request_sync
           |
handle_primary_request--------->|
                                                           |
                            __handle_primary_request
                               local_dev_remove(this will take long time)
                                            rte_mp_reply -------------------------------->|                              
                                                                                         |
                                                                             local_dev_remove
          |<------------------------------------------------- rte_mp_reply

The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
This caused the processing of hot_plug message to time out.
Stephen Hemminger Nov. 22, 2022, 5:28 p.m. UTC | #6
On Tue, 22 Nov 2022 09:25:33 +0000
"Zhou, YidingX" <yidingx.zhou@intel.com> wrote:

> > -----Original Message-----
> > From: Zhou, YidingX <yidingx.zhou@intel.com>
> > Sent: Wednesday, September 21, 2022 3:15 PM
> > To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
> > <qi.z.zhang@intel.com>
> > Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
> > Xingguang <xingguang.he@intel.com>; stable@dpdk.org
> > Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
> > 
> > 
> >   
> > > -----Original Message-----
> > > From: Stephen Hemminger <mailto:stephen@networkplumber.org>
> > > Sent: Tuesday, September 6, 2022 10:58 PM
> > > To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
> > > Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
> > > <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
> > > mailto:stable@dpdk.org
> > > Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
> > >
> > > On Tue,  6 Sep 2022 16:05:11 +0800
> > > Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
> > >  
> > > > The pcap file will be synchronized to the disk when stopping the device.
> > > > It takes a long time if the file is large that would cause the
> > > > 'detach sync request' timeout when the device is closed under
> > > > multi-process scenario.
> > > >
> > > > This commit fixes the issue by using alarm handler to release dumper.
> > > >
> > > > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > > > Cc: mailto:stable@dpdk.org
> > > >
> > > > Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>  
> > >
> > >
> > > I think you need to redesign the handshake if this the case.
> > > Forcing 30 second delay at the end of all uses of pcap is not acceptable.  
> > 
> > @Zhang, Qi Z Do we need to redesign the handshake to fix this?  
> 
> Hi, Ferruh
> Sorry for the late reply.
> I did not receive your email on Oct 6, I got your comments from patchwork.
> 
> "Can you please provide more details on multi-process communication and 
> call trace, to help us think about a solution to address this issue in a 
> more generic way (not just for pcap but for any case device close takes 
> more than multi-process timeout)?"
> 
> I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.
> 
>        thread                                 intr thread           intr thread             thread
>     of secondary                       of secondary          of primary          of primary
>              |                                              |                         |                          |
>              |                                              |                         |                          |
> rte_eal_hotplug_remove
> rte_dev_remove
> eal_dev_hotplug_request_to_primary
> rte_mp_request_sync ------------------------------------------------------->|
>                                                                                                                     |
>                                                                                               handle_secondary_request
>                                                                                          |<-----------------|
>                                                                                          |
>                                                                    __handle_secondary_request
>                                                           eal_dev_hotplug_request_to_secondary
>            |<------------------------------------- rte_mp_request_sync
>            |
> handle_primary_request--------->|
>                                                            |
>                             __handle_primary_request
>                                local_dev_remove(this will take long time)
>                                             rte_mp_reply -------------------------------->|                              
>                                                                                          |
>                                                                              local_dev_remove
>           |<------------------------------------------------- rte_mp_reply
> 
> The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
> When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
> This caused the processing of hot_plug message to time out.


Part of the problem maybe a hidden file sync in some library.
Normally, closing a file should be fast even with lots of outstanding data.
The actual write done by OS will continue from file cache.

I wonder if doing some kind of fadvise call might help see POSIX_FADV_SEQUENTIAL or POSIX_FADV_DONTNEED
Ferruh Yigit Nov. 29, 2022, 2:11 p.m. UTC | #7
On 11/22/2022 9:25 AM, Zhou, YidingX wrote:
> 
> 
>> -----Original Message-----
>> From: Zhou, YidingX <yidingx.zhou@intel.com>
>> Sent: Wednesday, September 21, 2022 3:15 PM
>> To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
>> <qi.z.zhang@intel.com>
>> Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
>> Xingguang <xingguang.he@intel.com>; stable@dpdk.org
>> Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
>>
>>
>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <mailto:stephen@networkplumber.org>
>>> Sent: Tuesday, September 6, 2022 10:58 PM
>>> To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
>>> Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
>>> <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
>>> mailto:stable@dpdk.org
>>> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
>>>
>>> On Tue,  6 Sep 2022 16:05:11 +0800
>>> Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
>>>
>>>> The pcap file will be synchronized to the disk when stopping the device.
>>>> It takes a long time if the file is large that would cause the
>>>> 'detach sync request' timeout when the device is closed under
>>>> multi-process scenario.
>>>>
>>>> This commit fixes the issue by using alarm handler to release dumper.
>>>>
>>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
>>>> Cc: mailto:stable@dpdk.org
>>>>
>>>> Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
>>>
>>>
>>> I think you need to redesign the handshake if this the case.
>>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
>>
>> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
> 
> Hi, Ferruh
> Sorry for the late reply.
> I did not receive your email on Oct 6, I got your comments from patchwork.
> 
> "Can you please provide more details on multi-process communication and 
> call trace, to help us think about a solution to address this issue in a 
> more generic way (not just for pcap but for any case device close takes 
> more than multi-process timeout)?"
> 
> I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.
> 
>        thread                                 intr thread           intr thread             thread
>     of secondary                       of secondary          of primary          of primary
>              |                                              |                         |                          |
>              |                                              |                         |                          |
> rte_eal_hotplug_remove
> rte_dev_remove
> eal_dev_hotplug_request_to_primary
> rte_mp_request_sync ------------------------------------------------------->|
>                                                                                                                     |
>                                                                                               handle_secondary_request
>                                                                                          |<-----------------|
>                                                                                          |
>                                                                    __handle_secondary_request
>                                                           eal_dev_hotplug_request_to_secondary
>            |<------------------------------------- rte_mp_request_sync
>            |
> handle_primary_request--------->|
>                                                            |
>                             __handle_primary_request
>                                local_dev_remove(this will take long time)
>                                             rte_mp_reply -------------------------------->|                              
>                                                                                          |
>                                                                              local_dev_remove
>           |<------------------------------------------------- rte_mp_reply
> 
> The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
> When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
> This caused the processing of hot_plug message to time out.

Hi Yiding,

Thanks for the information,

Right now all MP operations timeout is hardcoded in the code and it is 5
seconds.
Do you think does it work to have an API to set custom timeout,
something like `rte_mp_timeout_set()`, and call this from pdump?

This gives a generic solution for similar cases, not just for pcap.
But my concern is if this is too much multi-process related internal
detail to update, @Anatoly may comment on this.
diff mbox series

Patch

diff --git a/drivers/net/pcap/pcap_ethdev.c b/drivers/net/pcap/pcap_ethdev.c
index ec29fd6bc5..5c643a0277 100644
--- a/drivers/net/pcap/pcap_ethdev.c
+++ b/drivers/net/pcap/pcap_ethdev.c
@@ -17,6 +17,7 @@ 
 #include <rte_mbuf_dyn.h>
 #include <rte_bus_vdev.h>
 #include <rte_os_shim.h>
+#include <rte_alarm.h>
 
 #include "pcap_osdep.h"
 
@@ -664,6 +665,25 @@  eth_dev_start(struct rte_eth_dev *dev)
 	return 0;
 }
 
+static void eth_pcap_dumper_release(void *arg)
+{
+	pcap_dump_close((pcap_dumper_t *)arg);
+}
+
+static void
+eth_pcap_dumper_close(pcap_dumper_t *dumper)
+{
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		/*
+		 * Delay 30 seconds before releasing dumper to wait for file sync
+		 * to complete to avoid blocking alarm thread in PRIMARY process
+		 */
+		rte_eal_alarm_set(30000000, eth_pcap_dumper_release, dumper);
+	} else {
+		rte_eal_alarm_set(1, eth_pcap_dumper_release, dumper);
+	}
+}
+
 /*
  * This function gets called when the current port gets stopped.
  * Is the only place for us to close all the tx streams dumpers.
@@ -689,7 +709,7 @@  eth_dev_stop(struct rte_eth_dev *dev)
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
 		if (pp->tx_dumper[i] != NULL) {
-			pcap_dump_close(pp->tx_dumper[i]);
+			eth_pcap_dumper_close(pp->tx_dumper[i]);
 			pp->tx_dumper[i] = NULL;
 		}