[v2] net/pcap: fix timeout of stopping device
Checks
Commit Message
The pcap file will be synchronized to the disk when stopping the device.
It takes a long time if the file is large that would cause the
'detach sync request' timeout when the device is closed under multi-process
scenario.
This commit fixes the issue by using alarm handler to release dumper.
Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
Cc: stable@dpdk.org
Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
---
v2: use alarm handler to release dumper
---
drivers/net/pcap/pcap_ethdev.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
Comments
On Tue, 6 Sep 2022 16:05:11 +0800
Yiding Zhou <yidingx.zhou@intel.com> wrote:
> The pcap file will be synchronized to the disk when stopping the device.
> It takes a long time if the file is large that would cause the
> 'detach sync request' timeout when the device is closed under multi-process
> scenario.
>
> This commit fixes the issue by using alarm handler to release dumper.
>
> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> Cc: stable@dpdk.org
>
> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
I think you need to redesign the handshake if this the case.
Forcing 30 second delay at the end of all uses of pcap is not acceptable.
On Tue, 6 Sep 2022 16:05:11 +0800
Yiding Zhou <yidingx.zhou@intel.com> wrote:
> The pcap file will be synchronized to the disk when stopping the device.
> It takes a long time if the file is large that would cause the 'detach
> sync request' timeout when the device is closed under multi-process
> scenario.
>
> This commit fixes the issue by using alarm handler to release dumper.
>
> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> Cc: stable@dpdk.org
>
> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
I think you need to redesign the handshake if this the case.
Forcing 30 second delay at the end of all uses of pcap is not acceptable.
Thanks for your comments.
According to my test, the time required to sync a 100G pcap file is about 20s,
so I set a delay of 30s. I also tried to use AIO, but some issue in the multi-process scenario.
And I also consider io_uring, it is only supported in linux-5.x, we need to consider compatibility,.
Maybe better way is to do more work to redesign the handshake.
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, September 6, 2022 10:58 PM
> To: Zhou, YidingX <yidingx.zhou@intel.com>
> Cc: dev@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>; He, Xingguang <xingguang.he@intel.com>;
> stable@dpdk.org
> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
>
> On Tue, 6 Sep 2022 16:05:11 +0800
> Yiding Zhou <yidingx.zhou@intel.com> wrote:
>
> > The pcap file will be synchronized to the disk when stopping the device.
> > It takes a long time if the file is large that would cause the 'detach
> > sync request' timeout when the device is closed under multi-process
> > scenario.
> >
> > This commit fixes the issue by using alarm handler to release dumper.
> >
> > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
>
>
> I think you need to redesign the handshake if this the case.
> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
@Zhang, Qi Z Do we need to redesign the handshake to fix this?
On 9/21/2022 8:14 AM, Zhou, YidingX wrote:
>
>
>> -----Original Message-----
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Tuesday, September 6, 2022 10:58 PM
>> To: Zhou, YidingX <yidingx.zhou@intel.com>
>> Cc: dev@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>; Burakov, Anatoly
>> <anatoly.burakov@intel.com>; He, Xingguang <xingguang.he@intel.com>;
>> stable@dpdk.org
>> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
>>
>> On Tue, 6 Sep 2022 16:05:11 +0800
>> Yiding Zhou <yidingx.zhou@intel.com> wrote:
>>
>>> The pcap file will be synchronized to the disk when stopping the device.
>>> It takes a long time if the file is large that would cause the 'detach
>>> sync request' timeout when the device is closed under multi-process
>>> scenario.
>>>
>>> This commit fixes the issue by using alarm handler to release dumper.
>>>
>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Yiding Zhou <yidingx.zhou@intel.com>
>>
>>
>> I think you need to redesign the handshake if this the case.
>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
>
> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
Hi Yiding,
Your approach works, but Stephen's comment is to not add this delay by
default, because this delay is not always required.
Can you please provide more details on multi-process communication and
call trace, to help us think about a solution to address this issue in a
more generic way (not just for pcap but for any case device close takes
more than multi-process timeout)?
> -----Original Message-----
> From: Zhou, YidingX <yidingx.zhou@intel.com>
> Sent: Wednesday, September 21, 2022 3:15 PM
> To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
> <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
> Xingguang <xingguang.he@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
>
>
>
> > -----Original Message-----
> > From: Stephen Hemminger <mailto:stephen@networkplumber.org>
> > Sent: Tuesday, September 6, 2022 10:58 PM
> > To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
> > Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
> > <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
> > mailto:stable@dpdk.org
> > Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
> >
> > On Tue, 6 Sep 2022 16:05:11 +0800
> > Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
> >
> > > The pcap file will be synchronized to the disk when stopping the device.
> > > It takes a long time if the file is large that would cause the
> > > 'detach sync request' timeout when the device is closed under
> > > multi-process scenario.
> > >
> > > This commit fixes the issue by using alarm handler to release dumper.
> > >
> > > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > > Cc: mailto:stable@dpdk.org
> > >
> > > Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> >
> >
> > I think you need to redesign the handshake if this the case.
> > Forcing 30 second delay at the end of all uses of pcap is not acceptable.
>
> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
Hi, Ferruh
Sorry for the late reply.
I did not receive your email on Oct 6, I got your comments from patchwork.
"Can you please provide more details on multi-process communication and
call trace, to help us think about a solution to address this issue in a
more generic way (not just for pcap but for any case device close takes
more than multi-process timeout)?"
I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.
thread intr thread intr thread thread
of secondary of secondary of primary of primary
| | | |
| | | |
rte_eal_hotplug_remove
rte_dev_remove
eal_dev_hotplug_request_to_primary
rte_mp_request_sync ------------------------------------------------------->|
|
handle_secondary_request
|<-----------------|
|
__handle_secondary_request
eal_dev_hotplug_request_to_secondary
|<------------------------------------- rte_mp_request_sync
|
handle_primary_request--------->|
|
__handle_primary_request
local_dev_remove(this will take long time)
rte_mp_reply -------------------------------->|
|
local_dev_remove
|<------------------------------------------------- rte_mp_reply
The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
This caused the processing of hot_plug message to time out.
On Tue, 22 Nov 2022 09:25:33 +0000
"Zhou, YidingX" <yidingx.zhou@intel.com> wrote:
> > -----Original Message-----
> > From: Zhou, YidingX <yidingx.zhou@intel.com>
> > Sent: Wednesday, September 21, 2022 3:15 PM
> > To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
> > <qi.z.zhang@intel.com>
> > Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
> > Xingguang <xingguang.he@intel.com>; stable@dpdk.org
> > Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
> >
> >
> >
> > > -----Original Message-----
> > > From: Stephen Hemminger <mailto:stephen@networkplumber.org>
> > > Sent: Tuesday, September 6, 2022 10:58 PM
> > > To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
> > > Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
> > > <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
> > > mailto:stable@dpdk.org
> > > Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
> > >
> > > On Tue, 6 Sep 2022 16:05:11 +0800
> > > Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
> > >
> > > > The pcap file will be synchronized to the disk when stopping the device.
> > > > It takes a long time if the file is large that would cause the
> > > > 'detach sync request' timeout when the device is closed under
> > > > multi-process scenario.
> > > >
> > > > This commit fixes the issue by using alarm handler to release dumper.
> > > >
> > > > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> > > > Cc: mailto:stable@dpdk.org
> > > >
> > > > Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> > >
> > >
> > > I think you need to redesign the handshake if this the case.
> > > Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> >
> > @Zhang, Qi Z Do we need to redesign the handshake to fix this?
>
> Hi, Ferruh
> Sorry for the late reply.
> I did not receive your email on Oct 6, I got your comments from patchwork.
>
> "Can you please provide more details on multi-process communication and
> call trace, to help us think about a solution to address this issue in a
> more generic way (not just for pcap but for any case device close takes
> more than multi-process timeout)?"
>
> I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.
>
> thread intr thread intr thread thread
> of secondary of secondary of primary of primary
> | | | |
> | | | |
> rte_eal_hotplug_remove
> rte_dev_remove
> eal_dev_hotplug_request_to_primary
> rte_mp_request_sync ------------------------------------------------------->|
> |
> handle_secondary_request
> |<-----------------|
> |
> __handle_secondary_request
> eal_dev_hotplug_request_to_secondary
> |<------------------------------------- rte_mp_request_sync
> |
> handle_primary_request--------->|
> |
> __handle_primary_request
> local_dev_remove(this will take long time)
> rte_mp_reply -------------------------------->|
> |
> local_dev_remove
> |<------------------------------------------------- rte_mp_reply
>
> The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
> When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
> This caused the processing of hot_plug message to time out.
Part of the problem maybe a hidden file sync in some library.
Normally, closing a file should be fast even with lots of outstanding data.
The actual write done by OS will continue from file cache.
I wonder if doing some kind of fadvise call might help see POSIX_FADV_SEQUENTIAL or POSIX_FADV_DONTNEED
On 11/22/2022 9:25 AM, Zhou, YidingX wrote:
>
>
>> -----Original Message-----
>> From: Zhou, YidingX <yidingx.zhou@intel.com>
>> Sent: Wednesday, September 21, 2022 3:15 PM
>> To: Stephen Hemminger <stephen@networkplumber.org>; Zhang, Qi Z
>> <qi.z.zhang@intel.com>
>> Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; He,
>> Xingguang <xingguang.he@intel.com>; stable@dpdk.org
>> Subject: RE: [PATCH v2] net/pcap: fix timeout of stopping device
>>
>>
>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <mailto:stephen@networkplumber.org>
>>> Sent: Tuesday, September 6, 2022 10:58 PM
>>> To: Zhou, YidingX <mailto:yidingx.zhou@intel.com>
>>> Cc: mailto:dev@dpdk.org; Zhang, Qi Z <mailto:qi.z.zhang@intel.com>; Burakov, Anatoly
>>> <mailto:anatoly.burakov@intel.com>; He, Xingguang <mailto:xingguang.he@intel.com>;
>>> mailto:stable@dpdk.org
>>> Subject: Re: [PATCH v2] net/pcap: fix timeout of stopping device
>>>
>>> On Tue, 6 Sep 2022 16:05:11 +0800
>>> Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
>>>
>>>> The pcap file will be synchronized to the disk when stopping the device.
>>>> It takes a long time if the file is large that would cause the
>>>> 'detach sync request' timeout when the device is closed under
>>>> multi-process scenario.
>>>>
>>>> This commit fixes the issue by using alarm handler to release dumper.
>>>>
>>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
>>>> Cc: mailto:stable@dpdk.org
>>>>
>>>> Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
>>>
>>>
>>> I think you need to redesign the handshake if this the case.
>>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
>>
>> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
>
> Hi, Ferruh
> Sorry for the late reply.
> I did not receive your email on Oct 6, I got your comments from patchwork.
>
> "Can you please provide more details on multi-process communication and
> call trace, to help us think about a solution to address this issue in a
> more generic way (not just for pcap but for any case device close takes
> more than multi-process timeout)?"
>
> I try to explain this issue with a sequence diagram, hope it can be displayed correctly in the mail.
>
> thread intr thread intr thread thread
> of secondary of secondary of primary of primary
> | | | |
> | | | |
> rte_eal_hotplug_remove
> rte_dev_remove
> eal_dev_hotplug_request_to_primary
> rte_mp_request_sync ------------------------------------------------------->|
> |
> handle_secondary_request
> |<-----------------|
> |
> __handle_secondary_request
> eal_dev_hotplug_request_to_secondary
> |<------------------------------------- rte_mp_request_sync
> |
> handle_primary_request--------->|
> |
> __handle_primary_request
> local_dev_remove(this will take long time)
> rte_mp_reply -------------------------------->|
> |
> local_dev_remove
> |<------------------------------------------------- rte_mp_reply
>
> The marked 'local_dev_remove()' in the secondary process will perform a pcap file synchronization operation.
> When the pcap file is too large, it will take a lot of time (according to my test 100G takes 20+ seconds).
> This caused the processing of hot_plug message to time out.
Hi Yiding,
Thanks for the information,
Right now all MP operations timeout is hardcoded in the code and it is 5
seconds.
Do you think does it work to have an API to set custom timeout,
something like `rte_mp_timeout_set()`, and call this from pdump?
This gives a generic solution for similar cases, not just for pcap.
But my concern is if this is too much multi-process related internal
detail to update, @Anatoly may comment on this.
> >>> On Tue, 6 Sep 2022 16:05:11 +0800
> >>> Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
> >>>
> >>>> The pcap file will be synchronized to the disk when stopping the device.
> >>>> It takes a long time if the file is large that would cause the
> >>>> 'detach sync request' timeout when the device is closed under
> >>>> multi-process scenario.
> >>>>
> >>>> This commit fixes the issue by using alarm handler to release dumper.
> >>>>
> >>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> >>>> Cc: mailto:stable@dpdk.org
> >>>>
> >>>> Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> >>>
> >>>
> >>> I think you need to redesign the handshake if this the case.
> >>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> >>
> >> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
> >
> > Hi, Ferruh
> > Sorry for the late reply.
> > I did not receive your email on Oct 6, I got your comments from patchwork.
> >
> > "Can you please provide more details on multi-process communication
> > and call trace, to help us think about a solution to address this
> > issue in a more generic way (not just for pcap but for any case device
> > close takes more than multi-process timeout)?"
> >
> > I try to explain this issue with a sequence diagram, hope it can be displayed
> correctly in the mail.
> >
> > thread intr thread intr thread thread
> > of secondary of secondary of primary of primary
> > | | | |
> > | | | |
> > rte_eal_hotplug_remove
> > rte_dev_remove
> > eal_dev_hotplug_request_to_primary
> > rte_mp_request_sync ------------------------------------------------------->|
> > |
> >
> handle_secondary_request
> > |<-----------------|
> > |
> > __handle_secondary_request
> > eal_dev_hotplug_request_to_secondary
> > |<------------------------------------- rte_mp_request_sync
> > |
> > handle_primary_request--------->|
> > |
> > __handle_primary_request
> > local_dev_remove(this will take long time)
> > rte_mp_reply -------------------------------->|
> > |
> > local_dev_remove
> > |<-------------------------------------------------
> > rte_mp_reply
> >
> > The marked 'local_dev_remove()' in the secondary process will perform a
> pcap file synchronization operation.
> > When the pcap file is too large, it will take a lot of time (according to my test
> 100G takes 20+ seconds).
> > This caused the processing of hot_plug message to time out.
>
> Hi Yiding,
>
> Thanks for the information,
>
> Right now all MP operations timeout is hardcoded in the code and it is 5
> seconds.
> Do you think does it work to have an API to set custom timeout, something like
> `rte_mp_timeout_set()`, and call this from pdump?
>
> This gives a generic solution for similar cases, not just for pcap.
> But my concern is if this is too much multi-process related internal detail to
> update, @Anatoly may comment on this.
Hi, Ferruh
For pdump case only, I think the timeout is affected by pcap's size and other system components, such as the type of FS, system memory size.
It may be difficult to predict the specific time value for setting.
> > > > On Tue, 6 Sep 2022 16:05:11 +0800 Yiding Zhou
> > > > <mailto:yidingx.zhou@intel.com> wrote:
> > > >
> > > > > The pcap file will be synchronized to the disk when stopping the device.
> > > > > It takes a long time if the file is large that would cause the
> > > > > 'detach sync request' timeout when the device is closed under
> > > > > multi-process scenario.
> > > > >
> > > > > This commit fixes the issue by using alarm handler to release dumper.
> > > > >
> > > > > Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process
> > > > > private")
> > > > > Cc: mailto:stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> > > >
> > > >
> > > > I think you need to redesign the handshake if this the case.
> > > > Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> > >
> > > @Zhang, Qi Z Do we need to redesign the handshake to fix this?
> >
> > Hi, Ferruh
> > Sorry for the late reply.
> > I did not receive your email on Oct 6, I got your comments from patchwork.
> >
> > "Can you please provide more details on multi-process communication
> > and call trace, to help us think about a solution to address this
> > issue in a more generic way (not just for pcap but for any case device
> > close takes more than multi-process timeout)?"
> >
> > I try to explain this issue with a sequence diagram, hope it can be displayed
> correctly in the mail.
> >
> > thread intr thread intr thread thread
> > of secondary of secondary of primary of primary
> > | | | |
> > | | | |
> > rte_eal_hotplug_remove
> > rte_dev_remove
> > eal_dev_hotplug_request_to_primary
> > rte_mp_request_sync ------------------------------------------------------->|
> > |
> >
> handle_secondary_request
> > |<-----------------|
> > |
> > __handle_secondary_request
> > eal_dev_hotplug_request_to_secondary
> > |<------------------------------------- rte_mp_request_sync
> > |
> > handle_primary_request--------->|
> > |
> > __handle_primary_request
> > local_dev_remove(this will take long time)
> > rte_mp_reply -------------------------------->|
> > |
> > local_dev_remove
> > |<-------------------------------------------------
> > rte_mp_reply
> >
> > The marked 'local_dev_remove()' in the secondary process will perform a
> pcap file synchronization operation.
> > When the pcap file is too large, it will take a lot of time (according to my test
> 100G takes 20+ seconds).
> > This caused the processing of hot_plug message to time out.
>
>
> Part of the problem maybe a hidden file sync in some library.
> Normally, closing a file should be fast even with lots of outstanding data.
> The actual write done by OS will continue from file cache.
>
> I wonder if doing some kind of fadvise call might help see
> POSIX_FADV_SEQUENTIAL or POSIX_FADV_DONTNEED
Thanks for your advice, I will try it and give you feedback.
On 12/2/2022 10:13 AM, Zhou, YidingX wrote:
>
>
>>>>> On Tue, 6 Sep 2022 16:05:11 +0800
>>>>> Yiding Zhou <mailto:yidingx.zhou@intel.com> wrote:
>>>>>
>>>>>> The pcap file will be synchronized to the disk when stopping the device.
>>>>>> It takes a long time if the file is large that would cause the
>>>>>> 'detach sync request' timeout when the device is closed under
>>>>>> multi-process scenario.
>>>>>>
>>>>>> This commit fixes the issue by using alarm handler to release dumper.
>>>>>>
>>>>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
>>>>>> Cc: mailto:stable@dpdk.org
>>>>>>
>>>>>> Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
>>>>>
>>>>>
>>>>> I think you need to redesign the handshake if this the case.
>>>>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
>>>>
>>>> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
>>>
>>> Hi, Ferruh
>>> Sorry for the late reply.
>>> I did not receive your email on Oct 6, I got your comments from patchwork.
>>>
>>> "Can you please provide more details on multi-process communication
>>> and call trace, to help us think about a solution to address this
>>> issue in a more generic way (not just for pcap but for any case device
>>> close takes more than multi-process timeout)?"
>>>
>>> I try to explain this issue with a sequence diagram, hope it can be displayed
>> correctly in the mail.
>>>
>>> thread intr thread intr thread thread
>>> of secondary of secondary of primary of primary
>>> | | | |
>>> | | | |
>>> rte_eal_hotplug_remove
>>> rte_dev_remove
>>> eal_dev_hotplug_request_to_primary
>>> rte_mp_request_sync ------------------------------------------------------->|
>>> |
>>>
>> handle_secondary_request
>>> |<-----------------|
>>> |
>>> __handle_secondary_request
>>> eal_dev_hotplug_request_to_secondary
>>> |<------------------------------------- rte_mp_request_sync
>>> |
>>> handle_primary_request--------->|
>>> |
>>> __handle_primary_request
>>> local_dev_remove(this will take long time)
>>> rte_mp_reply -------------------------------->|
>>> |
>>> local_dev_remove
>>> |<-------------------------------------------------
>>> rte_mp_reply
>>>
>>> The marked 'local_dev_remove()' in the secondary process will perform a
>> pcap file synchronization operation.
>>> When the pcap file is too large, it will take a lot of time (according to my test
>> 100G takes 20+ seconds).
>>> This caused the processing of hot_plug message to time out.
>>
>> Hi Yiding,
>>
>> Thanks for the information,
>>
>> Right now all MP operations timeout is hardcoded in the code and it is 5
>> seconds.
>> Do you think does it work to have an API to set custom timeout, something like
>> `rte_mp_timeout_set()`, and call this from pdump?
>>
>> This gives a generic solution for similar cases, not just for pcap.
>> But my concern is if this is too much multi-process related internal detail to
>> update, @Anatoly may comment on this.
>
> Hi, Ferruh
> For pdump case only, I think the timeout is affected by pcap's size and other system components, such as the type of FS, system memory size.
> It may be difficult to predict the specific time value for setting.
It doesn't have to be specific.
Point here is to have a multi process API to set timeout, instead of put
a hardcoded timeout in pcap PMD.
> >>>>> On Tue, 6 Sep 2022 16:05:11 +0800 Yiding Zhou
> >>>>> <mailto:yidingx.zhou@intel.com> wrote:
> >>>>>
> >>>>>> The pcap file will be synchronized to the disk when stopping the device.
> >>>>>> It takes a long time if the file is large that would cause the
> >>>>>> 'detach sync request' timeout when the device is closed under
> >>>>>> multi-process scenario.
> >>>>>>
> >>>>>> This commit fixes the issue by using alarm handler to release dumper.
> >>>>>>
> >>>>>> Fixes: 0ecfb6c04d54 ("net/pcap: move handler to process private")
> >>>>>> Cc: mailto:stable@dpdk.org
> >>>>>>
> >>>>>> Signed-off-by: Yiding Zhou <mailto:yidingx.zhou@intel.com>
> >>>>>
> >>>>>
> >>>>> I think you need to redesign the handshake if this the case.
> >>>>> Forcing 30 second delay at the end of all uses of pcap is not acceptable.
> >>>>
> >>>> @Zhang, Qi Z Do we need to redesign the handshake to fix this?
> >>>
> >>> Hi, Ferruh
> >>> Sorry for the late reply.
> >>> I did not receive your email on Oct 6, I got your comments from patchwork.
> >>>
> >>> "Can you please provide more details on multi-process communication
> >>> and call trace, to help us think about a solution to address this
> >>> issue in a more generic way (not just for pcap but for any case
> >>> device close takes more than multi-process timeout)?"
> >>>
> >>> I try to explain this issue with a sequence diagram, hope it can be
> >>> displayed
> >> correctly in the mail.
> >>>
> >>> thread intr thread intr thread thread
> >>> of secondary of secondary of primary of
> primary
> >>> | | | |
> >>> | | | |
> >>> rte_eal_hotplug_remove
> >>> rte_dev_remove
> >>> eal_dev_hotplug_request_to_primary
> >>> rte_mp_request_sync ------------------------------------------------------->|
> >>>
> >>> |
> >>>
> >> handle_secondary_request
> >>> |<-----------------|
> >>> |
> >>> __handle_secondary_request
> >>>
> eal_dev_hotplug_request_to_secondary
> >>> |<------------------------------------- rte_mp_request_sync
> >>> |
> >>> handle_primary_request--------->|
> >>> |
> >>> __handle_primary_request
> >>> local_dev_remove(this will take long time)
> >>> rte_mp_reply -------------------------------->|
> >>> |
> >>> local_dev_remove
> >>> |<-------------------------------------------------
> >>> rte_mp_reply
> >>>
> >>> The marked 'local_dev_remove()' in the secondary process will
> >>> perform a
> >> pcap file synchronization operation.
> >>> When the pcap file is too large, it will take a lot of time
> >>> (according to my test
> >> 100G takes 20+ seconds).
> >>> This caused the processing of hot_plug message to time out.
> >>
> >> Hi Yiding,
> >>
> >> Thanks for the information,
> >>
> >> Right now all MP operations timeout is hardcoded in the code and it
> >> is 5 seconds.
> >> Do you think does it work to have an API to set custom timeout,
> >> something like `rte_mp_timeout_set()`, and call this from pdump?
> >>
> >> This gives a generic solution for similar cases, not just for pcap.
> >> But my concern is if this is too much multi-process related internal
> >> detail to update, @Anatoly may comment on this.
> >
> > Hi, Ferruh
> > For pdump case only, I think the timeout is affected by pcap's size and other
> system components, such as the type of FS, system memory size.
> > It may be difficult to predict the specific time value for setting.
>
> It doesn't have to be specific.
>
> Point here is to have a multi process API to set timeout, instead of put a
> hardcoded timeout in pcap PMD.
OK, I understood.
@@ -17,6 +17,7 @@
#include <rte_mbuf_dyn.h>
#include <rte_bus_vdev.h>
#include <rte_os_shim.h>
+#include <rte_alarm.h>
#include "pcap_osdep.h"
@@ -664,6 +665,25 @@ eth_dev_start(struct rte_eth_dev *dev)
return 0;
}
+static void eth_pcap_dumper_release(void *arg)
+{
+ pcap_dump_close((pcap_dumper_t *)arg);
+}
+
+static void
+eth_pcap_dumper_close(pcap_dumper_t *dumper)
+{
+ if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+ /*
+ * Delay 30 seconds before releasing dumper to wait for file sync
+ * to complete to avoid blocking alarm thread in PRIMARY process
+ */
+ rte_eal_alarm_set(30000000, eth_pcap_dumper_release, dumper);
+ } else {
+ rte_eal_alarm_set(1, eth_pcap_dumper_release, dumper);
+ }
+}
+
/*
* This function gets called when the current port gets stopped.
* Is the only place for us to close all the tx streams dumpers.
@@ -689,7 +709,7 @@ eth_dev_stop(struct rte_eth_dev *dev)
for (i = 0; i < dev->data->nb_tx_queues; i++) {
if (pp->tx_dumper[i] != NULL) {
- pcap_dump_close(pp->tx_dumper[i]);
+ eth_pcap_dumper_close(pp->tx_dumper[i]);
pp->tx_dumper[i] = NULL;
}