[v2] kni: Fix request overwritten

Message ID 20210924105409.21711-1-eladv6@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2] kni: Fix request overwritten |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Elad Nachman Sept. 24, 2021, 10:54 a.m. UTC
  Fix lack of multiple KNI requests handling support by introducing a
request in progress flag which will fail additional requests with
EAGAIN return code if the original request has not been processed
by user-space.

Bugzilla ID: 809
 
Signed-off-by: Elad Nachman <eladv6@gmail.com>
---
 kernel/linux/kni/kni_net.c | 9 +++++++++
 lib/kni/rte_kni.c          | 2 ++
 lib/kni/rte_kni_common.h   | 1 +
 3 files changed, 12 insertions(+)
  

Comments

Ferruh Yigit Oct. 4, 2021, 1:01 p.m. UTC | #1
On 9/24/2021 11:54 AM, Elad Nachman wrote:
> Fix lack of multiple KNI requests handling support by introducing a
> request in progress flag which will fail additional requests with
> EAGAIN return code if the original request has not been processed
> by user-space.
> 
> Bugzilla ID: 809

Hi Eric,

Can you please test this patch, if it solves the issue you reported?

>  
> Signed-off-by: Elad Nachman <eladv6@gmail.com>
> ---
>  kernel/linux/kni/kni_net.c | 9 +++++++++
>  lib/kni/rte_kni.c          | 2 ++
>  lib/kni/rte_kni_common.h   | 1 +
>  3 files changed, 12 insertions(+)
> 

<...>

> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
>  
>  	mutex_lock(&kni->sync_lock);
>  
> +	/* Check that existing request has been processed: */
> +	cur_req = (struct rte_kni_request *)kni->sync_kva;
> +	if (cur_req->req_in_progress) {
> +		ret = -EAGAIN;

Overall logic in the KNI looks good to me, this helps to serialize the requests
even for async ones.

But can you please clarify how it behaves in the kernel side with '-EAGAIN'
return type? Will linux call the ndo again, or will it just fail.

If it just fails should we handle the re-try on '-EAGAIN' within the kni module?
  
Elad Nachman Oct. 4, 2021, 1:09 p.m. UTC | #2
Hi,

EAGAIN is propogated back to the kernel and to the caller.

We cannot retry from the kni kernel module since we hold the rtnl lock.

FYI,

Elad

בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
ferruh.yigit@intel.com>:

> On 9/24/2021 11:54 AM, Elad Nachman wrote:
> > Fix lack of multiple KNI requests handling support by introducing a
> > request in progress flag which will fail additional requests with
> > EAGAIN return code if the original request has not been processed
> > by user-space.
> >
> > Bugzilla ID: 809
>
> Hi Eric,
>
> Can you please test this patch, if it solves the issue you reported?
>
> >
> > Signed-off-by: Elad Nachman <eladv6@gmail.com>
> > ---
> >  kernel/linux/kni/kni_net.c | 9 +++++++++
> >  lib/kni/rte_kni.c          | 2 ++
> >  lib/kni/rte_kni_common.h   | 1 +
> >  3 files changed, 12 insertions(+)
> >
>
> <...>
>
> > @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
> >
> >       mutex_lock(&kni->sync_lock);
> >
> > +     /* Check that existing request has been processed: */
> > +     cur_req = (struct rte_kni_request *)kni->sync_kva;
> > +     if (cur_req->req_in_progress) {
> > +             ret = -EAGAIN;
>
> Overall logic in the KNI looks good to me, this helps to serialize the
> requests
> even for async ones.
>
> But can you please clarify how it behaves in the kernel side with '-EAGAIN'
> return type? Will linux call the ndo again, or will it just fail.
>
> If it just fails should we handle the re-try on '-EAGAIN' within the kni
> module?
>
>
  
Ferruh Yigit Oct. 4, 2021, 2:03 p.m. UTC | #3
On 10/4/2021 2:09 PM, Elad Nachman wrote:
> Hi,
> 
> EAGAIN is propogated back to the kernel and to the caller.
> 

So will the user get an error, or it will be handled by the kernel and retried?

> We cannot retry from the kni kernel module since we hold the rtnl lock.
> 

Why not? We are already waiting until a command time out, like 'kni_net_open()'
can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can limit the
number of retry for safety.

> FYI,
> 
> Elad
> 
> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
> ferruh.yigit@intel.com>:
> 
>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>> Fix lack of multiple KNI requests handling support by introducing a
>>> request in progress flag which will fail additional requests with
>>> EAGAIN return code if the original request has not been processed
>>> by user-space.
>>>
>>> Bugzilla ID: 809
>>
>> Hi Eric,
>>
>> Can you please test this patch, if it solves the issue you reported?
>>
>>>
>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>> ---
>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>>>  lib/kni/rte_kni.c          | 2 ++
>>>  lib/kni/rte_kni_common.h   | 1 +
>>>  3 files changed, 12 insertions(+)
>>>
>>
>> <...>
>>
>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
>> struct rte_kni_request *req)
>>>
>>>       mutex_lock(&kni->sync_lock);
>>>
>>> +     /* Check that existing request has been processed: */
>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>> +     if (cur_req->req_in_progress) {
>>> +             ret = -EAGAIN;
>>
>> Overall logic in the KNI looks good to me, this helps to serialize the
>> requests
>> even for async ones.
>>
>> But can you please clarify how it behaves in the kernel side with '-EAGAIN'
>> return type? Will linux call the ndo again, or will it just fail.
>>
>> If it just fails should we handle the re-try on '-EAGAIN' within the kni
>> module?
>>
>>
  
Eric Christian Oct. 4, 2021, 2:14 p.m. UTC | #4
Adding Sahithi.

I believe adding the -EAGAIN method puts the responsibility on the
application/caller.  If we take the change MAC address as an example.  Most
application code just does this kind of check:

        ret = ioctl(sockfd, SIOCSIFHWADDR, &ifr);

        if (ret < 0) {
                PMD_LOG_ERRNO(ERR, "ioctl(SIOCSIFHWADDR) failed");
                return -EINVAL;
        }

So the existing application code will treat the -EAGAIN as a failure and
not retry.  Unless it is expected that the IOCTL can return -EAGAIN and the
application decides to keep retrying?

We can try this, but we have temporarily patched out the async changes in
our code as it was blocking QA due to
https://bugs.dpdk.org/show_bug.cgi?id=816

Eric










On Mon, Oct 4, 2021 at 9:05 AM Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 9/24/2021 11:54 AM, Elad Nachman wrote:
> > Fix lack of multiple KNI requests handling support by introducing a
> > request in progress flag which will fail additional requests with
> > EAGAIN return code if the original request has not been processed
> > by user-space.
> >
> > Bugzilla ID: 809
>
> Hi Eric,
>
> Can you please test this patch, if it solves the issue you reported?
>
> >
> > Signed-off-by: Elad Nachman <eladv6@gmail.com>
> > ---
> >  kernel/linux/kni/kni_net.c | 9 +++++++++
> >  lib/kni/rte_kni.c          | 2 ++
> >  lib/kni/rte_kni_common.h   | 1 +
> >  3 files changed, 12 insertions(+)
> >
>
> <...>
>
> > @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
> >
> >       mutex_lock(&kni->sync_lock);
> >
> > +     /* Check that existing request has been processed: */
> > +     cur_req = (struct rte_kni_request *)kni->sync_kva;
> > +     if (cur_req->req_in_progress) {
> > +             ret = -EAGAIN;
>
> Overall logic in the KNI looks good to me, this helps to serialize the
> requests
> even for async ones.
>
> But can you please clarify how it behaves in the kernel side with '-EAGAIN'
> return type? Will linux call the ndo again, or will it just fail.
>
> If it just fails should we handle the re-try on '-EAGAIN' within the kni
> module?
>
>
  
Elad Nachman Oct. 4, 2021, 2:25 p.m. UTC | #5
1. Userspace will get an error
2. Waiting with rtnl locked causes a deadlock; waiting with rtnl unlocked
for interface down command causes a crash because of a race condition in
the device delete/unregister list in the kernel.

FYI,

Elad.

בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
ferruh.yigit@intel.com>:

> On 10/4/2021 2:09 PM, Elad Nachman wrote:
> > Hi,
> >
> > EAGAIN is propogated back to the kernel and to the caller.
> >
>
> So will the user get an error, or it will be handled by the kernel and
> retried?
>
> > We cannot retry from the kni kernel module since we hold the rtnl lock.
> >
>
> Why not? We are already waiting until a command time out, like
> 'kni_net_open()'
> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
> limit the
> number of retry for safety.
>
> > FYI,
> >
> > Elad
> >
> > בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
> > ferruh.yigit@intel.com>:
> >
> >> On 9/24/2021 11:54 AM, Elad Nachman wrote:
> >>> Fix lack of multiple KNI requests handling support by introducing a
> >>> request in progress flag which will fail additional requests with
> >>> EAGAIN return code if the original request has not been processed
> >>> by user-space.
> >>>
> >>> Bugzilla ID: 809
> >>
> >> Hi Eric,
> >>
> >> Can you please test this patch, if it solves the issue you reported?
> >>
> >>>
> >>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
> >>> ---
> >>>  kernel/linux/kni/kni_net.c | 9 +++++++++
> >>>  lib/kni/rte_kni.c          | 2 ++
> >>>  lib/kni/rte_kni_common.h   | 1 +
> >>>  3 files changed, 12 insertions(+)
> >>>
> >>
> >> <...>
> >>
> >>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
> >> struct rte_kni_request *req)
> >>>
> >>>       mutex_lock(&kni->sync_lock);
> >>>
> >>> +     /* Check that existing request has been processed: */
> >>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
> >>> +     if (cur_req->req_in_progress) {
> >>> +             ret = -EAGAIN;
> >>
> >> Overall logic in the KNI looks good to me, this helps to serialize the
> >> requests
> >> even for async ones.
> >>
> >> But can you please clarify how it behaves in the kernel side with
> '-EAGAIN'
> >> return type? Will linux call the ndo again, or will it just fail.
> >>
> >> If it just fails should we handle the re-try on '-EAGAIN' within the kni
> >> module?
> >>
> >>
>
>
  
Ferruh Yigit Oct. 4, 2021, 2:51 p.m. UTC | #6
On 10/4/2021 3:25 PM, Elad Nachman wrote:

Can you please try to not top post, it will make impossible to follow this
discussion later from the mail archives.

> 1. Userspace will get an error

So there is nothing special with returning '-EAGAIN', user will only observe an
error.
Wasn't initial intention to use '-EAGAIN' to try request again?

> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl unlocked
> for interface down command causes a crash because of a race condition in
> the device delete/unregister list in the kernel.
> 

Why waiting with rthnl lock causes a deadlock? As said below we are already
doing it, why it is different with retry logic?

I agree to not wait with rtnl unlocked.

> FYI,
> 
> Elad.
> 
> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
> ferruh.yigit@intel.com>:
> 
>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>>> Hi,
>>>
>>> EAGAIN is propogated back to the kernel and to the caller.
>>>
>>
>> So will the user get an error, or it will be handled by the kernel and
>> retried?
>>
>>> We cannot retry from the kni kernel module since we hold the rtnl lock.
>>>
>>
>> Why not? We are already waiting until a command time out, like
>> 'kni_net_open()'
>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
>> limit the
>> number of retry for safety.
>>
>>> FYI,
>>>
>>> Elad
>>>
>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>>> ferruh.yigit@intel.com>:
>>>
>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>>>> Fix lack of multiple KNI requests handling support by introducing a
>>>>> request in progress flag which will fail additional requests with
>>>>> EAGAIN return code if the original request has not been processed
>>>>> by user-space.
>>>>>
>>>>> Bugzilla ID: 809
>>>>
>>>> Hi Eric,
>>>>
>>>> Can you please test this patch, if it solves the issue you reported?
>>>>
>>>>>
>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>>>> ---
>>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>>>>>  lib/kni/rte_kni.c          | 2 ++
>>>>>  lib/kni/rte_kni_common.h   | 1 +
>>>>>  3 files changed, 12 insertions(+)
>>>>>
>>>>
>>>> <...>
>>>>
>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
>>>> struct rte_kni_request *req)
>>>>>
>>>>>       mutex_lock(&kni->sync_lock);
>>>>>
>>>>> +     /* Check that existing request has been processed: */
>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>>>> +     if (cur_req->req_in_progress) {
>>>>> +             ret = -EAGAIN;
>>>>
>>>> Overall logic in the KNI looks good to me, this helps to serialize the
>>>> requests
>>>> even for async ones.
>>>>
>>>> But can you please clarify how it behaves in the kernel side with
>> '-EAGAIN'
>>>> return type? Will linux call the ndo again, or will it just fail.
>>>>
>>>> If it just fails should we handle the re-try on '-EAGAIN' within the kni
>>>> module?
>>>>
>>>>
>>
>>
  
Ferruh Yigit Oct. 4, 2021, 2:56 p.m. UTC | #7
On 10/4/2021 3:14 PM, Eric Christian wrote:
> Adding Sahithi.
> 
> I believe adding the -EAGAIN method puts the responsibility on the
> application/caller.  If we take the change MAC address as an example.  Most
> application code just does this kind of check:
> 
>         ret = ioctl(sockfd, SIOCSIFHWADDR, &ifr);
> 
>         if (ret < 0) {
>                 PMD_LOG_ERRNO(ERR, "ioctl(SIOCSIFHWADDR) failed");
>                 return -EINVAL;
>         }
> 

I am not sure '-EAGAIN' should be handled by the userspace code. I assumed that
kernel netdev layer will try again if ndo returns '-EAGAIN' but that seems not
the case, so perhaps we can retry in the KNI kernel module. So the issue can be
handled without the KNI module transparent to the user application.

> So the existing application code will treat the -EAGAIN as a failure and
> not retry.  Unless it is expected that the IOCTL can return -EAGAIN and the
> application decides to keep retrying?
> 
> We can try this, but we have temporarily patched out the async changes in
> our code as it was blocking QA due to
> https://bugs.dpdk.org/show_bug.cgi?id=816
> 
> Eric
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Oct 4, 2021 at 9:05 AM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> 
>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>> Fix lack of multiple KNI requests handling support by introducing a
>>> request in progress flag which will fail additional requests with
>>> EAGAIN return code if the original request has not been processed
>>> by user-space.
>>>
>>> Bugzilla ID: 809
>>
>> Hi Eric,
>>
>> Can you please test this patch, if it solves the issue you reported?
>>
>>>
>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>> ---
>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>>>  lib/kni/rte_kni.c          | 2 ++
>>>  lib/kni/rte_kni_common.h   | 1 +
>>>  3 files changed, 12 insertions(+)
>>>
>>
>> <...>
>>
>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
>> struct rte_kni_request *req)
>>>
>>>       mutex_lock(&kni->sync_lock);
>>>
>>> +     /* Check that existing request has been processed: */
>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>> +     if (cur_req->req_in_progress) {
>>> +             ret = -EAGAIN;
>>
>> Overall logic in the KNI looks good to me, this helps to serialize the
>> requests
>> even for async ones.
>>
>> But can you please clarify how it behaves in the kernel side with '-EAGAIN'
>> return type? Will linux call the ndo again, or will it just fail.
>>
>> If it just fails should we handle the re-try on '-EAGAIN' within the kni
>> module?
>>
>>
  
Elad Nachman Oct. 4, 2021, 2:58 p.m. UTC | #8
בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
ferruh.yigit@intel.com>:

> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>
> Can you please try to not top post, it will make impossible to follow this
> discussion later from the mail archives.
>
> > 1. Userspace will get an error
>
> So there is nothing special with returning '-EAGAIN', user will only
> observe an
> error.
> Wasn't initial intention to use '-EAGAIN' to try request again?
>
> To signal user-space to retry the operation.

>
> > 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl unlocked
> > for interface down command causes a crash because of a race condition in
> > the device delete/unregister list in the kernel.
> >
>
> Why waiting with rthnl lock causes a deadlock? As said below we are already
> doing it, why it is different with retry logic?
>
> Because it can be interface down request.


> I agree to not wait with rtnl unlocked.
>
> > FYI,
> >
> > Elad.
> >
> > בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
> > ferruh.yigit@intel.com>:
> >
> >> On 10/4/2021 2:09 PM, Elad Nachman wrote:
> >>> Hi,
> >>>
> >>> EAGAIN is propogated back to the kernel and to the caller.
> >>>
> >>
> >> So will the user get an error, or it will be handled by the kernel and
> >> retried?
> >>
> >>> We cannot retry from the kni kernel module since we hold the rtnl lock.
> >>>
> >>
> >> Why not? We are already waiting until a command time out, like
> >> 'kni_net_open()'
> >> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
> >> limit the
> >> number of retry for safety.
> >>
> >>> FYI,
> >>>
> >>> Elad
> >>>
> >>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
> >>> ferruh.yigit@intel.com>:
> >>>
> >>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
> >>>>> Fix lack of multiple KNI requests handling support by introducing a
> >>>>> request in progress flag which will fail additional requests with
> >>>>> EAGAIN return code if the original request has not been processed
> >>>>> by user-space.
> >>>>>
> >>>>> Bugzilla ID: 809
> >>>>
> >>>> Hi Eric,
> >>>>
> >>>> Can you please test this patch, if it solves the issue you reported?
> >>>>
> >>>>>
> >>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
> >>>>> ---
> >>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
> >>>>>  lib/kni/rte_kni.c          | 2 ++
> >>>>>  lib/kni/rte_kni_common.h   | 1 +
> >>>>>  3 files changed, 12 insertions(+)
> >>>>>
> >>>>
> >>>> <...>
> >>>>
> >>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
> >>>> struct rte_kni_request *req)
> >>>>>
> >>>>>       mutex_lock(&kni->sync_lock);
> >>>>>
> >>>>> +     /* Check that existing request has been processed: */
> >>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
> >>>>> +     if (cur_req->req_in_progress) {
> >>>>> +             ret = -EAGAIN;
> >>>>
> >>>> Overall logic in the KNI looks good to me, this helps to serialize the
> >>>> requests
> >>>> even for async ones.
> >>>>
> >>>> But can you please clarify how it behaves in the kernel side with
> >> '-EAGAIN'
> >>>> return type? Will linux call the ndo again, or will it just fail.
> >>>>
> >>>> If it just fails should we handle the re-try on '-EAGAIN' within the
> kni
> >>>> module?
> >>>>
> >>>>
> >>
> >>
>
> Elad.
  
Ferruh Yigit Oct. 4, 2021, 3:48 p.m. UTC | #9
On 10/4/2021 3:58 PM, Elad Nachman wrote:
> בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
> ferruh.yigit@intel.com>:
> 
>> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>>
>> Can you please try to not top post, it will make impossible to follow this
>> discussion later from the mail archives.
>>
>>> 1. Userspace will get an error
>>
>> So there is nothing special with returning '-EAGAIN', user will only
>> observe an
>> error.
>> Wasn't initial intention to use '-EAGAIN' to try request again?
>>
> To signal user-space to retry the operation.
>

Not sure if it will reach to the end user. If user is calling "ifconfig <iface>
down", it will just fail right, it won't recognize the error type.

Unless this is common usage by the Linux network drivers, having this usage in
KNI won't help much. I am for handling this in the kernel side if we can.

>>
>>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl unlocked
>>> for interface down command causes a crash because of a race condition in
>>> the device delete/unregister list in the kernel.
>>>
>>
>> Why waiting with rthnl lock causes a deadlock? As said below we are already
>> doing it, why it is different with retry logic?
>>
> Because it can be interface down request.
> 

(sure you like short answers)

Please help me to see why "interface down" is special. Isn't it point of your
patch to wait the request execution in the userspace even it is an async request?

And yet again, number of retry can be limited.


> 
>> I agree to not wait with rtnl unlocked.
>>
>>> FYI,
>>>
>>> Elad.
>>>
>>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
>>> ferruh.yigit@intel.com>:
>>>
>>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>>>>> Hi,
>>>>>
>>>>> EAGAIN is propogated back to the kernel and to the caller.
>>>>>
>>>>
>>>> So will the user get an error, or it will be handled by the kernel and
>>>> retried?
>>>>
>>>>> We cannot retry from the kni kernel module since we hold the rtnl lock.
>>>>>
>>>>
>>>> Why not? We are already waiting until a command time out, like
>>>> 'kni_net_open()'
>>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
>>>> limit the
>>>> number of retry for safety.
>>>>
>>>>> FYI,
>>>>>
>>>>> Elad
>>>>>
>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>>>>> ferruh.yigit@intel.com>:
>>>>>
>>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>>>>>> Fix lack of multiple KNI requests handling support by introducing a
>>>>>>> request in progress flag which will fail additional requests with
>>>>>>> EAGAIN return code if the original request has not been processed
>>>>>>> by user-space.
>>>>>>>
>>>>>>> Bugzilla ID: 809
>>>>>>
>>>>>> Hi Eric,
>>>>>>
>>>>>> Can you please test this patch, if it solves the issue you reported?
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>>>>>> ---
>>>>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>>>>>>>  lib/kni/rte_kni.c          | 2 ++
>>>>>>>  lib/kni/rte_kni_common.h   | 1 +
>>>>>>>  3 files changed, 12 insertions(+)
>>>>>>>
>>>>>>
>>>>>> <...>
>>>>>>
>>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device *dev,
>>>>>> struct rte_kni_request *req)
>>>>>>>
>>>>>>>       mutex_lock(&kni->sync_lock);
>>>>>>>
>>>>>>> +     /* Check that existing request has been processed: */
>>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>>>>>> +     if (cur_req->req_in_progress) {
>>>>>>> +             ret = -EAGAIN;
>>>>>>
>>>>>> Overall logic in the KNI looks good to me, this helps to serialize the
>>>>>> requests
>>>>>> even for async ones.
>>>>>>
>>>>>> But can you please clarify how it behaves in the kernel side with
>>>> '-EAGAIN'
>>>>>> return type? Will linux call the ndo again, or will it just fail.
>>>>>>
>>>>>> If it just fails should we handle the re-try on '-EAGAIN' within the
>> kni
>>>>>> module?
>>>>>>
>>>>>>
>>>>
>>>>
>>
>> Elad.
  
Elad Nachman Oct. 4, 2021, 4:18 p.m. UTC | #10
On Mon, Oct 4, 2021 at 7:05 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 10/4/2021 3:58 PM, Elad Nachman wrote:
> > בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
> > ferruh.yigit@intel.com>:
> >
> >> On 10/4/2021 3:25 PM, Elad Nachman wrote:
> >>
> >> Can you please try to not top post, it will make impossible to follow
> this
> >> discussion later from the mail archives.
> >>
> >>> 1. Userspace will get an error
> >>
> >> So there is nothing special with returning '-EAGAIN', user will only
> >> observe an
> >> error.
> >> Wasn't initial intention to use '-EAGAIN' to try request again?
> >>
> > To signal user-space to retry the operation.
> >
>
> Not sure if it will reach to the end user. If user is calling "ifconfig
> <iface>
> down", it will just fail right, it won't recognize the error type.
>
> Unless this is common usage by the Linux network drivers, having this
> usage in
> KNI won't help much. I am for handling this in the kernel side if we can.
>
>
If user calls ifconfig <iface> down it will not happen. It requires some
multi-core race condition only Eric can recreate.


> >>
> >>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl
> unlocked
> >>> for interface down command causes a crash because of a race condition
> in
> >>> the device delete/unregister list in the kernel.
> >>>
> >>
> >> Why waiting with rthnl lock causes a deadlock? As said below we are
> already
> >> doing it, why it is different with retry logic?
> >>
> > Because it can be interface down request.
> >
>
> (sure you like short answers)
>
> Please help me to see why "interface down" is special. Isn't it point of
> your
> patch to wait the request execution in the userspace even it is an async
> request?
>
> And yet again, number of retry can be limited.
>
>
No, it is not. Please look again:
https://patches.dpdk.org/project/dpdk/patch/20210924105409.21711-1-eladv6@gmail.com/



>
> >
> >> I agree to not wait with rtnl unlocked.
> >>
> >>> FYI,
> >>>
> >>> Elad.
> >>>
> >>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
> >>> ferruh.yigit@intel.com>:
> >>>
> >>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
> >>>>> Hi,
> >>>>>
> >>>>> EAGAIN is propogated back to the kernel and to the caller.
> >>>>>
> >>>>
> >>>> So will the user get an error, or it will be handled by the kernel and
> >>>> retried?
> >>>>
> >>>>> We cannot retry from the kni kernel module since we hold the rtnl
> lock.
> >>>>>
> >>>>
> >>>> Why not? We are already waiting until a command time out, like
> >>>> 'kni_net_open()'
> >>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
> >>>> limit the
> >>>> number of retry for safety.
> >>>>
> >>>>> FYI,
> >>>>>
> >>>>> Elad
> >>>>>
> >>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
> >>>>> ferruh.yigit@intel.com>:
> >>>>>
> >>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
> >>>>>>> Fix lack of multiple KNI requests handling support by introducing a
> >>>>>>> request in progress flag which will fail additional requests with
> >>>>>>> EAGAIN return code if the original request has not been processed
> >>>>>>> by user-space.
> >>>>>>>
> >>>>>>> Bugzilla ID: 809
> >>>>>>
> >>>>>> Hi Eric,
> >>>>>>
> >>>>>> Can you please test this patch, if it solves the issue you reported?
> >>>>>>
> >>>>>>>
> >>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
> >>>>>>> ---
> >>>>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
> >>>>>>>  lib/kni/rte_kni.c          | 2 ++
> >>>>>>>  lib/kni/rte_kni_common.h   | 1 +
> >>>>>>>  3 files changed, 12 insertions(+)
> >>>>>>>
> >>>>>>
> >>>>>> <...>
> >>>>>>
> >>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device
> *dev,
> >>>>>> struct rte_kni_request *req)
> >>>>>>>
> >>>>>>>       mutex_lock(&kni->sync_lock);
> >>>>>>>
> >>>>>>> +     /* Check that existing request has been processed: */
> >>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
> >>>>>>> +     if (cur_req->req_in_progress) {
> >>>>>>> +             ret = -EAGAIN;
> >>>>>>
> >>>>>> Overall logic in the KNI looks good to me, this helps to serialize
> the
> >>>>>> requests
> >>>>>> even for async ones.
> >>>>>>
> >>>>>> But can you please clarify how it behaves in the kernel side with
> >>>> '-EAGAIN'
> >>>>>> return type? Will linux call the ndo again, or will it just fail.
> >>>>>>
> >>>>>> If it just fails should we handle the re-try on '-EAGAIN' within the
> >> kni
> >>>>>> module?
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >> Elad.
>
>
  
Eric Christian Oct. 4, 2021, 4:59 p.m. UTC | #11
I am not sure that only we can recreate the KNI request overwrite.  We may
be the only ones with a current use case that exposes the vulnerability.
 It is possible for any KNI operation to encounter this issue with the new
async mechanism.  As long as the call to kni_net_process_request() is a
separate thread from rte_kni_handle_request() this has the potential to
occur with the use of async requests.

All you need is one async KNI request followed closely by a second KNI
request before the rte_kni_handle_request() has had a chance to process the
first request.

The kernel dev driver simply returns the error value back to the caller if
it is less than zero.


Eric



On Mon, Oct 4, 2021 at 12:19 PM Elad Nachman <eladv6@gmail.com> wrote:

>
>
> On Mon, Oct 4, 2021 at 7:05 PM Ferruh Yigit <ferruh.yigit@intel.com>
> wrote:
>
>> On 10/4/2021 3:58 PM, Elad Nachman wrote:
>> > בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
>> > ferruh.yigit@intel.com>:
>> >
>> >> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>> >>
>> >> Can you please try to not top post, it will make impossible to follow
>> this
>> >> discussion later from the mail archives.
>> >>
>> >>> 1. Userspace will get an error
>> >>
>> >> So there is nothing special with returning '-EAGAIN', user will only
>> >> observe an
>> >> error.
>> >> Wasn't initial intention to use '-EAGAIN' to try request again?
>> >>
>> > To signal user-space to retry the operation.
>> >
>>
>> Not sure if it will reach to the end user. If user is calling "ifconfig
>> <iface>
>> down", it will just fail right, it won't recognize the error type.
>>
>> Unless this is common usage by the Linux network drivers, having this
>> usage in
>> KNI won't help much. I am for handling this in the kernel side if we can.
>>
>>
> If user calls ifconfig <iface> down it will not happen. It requires some
> multi-core race condition only Eric can recreate.
>
>
>> >>
>> >>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl
>> unlocked
>> >>> for interface down command causes a crash because of a race condition
>> in
>> >>> the device delete/unregister list in the kernel.
>> >>>
>> >>
>> >> Why waiting with rthnl lock causes a deadlock? As said below we are
>> already
>> >> doing it, why it is different with retry logic?
>> >>
>> > Because it can be interface down request.
>> >
>>
>> (sure you like short answers)
>>
>> Please help me to see why "interface down" is special. Isn't it point of
>> your
>> patch to wait the request execution in the userspace even it is an async
>> request?
>>
>> And yet again, number of retry can be limited.
>>
>>
> No, it is not. Please look again:
> https://patches.dpdk.org/project/dpdk/patch/20210924105409.21711-1-eladv6@gmail.com/
>
>
>
>>
>> >
>> >> I agree to not wait with rtnl unlocked.
>> >>
>> >>> FYI,
>> >>>
>> >>> Elad.
>> >>>
>> >>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
>> >>> ferruh.yigit@intel.com>:
>> >>>
>> >>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> EAGAIN is propogated back to the kernel and to the caller.
>> >>>>>
>> >>>>
>> >>>> So will the user get an error, or it will be handled by the kernel
>> and
>> >>>> retried?
>> >>>>
>> >>>>> We cannot retry from the kni kernel module since we hold the rtnl
>> lock.
>> >>>>>
>> >>>>
>> >>>> Why not? We are already waiting until a command time out, like
>> >>>> 'kni_net_open()'
>> >>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we
>> can
>> >>>> limit the
>> >>>> number of retry for safety.
>> >>>>
>> >>>>> FYI,
>> >>>>>
>> >>>>> Elad
>> >>>>>
>> >>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>> >>>>> ferruh.yigit@intel.com>:
>> >>>>>
>> >>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>> >>>>>>> Fix lack of multiple KNI requests handling support by introducing
>> a
>> >>>>>>> request in progress flag which will fail additional requests with
>> >>>>>>> EAGAIN return code if the original request has not been processed
>> >>>>>>> by user-space.
>> >>>>>>>
>> >>>>>>> Bugzilla ID: 809
>> >>>>>>
>> >>>>>> Hi Eric,
>> >>>>>>
>> >>>>>> Can you please test this patch, if it solves the issue you
>> reported?
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>> >>>>>>> ---
>> >>>>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>> >>>>>>>  lib/kni/rte_kni.c          | 2 ++
>> >>>>>>>  lib/kni/rte_kni_common.h   | 1 +
>> >>>>>>>  3 files changed, 12 insertions(+)
>> >>>>>>>
>> >>>>>>
>> >>>>>> <...>
>> >>>>>>
>> >>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device
>> *dev,
>> >>>>>> struct rte_kni_request *req)
>> >>>>>>>
>> >>>>>>>       mutex_lock(&kni->sync_lock);
>> >>>>>>>
>> >>>>>>> +     /* Check that existing request has been processed: */
>> >>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>> >>>>>>> +     if (cur_req->req_in_progress) {
>> >>>>>>> +             ret = -EAGAIN;
>> >>>>>>
>> >>>>>> Overall logic in the KNI looks good to me, this helps to serialize
>> the
>> >>>>>> requests
>> >>>>>> even for async ones.
>> >>>>>>
>> >>>>>> But can you please clarify how it behaves in the kernel side with
>> >>>> '-EAGAIN'
>> >>>>>> return type? Will linux call the ndo again, or will it just fail.
>> >>>>>>
>> >>>>>> If it just fails should we handle the re-try on '-EAGAIN' within
>> the
>> >> kni
>> >>>>>> module?
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >> Elad.
>>
>>
  
Elad Nachman Oct. 4, 2021, 6:27 p.m. UTC | #12
בתאריך יום ב׳, 4 באוק׳ 2021, 20:00, מאת Eric Christian ‏<erclists@gmail.com
>:

> I am not sure that only we can recreate the KNI request overwrite.  We may
> be the only ones with a current use case that exposes the vulnerability.
>  It is possible for any KNI operation to encounter this issue with the new
> async mechanism.  As long as the call to kni_net_process_request() is a
> separate thread from rte_kni_handle_request() this has the potential to
> occur with the use of async requests.
>
> All you need is one async KNI request followed closely by a second KNI
> request before the rte_kni_handle_request() has had a chance to process the
> first request.
>
> The kernel dev driver simply returns the error value back to the caller if
> it is less than zero.
>
>
> Eric
>

What I did in order to attempt to recreate this issue was:

Create a qemu VM with four cores.
Run the KNI sample application.

Then:

1. Run ifconfig vEth0_0 down followed immediately by ifconfig vEth0_0 up

2. Same as above but in a sample userspace C application which calls ioctl
twice in a row to achieve the same.


Both did not recreate the problem.

The only way to recreate it IMHO is to either:
 run the ioctl from two parallel threads or perhaps assign RT fifo priority
to the application calling ioctl in a row.

Elad


>
> On Mon, Oct 4, 2021 at 12:19 PM Elad Nachman <eladv6@gmail.com> wrote:
>
>>
>>
>> On Mon, Oct 4, 2021 at 7:05 PM Ferruh Yigit <ferruh.yigit@intel.com>
>> wrote:
>>
>>> On 10/4/2021 3:58 PM, Elad Nachman wrote:
>>> > בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
>>> > ferruh.yigit@intel.com>:
>>> >
>>> >> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>>> >>
>>> >> Can you please try to not top post, it will make impossible to follow
>>> this
>>> >> discussion later from the mail archives.
>>> >>
>>> >>> 1. Userspace will get an error
>>> >>
>>> >> So there is nothing special with returning '-EAGAIN', user will only
>>> >> observe an
>>> >> error.
>>> >> Wasn't initial intention to use '-EAGAIN' to try request again?
>>> >>
>>> > To signal user-space to retry the operation.
>>> >
>>>
>>> Not sure if it will reach to the end user. If user is calling "ifconfig
>>> <iface>
>>> down", it will just fail right, it won't recognize the error type.
>>>
>>> Unless this is common usage by the Linux network drivers, having this
>>> usage in
>>> KNI won't help much. I am for handling this in the kernel side if we can.
>>>
>>>
>> If user calls ifconfig <iface> down it will not happen. It requires some
>> multi-core race condition only Eric can recreate.
>>
>>
>>> >>
>>> >>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl
>>> unlocked
>>> >>> for interface down command causes a crash because of a race
>>> condition in
>>> >>> the device delete/unregister list in the kernel.
>>> >>>
>>> >>
>>> >> Why waiting with rthnl lock causes a deadlock? As said below we are
>>> already
>>> >> doing it, why it is different with retry logic?
>>> >>
>>> > Because it can be interface down request.
>>> >
>>>
>>> (sure you like short answers)
>>>
>>> Please help me to see why "interface down" is special. Isn't it point of
>>> your
>>> patch to wait the request execution in the userspace even it is an async
>>> request?
>>>
>>> And yet again, number of retry can be limited.
>>>
>>>
>> No, it is not. Please look again:
>> https://patches.dpdk.org/project/dpdk/patch/20210924105409.21711-1-eladv6@gmail.com/
>>
>>
>>
>>>
>>> >
>>> >> I agree to not wait with rtnl unlocked.
>>> >>
>>> >>> FYI,
>>> >>>
>>> >>> Elad.
>>> >>>
>>> >>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
>>> >>> ferruh.yigit@intel.com>:
>>> >>>
>>> >>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>> EAGAIN is propogated back to the kernel and to the caller.
>>> >>>>>
>>> >>>>
>>> >>>> So will the user get an error, or it will be handled by the kernel
>>> and
>>> >>>> retried?
>>> >>>>
>>> >>>>> We cannot retry from the kni kernel module since we hold the rtnl
>>> lock.
>>> >>>>>
>>> >>>>
>>> >>>> Why not? We are already waiting until a command time out, like
>>> >>>> 'kni_net_open()'
>>> >>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we
>>> can
>>> >>>> limit the
>>> >>>> number of retry for safety.
>>> >>>>
>>> >>>>> FYI,
>>> >>>>>
>>> >>>>> Elad
>>> >>>>>
>>> >>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>>> >>>>> ferruh.yigit@intel.com>:
>>> >>>>>
>>> >>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>> >>>>>>> Fix lack of multiple KNI requests handling support by
>>> introducing a
>>> >>>>>>> request in progress flag which will fail additional requests with
>>> >>>>>>> EAGAIN return code if the original request has not been processed
>>> >>>>>>> by user-space.
>>> >>>>>>>
>>> >>>>>>> Bugzilla ID: 809
>>> >>>>>>
>>> >>>>>> Hi Eric,
>>> >>>>>>
>>> >>>>>> Can you please test this patch, if it solves the issue you
>>> reported?
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>> >>>>>>> ---
>>> >>>>>>>  kernel/linux/kni/kni_net.c | 9 +++++++++
>>> >>>>>>>  lib/kni/rte_kni.c          | 2 ++
>>> >>>>>>>  lib/kni/rte_kni_common.h   | 1 +
>>> >>>>>>>  3 files changed, 12 insertions(+)
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> <...>
>>> >>>>>>
>>> >>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device
>>> *dev,
>>> >>>>>> struct rte_kni_request *req)
>>> >>>>>>>
>>> >>>>>>>       mutex_lock(&kni->sync_lock);
>>> >>>>>>>
>>> >>>>>>> +     /* Check that existing request has been processed: */
>>> >>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>> >>>>>>> +     if (cur_req->req_in_progress) {
>>> >>>>>>> +             ret = -EAGAIN;
>>> >>>>>>
>>> >>>>>> Overall logic in the KNI looks good to me, this helps to
>>> serialize the
>>> >>>>>> requests
>>> >>>>>> even for async ones.
>>> >>>>>>
>>> >>>>>> But can you please clarify how it behaves in the kernel side with
>>> >>>> '-EAGAIN'
>>> >>>>>> return type? Will linux call the ndo again, or will it just fail.
>>> >>>>>>
>>> >>>>>> If it just fails should we handle the re-try on '-EAGAIN' within
>>> the
>>> >> kni
>>> >>>>>> module?
>>> >>>>>>
>>> >>>>>>
>>> >>>>
>>> >>>>
>>> >>
>>> >> Elad.
>>>
>>>
  
Ferruh Yigit Oct. 8, 2021, 8:23 p.m. UTC | #13
On 10/4/2021 7:27 PM, Elad Nachman wrote:
> בתאריך יום ב׳, 4 באוק׳ 2021, 20:00, מאת Eric Christian ‏<erclists@gmail.com
>> :
> 
>> I am not sure that only we can recreate the KNI request overwrite.  We may
>> be the only ones with a current use case that exposes the vulnerability.
>>   It is possible for any KNI operation to encounter this issue with the new
>> async mechanism.  As long as the call to kni_net_process_request() is a
>> separate thread from rte_kni_handle_request() this has the potential to
>> occur with the use of async requests.
>>
>> All you need is one async KNI request followed closely by a second KNI
>> request before the rte_kni_handle_request() has had a chance to process the
>> first request.
>>
>> The kernel dev driver simply returns the error value back to the caller if
>> it is less than zero.
>>
>>
>> Eric
>>
> 
> What I did in order to attempt to recreate this issue was:
> 
> Create a qemu VM with four cores.
> Run the KNI sample application.
> 
> Then:
> 
> 1. Run ifconfig vEth0_0 down followed immediately by ifconfig vEth0_0 up
> 
> 2. Same as above but in a sample userspace C application which calls ioctl
> twice in a row to achieve the same.
> 
> 
> Both did not recreate the problem.
> 
> The only way to recreate it IMHO is to either:
>   run the ioctl from two parallel threads or perhaps assign RT fifo priority
> to the application calling ioctl in a row.
> 

In kni sample app 'rte_kni_handle_request()' is called in endless loop in the datapath,
this may be making it hard to reproduce the issue.

But what Eric described is valid problem and can happen. Reducing the frequency of the
'rte_kni_handle_request()' call can make it easy to reproduce.

> 
> 
>>
>> On Mon, Oct 4, 2021 at 12:19 PM Elad Nachman <eladv6@gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, Oct 4, 2021 at 7:05 PM Ferruh Yigit <ferruh.yigit@intel.com>
>>> wrote:
>>>
>>>> On 10/4/2021 3:58 PM, Elad Nachman wrote:
>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
>>>>> ferruh.yigit@intel.com>:
>>>>>
>>>>>> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>>>>>>
>>>>>> Can you please try to not top post, it will make impossible to follow
>>>> this
>>>>>> discussion later from the mail archives.
>>>>>>
>>>>>>> 1. Userspace will get an error
>>>>>>
>>>>>> So there is nothing special with returning '-EAGAIN', user will only
>>>>>> observe an
>>>>>> error.
>>>>>> Wasn't initial intention to use '-EAGAIN' to try request again?
>>>>>>
>>>>> To signal user-space to retry the operation.
>>>>>
>>>>
>>>> Not sure if it will reach to the end user. If user is calling "ifconfig
>>>> <iface>
>>>> down", it will just fail right, it won't recognize the error type.
>>>>
>>>> Unless this is common usage by the Linux network drivers, having this
>>>> usage in
>>>> KNI won't help much. I am for handling this in the kernel side if we can.
>>>>
>>>>
>>> If user calls ifconfig <iface> down it will not happen. It requires some
>>> multi-core race condition only Eric can recreate.
>>>
>>>
>>>>>>
>>>>>>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl
>>>> unlocked
>>>>>>> for interface down command causes a crash because of a race
>>>> condition in
>>>>>>> the device delete/unregister list in the kernel.
>>>>>>>
>>>>>>
>>>>>> Why waiting with rthnl lock causes a deadlock? As said below we are
>>>> already
>>>>>> doing it, why it is different with retry logic?
>>>>>>
>>>>> Because it can be interface down request.
>>>>>
>>>>
>>>> (sure you like short answers)
>>>>
>>>> Please help me to see why "interface down" is special. Isn't it point of
>>>> your
>>>> patch to wait the request execution in the userspace even it is an async
>>>> request?
>>>>
>>>> And yet again, number of retry can be limited.
>>>>
>>>>
>>> No, it is not. Please look again:
>>> https://patches.dpdk.org/project/dpdk/patch/20210924105409.21711-1-eladv6@gmail.com/
>>>
>>>
>>>
>>>>
>>>>>
>>>>>> I agree to not wait with rtnl unlocked.
>>>>>>
>>>>>>> FYI,
>>>>>>>
>>>>>>> Elad.
>>>>>>>
>>>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
>>>>>>> ferruh.yigit@intel.com>:
>>>>>>>
>>>>>>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> EAGAIN is propogated back to the kernel and to the caller.
>>>>>>>>>
>>>>>>>>
>>>>>>>> So will the user get an error, or it will be handled by the kernel
>>>> and
>>>>>>>> retried?
>>>>>>>>
>>>>>>>>> We cannot retry from the kni kernel module since we hold the rtnl
>>>> lock.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why not? We are already waiting until a command time out, like
>>>>>>>> 'kni_net_open()'
>>>>>>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we
>>>> can
>>>>>>>> limit the
>>>>>>>> number of retry for safety.
>>>>>>>>
>>>>>>>>> FYI,
>>>>>>>>>
>>>>>>>>> Elad
>>>>>>>>>
>>>>>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>>>>>>>>> ferruh.yigit@intel.com>:
>>>>>>>>>
>>>>>>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>>>>>>>>>> Fix lack of multiple KNI requests handling support by
>>>> introducing a
>>>>>>>>>>> request in progress flag which will fail additional requests with
>>>>>>>>>>> EAGAIN return code if the original request has not been processed
>>>>>>>>>>> by user-space.
>>>>>>>>>>>
>>>>>>>>>>> Bugzilla ID: 809
>>>>>>>>>>
>>>>>>>>>> Hi Eric,
>>>>>>>>>>
>>>>>>>>>> Can you please test this patch, if it solves the issue you
>>>> reported?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   kernel/linux/kni/kni_net.c | 9 +++++++++
>>>>>>>>>>>   lib/kni/rte_kni.c          | 2 ++
>>>>>>>>>>>   lib/kni/rte_kni_common.h   | 1 +
>>>>>>>>>>>   3 files changed, 12 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <...>
>>>>>>>>>>
>>>>>>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device
>>>> *dev,
>>>>>>>>>> struct rte_kni_request *req)
>>>>>>>>>>>
>>>>>>>>>>>        mutex_lock(&kni->sync_lock);
>>>>>>>>>>>
>>>>>>>>>>> +     /* Check that existing request has been processed: */
>>>>>>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>>>>>>>>>> +     if (cur_req->req_in_progress) {
>>>>>>>>>>> +             ret = -EAGAIN;
>>>>>>>>>>
>>>>>>>>>> Overall logic in the KNI looks good to me, this helps to
>>>> serialize the
>>>>>>>>>> requests
>>>>>>>>>> even for async ones.
>>>>>>>>>>
>>>>>>>>>> But can you please clarify how it behaves in the kernel side with
>>>>>>>> '-EAGAIN'
>>>>>>>>>> return type? Will linux call the ndo again, or will it just fail.
>>>>>>>>>>
>>>>>>>>>> If it just fails should we handle the re-try on '-EAGAIN' within
>>>> the
>>>>>> kni
>>>>>>>>>> module?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> Elad.
>>>>
>>>>
  
Ferruh Yigit Oct. 8, 2021, 9:11 p.m. UTC | #14
On 10/4/2021 5:18 PM, Elad Nachman wrote:
> On Mon, Oct 4, 2021 at 7:05 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> 
>> On 10/4/2021 3:58 PM, Elad Nachman wrote:
>>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:51, מאת Ferruh Yigit ‏<
>>> ferruh.yigit@intel.com>:
>>>
>>>> On 10/4/2021 3:25 PM, Elad Nachman wrote:
>>>>
>>>> Can you please try to not top post, it will make impossible to follow
>> this
>>>> discussion later from the mail archives.
>>>>
>>>>> 1. Userspace will get an error
>>>>
>>>> So there is nothing special with returning '-EAGAIN', user will only
>>>> observe an
>>>> error.
>>>> Wasn't initial intention to use '-EAGAIN' to try request again?
>>>>
>>> To signal user-space to retry the operation.
>>>
>>
>> Not sure if it will reach to the end user. If user is calling "ifconfig
>> <iface>
>> down", it will just fail right, it won't recognize the error type.
>>
>> Unless this is common usage by the Linux network drivers, having this
>> usage in
>> KNI won't help much. I am for handling this in the kernel side if we can.
>>
>>
> If user calls ifconfig <iface> down it will not happen. It requires some
> multi-core race condition only Eric can recreate.
> > 
>>>>
>>>>> 2. Waiting with rtnl locked causes a deadlock; waiting with rtnl
>> unlocked
>>>>> for interface down command causes a crash because of a race condition
>> in
>>>>> the device delete/unregister list in the kernel.
>>>>>
>>>>
>>>> Why waiting with rthnl lock causes a deadlock? As said below we are
>> already
>>>> doing it, why it is different with retry logic?
>>>>
>>> Because it can be interface down request.
>>>
>>
>> (sure you like short answers)
>>
>> Please help me to see why "interface down" is special. Isn't it point of
>> your
>> patch to wait the request execution in the userspace even it is an async
>> request?
>>
>> And yet again, number of retry can be limited.
>>
>>
> No, it is not. Please look again:
> https://patches.dpdk.org/project/dpdk/patch/20210924105409.21711-1-eladv6@gmail.com/
> 

Still not clear why not to handle EAGAIN within KNI module.


Also another problem is kernel relying userspace to continue processing
requests is error prone, we need some escape mechanism, some kind of
time out etc...

Anyway, I will send a patch disable bifurcated device support by default.

> 
> 
>>
>>>
>>>> I agree to not wait with rtnl unlocked.
>>>>
>>>>> FYI,
>>>>>
>>>>> Elad.
>>>>>
>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 17:13, מאת Ferruh Yigit ‏<
>>>>> ferruh.yigit@intel.com>:
>>>>>
>>>>>> On 10/4/2021 2:09 PM, Elad Nachman wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> EAGAIN is propogated back to the kernel and to the caller.
>>>>>>>
>>>>>>
>>>>>> So will the user get an error, or it will be handled by the kernel and
>>>>>> retried?
>>>>>>
>>>>>>> We cannot retry from the kni kernel module since we hold the rtnl
>> lock.
>>>>>>>
>>>>>>
>>>>>> Why not? We are already waiting until a command time out, like
>>>>>> 'kni_net_open()'
>>>>>> can retry if 'kni_net_process_request()' returns '-EAGAIN'. And we can
>>>>>> limit the
>>>>>> number of retry for safety.
>>>>>>
>>>>>>> FYI,
>>>>>>>
>>>>>>> Elad
>>>>>>>
>>>>>>> בתאריך יום ב׳, 4 באוק׳ 2021, 16:05, מאת Ferruh Yigit ‏<
>>>>>>> ferruh.yigit@intel.com>:
>>>>>>>
>>>>>>>> On 9/24/2021 11:54 AM, Elad Nachman wrote:
>>>>>>>>> Fix lack of multiple KNI requests handling support by introducing a
>>>>>>>>> request in progress flag which will fail additional requests with
>>>>>>>>> EAGAIN return code if the original request has not been processed
>>>>>>>>> by user-space.
>>>>>>>>>
>>>>>>>>> Bugzilla ID: 809
>>>>>>>>
>>>>>>>> Hi Eric,
>>>>>>>>
>>>>>>>> Can you please test this patch, if it solves the issue you reported?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>>>>>>>>> ---
>>>>>>>>>   kernel/linux/kni/kni_net.c | 9 +++++++++
>>>>>>>>>   lib/kni/rte_kni.c          | 2 ++
>>>>>>>>>   lib/kni/rte_kni_common.h   | 1 +
>>>>>>>>>   3 files changed, 12 insertions(+)
>>>>>>>>>
>>>>>>>>
>>>>>>>> <...>
>>>>>>>>
>>>>>>>>> @@ -123,7 +124,15 @@ kni_net_process_request(struct net_device
>> *dev,
>>>>>>>> struct rte_kni_request *req)
>>>>>>>>>
>>>>>>>>>        mutex_lock(&kni->sync_lock);
>>>>>>>>>
>>>>>>>>> +     /* Check that existing request has been processed: */
>>>>>>>>> +     cur_req = (struct rte_kni_request *)kni->sync_kva;
>>>>>>>>> +     if (cur_req->req_in_progress) {
>>>>>>>>> +             ret = -EAGAIN;
>>>>>>>>
>>>>>>>> Overall logic in the KNI looks good to me, this helps to serialize
>> the
>>>>>>>> requests
>>>>>>>> even for async ones.
>>>>>>>>
>>>>>>>> But can you please clarify how it behaves in the kernel side with
>>>>>> '-EAGAIN'
>>>>>>>> return type? Will linux call the ndo again, or will it just fail.
>>>>>>>>
>>>>>>>> If it just fails should we handle the re-try on '-EAGAIN' within the
>>>> kni
>>>>>>>> module?
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> Elad.
>>
>>
  

Patch

diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
index 611719b5ee..927bf9537c 100644
--- a/kernel/linux/kni/kni_net.c
+++ b/kernel/linux/kni/kni_net.c
@@ -110,6 +110,7 @@  kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
 	void *resp_va;
 	uint32_t num;
 	int ret_val;
+	struct rte_kni_request *cur_req;
 
 	ASSERT_RTNL();
 
@@ -123,7 +124,15 @@  kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
 
 	mutex_lock(&kni->sync_lock);
 
+	/* Check that existing request has been processed: */
+	cur_req = (struct rte_kni_request *)kni->sync_kva;
+	if (cur_req->req_in_progress) {
+		ret = -EAGAIN;
+		goto fail;
+	}
+
 	/* Construct data */
+	req->req_in_progress = 1;
 	memcpy(kni->sync_kva, req, sizeof(struct rte_kni_request));
 	num = kni_fifo_put(kni->req_q, &kni->sync_va, 1);
 	if (num < 1) {
diff --git a/lib/kni/rte_kni.c b/lib/kni/rte_kni.c
index d3e236005e..0599e0356a 100644
--- a/lib/kni/rte_kni.c
+++ b/lib/kni/rte_kni.c
@@ -307,6 +307,7 @@  rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
 	kni->sync_addr = kni->m_sync_addr->addr;
 	dev_info.sync_va = kni->m_sync_addr->addr;
 	dev_info.sync_phys = kni->m_sync_addr->iova;
+	memset(kni->sync_addr, 0, sizeof(struct rte_kni_request));
 
 	kni->pktmbuf_pool = pktmbuf_pool;
 	kni->group_id = conf->group_id;
@@ -596,6 +597,7 @@  rte_kni_handle_request(struct rte_kni *kni)
 		ret = kni_fifo_put(kni->resp_q, (void **)&req, 1);
 	else
 		ret = 1;
+	req->req_in_progress = 0;
 	if (ret != 1) {
 		RTE_LOG(ERR, KNI, "Fail to put the muf back to resp_q\n");
 		return -1; /* It is an error of can't putting the mbuf back */
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index b547ea5501..1973e467f9 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -40,6 +40,7 @@  enum rte_kni_req_id {
  */
 struct rte_kni_request {
 	uint32_t req_id;             /**< Request id */
+	uint32_t req_in_progress;    /**< Request in progress flag */
 	RTE_STD_C11
 	union {
 		uint32_t new_mtu;    /**< New MTU */