mbox series

[RFC,0/2] pmdinfogen: rewrite in Python

Message ID 20200622004503.29036-1-dmitry.kozliuk@gmail.com (mailing list archive)
Headers
Series pmdinfogen: rewrite in Python |

Message

Dmitry Kozlyuk June 22, 2020, 12:45 a.m. UTC
  This is a PoC rewrite of pmdinfogen in Python with missing bits
described below and in commits. Community input is desired.

Pros:

1. Simpler build process without host apps.
2. Less build requirements (host libelf).
3. Easier debugging and maintenance with a high-level language.
4. Easier porting on Windows (only add new object format).

Cons:

1. No standard ELF or COFF module for Python
    (amount of Python code without libelf on par with C code using it).
2. struct rte_pci_id must be synchronized with header file
    (it's a few lines that never change).

There are no built-in or widely used Python libraries for ELF or COFF.
Some ELF-parsing libraries exist on PyPI, but they're not very handy for
the task and their installation would complicate build requirements.
Thus, elf.py implements its own parsing.

COFF support is underway, it's just not included in this RFC.
Amount of code is similar to elf.py.

Build is only tested on Linux x64_64.

If the community deems this RFC worth finishing, there are a few opens:

1. Support for >65K section headers seems present in current pmdinfogen.
    However, the data it reads is not used after. Is it really needed?

2. How much error-handling is required? This is a build-time tool,
    and Python gives nice stacktraces. However, segfaults are possible
    in Python version due to pointer handling. IMO, error checking
    must be just sufficient to prevent silent segfaults.

3. On Unix, pmdinfogen is called for each object file extracted with ar
    from an .a library by a shell script. On Windows, other tools
    have to be used, shell script will not work. On the other hand, COFF
    library format is quite simple. Would it be appropriate for
    pmdinfogen to handle it to avoid intermediate script?

---
Dmitry Kozlyuk (2):
  pmdinfogen: prototype in Python
  build: use Python pmdinfogen

 buildtools/elf.py        | 194 +++++++++++++++++++++++++++++++++++++++
 buildtools/meson.build   |   3 +-
 buildtools/pmdinfogen.py | 144 +++++++++++++++++++++++++++++
 drivers/meson.build      |   2 +-
 4 files changed, 340 insertions(+), 3 deletions(-)
 create mode 100644 buildtools/elf.py
 create mode 100755 buildtools/pmdinfogen.py
  

Comments

Neil Horman June 22, 2020, 12:41 p.m. UTC | #1
On Mon, Jun 22, 2020 at 03:45:01AM +0300, Dmitry Kozlyuk wrote:
> This is a PoC rewrite of pmdinfogen in Python with missing bits
> described below and in commits. Community input is desired.
> 
> Pros:
> 
> 1. Simpler build process without host apps.
> 2. Less build requirements (host libelf).
> 3. Easier debugging and maintenance with a high-level language.
> 4. Easier porting on Windows (only add new object format).
> 
Front matter: It seems reasonable to me to implement this in python, just for
ease of maintenence.

> Cons:
> 
> 1. No standard ELF or COFF module for Python
>     (amount of Python code without libelf on par with C code using it).
Thats not really true, pyelftools is pretty widely used, and packaged for
most(all) major distributions.  Requiring this kicks the can for (2) above down
the road a bit, but I would prefer to see that used rather than a custom
implementation, just to reduce the maint. burden

> 2. struct rte_pci_id must be synchronized with header file
>     (it's a few lines that never change).
> 
I think you can just use ctypes and the native python C interface to just import
the rte_pci_id structure from the native dpdk header to avoid this, no?

> There are no built-in or widely used Python libraries for ELF or COFF.
> Some ELF-parsing libraries exist on PyPI, but they're not very handy for
> the task and their installation would complicate build requirements.
> Thus, elf.py implements its own parsing.
> 
> COFF support is underway, it's just not included in this RFC.
> Amount of code is similar to elf.py.
> 
> Build is only tested on Linux x64_64.
> 
> If the community deems this RFC worth finishing, there are a few opens:
> 
> 1. Support for >65K section headers seems present in current pmdinfogen.
>     However, the data it reads is not used after. Is it really needed?
> 
Its used.
The support you're referring to is primarily the ability to extract the true
number of section headers from the ELF file (in the event that its more than
64k).  Without that, on very large binaries, you may not be able to find the
symbol table in the ELF file.

> 2. How much error-handling is required? This is a build-time tool,
>     and Python gives nice stacktraces. However, segfaults are possible
>     in Python version due to pointer handling. IMO, error checking
>     must be just sufficient to prevent silent segfaults.
> 
I'm not really sure what the question is here.  You need to handle errors in the
parsing process, we can't just have the tool crashing during the build.  How
thats handled is an implementation detail I would think.

> 3. On Unix, pmdinfogen is called for each object file extracted with ar
>     from an .a library by a shell script. On Windows, other tools
>     have to be used, shell script will not work. On the other hand, COFF
>     library format is quite simple. Would it be appropriate for
>     pmdinfogen to handle it to avoid intermediate script?
> 
I suppose you could, but extracting objects from archives seems outside the
scope of what pmdinfogen normally does.  What is the problem with using a shell
script on windows?  I thought WSL had support for executing bash syntax shell
scripts there?  Why not key off an environment variable to make the relevant
scripts do the right thing on your environment?

> ---
> Dmitry Kozlyuk (2):
>   pmdinfogen: prototype in Python
>   build: use Python pmdinfogen
> 
>  buildtools/elf.py        | 194 +++++++++++++++++++++++++++++++++++++++
>  buildtools/meson.build   |   3 +-
>  buildtools/pmdinfogen.py | 144 +++++++++++++++++++++++++++++
>  drivers/meson.build      |   2 +-
>  4 files changed, 340 insertions(+), 3 deletions(-)
>  create mode 100644 buildtools/elf.py
>  create mode 100755 buildtools/pmdinfogen.py
> 
> -- 
> 2.25.4
> 
>
  
Dmitry Kozlyuk June 22, 2020, 7:39 p.m. UTC | #2
On Mon, 22 Jun 2020 08:41:17 -0400, Neil Horman wrote:
> On Mon, Jun 22, 2020 at 03:45:01AM +0300, Dmitry Kozlyuk wrote:
[snip]
> > 1. No standard ELF or COFF module for Python
> >     (amount of Python code without libelf on par with C code using it).  
> Thats not really true, pyelftools is pretty widely used, and packaged for
> most(all) major distributions.  Requiring this kicks the can for (2) above down
> the road a bit, but I would prefer to see that used rather than a custom
> implementation, just to reduce the maint. burden

I must have underestimated pyelftools spread. I'll look into using it. The
less new code the better, I agree here.  Windows users can get it via PyPI.


> > 2. struct rte_pci_id must be synchronized with header file
> >     (it's a few lines that never change).
> >   
> I think you can just use ctypes and the native python C interface to just import
> the rte_pci_id structure from the native dpdk header to avoid this, no?

Not sure I follow. RFC script uses ctypes to describe struct rte_pci_id
in Python. If you're suggesting to create a Python module in C just to include
a header with a single small structure, I'd say it's not worth the trouble.


> > 1. Support for >65K section headers seems present in current pmdinfogen.
> >     However, the data it reads is not used after. Is it really needed?
> >   
> Its used.
> The support you're referring to is primarily the ability to extract the true
> number of section headers from the ELF file (in the event that its more than
> 64k).  Without that, on very large binaries, you may not be able to find the
> symbol table in the ELF file.

I was talking about these fields of struct elf_info:

	/* if Nth symbol table entry has .st_shndx = SHN_XINDEX,
	 * take shndx from symtab_shndx_start[N] instead
	 */
	Elf32_Word   *symtab_shndx_start;
	Elf32_Word   *symtab_shndx_stop;

They are not used after being filled in parse_elf() and their endianness
fixed in the end despite the comment.


> > 2. How much error-handling is required? This is a build-time tool,
> >     and Python gives nice stacktraces. However, segfaults are possible
> >     in Python version due to pointer handling. IMO, error checking
> >     must be just sufficient to prevent silent segfaults.
> >   
> I'm not really sure what the question is here.  You need to handle errors in the
> parsing process, we can't just have the tool crashing during the build.  How
> thats handled is an implementation detail I would think.

Consider a snippet from pmdinfogen.c:

	/* Check if file offset is correct */
	if (hdr->e_shoff > info->size) {
		fprintf(stderr, "section header offset=%lu in file '%s' "
		      "is bigger than filesize=%lu\n",
		      (unsigned long)hdr->e_shoff,
		      filename, info->size);
		return 0;
	}

It is required in C, because otherwise pmdinfogen would crash without
diagnostic. With Python, most errors like this result in a crash with a trace
to the error line and a message from ctypes (or ELF parsing library). I'm
arguing for leaving only the checks that prevent *silent* crashes (this can
happen with ctypes) which are hard to diagnose. Motivation is to keep the
code simple.


> > 3. On Unix, pmdinfogen is called for each object file extracted with ar
> >     from an .a library by a shell script. On Windows, other tools
> >     have to be used, shell script will not work. On the other hand, COFF
> >     library format is quite simple. Would it be appropriate for
> >     pmdinfogen to handle it to avoid intermediate script?
> >   
> I suppose you could, but extracting objects from archives seems outside the
> scope of what pmdinfogen normally does.  What is the problem with using a shell
> script on windows?  I thought WSL had support for executing bash syntax shell
> scripts there?  Why not key off an environment variable to make the relevant
> scripts do the right thing on your environment?

WSL2 is a Linux VM integrated in Windows, you can only cross-compile from
there. Native builds can use Python or PowerShell, which is as capable as Unix
shells, but incompatible with them. To call lib.exe (MSVC analog of ar) is
probably simpler then parsing COFF, yes. So meson could select one of the
following for a Windows target (objects are COFF):

Host     Toolchain  Archive  Script      Extractor
-------  ---------  -------  ---------   ---------
Linux    MinGW      .a       sh          ar
Windows  MinGW      .a       PowerShell  ar
Windows  Clang      .lib     PowerShell  lib
  
Neil Horman June 23, 2020, 11:28 a.m. UTC | #3
On Mon, Jun 22, 2020 at 10:39:46PM +0300, Dmitry Kozlyuk wrote:
> On Mon, 22 Jun 2020 08:41:17 -0400, Neil Horman wrote:
> > On Mon, Jun 22, 2020 at 03:45:01AM +0300, Dmitry Kozlyuk wrote:
> [snip]
> > > 1. No standard ELF or COFF module for Python
> > >     (amount of Python code without libelf on par with C code using it).  
> > Thats not really true, pyelftools is pretty widely used, and packaged for
> > most(all) major distributions.  Requiring this kicks the can for (2) above down
> > the road a bit, but I would prefer to see that used rather than a custom
> > implementation, just to reduce the maint. burden
> 
> I must have underestimated pyelftools spread. I'll look into using it. The
> less new code the better, I agree here.  Windows users can get it via PyPI.
> 
Ack.

> 
> > > 2. struct rte_pci_id must be synchronized with header file
> > >     (it's a few lines that never change).
> > >   
> > I think you can just use ctypes and the native python C interface to just import
> > the rte_pci_id structure from the native dpdk header to avoid this, no?
> 
> Not sure I follow. RFC script uses ctypes to describe struct rte_pci_id
> in Python. If you're suggesting to create a Python module in C just to include
> a header with a single small structure, I'd say it's not worth the trouble.
> 
Yeah, apologies, i misread what you were saying here, and didn't bother to check
the code.  what you're doing makes sense.

> 
> > > 1. Support for >65K section headers seems present in current pmdinfogen.
> > >     However, the data it reads is not used after. Is it really needed?
> > >   
> > Its used.
> > The support you're referring to is primarily the ability to extract the true
> > number of section headers from the ELF file (in the event that its more than
> > 64k).  Without that, on very large binaries, you may not be able to find the
> > symbol table in the ELF file.
> 
> I was talking about these fields of struct elf_info:
> 
> 	/* if Nth symbol table entry has .st_shndx = SHN_XINDEX,
> 	 * take shndx from symtab_shndx_start[N] instead
> 	 */
> 	Elf32_Word   *symtab_shndx_start;
> 	Elf32_Word   *symtab_shndx_stop;
> 
> They are not used after being filled in parse_elf() and their endianness
> fixed in the end despite the comment.
> 
Its been a while since I wrote this, so I need to go back and look closely, but
as you say, these values aren't used after parse_elf completes, but they are(I
think) used to correct the endianess of the section header index table, so that
get_sym_value works properly.  You would (again I think), only note a problem if
you were parsing an ELF file for an architecture with an endianess opposite that
of what you are building on (i.e. an x86 system cross compiling for a powerpc
target).

> 
> > > 2. How much error-handling is required? This is a build-time tool,
> > >     and Python gives nice stacktraces. However, segfaults are possible
> > >     in Python version due to pointer handling. IMO, error checking
> > >     must be just sufficient to prevent silent segfaults.
> > >   
> > I'm not really sure what the question is here.  You need to handle errors in the
> > parsing process, we can't just have the tool crashing during the build.  How
> > thats handled is an implementation detail I would think.
> 
> Consider a snippet from pmdinfogen.c:
> 
> 	/* Check if file offset is correct */
> 	if (hdr->e_shoff > info->size) {
> 		fprintf(stderr, "section header offset=%lu in file '%s' "
> 		      "is bigger than filesize=%lu\n",
> 		      (unsigned long)hdr->e_shoff,
> 		      filename, info->size);
> 		return 0;
> 	}
> 
> It is required in C, because otherwise pmdinfogen would crash without
> diagnostic. With Python, most errors like this result in a crash with a trace
> to the error line and a message from ctypes (or ELF parsing library). I'm
> arguing for leaving only the checks that prevent *silent* crashes (this can
> happen with ctypes) which are hard to diagnose. Motivation is to keep the
> code simple.
> 
Hmm, I'd defer to others opinions on that.  As a build tool it may be ok to just
crash with a backtrace, but I'm personally not a fan of that.  Id rather see a
diagnostic error and have the tool exits with an appropriate exit code, but lets
see what others say.

> 
> > > 3. On Unix, pmdinfogen is called for each object file extracted with ar
> > >     from an .a library by a shell script. On Windows, other tools
> > >     have to be used, shell script will not work. On the other hand, COFF
> > >     library format is quite simple. Would it be appropriate for
> > >     pmdinfogen to handle it to avoid intermediate script?
> > >   
> > I suppose you could, but extracting objects from archives seems outside the
> > scope of what pmdinfogen normally does.  What is the problem with using a shell
> > script on windows?  I thought WSL had support for executing bash syntax shell
> > scripts there?  Why not key off an environment variable to make the relevant
> > scripts do the right thing on your environment?
> 
> WSL2 is a Linux VM integrated in Windows, you can only cross-compile from
> there. Native builds can use Python or PowerShell, which is as capable as Unix
> shells, but incompatible with them. To call lib.exe (MSVC analog of ar) is
> probably simpler then parsing COFF, yes. So meson could select one of the
> following for a Windows target (objects are COFF):
> 
> Host     Toolchain  Archive  Script      Extractor
> -------  ---------  -------  ---------   ---------
> Linux    MinGW      .a       sh          ar
> Windows  MinGW      .a       PowerShell  ar
> Windows  Clang      .lib     PowerShell  lib
I think if I read you right, what you're suggesting here is that meson setup a
build time marco $ARCHIVETOOL, and set it to ar or lib.exe depending on the
build environment, and just have the script in question use $ARCHIVER?  If so,
I'd be good with that.  Do we have to worry about the use of divergent command
line options for each tool as well?

Neil

> 
> -- 
> Dmitry Kozlyuk
>
  
Bruce Richardson June 23, 2020, 11:59 a.m. UTC | #4
On Tue, Jun 23, 2020 at 07:28:06AM -0400, Neil Horman wrote:
> On Mon, Jun 22, 2020 at 10:39:46PM +0300, Dmitry Kozlyuk wrote:
> > On Mon, 22 Jun 2020 08:41:17 -0400, Neil Horman wrote:
> > > On Mon, Jun 22, 2020 at 03:45:01AM +0300, Dmitry Kozlyuk wrote:
> > [snip]

> > 
> > > > 2. How much error-handling is required? This is a build-time tool,
> > > >     and Python gives nice stacktraces. However, segfaults are possible
> > > >     in Python version due to pointer handling. IMO, error checking
> > > >     must be just sufficient to prevent silent segfaults.
> > > >   
> > > I'm not really sure what the question is here.  You need to handle errors in the
> > > parsing process, we can't just have the tool crashing during the build.  How
> > > thats handled is an implementation detail I would think.
> > 
> > Consider a snippet from pmdinfogen.c:
> > 
> > 	/* Check if file offset is correct */
> > 	if (hdr->e_shoff > info->size) {
> > 		fprintf(stderr, "section header offset=%lu in file '%s' "
> > 		      "is bigger than filesize=%lu\n",
> > 		      (unsigned long)hdr->e_shoff,
> > 		      filename, info->size);
> > 		return 0;
> > 	}
> > 
> > It is required in C, because otherwise pmdinfogen would crash without
> > diagnostic. With Python, most errors like this result in a crash with a trace
> > to the error line and a message from ctypes (or ELF parsing library). I'm
> > arguing for leaving only the checks that prevent *silent* crashes (this can
> > happen with ctypes) which are hard to diagnose. Motivation is to keep the
> > code simple.
> > 
> Hmm, I'd defer to others opinions on that.  As a build tool it may be ok to just
> crash with a backtrace, but I'm personally not a fan of that.  Id rather see a
> diagnostic error and have the tool exits with an appropriate exit code, but lets
> see what others say.
>
I'd take a wait-and-see approach here. Obviously individualized errors
messages are better, but for a case like the above where we have a
malformed elf header, I would think that this is such a rare event that
just letting python exception handling manage it should be ok.
 
> > 
> > > > 3. On Unix, pmdinfogen is called for each object file extracted with ar
> > > >     from an .a library by a shell script. On Windows, other tools
> > > >     have to be used, shell script will not work. On the other hand, COFF
> > > >     library format is quite simple. Would it be appropriate for
> > > >     pmdinfogen to handle it to avoid intermediate script?
> > > >   
> > > I suppose you could, but extracting objects from archives seems outside the
> > > scope of what pmdinfogen normally does.  What is the problem with using a shell
> > > script on windows?  I thought WSL had support for executing bash syntax shell
> > > scripts there?  Why not key off an environment variable to make the relevant
> > > scripts do the right thing on your environment?
> > 
> > WSL2 is a Linux VM integrated in Windows, you can only cross-compile from
> > there. Native builds can use Python or PowerShell, which is as capable as Unix
> > shells, but incompatible with them. To call lib.exe (MSVC analog of ar) is
> > probably simpler then parsing COFF, yes. So meson could select one of the
> > following for a Windows target (objects are COFF):
> > 
> > Host     Toolchain  Archive  Script      Extractor
> > -------  ---------  -------  ---------   ---------
> > Linux    MinGW      .a       sh          ar
> > Windows  MinGW      .a       PowerShell  ar
> > Windows  Clang      .lib     PowerShell  lib
> I think if I read you right, what you're suggesting here is that meson setup a
> build time marco $ARCHIVETOOL, and set it to ar or lib.exe depending on the
> build environment, and just have the script in question use $ARCHIVER?  If so,
> I'd be good with that.  Do we have to worry about the use of divergent command
> line options for each tool as well?
> 
It might be better just to do that detection in the python script itself,
rather than in meson, since the script may have to do other behavioural
changes based on the platform.

Regards,
/Bruce
  
Dmitry Kozlyuk July 2, 2020, 12:07 a.m. UTC | #5
On Tue, 23 Jun 2020 07:28:06 -0400 Neil Horman wrote:
> On Mon, Jun 22, 2020 at 10:39:46PM +0300, Dmitry Kozlyuk wrote:
> > On Mon, 22 Jun 2020 08:41:17 -0400, Neil Horman wrote:  
> > > On Mon, Jun 22, 2020 at 03:45:01AM +0300, Dmitry Kozlyuk wrote:  
[snip]  
> > I was talking about these fields of struct elf_info:
> > 
> > 	/* if Nth symbol table entry has .st_shndx = SHN_XINDEX,
> > 	 * take shndx from symtab_shndx_start[N] instead
> > 	 */
> > 	Elf32_Word   *symtab_shndx_start;
> > 	Elf32_Word   *symtab_shndx_stop;
> > 
> > They are not used after being filled in parse_elf() and their endianness
> > fixed in the end despite the comment.
> >   
> Its been a while since I wrote this, so I need to go back and look closely, but
> as you say, these values aren't used after parse_elf completes, but they are(I
> think) used to correct the endianess of the section header index table, so that
> get_sym_value works properly.  You would (again I think), only note a problem if
> you were parsing an ELF file for an architecture with an endianess opposite that
> of what you are building on (i.e. an x86 system cross compiling for a powerpc
> target).

Thanks for the explanation. Anyway, it's yet another point to prefer
pyelftools to parsing ELF manually.


[snip]
> > Native builds can use Python or PowerShell, which is as capable as Unix
> > shells, but incompatible with them. To call lib.exe (MSVC analog of ar) is
> > probably simpler then parsing COFF, yes. So meson could select one of the
> > following for a Windows target (objects are COFF):
> > 
> > Host     Toolchain  Archive  Script      Extractor
> > -------  ---------  -------  ---------   ---------
> > Linux    MinGW      .a       sh          ar
> > Windows  MinGW      .a       PowerShell  ar
> > Windows  Clang      .lib     PowerShell  lib  
> I think if I read you right, what you're suggesting here is that meson setup a
> build time marco $ARCHIVETOOL, and set it to ar or lib.exe depending on the
> build environment, and just have the script in question use $ARCHIVER?  If so,
> I'd be good with that.  Do we have to worry about the use of divergent command
> line options for each tool as well?

As I said, PowerShell cannot consume the same shell script. My suggestion was
to have two scripts, one existing for shell and ar (used by Linux host), and
one new PowerShell script for both lib.exe and ar (used by Windows host).
Meson then would select which script to call. Another option is to have one
wrapper Python script for all hosts: a few extra lines to launch archiver,
but less files and no need to know shell/PowerShell when changing this part.
(I believe Bruce means the same.)  But it will still be a separate script.