[v5,2/2] doc: add guide for debug and troubleshoot

Message ID 20190121104144.67365-3-vipin.varghese@intel.com
State Superseded, archived
Delegated to: Thomas Monjalon
Headers show
Series
  • doc/howto: add debug and troubleshoot guide
Related show

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Vipin Varghese Jan. 21, 2019, 10:41 a.m.
Add user guide on debug and troubleshoot for common issues and bottleneck
found in sample application model.

Signed-off-by: Vipin Varghese <vipin.varghese@intel.com>
Acked-by: Marko Kovacevic <marko.kovacevic@intel.com>
---
 doc/guides/howto/debug_troubleshoot_guide.rst | 375 ++++++++++++++++++
 doc/guides/howto/index.rst                    |   1 +
 2 files changed, 376 insertions(+)
 create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst

Comments

Thomas Monjalon Jan. 28, 2019, 1:30 a.m. | #1
Hi,

I feel this doc will be updated to provide a complete debug checklist,
and will become useful to many users hopefully.

One general comment about documentation,
It is better to wrap lines logically, for example, always start sentences
at the beginning of a new line. It will make further update patches
simpler to review.

Few more nits below,

21/01/2019 11:41, Vipin Varghese:
> +.. _debug_troubleshoot_via_pmd:

No need of such anchor.

> +
> +Debug & Troubleshoot guide via PMD
> +==================================

Why "via PMD"? Do we use PMD for troubleshooting?
Or is it dedicated to troubleshoot the PMD behaviour?

> +
> +DPDK applications can be designed to run as single thread simple stage to
> +multiple threads with complex pipeline stages. These application can use poll

applications

> +mode devices which helps in offloading CPU cycles. A few models are

help

A colon would be nice at the end of the line before the list.

> +
> +  *  single primary
> +  *  multiple primary
> +  *  single primary single secondary
> +  *  single primary multiple secondary
> +
> +In all the above cases, it is a tedious task to isolate, debug and understand
> +odd behaviour which occurs randomly or periodically. The goal of guide is to
> +share and explore a few commonly seen patterns and behaviour. Then, isolate
> +and identify the root cause via step by step debug at various processing
> +stages.

I don't understand how this introduction is related to "via PMD" in the title.

> +
> +Application Overview
> +--------------------
> +
> +Let us take up an example application as reference for explaining issues and
> +patterns commonly seen. The sample application in discussion makes use of
> +single primary model with various pipeline stages. The application uses PMD
> +and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS
> +and eth.

"pkt mbuf" can be called simply mbuf, but event, crypto and eth
should be eventdev, cryptodev and ethdev.

> +
> +The overview of an application modeled using PMD is shown in
> +:numref:`dtg_sample_app_model`.
> +
> +.. _dtg_sample_app_model:
> +
> +.. figure:: img/dtg_sample_app_model.*
> +
> +   Overview of pipeline stage of an application
> +
> +Bottleneck Analysis
> +-------------------
> +
> +To debug the bottleneck and performance issues the desired application

missing comma after "issues"?

> +is made to run in an environment matching as below

colon missing

> +
> +#. Linux 64-bit|32-bit
> +#. DPDK PMD and libraries are used

Isn't it always the case with DPDK?

> +#. Libraries and PMD are either static or shared. But not both

Strange assumption. Why would it be both?

> +#. Machine flag optimizations of gcc or compiler are made constant

What do you mean?

> +
> +Is there mismatch in packet rate (received < send)?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +RX Port and associated core :numref:`dtg_rx_rate`.
> +
> +.. _dtg_rx_rate:
> +
> +.. figure:: img/dtg_rx_rate.*
> +
> +   RX send rate compared against Received rate

RX send ?

> +
> +#. Are generic configuration correct?

Are -> Is

> +    -  What is port Speed, Duplex? rte_eth_link_get()
> +    -  Are packets of higher sizes are dropped? rte_eth_get_mtu()

are dropped -> dropped

> +    -  Are only specific MAC received? rte_eth_promiscuous_get()
> +
> +#. Are there NIC specific drops?
> +    -  Check rte_eth_rx_queue_info_get() for nb_desc and scattered_rx
> +    -  Is RSS enabled? rte_eth_dev_rss_hash_conf_get()
> +    -  Are packets spread on all queues? rte_eth_dev_stats()
> +    -  If stats for RX and drops updated on same queue? check receive thread
> +    -  If packet does not reach PMD? check if offload for port and queue
> +       matches to traffic pattern send.
> +
> +#. If problem still persists, this might be at RX lcore thread
> +    -  Check if RX thread, distributor or event rx adapter? these may be
> +       processing less than required
> +    -  Is the application is build using processing pipeline with RX stage? If

is build -> built

> +       there are multiple port-pair tied to a single RX core, try to debug by
> +       using rte_prefetch_non_temporal(). This will intimate the mbuf in cache
> +       is temporary.

I stop nit-picking review here.
Marko, John, please could you check english grammar?
Thanks
Vipin Varghese Jan. 28, 2019, 2:51 p.m. | #2
Hi Thomas,

snipped
> 
> I feel this doc will be updated to provide a complete debug checklist,
Attempt is made to capture commonly seen filed issue. Saying so, I am clear that I will not be able to identify all debug check list. As time, experience and sharing increases (from the community), I am certain sure this will grow 

 and will

snipped
> One general comment about documentation, It is better to wrap lines
> logically, for example, always start sentences at the beginning of a new line. It
> will make further update patches simpler to review.
> 
> Few more nits below,
> 
> 21/01/2019 11:41, Vipin Varghese:
> > +.. _debug_troubleshoot_via_pmd:
I need cross check with John or Marko on the same, as the PDF generator tool make a check for anchor and figure name. 

> 
> No need of such anchor
Please give me time to cross check.

.
> 
> > +
> > +Debug & Troubleshoot guide via PMD
> > +==================================
> 
> Why "via PMD"? Do we use PMD for troubleshooting?
I believe yes, we do collect information with enhanced procinfo tool.

> Or is it dedicated to troubleshoot the PMD behaviour?
I am not clear with this statement. Hence is the query 'Is this dedicated to troubleshooting Application. PMD and Library uses cases?'

> 
> > +
> > +DPDK applications can be designed to run as single thread simple
> > +stage to multiple threads with complex pipeline stages. These
> > +application can use poll
> 
> applications
Ok

> 
> > +mode devices which helps in offloading CPU cycles. A few models are
> 
> help
Ok

> 
> A colon would be nice at the end of the line before the list.
> 
> > +
> > +  *  single primary
> > +  *  multiple primary
> > +  *  single primary single secondary
> > +  *  single primary multiple secondary
> > +
> > +In all the above cases, it is a tedious task to isolate, debug and
> > +understand odd behaviour which occurs randomly or periodically. The
> > +goal of guide is to share and explore a few commonly seen patterns
> > +and behaviour. Then, isolate and identify the root cause via step by
> > +step debug at various processing stages.
> 
> I don't understand how this introduction is related to "via PMD" in the title.
I believe the information is shared ```The goal of guide is to share and explore a few commonly seen patterns and behaviour. Then, isolate and identify the root cause via step by step debug at various processing stages.'```

There would multiple ways to design application for solving a same problem. These are depended on user, platform, scaling factor and target. These various combinations make use PMD and libraries. Misconfiguration and not taking care of platform will cause throttling and even drops.

Example: application designed to run on single is now been deployed to run on multi NUMA model.

snipped
> 
> "pkt mbuf" can be called simply mbuf, but event, crypto and eth should be
> eventdev, cryptodev and ethdev.
Ok. I can make this change.

> 
snipped
> > +To debug the bottleneck and performance issues the desired
> > +application
> 
> missing comma after "issues"?
Ok

> 
> > +is made to run in an environment matching as below
> 
> colon missing
Ok

> 
> > +
> > +#. Linux 64-bit|32-bit
> > +#. DPDK PMD and libraries are used
> 
> Isn't it always the case with DPDK?
> 
> > +#. Libraries and PMD are either static or shared. But not both
> 
> Strange assumption. Why would it be both?
If applications are only build with DPDK libraries, then yes the assumption is correct. But when applications are build using DPDK as one of software layer (example DPDK network stack, DPDK suricata, DPDK hyperscan)  as per my understanding this is not true.

> 
> > +#. Machine flag optimizations of gcc or compiler are made constant
> 
> What do you mean?
I can reword as ```DPDK and the application libraries are built with same flags. ```

> 
snipped
> > +
> > +   RX send rate compared against Received rate
> 
> RX send ?
Thanks will correct this

> 
> > +
> > +#. Are generic configuration correct?
> 
> Are -> Is
> 
> > +    -  What is port Speed, Duplex? rte_eth_link_get()
> > +    -  Are packets of higher sizes are dropped? rte_eth_get_mtu()
> 
> are dropped -> dropped
Ok 

snipped
> > +    -  Is the application is build using processing pipeline with RX
> > +stage? If
> 
> is build -> built
Ok

> 
> > +       there are multiple port-pair tied to a single RX core, try to debug by
> > +       using rte_prefetch_non_temporal(). This will intimate the mbuf in cache
> > +       is temporary.
> 
> I stop nit-picking review here.
Thanks as any form of correction is always good.

> Marko, John, please could you check english grammar?
> Thanks
> 
>
Thomas Monjalon Jan. 28, 2019, 3:59 p.m. | #3
28/01/2019 15:51, Varghese, Vipin:
> Hi Thomas,
> 
> snipped
> > 
> > I feel this doc will be updated to provide a complete debug checklist,
> Attempt is made to capture commonly seen filed issue. Saying so, I am clear that I will not be able to identify all debug check list. As time, experience and sharing increases (from the community), I am certain sure this will grow 

Yes this is what I mean.
We just need to give a good start by explaining well the intent and context.

> > > +Debug & Troubleshoot guide via PMD
> > > +==================================
> > 
> > Why "via PMD"? Do we use PMD for troubleshooting?
> I believe yes, we do collect information with enhanced procinfo tool.
> 
> > Or is it dedicated to troubleshoot the PMD behaviour?
> I am not clear with this statement. Hence is the query 'Is this dedicated to troubleshooting Application. PMD and Library uses cases?'

Sorry I don't understand.
I think you can just remove "via PMD" in the title.

[...]
> > > +  *  single primary
> > > +  *  multiple primary
> > > +  *  single primary single secondary
> > > +  *  single primary multiple secondary
> > > +
> > > +In all the above cases, it is a tedious task to isolate, debug and
> > > +understand odd behaviour which occurs randomly or periodically. The
> > > +goal of guide is to share and explore a few commonly seen patterns
> > > +and behaviour. Then, isolate and identify the root cause via step by
> > > +step debug at various processing stages.
> > 
> > I don't understand how this introduction is related to "via PMD" in the title.
> I believe the information is shared ```The goal of guide is to share and explore a few commonly seen patterns and behaviour. Then, isolate and identify the root cause via step by step debug at various processing stages.'```
> 
> There would multiple ways to design application for solving a same problem. These are depended on user, platform, scaling factor and target. These various combinations make use PMD and libraries. Misconfiguration and not taking care of platform will cause throttling and even drops.
> 
> Example: application designed to run on single is now been deployed to run on multi NUMA model.

Yes, so you are explaining there can be a lot of different scenarios.

[...]
> > > +#. Linux 64-bit|32-bit
> > > +#. DPDK PMD and libraries are used
> > 
> > Isn't it always the case with DPDK?
> > 
> > > +#. Libraries and PMD are either static or shared. But not both
> > 
> > Strange assumption. Why would it be both?
> If applications are only build with DPDK libraries, then yes the assumption is correct. But when applications are build using DPDK as one of software layer (example DPDK network stack, DPDK suricata, DPDK hyperscan)  as per my understanding this is not true.

Sorry I don't understand.
The DPDK libraries are either shared or static, but never mixed.
Anyway why is it significant here?

> > > +#. Machine flag optimizations of gcc or compiler are made constant
> > 
> > What do you mean?
> I can reword as ```DPDK and the application libraries are built with same flags. ```

Why is it significant?
Vipin Varghese Feb. 8, 2019, 9:21 a.m. | #4
Hi Thomas, John and Marko,

I am working on sharing the next version with grammar checks done with Grammarly. Will update ASAP.

Thanks
Vipin Varghese

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, January 28, 2019 9:30 PM
> To: Varghese, Vipin <vipin.varghese@intel.com>
> Cc: Mcnamara, John <john.mcnamara@intel.com>; Kovacevic, Marko
> <marko.kovacevic@intel.com>; dev@dpdk.org; shreyansh.jain@nxp.com;
> Patel, Amol <amol.patel@intel.com>; Padubidri, Sanjay A
> <sanjay.padubidri@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v5 2/2] doc: add guide for debug and
> troubleshoot
> 
> 28/01/2019 15:51, Varghese, Vipin:
> > Hi Thomas,
> >
> > snipped
> > >
> > > I feel this doc will be updated to provide a complete debug
> > > checklist,
> > Attempt is made to capture commonly seen filed issue. Saying so, I am
> > clear that I will not be able to identify all debug check list. As
> > time, experience and sharing increases (from the community), I am
> > certain sure this will grow
> 
> Yes this is what I mean.
> We just need to give a good start by explaining well the intent and context.
> 
> > > > +Debug & Troubleshoot guide via PMD
> > > > +==================================
> > >
> > > Why "via PMD"? Do we use PMD for troubleshooting?
> > I believe yes, we do collect information with enhanced procinfo tool.
> >
> > > Or is it dedicated to troubleshoot the PMD behaviour?
> > I am not clear with this statement. Hence is the query 'Is this dedicated to
> troubleshooting Application. PMD and Library uses cases?'
> 
> Sorry I don't understand.
> I think you can just remove "via PMD" in the title.
> 
> [...]
> > > > +  *  single primary
> > > > +  *  multiple primary
> > > > +  *  single primary single secondary
> > > > +  *  single primary multiple secondary
> > > > +
> > > > +In all the above cases, it is a tedious task to isolate, debug
> > > > +and understand odd behaviour which occurs randomly or
> > > > +periodically. The goal of guide is to share and explore a few
> > > > +commonly seen patterns and behaviour. Then, isolate and identify
> > > > +the root cause via step by step debug at various processing stages.
> > >
> > > I don't understand how this introduction is related to "via PMD" in the
> title.
> > I believe the information is shared ```The goal of guide is to share
> > and explore a few commonly seen patterns and behaviour. Then, isolate
> > and identify the root cause via step by step debug at various
> > processing stages.'```
> >
> > There would multiple ways to design application for solving a same problem.
> These are depended on user, platform, scaling factor and target. These various
> combinations make use PMD and libraries. Misconfiguration and not taking
> care of platform will cause throttling and even drops.
> >
> > Example: application designed to run on single is now been deployed to run
> on multi NUMA model.
> 
> Yes, so you are explaining there can be a lot of different scenarios.
> 
> [...]
> > > > +#. Linux 64-bit|32-bit
> > > > +#. DPDK PMD and libraries are used
> > >
> > > Isn't it always the case with DPDK?
> > >
> > > > +#. Libraries and PMD are either static or shared. But not both
> > >
> > > Strange assumption. Why would it be both?
> > If applications are only build with DPDK libraries, then yes the assumption is
> correct. But when applications are build using DPDK as one of software layer
> (example DPDK network stack, DPDK suricata, DPDK hyperscan)  as per my
> understanding this is not true.
> 
> Sorry I don't understand.
> The DPDK libraries are either shared or static, but never mixed.
> Anyway why is it significant here?
> 
> > > > +#. Machine flag optimizations of gcc or compiler are made
> > > > +constant
> > >
> > > What do you mean?
> > I can reword as ```DPDK and the application libraries are built with
> > same flags. ```
> 
> Why is it significant?
> 
>

Patch

diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst
new file mode 100644
index 000000000..868dc6e58
--- /dev/null
+++ b/doc/guides/howto/debug_troubleshoot_guide.rst
@@ -0,0 +1,375 @@ 
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+.. _debug_troubleshoot_via_pmd:
+
+Debug & Troubleshoot guide via PMD
+==================================
+
+DPDK applications can be designed to run as single thread simple stage to
+multiple threads with complex pipeline stages. These application can use poll
+mode devices which helps in offloading CPU cycles. A few models are
+
+  *  single primary
+  *  multiple primary
+  *  single primary single secondary
+  *  single primary multiple secondary
+
+In all the above cases, it is a tedious task to isolate, debug and understand
+odd behaviour which occurs randomly or periodically. The goal of guide is to
+share and explore a few commonly seen patterns and behaviour. Then, isolate
+and identify the root cause via step by step debug at various processing
+stages.
+
+Application Overview
+--------------------
+
+Let us take up an example application as reference for explaining issues and
+patterns commonly seen. The sample application in discussion makes use of
+single primary model with various pipeline stages. The application uses PMD
+and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS
+and eth.
+
+The overview of an application modeled using PMD is shown in
+:numref:`dtg_sample_app_model`.
+
+.. _dtg_sample_app_model:
+
+.. figure:: img/dtg_sample_app_model.*
+
+   Overview of pipeline stage of an application
+
+Bottleneck Analysis
+-------------------
+
+To debug the bottleneck and performance issues the desired application
+is made to run in an environment matching as below
+
+#. Linux 64-bit|32-bit
+#. DPDK PMD and libraries are used
+#. Libraries and PMD are either static or shared. But not both
+#. Machine flag optimizations of gcc or compiler are made constant
+
+Is there mismatch in packet rate (received < send)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX Port and associated core :numref:`dtg_rx_rate`.
+
+.. _dtg_rx_rate:
+
+.. figure:: img/dtg_rx_rate.*
+
+   RX send rate compared against Received rate
+
+#. Are generic configuration correct?
+    -  What is port Speed, Duplex? rte_eth_link_get()
+    -  Are packets of higher sizes are dropped? rte_eth_get_mtu()
+    -  Are only specific MAC received? rte_eth_promiscuous_get()
+
+#. Are there NIC specific drops?
+    -  Check rte_eth_rx_queue_info_get() for nb_desc and scattered_rx
+    -  Is RSS enabled? rte_eth_dev_rss_hash_conf_get()
+    -  Are packets spread on all queues? rte_eth_dev_stats()
+    -  If stats for RX and drops updated on same queue? check receive thread
+    -  If packet does not reach PMD? check if offload for port and queue
+       matches to traffic pattern send.
+
+#. If problem still persists, this might be at RX lcore thread
+    -  Check if RX thread, distributor or event rx adapter? these may be
+       processing less than required
+    -  Is the application is build using processing pipeline with RX stage? If
+       there are multiple port-pair tied to a single RX core, try to debug by
+       using rte_prefetch_non_temporal(). This will intimate the mbuf in cache
+       is temporary.
+
+Are there packet drops (receive|transmit)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`.
+
+.. _dtg_rx_tx_drop:
+
+.. figure:: img/dtg_rx_tx_drop.*
+
+   RX-TX drops
+
+#. At RX
+    -  Get RX queue count? nb_rx_queues using rte_eth_dev_info_get()
+    -  Are there miss, errors, qerros? rte_eth_dev_stats() for imissed,
+       ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
+
+#. At TX
+    -  Are you doing in bulk TX? check application for TX descriptor overhead.
+    -  Are there TX errors? rte_eth_dev_stats() for oerrors and qerros
+    -  Is specific scenarios not releasing mbuf? check rte_mbuf_ref_count of
+       those packets.
+    -  Is the packet multi segmented? Check if port and queue offload is set.
+
+Are there object drops in producer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Producer point for ring :numref:`dtg_producer_ring`.
+
+.. _dtg_producer_ring:
+
+.. figure:: img/dtg_producer_ring.*
+
+   Producer point for Rings
+
+#. Performance for Producer
+    -  Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ)
+    -  If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or
+       rte_ring_free_count()
+    -  If 'burst or single enqueue returning always 0'? is rte_ring_full()
+       true then next stage is not pulling the content at desired rate.
+
+Are there object drops in consumer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consumer point for ring :numref:`dtg_consumer_ring`.
+
+.. _dtg_consumer_ring:
+
+.. figure:: img/dtg_consumer_ring.*
+
+   Consumer point for Rings
+
+#. Performance for Consumer
+    -  Fetch the type of RING – rte_ring_dump() for flags (RING_F_SC_DEQ)
+    -  If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count()
+    -  If 'burst or single enqueue' always results 0 check the ring is empty
+       via rte_ring_empty()
+
+Are packets or objects are not processed at desired rate?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Memory objects close to NUMA :numref:`dtg_mempool`.
+
+.. _dtg_mempool:
+
+.. figure:: img/dtg_mempool.*
+
+   Memory objects has to be close to device per NUMA
+
+#. Is the performance low?
+    -  Are packets received from multiple NIC? rte_eth_dev_count_all()
+    -  Are NIC interfaces on different socket? use rte_eth_dev_socket_id()
+    -  Is mempool created with right socket? rte_mempool_create() or
+       rte_pktmbuf_pool_create()
+    -  Are drops on specific socket? If yes check if there are sufficient
+       objects by rte_mempool_get_count() or rte_mempool_avail_count()
+    -  Is 'rte_mempool_get_count() or rte_mempool_avail_count()' zero?
+       application requires more objects hence reconfigure number of
+       elements in rte_mempool_create().
+    -  Is there single RX thread for multiple NIC? try having multiple
+       lcore to read from fixed interface or we might be hitting cache
+       limit, so increase cache_size for pool_create().
+
+#. Is performance low for some scenarios?
+    -  Check if sufficient objects in mempool by rte_mempool_avail_count()
+    -  Is failure seen in some packets? we might be getting packets with
+       'size > mbuf data size'.
+    -  Is NIC offload or application handling multi segment mbuf? check the
+       special packets are continuous with rte_pktmbuf_is_contiguous().
+    -  If there separate user threads used to access mempool objects, use
+       rte_mempool_cache_create() for non DPDK threads.
+    -  Is the error reproducible with 1GB hugepage? If no, then try debugging
+       the issue with lookup table or objects with rte_mem_lock_page().
+
+.. note::
+  Stall in release of MBUF can be because
+
+  *  Processing pipeline is too heavy
+  *  Number of stages are too many
+  *  TX is not transferred at desired rate
+  *  Multi segment is not offloaded at TX device.
+  *  Application misuse scenarios can be
+      -  not freeing packets
+      -  invalid rte_pktmbuf_refcnt_set
+      -  invalid rte_pktmbuf_prefree_seg
+
+Is there difference in performance for crypto?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Crypto device and PMD :numref:`dtg_crypto`.
+
+.. _dtg_crypto:
+
+.. figure:: img/dtg_crypto.*
+
+   CRYPTO and interaction with PMD device
+
+#. Are generic configuration correct?
+    -  Get total crypto devices – rte_cryptodev_count()
+    -  Cross check software or hardware flags are configured properly
+       rte_cryptodev_info_get() for feature_flags
+
+#. If enqueue request > actual enqueue (drops)?
+    -  Is the queue pair setup for right NUMA? check for socket_id using
+       rte_cryptodev_queue_pair_setup().
+    -  Is the session_pool created from same socket_id as queue pair? If no,
+       then create on same NUMA.
+    -  Is enqueue thread on same socket_id as object? If no, then try
+       to put on same NUMA.
+    -  Are there errors and drops? check err_count using rte_cryptodev_stats()
+    -  Do multiple threads enqueue or dequeue from same queue pair? Try
+       debugging with separate threads.
+
+#. If enqueue rate > dequeue rate?
+    -  Is dequeue lcore thread is same socket_id?
+    -  If software crypto is in use, check if the CRYPTO Library is build with
+       right (SIMD) flags or check if the queue pair using CPU ISA for
+       feature_flags AVX|SSE|NEON using rte_cryptodev_info_get()
+    -  If its hardware assisted crypto showing performance variance? Check if
+       hardware is on same NUMA socket as queue pair and session pool.
+
+Worker functions not giving performance?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Custom worker function :numref:`dtg_distributor_worker`.
+
+.. _dtg_distributor_worker:
+
+.. figure:: img/dtg_distributor_worker.*
+
+   Custom worker function performance drops
+
+#. Performance
+    -  Threads context switches are more frequent? Identify lcore with
+       rte_lcore() and lcore index mapping with rte_lcore_index(). Best
+       performance when mapping of thread and core is 1:1.
+    -  What are lcore role (type or state)? fetch the roles like RTE, OFF and
+       SERVICE using rte_eal_lcore_role().
+    -  Check if application has multiple functions running on same service
+       core? registered functions may be exceeding the desired time slots
+       while running on same service core.
+    -  Is function is running on RTE core? check if there are conflicting
+       functions running on same CPU core by rte_thread_get_affinity().
+
+#. Debug
+    -  Check what is mode of operation? master core, lcore, service core,
+       and numa count can be fetched with rte_eal_get_configuration().
+    -  Is it occurring on special scenario? Analyze run logic with
+       rte_dump_stack(), rte_dump_registers() and rte_memdump() for more
+       insights.
+    -  Is 'perf' showing data process or memory stalls in functions? check
+       instruction being generated for functions using objdump.
+
+Service functions are not frequent enough?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+service functions on service cores :numref:`dtg_service`.
+
+.. _dtg_service:
+
+.. figure:: img/dtg_service.*
+
+   functions running on service cores
+
+#. Performance
+    -  Get service core count using rte_service_lcore_count() and compare with
+       result of rte_eal_get_configuration()
+    -  Check registered service is available using rte_service_get_by_name(),
+       rte_service_get_count() and rte_service_get_name()
+    -  Is given service running parallel on multiple lcores?
+       rte_service_probe_capability() and rte_service_map_lcore_get()
+    -  Is service running? rte_service_runstate_get()
+
+#. Debug
+    -  Find how many services are running on specific service lcore by
+       rte_service_lcore_count_services()
+    -  Generic debug via rte_service_dump()
+
+Is there bottleneck in eventdev?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. Are generic configuration correct?
+    -  Get event_dev devices? rte_event_dev_count()
+    -  Are they created on correct socket_id? - rte_event_dev_socket_id()
+    -  Check if HW or SW capabilities? - rte_event_dev_info_get() for
+       event_qos, queue_all_types, burst_mode, multiple_queue_port,
+       max_event_queue|dequeue_depth
+    -  Is packet stuck in queue? check for stages (event qeueue) where
+       packets are looped back to same or previous stages.
+
+#. Performance drops in enqueue (event count > actual enqueue)?
+    -  Dump the event_dev information? rte_event_dev_dump()
+    -  Check stats for queue and port for eventdev
+    -  Check the inflight, current queue element for enqueue|deqeue
+
+How to debug QoS via TM?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+TM on TX interface :numref:`dtg_qos_tx`.
+
+.. _dtg_qos_tx:
+
+.. figure:: img/dtg_qos_tx.*
+
+   Traffic Manager just before TX
+
+#. Is configuration right?
+    -  Get current capabilities for DPDK port for max nodes, level, shaper
+       private, shaper shared, sched_n_children and stats_mask using
+       rte_tm_capabilities_get()
+    -  Check if current leaf are configured identically by fetching
+       leaf_nodes_identical using rte_tm_capabilities_get()
+    -  Get leaf nodes for a dpdk port - rte_tn_get_number_of_leaf_node()
+    -  Check level capabilities by rte_tm_level_capabilities_get for n_nodes
+        -  Max, nonleaf_max, leaf_max
+        -  identical, non_identical
+        -  Shaper_private_supported
+        -  Stats_mask
+        -  Cman wred packet|byte supported
+        -  Cman head drop supported
+    -  Check node capabilities by rte_tm_node_capabilities_get for n_nodes
+        -  Shaper_private_supported
+        -  Stats_mask
+        -  Cman wred packet|byte supported
+        -  Cman head drop supported
+    -  Debug via stats - rte_tm_stats_update() and rte_tm_node_stats_read()
+
+Packet is not of right format?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Packet capture before and after processing :numref:`dtg_pdump`.
+
+.. _dtg_pdump:
+
+.. figure:: img/dtg_pdump.*
+
+   Capture points of Traffic at RX-TX
+
+#. Where to capture packets?
+    -  Enable pdump in primary to allow secondary to access queue-pair for
+       ports. Thus packets are copied over in RX|TX callback by secondary
+       process using ring buffers.
+    -  To capture packet in middle of pipeline stage, user specific hooks
+       or callback are to be used to copy the packets. These packets can
+       be shared to secondary process via user defined custom rings.
+
+Issue still persists?
+~~~~~~~~~~~~~~~~~~~~~
+
+#. Are there custom or vendor specific offload meta data?
+    -  From PMD, then check for META data error and drops.
+    -  From application, then check for META data error and drops.
+#. Is multiprocess is used configuration and data processing?
+    -  Check enabling or disabling features from secondary is supported or not?
+#. Is there drops for certain scenario for packets or objects?
+    -  Check user private data in objects by dumping the details for debug.
+
+How to develop custom code to debug?
+------------------------------------
+
+-  For single process - the debug functionality is to be added in same
+   process
+-  For multiple process - the debug functionality can be added to
+   secondary multi process
+
+.. note::
+
+  Primary's Debug functions invoked via
+    #. Timer call-back
+    #. Service function under service core
+    #. USR1 or USR2 signal handler
diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index a642a2be1..9527fa84d 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -18,3 +18,4 @@  HowTo Guides
     virtio_user_as_exceptional_path
     packet_capture_framework
     telemetry
+    debug_troubleshoot_guide