[v8,4/5] app/testpmd: report lcore usage

Message ID 20230202134329.539625-5-rjarry@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series lcore telemetry improvements |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Robin Jarry Feb. 2, 2023, 1:43 p.m. UTC
  Reuse the --record-core-cycles option to account for busy cycles. One
turn of packet_fwd_t is considered "busy" if there was at least one
received or transmitted packet.

Add a new busy_cycles field in struct fwd_stream. Update get_end_cycles
to accept an additional argument for the number of processed packets.
Update fwd_stream.busy_cycles when the number of packets is greater than
zero.

When --record-core-cycles is specified, register a callback with
rte_lcore_register_usage_cb(). In the callback, use the new lcore_id
field in struct fwd_lcore to identify the correct index in fwd_lcores
and return the sum of busy/total cycles of all fwd_streams.

This makes the cycles counters available in rte_lcore_dump() and the
lcore telemetry API:

 testpmd> dump_lcores
 lcore 3, socket 0, role RTE, cpuset 3
 lcore 4, socket 0, role RTE, cpuset 4, busy cycles 1228584096/9239923140
 lcore 5, socket 0, role RTE, cpuset 5, busy cycles 1255661768/9218141538

 --> /eal/lcore/info,4
 {
   "/eal/lcore/info": {
     "lcore_id": 4,
     "socket": 0,
     "role": "RTE",
     "cpuset": [
       4
     ],
     "busy_cycles": 10623340318,
     "total_cycles": 55331167354
   }
 }

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
Reviewed-by: Kevin Laatz <kevin.laatz@intel.com>
---

Notes:
    v7 -> v8: no change

 app/test-pmd/5tswap.c         |  5 +++--
 app/test-pmd/csumonly.c       |  6 +++---
 app/test-pmd/flowgen.c        |  2 +-
 app/test-pmd/icmpecho.c       |  6 +++---
 app/test-pmd/iofwd.c          |  5 +++--
 app/test-pmd/macfwd.c         |  5 +++--
 app/test-pmd/macswap.c        |  5 +++--
 app/test-pmd/noisy_vnf.c      |  4 ++++
 app/test-pmd/rxonly.c         |  5 +++--
 app/test-pmd/shared_rxq_fwd.c |  5 +++--
 app/test-pmd/testpmd.c        | 40 ++++++++++++++++++++++++++++++++++-
 app/test-pmd/testpmd.h        | 14 ++++++++----
 app/test-pmd/txonly.c         |  7 +++---
 13 files changed, 82 insertions(+), 27 deletions(-)
  

Comments

Chengwen Feng Feb. 6, 2023, 3:31 a.m. UTC | #1
Suggest add field "busy ratio" to reduce hand-computing and improve readability.

With above add, Acked-by: Chengwen Feng <fengchengwen@huawei.com>

On 2023/2/2 21:43, Robin Jarry wrote:
> Reuse the --record-core-cycles option to account for busy cycles. One
> turn of packet_fwd_t is considered "busy" if there was at least one
> received or transmitted packet.
> 
> Add a new busy_cycles field in struct fwd_stream. Update get_end_cycles
> to accept an additional argument for the number of processed packets.
> Update fwd_stream.busy_cycles when the number of packets is greater than
> zero.
> 
> When --record-core-cycles is specified, register a callback with
> rte_lcore_register_usage_cb(). In the callback, use the new lcore_id
> field in struct fwd_lcore to identify the correct index in fwd_lcores
> and return the sum of busy/total cycles of all fwd_streams.
> 
> This makes the cycles counters available in rte_lcore_dump() and the
> lcore telemetry API:
> 
>  testpmd> dump_lcores
>  lcore 3, socket 0, role RTE, cpuset 3
>  lcore 4, socket 0, role RTE, cpuset 4, busy cycles 1228584096/9239923140
>  lcore 5, socket 0, role RTE, cpuset 5, busy cycles 1255661768/9218141538
> 
>  --> /eal/lcore/info,4
>  {
>    "/eal/lcore/info": {
>      "lcore_id": 4,
>      "socket": 0,
>      "role": "RTE",
>      "cpuset": [
>        4
>      ],
>      "busy_cycles": 10623340318,
>      "total_cycles": 55331167354
>    }
>  }
> 
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Reviewed-by: Kevin Laatz <kevin.laatz@intel.com>
> ---
> 

...
  
David Marchand Feb. 6, 2023, 8:58 a.m. UTC | #2
Salut Robin,

On Thu, Feb 2, 2023 at 2:44 PM Robin Jarry <rjarry@redhat.com> wrote:
>
> Reuse the --record-core-cycles option to account for busy cycles. One
> turn of packet_fwd_t is considered "busy" if there was at least one
> received or transmitted packet.
>
> Add a new busy_cycles field in struct fwd_stream. Update get_end_cycles
> to accept an additional argument for the number of processed packets.
> Update fwd_stream.busy_cycles when the number of packets is greater than
> zero.
>
> When --record-core-cycles is specified, register a callback with
> rte_lcore_register_usage_cb(). In the callback, use the new lcore_id
> field in struct fwd_lcore to identify the correct index in fwd_lcores
> and return the sum of busy/total cycles of all fwd_streams.
>
> This makes the cycles counters available in rte_lcore_dump() and the
> lcore telemetry API:
>
>  testpmd> dump_lcores
>  lcore 3, socket 0, role RTE, cpuset 3
>  lcore 4, socket 0, role RTE, cpuset 4, busy cycles 1228584096/9239923140
>  lcore 5, socket 0, role RTE, cpuset 5, busy cycles 1255661768/9218141538

I have been playing a bit with this series with two lcores, each one
polling a net/null port.
At first it looked good, but then I started to have one idle lcore, by
asking net/null not to receive anything.

$ build-clang/app/dpdk-testpmd -c 7 --no-huge -m 40 -a 0:0.0 --vdev
net_null1,no-rx=1 --vdev net_null2 -- --no-mlockall
--total-num-mbufs=2048 -ia --record-core-cycles --nb-cores=2

One thing that struck me is that an idle lcore was always showing less
"total_cycles" than a busy one.
The more time testpmd was running, the bigger the divergence between
lcores would be.

Re-reading the API, it is unclear to me (which is the reason for my
comments on patch 2).
Let's first sort out my patch 2 comments and we may revisit this patch
4 implementation afterwards (as I think we are not accounting some
mainloop cycles with current implementation).


For now, I have some comments on the existing data structures, see below.

[snip]

> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index e366f81a0f46..105f75ad5f35 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -2053,7 +2053,7 @@ fwd_stats_display(void)
>                                 fs->rx_bad_outer_ip_csum;
>
>                 if (record_core_cycles)
> -                       fwd_cycles += fs->core_cycles;
> +                       fwd_cycles += fs->busy_cycles;
>         }
>         for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {
>                 pt_id = fwd_ports_ids[i];
> @@ -2184,6 +2184,7 @@ fwd_stats_reset(void)
>
>                 memset(&fs->rx_burst_stats, 0, sizeof(fs->rx_burst_stats));
>                 memset(&fs->tx_burst_stats, 0, sizeof(fs->tx_burst_stats));
> +               fs->busy_cycles = 0;
>                 fs->core_cycles = 0;
>         }
>  }
> @@ -2260,6 +2261,7 @@ run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
>         tics_datum = rte_rdtsc();
>         tics_per_1sec = rte_get_timer_hz();
>  #endif
> +       fc->lcore_id = rte_lcore_id();

A fwd_lcore object is bound to a single lcore, so this lcore_id is unneeded.


>         fsm = &fwd_streams[fc->stream_idx];
>         nb_fs = fc->stream_nb;
>         do {
> @@ -2288,6 +2290,38 @@ run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
>         } while (! fc->stopped);
>  }
>
> +static int
> +lcore_usage_callback(unsigned int lcore_id, struct rte_lcore_usage *usage)
> +{
> +       struct fwd_stream **fsm;
> +       struct fwd_lcore *fc;
> +       streamid_t nb_fs;
> +       streamid_t sm_id;
> +       int c;
> +
> +       for (c = 0; c < nb_lcores; c++) {
> +               fc = fwd_lcores[c];
> +               if (fc->lcore_id != lcore_id)
> +                       continue;

You can find which fwd_lcore is mapped to a lcore using existing structures.
This requires updating some helper, something like:

diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 7d24d25970..e5297ee7fb 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -785,25 +785,31 @@ is_proc_primary(void)
        return rte_eal_process_type() == RTE_PROC_PRIMARY;
 }

-static inline unsigned int
-lcore_num(void)
+void
+parse_fwd_portlist(const char *port);
+
+static inline struct fwd_lcore *
+lcore_to_fwd_lcore(uint16_t lcore_id)
 {
        unsigned int i;

-       for (i = 0; i < RTE_MAX_LCORE; ++i)
-               if (fwd_lcores_cpuids[i] == rte_lcore_id())
-                       return i;
+       for (i = 0; i < cur_fwd_config.nb_fwd_lcores; ++i) {
+               if (fwd_lcores_cpuids[i] == lcore_id)
+                       return fwd_lcores[i];
+       }

-       rte_panic("lcore_id of current thread not found in
fwd_lcores_cpuids\n");
+       return NULL;
 }

-void
-parse_fwd_portlist(const char *port);
-
 static inline struct fwd_lcore *
 current_fwd_lcore(void)
 {
-       return fwd_lcores[lcore_num()];
+       struct fwd_lcore *fc = lcore_to_fwd_lcore(rte_lcore_id());
+
+       if (fc == NULL)
+               rte_panic("lcore_id of current thread not found in
fwd_lcores_cpuids\n");
+
+       return fc;
 }

 /* Mbuf Pools */


And then by using this new helper, lcore_usage_callback becomes simpler:

+static int
+lcore_usage_callback(unsigned int lcore_id, struct rte_lcore_usage *usage)
+{
+    struct fwd_stream **fsm;
+    struct fwd_lcore *fc;
+    streamid_t nb_fs;
+    streamid_t sm_id;
+
+    fc = lcore_to_fwd_lcore(lcore_id);
+    if (fc == NULL)
+        return -1;
+
+    fsm = &fwd_streams[fc->stream_idx];
+    nb_fs = fc->stream_nb;
+    usage->busy_cycles = 0;
+    usage->total_cycles = 0;
+
+    for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+        if (fsm[sm_id]->disabled)
+            continue;
+
+        usage->busy_cycles += fsm[sm_id]->busy_cycles;
+        usage->total_cycles += fsm[sm_id]->core_cycles;
+    }
+
+    return 0;
+}
+
  
Robin Jarry Feb. 6, 2023, 9:08 a.m. UTC | #3
David Marchand, Feb 06, 2023 at 09:58:
> I have been playing a bit with this series with two lcores, each one
> polling a net/null port.
> At first it looked good, but then I started to have one idle lcore, by
> asking net/null not to receive anything.
>
> $ build-clang/app/dpdk-testpmd -c 7 --no-huge -m 40 -a 0:0.0 --vdev
> net_null1,no-rx=1 --vdev net_null2 -- --no-mlockall
> --total-num-mbufs=2048 -ia --record-core-cycles --nb-cores=2
>
> One thing that struck me is that an idle lcore was always showing less
> "total_cycles" than a busy one.
> The more time testpmd was running, the bigger the divergence between
> lcores would be.
>
> Re-reading the API, it is unclear to me (which is the reason for my
> comments on patch 2).
> Let's first sort out my patch 2 comments and we may revisit this patch
> 4 implementation afterwards (as I think we are not accounting some
> mainloop cycles with current implementation).

Indeed, we are not accounting for all cycles. Only the cycles spent in 
the packet_fwd_t functions. This was already the case before my series 
I only added the busy cycles accounting.

However, I agree that this should be updated to take all cycles into 
account (as much as it is possible with the current code base). Maybe 
this could be done as a separate patch or do you want to include it in 
this series?

> A fwd_lcore object is bound to a single lcore, so this lcore_id is 
> unneeded.
[snip]
> You can find which fwd_lcore is mapped to a lcore using existing 
> structures. This requires updating some helper, something like:

I had missed that. Indeed, no need for a new field. I'll address that in 
v9.
  
David Marchand Feb. 6, 2023, 3:06 p.m. UTC | #4
On Mon, Feb 6, 2023 at 10:08 AM Robin Jarry <rjarry@redhat.com> wrote:
>
> David Marchand, Feb 06, 2023 at 09:58:
> > I have been playing a bit with this series with two lcores, each one
> > polling a net/null port.
> > At first it looked good, but then I started to have one idle lcore, by
> > asking net/null not to receive anything.
> >
> > $ build-clang/app/dpdk-testpmd -c 7 --no-huge -m 40 -a 0:0.0 --vdev
> > net_null1,no-rx=1 --vdev net_null2 -- --no-mlockall
> > --total-num-mbufs=2048 -ia --record-core-cycles --nb-cores=2
> >
> > One thing that struck me is that an idle lcore was always showing less
> > "total_cycles" than a busy one.
> > The more time testpmd was running, the bigger the divergence between
> > lcores would be.
> >
> > Re-reading the API, it is unclear to me (which is the reason for my
> > comments on patch 2).
> > Let's first sort out my patch 2 comments and we may revisit this patch
> > 4 implementation afterwards (as I think we are not accounting some
> > mainloop cycles with current implementation).
>
> Indeed, we are not accounting for all cycles. Only the cycles spent in
> the packet_fwd_t functions. This was already the case before my series
> I only added the busy cycles accounting.

"busy" cycles is what was already present in testpmd under the
core_cycles report existing feature: get_end_cycles was only called
with nb_rx + nb_tx > 0.
The only change with this patch is its internal name, there is no
addition on this topic.


But this patch adds "total_cycles" for testpmd...

>
>
> However, I agree that this should be updated to take all cycles into
> account (as much as it is possible with the current code base). Maybe
> this could be done as a separate patch or do you want to include it in
> this series?

... and its implementation seems non compliant with the lcore_usage
API as discussed in patch 2.

As for how much cycles are counted as busy (meaning, should we count
cycles spent in the mainloop too), I think it is better but that would
be a change in the core_cycles report existing feature.

I'd really like to hear from testpmd maintainers.
  

Patch

diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
index f041a5e1d530..03225075716c 100644
--- a/app/test-pmd/5tswap.c
+++ b/app/test-pmd/5tswap.c
@@ -116,7 +116,7 @@  pkt_burst_5tuple_swap(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
@@ -182,7 +182,8 @@  pkt_burst_5tuple_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 1c2459851522..03e141221a56 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -868,7 +868,7 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	rx_bad_ip_csum = 0;
@@ -1200,8 +1200,8 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(tx_pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index fd6abc0f4124..7b2f0ffdf0f5 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -196,7 +196,7 @@  pkt_burst_flow_gen(struct fwd_stream *fs)
 
 	RTE_PER_LCORE(_next_flow) = next_flow;
 
-	get_end_cycles(fs, start_tsc);
+	get_end_cycles(fs, start_tsc, nb_tx);
 }
 
 static int
diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 066f2a3ab79b..2fc9f96dc95f 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -303,7 +303,7 @@  reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	nb_replies = 0;
@@ -508,8 +508,8 @@  reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 			} while (++nb_tx < nb_replies);
 		}
 	}
-
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/iofwd.c b/app/test-pmd/iofwd.c
index 8fafdec548ad..e5a2dbe20c69 100644
--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -59,7 +59,7 @@  pkt_burst_io_forward(struct fwd_stream *fs)
 			pkts_burst, nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 	fs->rx_packets += nb_rx;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
@@ -84,7 +84,8 @@  pkt_burst_io_forward(struct fwd_stream *fs)
 		} while (++nb_tx < nb_rx);
 	}
 
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index beb220fbb462..9db623999970 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -65,7 +65,7 @@  pkt_burst_mac_forward(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
@@ -115,7 +115,8 @@  pkt_burst_mac_forward(struct fwd_stream *fs)
 		} while (++nb_tx < nb_rx);
 	}
 
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 4f8deb338296..4db134ac1d91 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -66,7 +66,7 @@  pkt_burst_mac_swap(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
@@ -93,7 +93,8 @@  pkt_burst_mac_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/noisy_vnf.c b/app/test-pmd/noisy_vnf.c
index c65ec6f06a5c..290bdcda45f0 100644
--- a/app/test-pmd/noisy_vnf.c
+++ b/app/test-pmd/noisy_vnf.c
@@ -152,6 +152,9 @@  pkt_burst_noisy_vnf(struct fwd_stream *fs)
 	uint64_t delta_ms;
 	bool needs_flush = false;
 	uint64_t now;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
 
 	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
 			pkts_burst, nb_pkt_per_burst);
@@ -219,6 +222,7 @@  pkt_burst_noisy_vnf(struct fwd_stream *fs)
 		fs->fwd_dropped += drop_pkts(tmp_pkts, nb_deqd, sent);
 		ncf->prev_time = rte_get_timer_cycles();
 	}
+	get_end_cycles(fs, start_tsc, nb_rx + nb_tx);
 }
 
 #define NOISY_STRSIZE 256
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index d528d4f34e60..519202339e16 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -58,13 +58,14 @@  pkt_burst_receive(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 
 	fs->rx_packets += nb_rx;
 	for (i = 0; i < nb_rx; i++)
 		rte_pktmbuf_free(pkts_burst[i]);
 
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
index 2e9047804b5b..395b73bfe52e 100644
--- a/app/test-pmd/shared_rxq_fwd.c
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -102,9 +102,10 @@  shared_rxq_fwd(struct fwd_stream *fs)
 				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
-		return;
+		goto end;
 	forward_shared_rxq(fs, nb_rx, pkts_burst);
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_rx);
 }
 
 static void
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index e366f81a0f46..105f75ad5f35 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2053,7 +2053,7 @@  fwd_stats_display(void)
 				fs->rx_bad_outer_ip_csum;
 
 		if (record_core_cycles)
-			fwd_cycles += fs->core_cycles;
+			fwd_cycles += fs->busy_cycles;
 	}
 	for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {
 		pt_id = fwd_ports_ids[i];
@@ -2184,6 +2184,7 @@  fwd_stats_reset(void)
 
 		memset(&fs->rx_burst_stats, 0, sizeof(fs->rx_burst_stats));
 		memset(&fs->tx_burst_stats, 0, sizeof(fs->tx_burst_stats));
+		fs->busy_cycles = 0;
 		fs->core_cycles = 0;
 	}
 }
@@ -2260,6 +2261,7 @@  run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 	tics_datum = rte_rdtsc();
 	tics_per_1sec = rte_get_timer_hz();
 #endif
+	fc->lcore_id = rte_lcore_id();
 	fsm = &fwd_streams[fc->stream_idx];
 	nb_fs = fc->stream_nb;
 	do {
@@ -2288,6 +2290,38 @@  run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 	} while (! fc->stopped);
 }
 
+static int
+lcore_usage_callback(unsigned int lcore_id, struct rte_lcore_usage *usage)
+{
+	struct fwd_stream **fsm;
+	struct fwd_lcore *fc;
+	streamid_t nb_fs;
+	streamid_t sm_id;
+	int c;
+
+	for (c = 0; c < nb_lcores; c++) {
+		fc = fwd_lcores[c];
+		if (fc->lcore_id != lcore_id)
+			continue;
+
+		fsm = &fwd_streams[fc->stream_idx];
+		nb_fs = fc->stream_nb;
+		usage->busy_cycles = 0;
+		usage->total_cycles = 0;
+
+		for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+			if (!fsm[sm_id]->disabled) {
+				usage->busy_cycles += fsm[sm_id]->busy_cycles;
+				usage->total_cycles += fsm[sm_id]->core_cycles;
+			}
+		}
+
+		return 0;
+	}
+
+	return -1;
+}
+
 static int
 start_pkt_forward_on_core(void *fwd_arg)
 {
@@ -4527,6 +4561,10 @@  main(int argc, char** argv)
 		rte_stats_bitrate_reg(bitrate_data);
 	}
 #endif
+
+	if (record_core_cycles)
+		rte_lcore_register_usage_cb(lcore_usage_callback);
+
 #ifdef RTE_LIB_CMDLINE
 	if (init_cmdline() != 0)
 		rte_exit(EXIT_FAILURE,
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 7d24d25970d2..5dbf5d1c465c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -174,7 +174,8 @@  struct fwd_stream {
 #ifdef RTE_LIB_GRO
 	unsigned int gro_times;	/**< GRO operation times */
 #endif
-	uint64_t     core_cycles; /**< used for RX and TX processing */
+	uint64_t busy_cycles; /**< used with --record-core-cycles */
+	uint64_t core_cycles; /**< used with --record-core-cycles */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
 	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
@@ -360,6 +361,7 @@  struct fwd_lcore {
 	streamid_t stream_nb;    /**< number of streams in "fwd_streams" */
 	lcoreid_t  cpuid_idx;    /**< index of logical core in CPU id table */
 	volatile char stopped;   /**< stop forwarding when set */
+	unsigned int lcore_id;   /**< return value of rte_lcore_id() */
 };
 
 /*
@@ -836,10 +838,14 @@  get_start_cycles(uint64_t *start_tsc)
 }
 
 static inline void
-get_end_cycles(struct fwd_stream *fs, uint64_t start_tsc)
+get_end_cycles(struct fwd_stream *fs, uint64_t start_tsc, uint64_t nb_packets)
 {
-	if (record_core_cycles)
-		fs->core_cycles += rte_rdtsc() - start_tsc;
+	if (record_core_cycles) {
+		uint64_t cycles = rte_rdtsc() - start_tsc;
+		fs->core_cycles += cycles;
+		if (nb_packets > 0)
+			fs->busy_cycles += cycles;
+	}
 }
 
 static inline void
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 021624952daa..ad37626ff63c 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -331,7 +331,7 @@  pkt_burst_transmit(struct fwd_stream *fs)
 	struct rte_mbuf *pkt;
 	struct rte_mempool *mbp;
 	struct rte_ether_hdr eth_hdr;
-	uint16_t nb_tx;
+	uint16_t nb_tx = 0;
 	uint16_t nb_pkt;
 	uint16_t vlan_tci, vlan_tci_outer;
 	uint32_t retry;
@@ -392,7 +392,7 @@  pkt_burst_transmit(struct fwd_stream *fs)
 	}
 
 	if (nb_pkt == 0)
-		return;
+		goto end;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_pkt);
 
@@ -426,7 +426,8 @@  pkt_burst_transmit(struct fwd_stream *fs)
 		} while (++nb_tx < nb_pkt);
 	}
 
-	get_end_cycles(fs, start_tsc);
+end:
+	get_end_cycles(fs, start_tsc, nb_tx);
 }
 
 static int