[RFC,v2,2/2] eal: introduce random generator function with upper bound
Checks
Commit Message
Add a function rte_rand_max() which generates an uniformly distributed
pseudo-random number less than a user-specified upper bound.
The commonly used pattern rte_rand() % SOME_VALUE, in addition to
being slow, also creates biased results if SOME_VALUE is not a power
of 2. This bias is very small for small SOME_VALUE, but increases
linearly with larger SOME_VALUE.
Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
lib/librte_eal/common/include/rte_random.h | 16 ++++++++++++++++
lib/librte_eal/common/rte_random.c | 20 ++++++++++++++++++++
lib/librte_eal/rte_eal_version.map | 1 +
3 files changed, 37 insertions(+)
Comments
> On Apr 19, 2019, at 5:21 PM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>
> Add a function rte_rand_max() which generates an uniformly distributed
> pseudo-random number less than a user-specified upper bound.
>
> The commonly used pattern rte_rand() % SOME_VALUE, in addition to
> being slow, also creates biased results if SOME_VALUE is not a power
> of 2. This bias is very small for small SOME_VALUE, but increases
> linearly with larger SOME_VALUE.
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
> lib/librte_eal/common/include/rte_random.h | 16 ++++++++++++++++
> lib/librte_eal/common/rte_random.c | 20 ++++++++++++++++++++
> lib/librte_eal/rte_eal_version.map | 1 +
> 3 files changed, 37 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/rte_random.h b/lib/librte_eal/common/include/rte_random.h
> index 66dfe8ae7..053912168 100644
> --- a/lib/librte_eal/common/include/rte_random.h
> +++ b/lib/librte_eal/common/include/rte_random.h
> @@ -47,6 +47,22 @@ rte_srand(uint64_t seedval);
> uint64_t
> rte_rand(void);
>
> +/**
> + * Generates a pseudo-random number less than upper_bound.
> + *
> + * This function returns an uniformly distributed (unbiased) random
> + * number lower than a user-specified maximum value.
> + *
> + * If called from lcore threads, this function is thread-safe.
> + *
> + * @param upper_bound
> + * The upper bound of the generated number.
> + * @return
> + * A pseudo-random value between 0 and (upper_bound-1).
> + */
> +uint64_t
> +rte_rand_max(uint64_t upper_bound);
> +
> #ifdef __cplusplus
> }
> #endif
> diff --git a/lib/librte_eal/common/rte_random.c b/lib/librte_eal/common/rte_random.c
> index 288e7b8c5..bf9240470 100644
> --- a/lib/librte_eal/common/rte_random.c
> +++ b/lib/librte_eal/common/rte_random.c
> @@ -131,6 +131,26 @@ rte_rand(void)
> return __rte_rand_lfsr258(state);
> }
>
> +uint64_t __rte_experimental
> +rte_rand_max(uint64_t upper_bound)
> +{
> + uint8_t zeros;
> + uint64_t mask = ~((uint64_t)0);
> + uint64_t res;
> +
> + if (upper_bound < 2)
> + return 0;
> +
> + zeros = __builtin_clzll(upper_bound);
> + mask >>= zeros;
> +
> + do {
> + res = rte_rand() & mask;
> + } while (unlikely(res >= upper_bound));
My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?
If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?
> +
> + return res;
> +}
> +
> RTE_INIT(rte_rand_init)
> {
> rte_srand(rte_get_timer_cycles());
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index 0d60668fa..8f5b8c6a6 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -367,6 +367,7 @@ EXPERIMENTAL {
> rte_mp_sendmsg;
> rte_option_register;
> rte_rand;
> + rte_rand_max;
> rte_realloc_socket;
> rte_service_lcore_attr_get;
> rte_service_lcore_attr_reset_all;
> --
> 2.17.1
>
Regards,
Keith
On 2019-04-20 23:08, Wiles, Keith wrote:
>> +uint64_t __rte_experimental
>> +rte_rand_max(uint64_t upper_bound)
>> +{
>> + uint8_t zeros;
>> + uint64_t mask = ~((uint64_t)0);
>> + uint64_t res;
>> +
>> + if (upper_bound < 2)
>> + return 0;
>> +
>> + zeros = __builtin_clzll(upper_bound);
>> + mask >>= zeros;
>> +
>> + do {
>> + res = rte_rand() & mask;
>> + } while (unlikely(res >= upper_bound));
>
> My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?
>
If the number of values over the limit is huge, so is the value range
below it.
> If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?
Formally, the execution has no upper bound, but in practice I think this
is a non-issue.
The cycle cost of an iteration of the loop (including the rte_rand()
call) is ~10 cc. With the worst-case upper_limit, the probability of
having to redo the rte_rand() call is ~0.5, for every iteration. This is
with a pseudo-random number generator featuring an uniform distribution.
Let's assume we care about determinism, and we have a system which
suffers a catastrophic failure if the rte_rand_max() latency is more
than 10k clock cycles. The system performs 10M rte_rand_max() calls per
second.
The probability of this system running for a year without any
rte_rand_max()-related crashes is:
0.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999970568598526
So crashes due to excessive rte_rand_max() latency may occur, but this
risk is dwarfed by that of the system being hit by an asteroid.
(Btw, I doubt even the people in our sales force have seen that many
nines before.)
From a performance point of view, the high-loop-count cases are rare
enough not to pose a serious threat. For example, being forced to redo
rte_rand() more than five times is only a ~3% risk.
Sent from my iPhone
> On Apr 21, 2019, at 2:05 PM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>
> On 2019-04-20 23:08, Wiles, Keith wrote:
>
>>> +uint64_t __rte_experimental
>>> +rte_rand_max(uint64_t upper_bound)
>>> +{
>>> + uint8_t zeros;
>>> + uint64_t mask = ~((uint64_t)0);
>>> + uint64_t res;
>>> +
>>> + if (upper_bound < 2)
>>> + return 0;
>>> +
>>> + zeros = __builtin_clzll(upper_bound);
>>> + mask >>= zeros;
>>> +
>>> + do {
>>> + res = rte_rand() & mask;
>>> + } while (unlikely(res >= upper_bound));
>> My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?
>
> If the number of values over the limit is huge, so is the value range below it.
>
>> If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?
>
> Formally, the execution has no upper bound, but in practice I think this is a non-issue.
>
> The cycle cost of an iteration of the loop (including the rte_rand() call) is ~10 cc. With the worst-case upper_limit, the probability of having to redo the rte_rand() call is ~0.5, for every iteration. This is with a pseudo-random number generator featuring an uniform distribution.
>
> Let's assume we care about determinism, and we have a system which suffers a catastrophic failure if the rte_rand_max() latency is more than 10k clock cycles. The system performs 10M rte_rand_max() calls per second.
>
> The probability of this system running for a year without any rte_rand_max()-related crashes is:
> 0.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999970568598526
>
> So crashes due to excessive rte_rand_max() latency may occur, but this risk is dwarfed by that of the system being hit by an asteroid.
>
> (Btw, I doubt even the people in our sales force have seen that many nines before.)
>
> From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.
Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
On 2019-04-22 06:33, Wiles, Keith wrote:
>> From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.
>
> Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
>
Probability is already effectively putting a limit to the loop. The risk
of being stuck for >1us is p=~6e-73. The variations in execution times
will in most cases be less than a LLC miss.
A loop variable will not have any effect on performance, pollute the
code, and hurt uniformity.
Here's what rte_rand_max() performance looks like on my Skylake.
Average rte_rand_max() latency with worst-case upper_bound:
rte_rand_max() w/o loop limit: 47 cc
rte_rand_max() w/ max 8 retries: 49 cc
rte_rand_max() w/ max 4 retries: 47 cc
rte_rand_max() w/ max 2 retries: 40 cc
So you need to be very aggressive in limiting the loop count for that
loop variable to pay off. Otherwise, you will just be at a loss, doing
all that bookkeeping which very rarely turns out to be useful.
> On Apr 22, 2019, at 2:07 AM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>
> On 2019-04-22 06:33, Wiles, Keith wrote:
>
>>> From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.
>> Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
>
> Probability is already effectively putting a limit to the loop. The risk of being stuck for >1us is p=~6e-73. The variations in execution times will in most cases be less than a LLC miss.
>
> A loop variable will not have any effect on performance, pollute the code, and hurt uniformity.
>
> Here's what rte_rand_max() performance looks like on my Skylake.
>
> Average rte_rand_max() latency with worst-case upper_bound:
> rte_rand_max() w/o loop limit: 47 cc
> rte_rand_max() w/ max 8 retries: 49 cc
> rte_rand_max() w/ max 4 retries: 47 cc
> rte_rand_max() w/ max 2 retries: 40 cc
>
> So you need to be very aggressive in limiting the loop count for that loop variable to pay off. Otherwise, you will just be at a loss, doing all that bookkeeping which very rarely turns out to be useful.
The adding of a loop counter is not a performance problem (as you stated only a few cycles are added) and it does not hurt the readability of the code. As for hurting uniformity I do not see that being a problem, as most loops have a visible way to terminate the loop instead of relying on a random number generator to give a valid value in a few loops. It is only reasonable IMO to limit the loop.
But I will let it go and let you add this code to DPDK if someone else wants to ACK it.
Regards,
Keith
@@ -47,6 +47,22 @@ rte_srand(uint64_t seedval);
uint64_t
rte_rand(void);
+/**
+ * Generates a pseudo-random number less than upper_bound.
+ *
+ * This function returns an uniformly distributed (unbiased) random
+ * number lower than a user-specified maximum value.
+ *
+ * If called from lcore threads, this function is thread-safe.
+ *
+ * @param upper_bound
+ * The upper bound of the generated number.
+ * @return
+ * A pseudo-random value between 0 and (upper_bound-1).
+ */
+uint64_t
+rte_rand_max(uint64_t upper_bound);
+
#ifdef __cplusplus
}
#endif
@@ -131,6 +131,26 @@ rte_rand(void)
return __rte_rand_lfsr258(state);
}
+uint64_t __rte_experimental
+rte_rand_max(uint64_t upper_bound)
+{
+ uint8_t zeros;
+ uint64_t mask = ~((uint64_t)0);
+ uint64_t res;
+
+ if (upper_bound < 2)
+ return 0;
+
+ zeros = __builtin_clzll(upper_bound);
+ mask >>= zeros;
+
+ do {
+ res = rte_rand() & mask;
+ } while (unlikely(res >= upper_bound));
+
+ return res;
+}
+
RTE_INIT(rte_rand_init)
{
rte_srand(rte_get_timer_cycles());
@@ -367,6 +367,7 @@ EXPERIMENTAL {
rte_mp_sendmsg;
rte_option_register;
rte_rand;
+ rte_rand_max;
rte_realloc_socket;
rte_service_lcore_attr_get;
rte_service_lcore_attr_reset_all;