[RFC,v2,2/2] eal: introduce random generator function with upper bound

Message ID 20190419212138.17422-3-mattias.ronnblom@ericsson.com
State New
Delegated to: Thomas Monjalon
Headers show
Series
  • Pseudo-number generation improvements
Related show

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Mattias Rönnblom April 19, 2019, 9:21 p.m.
Add a function rte_rand_max() which generates an uniformly distributed
pseudo-random number less than a user-specified upper bound.

The commonly used pattern rte_rand() % SOME_VALUE, in addition to
being slow, also creates biased results if SOME_VALUE is not a power
of 2. This bias is very small for small SOME_VALUE, but increases
linearly with larger SOME_VALUE.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/librte_eal/common/include/rte_random.h | 16 ++++++++++++++++
 lib/librte_eal/common/rte_random.c         | 20 ++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map         |  1 +
 3 files changed, 37 insertions(+)

Comments

Wiles, Keith April 20, 2019, 9:08 p.m. | #1
> On Apr 19, 2019, at 5:21 PM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
> Add a function rte_rand_max() which generates an uniformly distributed
> pseudo-random number less than a user-specified upper bound.
> 
> The commonly used pattern rte_rand() % SOME_VALUE, in addition to
> being slow, also creates biased results if SOME_VALUE is not a power
> of 2. This bias is very small for small SOME_VALUE, but increases
> linearly with larger SOME_VALUE.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
> lib/librte_eal/common/include/rte_random.h | 16 ++++++++++++++++
> lib/librte_eal/common/rte_random.c         | 20 ++++++++++++++++++++
> lib/librte_eal/rte_eal_version.map         |  1 +
> 3 files changed, 37 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/rte_random.h b/lib/librte_eal/common/include/rte_random.h
> index 66dfe8ae7..053912168 100644
> --- a/lib/librte_eal/common/include/rte_random.h
> +++ b/lib/librte_eal/common/include/rte_random.h
> @@ -47,6 +47,22 @@ rte_srand(uint64_t seedval);
> uint64_t
> rte_rand(void);
> 
> +/**
> + * Generates a pseudo-random number less than upper_bound.
> + *
> + * This function returns an uniformly distributed (unbiased) random
> + * number lower than a user-specified maximum value.
> + *
> + * If called from lcore threads, this function is thread-safe.
> + *
> + * @param upper_bound
> + *   The upper bound of the generated number.
> + * @return
> + *   A pseudo-random value between 0 and (upper_bound-1).
> + */
> +uint64_t
> +rte_rand_max(uint64_t upper_bound);
> +
> #ifdef __cplusplus
> }
> #endif
> diff --git a/lib/librte_eal/common/rte_random.c b/lib/librte_eal/common/rte_random.c
> index 288e7b8c5..bf9240470 100644
> --- a/lib/librte_eal/common/rte_random.c
> +++ b/lib/librte_eal/common/rte_random.c
> @@ -131,6 +131,26 @@ rte_rand(void)
> 	return __rte_rand_lfsr258(state);
> }
> 
> +uint64_t __rte_experimental
> +rte_rand_max(uint64_t upper_bound)
> +{
> +	uint8_t zeros;
> +	uint64_t mask = ~((uint64_t)0);
> +	uint64_t res;
> +
> +	if (upper_bound < 2)
> +		return 0;
> +
> +	zeros = __builtin_clzll(upper_bound);
> +	mask >>= zeros;
> +
> +	do {
> +		res = rte_rand() & mask;
> +	} while (unlikely(res >= upper_bound));

My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?

If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?
> +
> +	return res;
> +}
> +
> RTE_INIT(rte_rand_init)
> {
> 	rte_srand(rte_get_timer_cycles());
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index 0d60668fa..8f5b8c6a6 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -367,6 +367,7 @@ EXPERIMENTAL {
> 	rte_mp_sendmsg;
> 	rte_option_register;
> 	rte_rand;
> +	rte_rand_max;
> 	rte_realloc_socket;
> 	rte_service_lcore_attr_get;
> 	rte_service_lcore_attr_reset_all;
> -- 
> 2.17.1
> 

Regards,
Keith
Mattias Rönnblom April 21, 2019, 7:05 p.m. | #2
On 2019-04-20 23:08, Wiles, Keith wrote:

>> +uint64_t __rte_experimental
>> +rte_rand_max(uint64_t upper_bound)
>> +{
>> +	uint8_t zeros;
>> +	uint64_t mask = ~((uint64_t)0);
>> +	uint64_t res;
>> +
>> +	if (upper_bound < 2)
>> +		return 0;
>> +
>> +	zeros = __builtin_clzll(upper_bound);
>> +	mask >>= zeros;
>> +
>> +	do {
>> +		res = rte_rand() & mask;
>> +	} while (unlikely(res >= upper_bound));
> 
> My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?
> 

If the number of values over the limit is huge, so is the value range 
below it.

> If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?

Formally, the execution has no upper bound, but in practice I think this 
is a non-issue.

The cycle cost of an iteration of the loop (including the rte_rand() 
call) is ~10 cc. With the worst-case upper_limit, the probability of 
having to redo the rte_rand() call is ~0.5, for every iteration. This is 
with a pseudo-random number generator featuring an uniform distribution.

Let's assume we care about determinism, and we have a system which 
suffers a catastrophic failure if the rte_rand_max() latency is more 
than 10k clock cycles. The system performs 10M rte_rand_max() calls per 
second.

The probability of this system running for a year without any 
rte_rand_max()-related crashes is:
0.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999970568598526

So crashes due to excessive rte_rand_max() latency may occur, but this 
risk is dwarfed by that of the system being hit by an asteroid.

(Btw, I doubt even the people in our sales force have seen that many 
nines before.)

 From a performance point of view, the high-loop-count cases are rare 
enough not to pose a serious threat. For example, being forced to redo 
rte_rand() more than five times is only a ~3% risk.
Wiles, Keith April 22, 2019, 4:33 a.m. | #3
Sent from my iPhone

> On Apr 21, 2019, at 2:05 PM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
> On 2019-04-20 23:08, Wiles, Keith wrote:
> 
>>> +uint64_t __rte_experimental
>>> +rte_rand_max(uint64_t upper_bound)
>>> +{
>>> +    uint8_t zeros;
>>> +    uint64_t mask = ~((uint64_t)0);
>>> +    uint64_t res;
>>> +
>>> +    if (upper_bound < 2)
>>> +        return 0;
>>> +
>>> +    zeros = __builtin_clzll(upper_bound);
>>> +    mask >>= zeros;
>>> +
>>> +    do {
>>> +        res = rte_rand() & mask;
>>> +    } while (unlikely(res >= upper_bound));
>> My concern here is the numbers of times this loop could be executed as the upper bound could be just over a power of 2 and it is a large number meaning the number of values above upper max and the power of 2 could be huge. Am I looking this loop correctly. If my thought process is correct it could take a long time to get a value less tha n upper max, correct?
> 
> If the number of values over the limit is huge, so is the value range below it.
> 
>> If every thing aligns in a bad way it could be a large number of loops and cause all kinds of problems. What could be done here or is this a non-issue?
> 
> Formally, the execution has no upper bound, but in practice I think this is a non-issue.
> 
> The cycle cost of an iteration of the loop (including the rte_rand() call) is ~10 cc. With the worst-case upper_limit, the probability of having to redo the rte_rand() call is ~0.5, for every iteration. This is with a pseudo-random number generator featuring an uniform distribution.
> 
> Let's assume we care about determinism, and we have a system which suffers a catastrophic failure if the rte_rand_max() latency is more than 10k clock cycles. The system performs 10M rte_rand_max() calls per second.
> 
> The probability of this system running for a year without any rte_rand_max()-related crashes is:
> 0.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999970568598526
> 
> So crashes due to excessive rte_rand_max() latency may occur, but this risk is dwarfed by that of the system being hit by an asteroid.
> 
> (Btw, I doubt even the people in our sales force have seen that many nines before.)
> 
> From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.

Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
Mattias Rönnblom April 22, 2019, 7:07 a.m. | #4
On 2019-04-22 06:33, Wiles, Keith wrote:

>>  From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.
> 
> Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
> 

Probability is already effectively putting a limit to the loop. The risk 
of being stuck for >1us is p=~6e-73. The variations in execution times 
will in most cases be less than a LLC miss.

A loop variable will not have any effect on performance, pollute the 
code, and hurt uniformity.

Here's what rte_rand_max() performance looks like on my Skylake.

Average rte_rand_max() latency with worst-case upper_bound:
rte_rand_max() w/o loop limit: 47 cc
rte_rand_max() w/ max 8 retries: 49 cc
rte_rand_max() w/ max 4 retries: 47 cc
rte_rand_max() w/ max 2 retries: 40 cc

So you need to be very aggressive in limiting the loop count for that 
loop variable to pay off. Otherwise, you will just be at a loss, doing 
all that bookkeeping which very rarely turns out to be useful.
Wiles, Keith April 22, 2019, 1:19 p.m. | #5
> On Apr 22, 2019, at 2:07 AM, Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
> On 2019-04-22 06:33, Wiles, Keith wrote:
> 
>>> From a performance point of view, the high-loop-count cases are rare enough not to pose a serious threat. For example, being forced to redo rte_rand() more than five times is only a ~3% risk.
>> Even a few loops can have an effect on performance when we are talking about micro-seconds plus it leads to indeterminate results. The numbers you reported here are interesting, but I would be happier if you added a limit to the loop. If you state the likely hood of doing 5 loops is only 3% then adding a loop limit would be reasonable, right?
> 
> Probability is already effectively putting a limit to the loop. The risk of being stuck for >1us is p=~6e-73. The variations in execution times will in most cases be less than a LLC miss.
> 
> A loop variable will not have any effect on performance, pollute the code, and hurt uniformity.
> 
> Here's what rte_rand_max() performance looks like on my Skylake.
> 
> Average rte_rand_max() latency with worst-case upper_bound:
> rte_rand_max() w/o loop limit: 47 cc
> rte_rand_max() w/ max 8 retries: 49 cc
> rte_rand_max() w/ max 4 retries: 47 cc
> rte_rand_max() w/ max 2 retries: 40 cc
> 
> So you need to be very aggressive in limiting the loop count for that loop variable to pay off. Otherwise, you will just be at a loss, doing all that bookkeeping which very rarely turns out to be useful.

The adding of a loop counter is not a performance problem (as you stated only a few cycles are added) and it does not hurt the readability of the code. As for hurting uniformity I do not see that being a problem, as most loops have a visible way to terminate the loop instead of relying on a random number generator to give a valid value in a few loops. It is only reasonable IMO to limit the loop.

But I will let it go and let you add this code to DPDK if someone else wants to ACK it.

Regards,
Keith

Patch

diff --git a/lib/librte_eal/common/include/rte_random.h b/lib/librte_eal/common/include/rte_random.h
index 66dfe8ae7..053912168 100644
--- a/lib/librte_eal/common/include/rte_random.h
+++ b/lib/librte_eal/common/include/rte_random.h
@@ -47,6 +47,22 @@  rte_srand(uint64_t seedval);
 uint64_t
 rte_rand(void);
 
+/**
+ * Generates a pseudo-random number less than upper_bound.
+ *
+ * This function returns an uniformly distributed (unbiased) random
+ * number lower than a user-specified maximum value.
+ *
+ * If called from lcore threads, this function is thread-safe.
+ *
+ * @param upper_bound
+ *   The upper bound of the generated number.
+ * @return
+ *   A pseudo-random value between 0 and (upper_bound-1).
+ */
+uint64_t
+rte_rand_max(uint64_t upper_bound);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/rte_random.c b/lib/librte_eal/common/rte_random.c
index 288e7b8c5..bf9240470 100644
--- a/lib/librte_eal/common/rte_random.c
+++ b/lib/librte_eal/common/rte_random.c
@@ -131,6 +131,26 @@  rte_rand(void)
 	return __rte_rand_lfsr258(state);
 }
 
+uint64_t __rte_experimental
+rte_rand_max(uint64_t upper_bound)
+{
+	uint8_t zeros;
+	uint64_t mask = ~((uint64_t)0);
+	uint64_t res;
+
+	if (upper_bound < 2)
+		return 0;
+
+	zeros = __builtin_clzll(upper_bound);
+	mask >>= zeros;
+
+	do {
+		res = rte_rand() & mask;
+	} while (unlikely(res >= upper_bound));
+
+	return res;
+}
+
 RTE_INIT(rte_rand_init)
 {
 	rte_srand(rte_get_timer_cycles());
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 0d60668fa..8f5b8c6a6 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -367,6 +367,7 @@  EXPERIMENTAL {
 	rte_mp_sendmsg;
 	rte_option_register;
 	rte_rand;
+	rte_rand_max;
 	rte_realloc_socket;
 	rte_service_lcore_attr_get;
 	rte_service_lcore_attr_reset_all;