January 28, 2019

What you should know about numpy and pseudo random number generators (PRNG).

The Mersenne-Twister is a pseudo random number generator (PRNG) used by the numpy library. This article explains all you need to know before using it.

As a data scientist, I need to generate random numbers for my algorithms. It puzzled me to know that deterministic processes in my computer could generate random sequences. I had a deeper look into it, and here is what I’ve learned:

Your computer does not generate truly random numbers but uses algorithm such as the Mersenne-Twister to get pseudo random sequences.

It is important to know when and how the seed of your pseudo random generator is set, otherwise you might have bad surprises, especially when developing multiprocess applications.

Standard random generators are highly predictable and should not be used for cryptographic or game gambling purposes.

With `numpy`

, the quickest way to obtain random numbers between 0 and 1 is to use the following:

```
A first random number: 0.2332029758567754
A second random number: 0.7277750980801885
```

Or (almost) equivalently:

```
A first random number: 0.8492693008307766
A second random number: 0.9858307170084044
```

In both ways, we are using what we call a pseudo random number generator or PRNG. Indeed, whenever we call a python function, such as `np.random.rand()`

the output can only be deterministic and cannot be truly random. Hence, `numpy`

has to come up with a trick to generate sequences of numbers that look like random and behave as if they came from a purely random source, and this is what PRNG are.

The most spread PRNG is the Mersenne Twister (MT) generator that was designed in 1997 by M. Matsumoto and T. Nishimura. It works schematically as the following:

Step 1: We initialize our random generator with

`random_state= np.random.RandomState()`

and we generate an internal state of 624 integers.Step 2: Each time we call

`random_state.rand()`

we apply two operations: first, we perform a twist on the internal state to get a new internal state. Then we apply a tempering function on the newly obtained state that generates a float between 0 and 1 which is return by the`random_state.rand()`

. These two operations are performed each time we ask for a new random number.

Schematic representation of the Mersenne Twister

The `twister`

and the `temper`

functions are totally deterministic. Hence, the value of the first occurrence of `random_state.rand()`

only depends on the initial internal state which therefore should be generated… randomly. In practice, `RandomState()`

generates the initial internal state from a `seed `

value which is taken from the internal clock of your computer. Otherwise, you can also specify the `seed`

value by doing `RandomState(seed = your_favorite_seed_value)`

which can be useful in order to have a deterministic code.

The twister and tempering functions are made of elementary bitwise and matrix operations. You can find their definitions in the original paper of the Mersenne Twister, or more simply in their C implementation in the numpy github repository:

`rk_state`

is defined as C structure whose only relevant fields for our case are a list of 624 integers representing the state stored in `rk_state.key`

and a position whose value is `rk_state.pos`

. The twister function from line 14 to 29 is only applied once every 624 called when the `state->pos`

reaches 624. In fact, the internal state stored in `state->key`

contains enough information to generate 624 different random numbers. The `tempering`

function consists of the bitwise operations lines 33 to 36.

The Mersenne Twister is used so much for two simple reasons:

It is fast to evaluate: few nanoseconds on a recent laptop according to

`time it.`

It produces sequences of numbers that look pretty random. In order to evaluate the randomness of a sequence, we can perform different statistical tests such as the (have a look at it!) that are passed by the Mersenne Twister PGNR.

MT PRNG should not be used as such for cryptography purposes! In fact, the

`temper`

function can be easily invertible. It means that for each output of the`rk_random`

function we can infer one column of`state->key`

and after 624 iterations of`rk_random`

we know perfectly the`state`

of the generator and can predict the remaining of the pseudo random sequence. In conclusion, do not use`numpy.random`

in any cryptography or gambling game development, but look rather for specialized modules such as the secrets, or Crypto.Random.Be careful when you use

`numpy.random`

in multiprocess application as it can lead to misleading behaviors as shown in the following:

Output:

```
'Bad practice: '
[array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9])]
'Good practice: '
[array([8, 9, 4, 5, 1, 0, 8, 1, 5, 4]),
array([5, 1, 3, 3, 3, 0, 0, 1, 0, 8]),
array([1, 9, 9, 9, 2, 9, 4, 3, 2, 1]),
array([4, 3, 6, 2, 6, 1, 2, 9, 5, 2]),
array([6, 3, 5, 9, 7, 1, 7, 4, 8, 5])]
```

You can see that the different calls to `bad_practice`

in our multiprocess template always generate the same output. In fact, the `RandomState `

is not re-initialized for each thread and all the threads share the same initial seed and initial internal state. While in the good multiprocess practice case, the random state is initialized for each thread differently.

In one of the square, I have randomly drawn in red 100 000 couples of points (x,y) with a uniform distribution in [0,1], while the other square contains 100 000 points coming from another stochastic process. Can you guess which one is the uniformly distributed one?

Actually, the uniformly random one is the left one! It is quite misleading as uniformly random processes might create more clusters and gaps than what we think. The image on the right was generated by first putting red dots on a regular grid and shifting each dots by random small steps (the output is therefore stochastic but it is not a uniform distribution). In fact, the human brain is quite bad for generating and/or detecting random processes, but that would be the topic of another post!

Thanks to Pierre Marcenac, Pierre Marcenac, Dan Ringwald, and Nicolas Jean.

GAN with Keras: Application to Image Deblurring

A Generative Adversarial Networks tutorial applied to Image Deblurring with the Keras library.

How to Perform Fraud Detection with Personalized Page Rank

This article shows how to perform fraud detection with Graph Analysis.

Image Registration: From SIFT to Deep Learning

How the field has evolved from OpenCV to Neural Networks.