Data Science Essentials: Detailed Explanation of Generating Random Data in Python

There are various random events in daily work programming, and the same is true when generating random numbers in programming. How random is randomness? It is one of the most important issues when it comes to information security. Whenever random data, strings, or numbers are generated in Python, it's good to have at least a rough idea of ​​how the data is generated.

Different options for generating random data in Python, then compare each option in terms of safety, versatility, usefulness, and speed.

This content is not math or cryptography related content, just do as much math as you need and that's it.

insert image description here

How random is randomness

Most random data generated in Python is not completely random in a scientific sense. The opposite is pseudorandom: generated using a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data. "True" random numbers can be generated by a True Random Number Generator (TRNG).

You may have seen something like random.seed(999) in Python. This function calls the underlying random number generator used by the Python module random.seed(1234) . random makes subsequent calls to generate random numbers deterministic: input A always produces output B.

Perhaps the terms "random" and "deterministic" do not seem to coexist. To make this more clear here's an extremely stripped-down version, random() , which creates a "random" number x = (x * 3) % 19 by using iteration. x is initially defined as a seed value and then deformed into a deterministic sequence of numbers based on that seed.

class NotSoRandom(object):
    def seed(self, a=3):
        """随机数生成器"""
        self.seedval = a
    def random(self):
        """随机数"""
        self.seedval = (self.seedval * 3) % 19
        return self.seedval

_inst = NotSoRandom()
seed = _inst.seed
random = _inst.random

for i in range(10):
    seed(123)
    print([random() for _ in range(10)])
    
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]
[8, 5, 15, 7, 2, 6, 18, 16, 10, 11]

cryptographic security

If you don't know enough about the "RNG" acronym, add another CSPRNG, or cryptographically secure PRNG. CSPRNG is suitable for generating sensitive data such as passwords, authenticators and tokens. Given a random string, there is actually no way to determine which string appears before or after that string in a sequence of random strings.

Another term for entropy, the amount of randomness introduced or expected. For example, a Python module that will be introduced defines DEFAULT_ENTROPY = 32, which is the number of bytes returned by default.

A key point about CSPRNGs is that they are still pseudorandom. They are designed in some way internally deterministic, but add some other variables or have properties that make them "random enough" to prohibit returning to any function that enforces determinism.

PRNGs and CSPRNGs in Python Tools:

  • PRNG options include the random module from the Python standard library and its array-based NumPy counterpart, numpy.random.
  • Python's os, secrets, and uuid modules contain functions for generating cryptographically secure objects.

PRNG

insert image description here

random module

The random module is the most widely known tool for generating random data in Python, using the Mersenne Twister PRNG algorithm as its core generator.

Build some random data without seeding. The random.random() function returns a random floating point number in the interval [0.0, 1.0).

import random
random.random()
0.1250920165739744
random.random()
0.7327868824782764

Using random.seed() , the results can be made reproducible, and subsequent chains of calls to random.seed() will produce the same data trace.

The random number sequence becomes deterministic, or completely determined by the seed value.

random.seed(444)
random.random()
0.3088946587429545
random.random()
0.01323751590501987

random.seed(444)
random.random()
0.3088946587429545
random.random()
0.01323751590501987

Use random.randint() to generate a random integer between two endpoints in Python. The data is over the entire [x, y] interval and may include both endpoints.

>>> random.randint(0, 10)
2
>>> random.randint(500, 50000)
9991

Use random.randrange() to exclude the right side of the interval, and the resulting numbers are always in the range [x, y) and always less than the right endpoint.

random.randrange(1, 10)
9

Use random.uniform() to generate random floating point numbers in the specified [x, y] interval from a continuous uniform distribution.

random.uniform(20, 30)
27.42639687016509
random.uniform(30, 40)
36.33865802745107

Use random.choice() to choose random elements from a non-empty sequence (such as a list or tuple).

items = ['A', 'B', 'C', 'D', 'E']
random.choice(items)
'B'

random.choices(items, k=2)
['A', 'C']
random.choices(items, k=3)
['C', 'D', 'E']

Use random.sample() to simulate sampling without replacement.

random.sample(items, 4)
['A', 'D', 'B', 'E']

Use random.shuffle() to modify the sequence object and randomize the order of elements.

random.shuffle(items)
items
['E', 'B', 'A', 'C', 'D']

An example of generating a series of random strings of the same length, generally used for verification codes.

from random import Random

# 随机生成邮件验证码的随机字符串
def RandomsStr(random_length):
    Str = ''
    chars = 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789'  # 设置可选字符
    length = len(chars) - 1
    random = Random()
    for i in range(random_length):
        Str += chars[random.randint(0, length)]
    return Str

RandomsStr(10)
LhK3vFepch

RandomsStr(16)
iGy1g0FO54Cjx3WP

array numpy.random

Most functions random return a scalar value (single int, float, or other object). To generate a sequence, you can use the list generation method.

[random.random() for _ in range(5)]
[0.7401011155476498,
 0.9892634439644596,
 0.36991622177966765, 
 0.14950913503744223, 
 0.4868906039708182]

numpy.random uses its own PRNG, which is not the same as normal random.

"""
从标准正态分布返回样本
"""
np.random.randn(5)
array([-0.59656657, -0.6271152 , -1.51244475, -1.02445644, -0.36722254])

np.random.randn(3, 4)
array([[ 0.34054183,  1.59173609, -0.5257795 , -0.86912511],
       [-0.86855499, -0.64487065,  1.47682128,  1.8238103 ],
       [ 0.05477224,  0.35452769,  0.14088743,  0.55049185]])

"""
根据概率随机分配
"""
np.random.choice([0, 1], p=[0.6, 0.4], size=(5, 4))
array([[0, 1, 0, 1],
       [0, 1, 1, 1],
       [0, 1, 0, 1],
       [1, 0, 0, 0],
       [0, 0, 0, 0]])

"""
创建一系列随机布尔值
"""
np.random.randint(0, 2, size=25, dtype=np.uint8).view(bool)
array([ True, False,  True,  True, False,  True, False, False, False,
       False, False,  True,  True, False, False, False,  True, False,
        True, False,  True,  True,  True, False,  True])

Generation of relevant data

Suppose you want to simulate two related time series. One way to solve this problem is to use NumPy's multivariate_normal() function, which takes the covariance matrix into account. In other words to extract from a single normally distributed random variable, you need to specify its mean and variance (or standard deviation).

def corr2cov(p, s):
    """相关性和标准差的协方差矩阵"""
    d = np.diag(s)
    return d @ p @ d
corr = np.array([[1., -0.40],[-0.40, 1.]])
stdev = np.array([6., 1.])
mean = np.array([2., 0.5])
cov = corr2cov(corr, stdev)
data = np.random.multivariate_normal(mean=mean, cov=cov, size=50)
data[:10]

[[-0.33377432  0.22889428]
 [-1.5311996   0.31678635]
 [-6.02684472  0.90562824]
 [ 5.2696086   0.86518295]
 [ 6.43832395  0.36507745]
 [-8.49347011  0.68663565]
 [-5.05968126  0.55214914]
 [ 2.02314646  1.32325775]
 [ 0.98705556 -0.63118682]
 [ 2.90724439 -1.26188307]]

Comparison table between random module and NumPy

random module NumPy counterpart illustrate
random() rand() random float in [0.0, 1.0)
dating (a, b) random_integers() random integers in [a, b]
randrange(a, b[, step]) dating () random integers in [a, b)
uniform(a, b) uniform() random floats in [a, b]
choice(seq) choice() random elements from seq
choices(seq, k=1) choice() random k-element seq with replacement
sample(population, k) choice() and replace=False random k-element seq without replacement
shuffle(x[, random]) shuffle() Shuffle the sequence randomly
normalvariate (mu, sigma) 或者 gauss (mu, sigma) normal() mu normally distributed sample sigma with mean and standard deviation

CSPRNG

insert image description here

as random as possible os.urandom()

Without going into too many details, generating OS-dependent random bytes, which can be safely called cryptographically secure secretsuuidos.urandom() , is still technically pseudo-random.

The only parameter is the number of bytes to return.

os.urandom(3)
b'\xa2\xe8\x02'

x = os.urandom(6)
x
b'\xce\x11\xe7"!\x84'

type(x), len(x)
(bytes, 6)

But this save format does not meet the requirements of development.

Best way to keep secrets

The PEP introduced in Python 3.6+, the secrets module aims to be the de facto Python module for generating cryptographically secure random bytes and strings.

secrets is basically a wrapper around os.urandom(). Only a few functions for generating random numbers, bytes and strings are exported.

n = 16

# 生成安全令牌
secrets.token_bytes(n)
b'A\x8cz\xe1o\xf9!;\x8b\xf2\x80pJ\x8b\xd4\xd3'
secrets.token_hex(n)
'9cb190491e01230ec4239cae643f286f'  
secrets.token_urlsafe(n)
'MJoi7CknFu3YN41m88SEgQ'
# `random.choice()` 的安全版本
secrets.choice('rain')
'a'

UUID

The final option for generating random tokens is the uuid4() function in Python's uuid module. UUID is a Universally Unique Identifier, a 128-bit sequence (a string of length 32) designed to "guarantee uniqueness across space and time". uuid4() is one of the most useful functions of this module, which also uses os.urandom().

import uuid

uuid.uuid4()
UUID('3e3ef28d-3ff0-4933-9bba-e5ee91ce0e7b')
uuid.uuid4()
UUID('2e115fcb-5761-4fa1-8287-19f4ee2877ac')

You may have seen some other variants: uuid1(), uuid3(), and uuid5(). The main difference between them is that the three functions of uuid4() all take some form of input, which does not meet the "guaranteed uniqueness across space and time" of uuid4().

In addition to security modules (such as secrets), Python's random module actually has a rarely used class called SystemRandom, which uses os.urandom(). (In turn, SystemRandom is used in secret. It's a bit like a network that goes back to urandom() .)

So why not "default" this version? Why not "always secure" instead of defaulting to a cryptographically insecure deterministic random function?

  1. Because sometimes data is desired to be deterministic and reproducible for subsequent use by others.
  2. time efficiency issues.
"""
CSPRNG 至少在 Python 中,往往比 PRNG 慢得多。 
让我们使用脚本 timed.py 来测试,该脚本使用 timeit.repeat() 比较 randint() 的 PRNG 和 CSPRNG 版本。
"""

import random
import timeit

# CSPRNG 版本依次使用 `SystemRandom()` 和 `os.urandom()`。
_sysrand = random.SystemRandom()

def prng() -> None:
    random.randint(0, 95)

def csprng() -> None:
    _sysrand.randint(0, 95)

setup = 'import random; from __main__ import prng, csprng'

if __name__ == '__main__':
    print('Best of 3 trials with 1,000,000 loops per trial:')

    for f in ('prng()', 'csprng()'):
        best = min(timeit.repeat(f, setup=setup))
        print('\t{:8s} {:0.2f} seconds total time.'.format(f, best))

Best of 3 trials with 1,000,000 loops per trial:
	prng()   0.93 seconds total time.
	csprng() 1.70 seconds total time.

Comparison of Engineering Randomness

Package/Module describe encryption security
random Quick and Easy Random Data Using Mersenne Twister Do not
numpy.random like random but for (possibly multidimensional) arrays Do not
os Contains urandom(), the basis for the other functions presented here Yes
secrets Designed as Python's de facto module for generating secure random numbers, bytes and strings Yes
uuid Home to some functions used to build 128-bit identifiers uuid4() is

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124164980