3 @brief Module provides various random sampling algorithms.
5 @authors Andrei Novikov (pyclustering@yandex.ru)
7 @copyright BSD-3-Clause
17 @brief Performs data sampling using Reservoir Algorithm R.
18 @details Algorithm complexity O(n). Implementation is based on paper @cite article::utils::sampling::1. Average
19 number of uniform random variates: \f$N - n\f$.
21 @param[in] data (list): Input data for sampling.
22 @param[in] n (uint): Size of sample that should be extracted from 'data'.
24 @return (list) Sample with size 'n' from 'data'.
26 Generate random samples with 5 elements and with 3 elements using Reservoir Algorithm R:
28 data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
29 sample = reservoir_r(data, 5) # generate sample with 5 elements for 'data'.
32 sample = reservoir_r(data, 3) # generate sample with 3 elements for 'data'.
36 Output example for the code above:
44 raise ValueError(
"Incorrect sampling value 'n' (%d) that should be bigger then data size (%d).")
49 for i
in range(n, len(data)):
50 m = random.randrange(0, i + 1)
52 reservoir[m] = data[i]
59 @brief Performs data sampling using Reservoir Algorithm X.
60 @details Algorithm complexity O(n). Implementation is based on paper @cite article::utils::sampling::1. Average
61 number of uniform random variates:
62 \f[\approx 2n\ln \left (\frac{N}{n} \right)\f]
64 @param[in] data (list): Input data for sampling.
65 @param[in] n (uint): Size of sample that should be extracted from 'data'.
67 @return (list) Sample with size 'n' from 'data'.
69 Generate random sample with 5 elements using Reservoir Algorithm X:
71 data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
72 sample = reservoir_x(data, 10) # generate sample with 5 elements for 'data'.
76 Output example for the code above:
78 [0, 20, 2, 16, 13, 15, 19, 18, 10, 9]
82 def calculate_skip_value(t, size, skip):
83 return pow(t + 1 - size, skip + 1) / pow(t + 1, skip + 1)
85 def generate_skip_value(t, size):
86 threshold, skip = random.random(), 0
87 while calculate_skip_value(t, size, skip) > threshold:
92 raise ValueError(
"Incorrect sampling value 'n' (%d) that should be bigger then data size (%d).")
99 i += generate_skip_value(i, n)
102 m = random.randrange(0, n)
103 reservoir[m] = data[i]