A simple first method to check if the numbers are uniform is to create a histogram of the
data and to see if the histogram is reasonably flat. In order to do this, we can use the
plt.hist
function of Python importing matplotlib.pyplot as plt
.
np.random.seed(2023)
u = np.random.uniform(0,1,5000)
plt.hist(u)
plt.show()
We could do a hypothesis test that follows this form:
- $H_0$: $u_i$ is uniform between zero and one, i = 1,2,…
- $H_a$: $u_i$ is not uniform between zero and one, i = 1,2,… If we reject the null hypothesis, which happens if the p-value of the test is very small (or smaller than a critical value α of our choice), then we would say that, with a confidence level of (1-α) * 100%, we have enough statistical evidence to reject the $H_0$.
#Kolmogrov-Smirnov Test
There are several ways to carry out such a test, but we will consider here only one: the so-called Kolmogorov-Smirnov Test. The ecdf $\hat{F}$ is the cumulative distribution function computed from a sequence of N numbers as:
$$ \hat{F}(t) = \frac{\text{numbers in the sequence} \leq t}{N} $$u = [0.1, 0.2, 0.4, 0.8, 0.9]
x = [i/len(values) for i in range(1, len(values)+1)]
plt.step(x, u, where='post')
plt.show()
The idea behind this test is to quantify how similar the ecdf computed from a sequence of data is to the one of the uniform distribution which is represented by a straight line.
The test formally embeds this idea of similarity between the ecdf and the cdf of the uniform
in a test of hypothesis. The function stats.kstest
from the scipy
library implements this
test in Python. For the two sequences u1 and u2 of the previous slide, the test can be
implemented as following:
stats.ktest(u1, 'uniform')
#D-statistic: 0.0092
#p-value: 0.787176
stats.ktest(u2, 'uniform')
#D-statistic: 0.4868
#p-value: 0.0
From the results that the p-value of the test for the sequence u1 is 0.787 and so we would not be able to reject the $H_0$ that the sequence is uniformly distributed.
On the other hand the p-value for the test over the sequence u2 has an extremely small p-value therefore suggesting that we reject the $H_0$ and conclude that the sequence is not uniformly distributed.
This confirms our intuition.