There are many styles of randomness in nature and in theory. Sometimes the randomness can be described as having a distribution, or probability density function, that is a named function. The most common and famous one is the Gaussian, or Normal distribution. The uniform distribution is the simplest possible distribution.
First, let's get our notebook session set up with a directive and some standard imports.
%matplotlib notebook
# Our standard imports:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Access to many standard distributions:
import scipy.stats as ss
# Uncomment for low-res display:
#plt.rcParams['figure.dpi'] = 80
There are two basic ways of visualizing distributions: as probability density functions (PDF) and as cumulative density functions (CDF). The CDF is just the integral of the PDF; and because the PDF is the derivative of the CDF, it looks noisier when estimated from a sample of modest size.
We will start with two standard distributions: normal (Gaussian), and uniform.
We will also start by plotting using the canned black-box hist()
function. Then we will back up and break the operation into its component parts to be sure we understand what is going on.
The hist()
function or Axes method has many options and capabilities--probably too many. We will use only a few of them. To begin, let's plot counts per bin. We will let the function select the bin boundaries.
nsamp = 1000
nbins = nsamp // 50
np.random.seed(1234)
yg = np.random.randn(nsamp)
yu = np.random.rand(nsamp)
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(hspace=0.3)
ax = axs[0, 0]
ax.hist(yu, bins=nbins)
ax.set_title('Uniform per bin')
ax = axs[1, 0]
ax.hist(yu, bins=nbins, cumulative=True)
ax.set_title('Uniform cumulative')
ax = axs[0, 1]
ax.hist(yg, bins=nbins)
ax.set_title('Normal per bin')
ax = axs[1, 1]
ax.hist(yg, bins=nbins, cumulative=True)
ax.set_title('Normal cumulative');
plt.close(fig)
Next, we will use a normalization that provides an approximate PDF and CDF.
nsamp = 1000
nbins = nsamp // 50
# keyword arguments for all subplots
kwargs = dict(bins=nbins, density=True)
yg = np.random.randn(nsamp)
yu = np.random.rand(nsamp)
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(hspace=0.3)
ax = axs[0, 0]
ax.hist(yu, **kwargs)
ax.set_title('Uniform PDF')
ax = axs[1, 0]
ax.hist(yu, cumulative=True, **kwargs)
ax.set_title('Uniform CDF')
ax = axs[0, 1]
ax.hist(yg, **kwargs)
ax.set_title('Normal PDF')
ax = axs[1, 1]
ax.hist(yg, cumulative=True, **kwargs)
ax.set_title('Normal CDF')
plt.close(fig)
As you see, the integral of the PDF over all possible values is 1, so the CDF goes from 0 to 1. The CDF really should be plotted as a line from point to point and starting at (0, 0), not with steps, but unfortunately Matplotlib's hist
function lacks an option for this. Below, we will see how to get around this design error. But first, let's verify a basic principle in statistics.
Why is the Normal distribution so central to statistics? There's a theorem about that!
If you take the average of more and more realizations of any distribution, the distribution of those averages approaches a Gaussian distribution. Let's illustrate this first with the uniform distribution.
npts = 1000000
navgs = [1, 3, 5, 7]
bins=np.linspace(0, 1, num=51)
fig, ax = plt.subplots()
for navg in navgs:
y = np.random.rand(npts, navg).mean(axis=-1)
ax.hist(y, histtype='step', bins=bins, density=True, label=str(navg))
ax.legend(loc='upper right')
plt.close(fig)
What happens if we square the values of a uniform distribution?
npts = 100000
navgs = [1, 3, 5, 7, 20]
bins=np.linspace(0, 1, num=51)
fig, ax = plt.subplots()
for navg in navgs:
y = (np.random.rand(npts, navg)**2).mean(axis=-1)
ax.hist(y, histtype='step', bins=bins, density=True, label=str(navg))
ax.legend(loc='upper right')
plt.close(fig)
Notice how skewed the original distribution is. As a result, more averaging is required to approach the Gaussian than was the case when we started with a uniform distribution.
The hist()
function in matplotlib is using np.histogram
to do the calculation, so let's look at that underlying function. Notice that it shares some argument names with hist
. Let's experiment with the bins
argument and the density
argument, using a uniform distribution.
yu = np.random.rand(1000)
for density in (False, True):
h, edges = np.histogram(yu, bins=[0, 0.5, 1], density=density)
print("density = %s: h is" % density, h, 'edges is', edges)
yu = np.random.rand(1000)
for density in (False, True):
h, edges = np.histogram(yu, bins=[0, 0.25, 1], density=density)
print("density = %s: h is" % density, h, 'edges is', edges)
Notice that the count (first output, density=False
) depends on the bin boundaries, but the density (PDF) does not, apart from fluctuations inherent in working with random numbers.
To get a cumulative distribution we need to take cumulative sum of the counts; for a CDF, we need a discrete integral of the density. We will use a few more bin boundaries, and keep them uneven, so we can see whether we are doing this correctly.
bins = [0, 0.2, 0.3, 0.4, 0.6, 0.9, 1]
h_counts, edges = np.histogram(yu, bins=bins, density=False)
h_density, edges = np.histogram(yu, bins=bins, density=True)
cumulative_counts = np.cumsum(h_counts)
# Integrate the PDF:
intervals = np.diff(bins)
int_density = np.cumsum(h_density * intervals)
print("Cumulative distribution:", cumulative_counts)
print("CDF:", int_density)
Calculated this way, the last CDF value will always be 1. The first value, corresponding to the left edge of the first bin, is 0, but it is not included in the output from the histogram
function. Below, we will prepend it to that output for plotting purposes, so we will have a value for each bin edge.
fig, (axc, axd) = plt.subplots(ncols=2)
cc = np.hstack(([0], cumulative_counts))
axc.plot(bins, cc, marker='o')
axc.set_title('cumulative counts')
cdf = np.hstack(([0], int_density))
axd.plot(bins, cdf, marker='o')
axd.set_title('CDF')
plt.close(fig)
Of course you could also use the matplotlib bar
function, or skip the explicit np.histogram
calculation and let matplotlib do it for you via its hist
, but for the cumulative distribution I think a line plot makes more sense. Do you agree or disagree? Why?
The np.histogram
function is fast and convenient, but to be sure you understand what it is doing your assignment is to write a version of your own, using a simple strategy of looping through the bins. Here are some things you will need.
# If you have an array of bin boundaries, you can loop through the pairs of boundaries like this:
bins = [1.1, 2.2, 3.3, 4.4, 5.5]
print('method 1')
for left, right in zip(bins[:-1], bins[1:]):
print(left, right)
# Or like this:
print('\nmethod 2')
for i in range(len(bins)-1):
left, right = bins[i:i+2]
print(left, right)
# Or like this:
print('\nmethod3')
for i in range(len(bins)-1):
left = bins[i]
right = bins[i+1]
print(left, right)
# Once you have a left and right, you will need to use them to select the appropriate values:
xx = 10 * np.random.randn(500)
left = 4.4
right = 5.0
selection = xx[(xx >= left) & (xx < right)]
print(selection)
print("We found %d values in the [%.2f, %.2f) bin"
% (len(selection), left, right)) # or selection.size
Numpy has basic random number generation functions; scipy provides access to much more information about various theoretical distributions. Here is a very brief example for the Normal distribution.
x = np.linspace(-4, 4, 500)
fig, (axd, axc) = plt.subplots(ncols=2)
axd.plot(x, ss.norm.pdf(x))
axd.set_title('Normal PDF')
axc.plot(x, ss.norm.cdf(x))
axc.set_title('Normal CDF')
plt.close(fig)
Use ss.norm?
to see the various functions (actually, methods) that are available. Some of these will be very useful later in the course.
Suppose you have a data set that appears to be approximately Gaussian. How might you fit a Gaussian curve to it? Let's try a simple approach. We will start with a series that really is Gaussian. We know the true mean and true standard deviation for our experiment, but of course with a real data set we would know only the sample mean and sample standard deviation. We just use those as the parameters for our estimated Gaussian fit.
np.random.seed(1)
ymean_true = 1.5
ystd_true = 2.5
y = ymean_true + ystd_true * np.random.randn(1000)
ymean = y.mean()
ystd = y.std()
print('sample mean and standard deviation are', ymean, ystd)
x = np.linspace(-10, 13, 1000)
y_pdf = ss.norm(loc=ymean, scale=ystd).pdf(x)
fig, ax = plt.subplots()
ax.hist(y, 20, density=True)
ax.plot(x, y_pdf, 'r', lw=2)
ax.set_xlim(-10, 13)
plt.close(fig)
We see that even with 1000 independent points taken from a very good Gaussian distribution, the randomness means we don't get a perfect fit. How would the fit change if we used more bins in the histogram? Try it and see! Make sure you understand, and can explain, how the fit of the histogram to the continuous pdf varies with the number of bins.
Suppose we are in a location where the vector-mean wind is zero, but the East and North components of the wind are approximately Gaussian, are uncorrelated, and have the same standard deviation--let's say it is 5 m/s. The wind direction would be uniformly distributed around all points of the compass. What would the distribution of the speed look like? It certainly can't be Gaussian--it has to be positive. It should be a Rayleigh distribution. Let's check.
np.random.seed(1)
npts = 10000
u = 5 * np.random.randn(npts)
v = 5 * np.random.randn(npts)
s = np.hypot(u, v)
x = np.linspace(0, 25, 1000)
y = ss.rayleigh(scale=5).pdf(x)
nbins=20
fig, ax = plt.subplots()
ax.hist(s, nbins, density=True);
ax.plot(x, y, 'r', lw=2)
plt.close(fig)
Notice that although the Gaussian distribution has two free parameters, the Rayleigh distribution has only one: the scale. How can we estimate it? It is not the mean or the median:
print("mean and median of the speed are", s.mean(), np.median(s))
The scale parameter--which gives the location of the mode, or highest point on the pdf-- is $\sqrt{\frac{2}{\pi}}$ times the mean.
print("estimated mode is", np.sqrt(2/np.pi) * s.mean())