Matplotlib, SciPy, NumPy, and pandas: Coming Together in Perfect Harmony


The Point of this Post: To Document an Example


In this update, we'll cover reading data into a pandas DataFrame, Seaborn, creating multi-plot figures with matplotlib.pyplot.subplots(), LaTeX labeling, and parameterizing Gamma distributions using SciPy.

I've been sitting on this example for a while now, so it already seems quaint relative to the things we've accomplished since (including bootstrap hypothesis testing, implementing multi-armed bandit algorithms, and Bayesian parameter estimation). But hey! Part of the fun of having a blog is documenting life's great journey to your three-person audience.

This particular example was part of a far larger sprint we completed at the beginning of this week, but I think it's helpful in illuminating key features, and more importantly, showing how all those features relate to one another.

If that sounds intriguing to you -- hey, wait, come back!


Okay, Dan, Enough with the Admittedly Witty Exposition: Start Already



Importing


To begin, we're going to need to invite all of our friends to our Python Party. In Python, this is trivial. We only need import them:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn

# See: Comment (1) 
sb_dark = seaborn.dark_palette("skyblue", 8, reverse=True)
seaborn.set(palette=sb_dark)


How painless is Seaborn? Prior to calling seaborn.set(palette = sb_dark), our histograms looked like this:

After calling seaborn.set(palette = sb_dark), our histograms look like this: 


Wonderful.

In the example above, seaborn.dark_palette() creates a list -- or "palette" -- of eight colors, ranging from sky blue to very dark grey (hence the name "dark_palette"). Colors themselves are (r, g, b, alpha) tuples, so if we printed out our color palette, we'd simply see 32 numbers and 8 sets of parentheses. 


Reading Data 


Now that all our libraries are accounted for, we'll need data. I hear they have that on The Internet now, but for this example, let's assume we already have some. Relative to our current directory, we can find this data in a folder we've named "data", and in a file we've named "data.txt". Let's further assume that we've investigated our data via the command line, and so we know it's tab-separated.
nville = pd.read_csv("data/data.txt", sep="\t") # See: Comment (1) 



DataFrame objects are amazing things. They hold the same information that our text file did, but they have methods, like this one: 

nville.head()
YearJanFebMarAprMayJunJulAugSepOctNovDec
018712.764.585.014.133.302.981.582.360.951.312.131.65
118722.322.113.145.913.095.176.101.654.501.582.252.38
218732.967.144.113.596.314.204.632.361.814.284.365.94
318745.229.235.3611.841.492.872.653.523.122.636.124.19
418756.153.068.144.221.735.638.121.603.791.255.464.30
This allows us to quickly see what data we're dealing with. In this particular case, it's the average monthly precipitation in Nashville going all the way back to 1871. (Alright, so there's no way you could have known that without me telling you).

Great... So when are you actually going to do something? 


Now -- I promise! 


We're going to plot twelve histograms, each one visualizing the distribution of rainfall for a particular month. Then, we're going to try to model each month's distribution as a Gamma random variable, and plot the associated Gamma PDF over the associated histogram. Why use Gamma? Well, this guy wrote a long academic paper on it, and people who use font that small generally know what they're doing. 


We're going to parameterize our Gamma distribution in two ways: Once using Method of Moments and once using Maximum Likelihood. For more on that particular topic, you can read this wiki written by -- shameless plug! -- my alma mater. 


Fantastic. Now that you know the problem at hand, we can dive in. That's my code below. To spare you from as much of my writing as possible, I've included a number of comments that say, "See: Comment." That way, you'll only need to consult my writing when you're profoundly confused. 


See: Despite my impossibly long-winded introduction, I really do care about your time.  

# We are going to plot 12 histograms, one for every month

months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
num_rows, num_columns = 4, 3


fig, our_axes = plt.subplots(nrows=4, ncols=3, figsize=(15, 15)) # See: Comment (1)
monthly_axes = [ax for ax_row in our_axes for ax in ax_row]      

for i, month in enumerate(months):
    current_ax = monthly_axes[i]        
    
    current_month_data = nville[month[:3]]  # Store relevant column in current_month_data
    current_ax.hist(current_month_data) # Ask Axes object to plot a histogram
    current_ax.set_title(month)         # Give our Axes a unique title
    current_ax.set_ylabel("Counts")
    current_ax.grid(False)              # Turn off gridlines -- they're ugly
    
    
    # Now, we're going to fit our Gamma Distribution
    # As mentioned, we're going to find the parameters in two ways: 
    # (1) "Manually", using Method of Moments
    # (2) "Automatically", using MLE by way of SciPy objects/methods
    
    # The Manual Way (primarily to demonstrate numpy)
    current_mean = np.mean(current_month_data)
    current_var = np.var(current_month_data)
    moment_alpha = (current_mean*current_mean)/(current_var)
    moment_beta =  current_mean/(current_var)
    moment_gamma = [moment_alpha, 0, moment_beta]
    gamma1 = stats.gamma(*moment_gamma) # See: Comment (2)
    
    # The Easy Way (MLE)
    mle_gamma = stats.gamma.fit(current_month_data) # See: Comment (3) 
    gamma2 = stats.gamma(*mle_gamma) 
    
    
    # Plotting the PDFs of our Gamma Distributions
    x_min = np.min(current_month_data)
    x_max = np.max(current_month_data)
    x = np.linspace(x_min, x_max) 
    
    current_ax_twin = current_ax.twinx() # See: Comment (4)
    current_ax_twin.grid(False)          
    
    # See: Comment (5)
    gamma1_lab = r'$\gamma(\alpha=%.2f, \beta=%.2f$' % (moment_alpha, moment_beta) + ')'
    gamma2_lab = r'$\gamma(\alpha=%.2f, \beta=%.2f$' % (mle_gamma[0], mle_gamma[2]) + ')'
    
    current_ax_twin.plot(x, gamma1.pdf(x), color=sb_dark[3], label=gamma1_lab)
    current_ax_twin.plot(x, gamma2.pdf(x), color=sb_dark[7], label=gamma2_lab)
    current_ax_twin.legend() 
    
fig.tight_layout() # See: Comment (6) 
plt.show()         # Fin!






potato_fig, potato_ax = plt.subplots(1, 1)
plt.figtext(.2, .7, "potato\n\n$potato$", axes=potato_ax, fontsize="xx-large")


And with that, I leave  you with the fruit of our statistical computing





Labels: ,