Bayesian Regression with PyMC: A Brief Tutorial

Warning: This is a love story between a man and his Python module

As I mentioned previously, one of the most powerful concepts I've really learned at Zipfian has been Bayesian inference using PyMC. PyMC is currently my favorite library of any kind in any language. I dramatically italicized "learned" because I had been taught Bayesian inference in college; however, the lecture usually ended with something along the lines of, "But this gets really mathematically complicated and intensive so screw it."

In fairness, they were always at least partially right. Bayesian inference can be very mathematically complicated and intensive; however, PyMC abstracts that away from you. So long as you have a coherent mental model of the problem at hand, you can apply Bayesian methods to anything (although I would note: It is absolutely worth learning the math that underlies all of this).

So as you can tell, I really, really like PyMC. 

(Indeed, this poignant romance was originally going to be the first ever blog post nominated for an Academy Award, but Her came out and mined much of the same emotional territory. I gracefully stepped aside and let Spike Jonze, Joaquin Phoenix and Scarlett Johansson have their fifteen minutes...

...The Academy Awards are pure politics anyway.)

What we'll cover

In this example, I'm going to demonstrate how you can use Bayesian inference with PyMC to estimate the parameters of a linear regression problem. I think this example is particularly well-suited for a tutorial because: (A) A lot of people learn linear regression in college, so hopefully you'll readily understand the model necessary to solve the problem; and (B) The numbers that emerge are readily meaningful and interpretable.

I'm first going to cover the Single Variable case, and then I'll share my model for the Multivariable case. 

Single Variable Bayesian Regression

For this example, we're going to try to model the mpg of a car as a function of the car's weight, i.e., mpg = b0 + b1 * weight + error. In the example below, the Data Frame I'm using is named "df_float," and the column names are self-explanatory. 

As usual, we begin by importing our friends: For this exercise, we'll need pymc, numpy, pandas, and matplotlib.pyplot, which I refer to below as pymc, np, pd and plt, respectively. 

Next, we'll need to set up the model. I explain what I'm doing in the included comments, but I will explain the more important parts below. 

# NOTE: the linear regression model we're trying to solve for is 
# given by:
# y = b0 + b1(x) + error
# where b0 is the intercept term, b1 is the slope, and error is 
# the error

# model the intercept/slope terms of our model as 
# normal random variables with comically large variances
b0 = pymc.Normal("b0", 0, 0.0003)
b1 = pymc.Normal("b1", 0, 0.0003)

# model our error term as a uniform random variable
err = pymc.Uniform("err", 0, 500)

# "model" the observed x values as a normal random variable
# in reality, because x is observed, it doesn't actually matter
# how we choose to model x -- PyMC isn't going to change x's values
x_weight = pymc.Normal("weight", 0, 1, value=np.array(float_df["weight"]), observed=True)

# this is the heart of our model: given our b0, b1 and our x observations, we want
# to predict y
def pred(b0=b0, b1=b1, x=x_weight):
    return b0 + b1*x

# "model" the observed y values: again, I reiterate that PyMC treats y as 
# evidence -- as fixed; it's going to use this as evidence in updating our belief
# about the "unobserved" parameters (b0, b1, and err), which are the 
# things we're interested in inferring after all
y = pymc.Normal("y", pred, err, value=np.array(float_df["mpg"]), observed=True)

# put everything we've modeled into a PyMC model
model = pymc.Model([pred, b0, b1, y, err, x_weight])

The aforementioned core concepts:

Whew. Get all that? I didn't the first time around, so if you're still in need of some help I would check out PyMC's official tutorial

An exceedingly helpful way of visualizing our model and ensuring that we set everything up exactly as we intended is by using the "graph" module. I've included the code and the resultant model below:

import pymc.graph
graph = pymc.graph.graph(model)

We basically get that for free -- immensely cool.

Now that we've done the legwork of setting up our model, PyMC can work its magic:

# prepare for MCMC
mcmc = pymc.MCMC(model)

# sample from our posterior distribution 50,000 times, but
# throw the first 20,000 samples out to ensure that we're only
# sampling from our steady-state posterior distribution
mcmc.sample(50000, 20000)
By invoking pymc.MCMC() and passing in our model, we can sample values from the steady-state posterior distribution of our model thousands and thousands of times. The notion of "steady-state" is why we pass 20000 as the second argument to the sample method: We don't want to record samples before MCMC has converged on a solution, so we throw out the first 20,000. This creates a probability distribution for the various non-observed inputs of our model. To get a more in-depth idea of how MCMC works, I found this lecture amazingly helpful. UPDATE: One of my classmates, Jonny Lee, also wrote a great post specific to this topic here

After sampling our posterior distribution, which again, was developed as a "combination" of our prior beliefs and our observed data, we can do things like this:

print np.mean(mcmc.trace('b1')[:])
plt.hist(mcmc.trace('b1')[:], bins=50)

This is the probably distribution of our slope term: It allows us to say incredible things like, "A one-unit increase in the weight of a car is associated with a decrease in its mpg, and we're virtually certain that the magnitude of this decrease is between -0.0085 and -0.007 units."

The Multivariable Case

The multivariable case isn't very different from the above. There's simply the added concept of a PyMC "Container," which binds similar random variables together in a coherent way (important when you have multiple betas to solve for).

The code for setting up this case is below:

def linear_setup(df, ind_cols, dep_col):
        Inputs: pandas Data Frame, list of strings for the independent variables,
        single string for the dependent variable
        Output: PyMC Model
    # model our intercept and error term as above
    b0 = pymc.Normal("b0", 0, 0.0001)
    err = pymc.Uniform("err", 0, 500)
    # initialize a NumPy array to hold our betas 
    # and our observed x values
    b = np.empty(len(ind_cols), dtype=object)
    x = np.empty(len(ind_cols), dtype=object)
    # loop through b, and make our ith beta
    # a normal random variable, as in the single variable case
    for i in range(len(b)):
        b[i] = pymc.Normal("b" + str(i + 1), 0, 0.0001)
    # loop through x, and inform our model about the observed
    # x values that correspond to the ith position
    for i, col in enumerate(ind_cols):
        x[i] = pymc.Normal("x" + str(i + 1), 0, 1, value=np.array(df[col]), observed=True)
    # as above, but use .dot() for 2D array (i.e., matrix) multiplication
    def y_pred(b0=b0, b=b, x=x):
        return b0 +
    # finally, "model" our observed y values as above
    y = pymc.Normal("y", y_pred, err, value=np.array(df[dep_col]), observed=True)
    return pymc.Model([b0, pymc.Container(b), err, pymc.Container(x), y, y_pred])

And here's an example (for the example, our independent variables are "weight" and "acceleration," and our dependent variable is still "mpg").

test_model = linear_setup(float_df, ["weight", "acceleration"], "mpg")
mcmc = pymc.MCMC(test_model)
mcmc.sample(100000, 20000)
Now that we've sampled our posterior, we can again plot the marginal probability distributions of our beta terms:
multifig, multiax = plt.subplots(2, 1, figsize=(10, 10))
b_nought = mcmc.trace("b0")[:]
b_weight = mcmc.trace("b1")[:]
b_accelerate = mcmc.trace("b2")[:]

multiax[0].set_title("Weight Coefficient Probability Distribution")
multiax[1].set_title("Acceleration Coefficient Probability Distribution")

Finally, to tie everything together and solidify our mental model, I include the means of the histograms I plotted above along with the mean of our intercept term, and also provide the results of using OLS. You'll find that the two are quite similar...


print "Intercept: " + str(np.mean(b_nought)) 
print "Weight Coefficient: " + str(np.mean(b_weight)) 
print "Acceleration Coefficient: " + str(np.mean(b_accelerate))

# prints
# Intercept: 41.0051165145
# Weight Coefficient: -0.00729068321556
# Acceleration Coefficient: 0.267134957533


coef    std err          t      P>|t|      [95.0% Conf. Int.]
const         41.0953      1.868     21.999      0.000        37.423    44.768
x1            -0.0073      0.000    -25.966      0.000        -0.008    -0.007
x2             0.2617      0.086      3.026      0.003         0.092     0.432

And there you have it. A lot of work whose results could have been duplicated in a single line of code using the statsmodels package, but doesn't this make you feel a lot more accomplished?


*Moment of truth: I glossed over an important point here in the hopes that a little simplification would make things more clear. In the comments to my code, I say something along the lines of, "Practically speaking, it doesn't matter how we model our X variables since they are observed and their values can't vary." This is technically true if we're solely interested in inferring our beta terms, but not so true if we're interested in Bayesian formalism. The fact is: We are still choosing to model our X observations as Normal random variables. If our data weren't observed, then we would still be able to simulate values for it based on our prior probabilities, and our prior probability for X would say that it's a normal random variable. I tend to be a pragmatist about these sorts of things, but I don't want to get yelled at on Twitter.