🔍 Overview

Bayesian statistics treats probability as not a frequency of an event occurring, but rather as a degree of belief in an event. This course takes a theoretical and practical approach to the topic. We are given a real-life datasets, and apply appropriate models to conduct inference. There is an emphasis on explanations and proper methodology to back our results.

*I took this course Spring 2020, contents may have changed since then

🏢 Structure

  • Six assignments - 30% (of the final grade)
  • Project - 10%
  • Exams - 60%

Assignments

These consist of a combination of hand calculations and derivations, as well as simulations using WinBUGS / OpenBUGS (Bayesian inference Using Gibbs Sampling). They are heavily scenario based (ie. solving probabilities for real framed problems) and are turned in report styled.

  1. Probability Basics - Basic probability refresher and Bayes' Theorem introduction
  2. Probability Distributions - Various probability distributions, applications, and analysis
  3. Bayesian Approach - Applications of priors, posteriors, and estimators
  4. Sampling - Metropolis and Gibbs sampling, conjugate priors
  5. OpenBUGS - Creating estimators for data using priors and OpenBUGS
  6. Missing Data- More OpenBUGS practice with complex input data and credible sets (find that credible set info thing?)

Project

Open-Ended Bayesian Analysis - There was one open ended project where you picked your own dataset and performed analysis and estimation using techniques from the course. The deliverable runnable source code (OpenBUGS, or Python PyMC3), and a report summarizing findings.

Exams

You had one week to complete the exams, which were open-notes, and required a mix of handwritten solutions and OpenBUGS. The style was similar to the homeworks, but required a bit more depth for some problems. Overall the content was from the same topics, so there is no need to go into more detail in a later section.

📖 Assignment 1 - Probability Basics

Naive Bayes Classifier

The good ol' Bayes Theorem:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Say we have tabular data and features B0,B1,... ,BnB_0, B_1, ...\ , B_n, and we want to predict AA

If these features are independent (a naive assumption), we have the property

P(AB)=P(A)P(B)P(AB) = P(A)P(B)

Given this assumption, we can now multiply all variables to create an estimator:

P(AB0,...,Bn)=P(B0A)...P(BnA)P(A)P(B0)...P(Bn)P(A|B_0,...,B_n) = \frac{P(B_0|A)...P(B_n|A)P(A)}{P(B_0)...P(B_n)}

As P(Bn)P(B_n) will not change, we can further simply this to:

P(AB0,...,Bn)P(A)P(BnA)P(A|B_0,...,B_n) \propto P(A)\prod{P(B_n|A)}

This result is proportional, so we just need to check the other estimates of A, and the max is our output. We can recover the true prediction probability by normalizing the estimates.

Results

One of the tasks is to manually create a Bayes estimate for whether or not somebody goes to the beach. We are given a spreadsheet, and use the percentage that a feature exists per class.

https://personal-site-kwang.s3.amazonaws.com/omscs/cs6420/Untitled.png

We also look at a problem which looks at the sensitivity, specificity, and positive predictive value of tests, and the impact that combining their results serial or in parallel has. Like a circuit, for a combined test of two to be declared positive when put in serial, both must test positive. In contrast, the combined result in a parallel test only needs one positive to output positive.

📖 Assignment 2 - Probability Distributions

There's not too much to add here, the probability distributions we know and love are reviewed. This includes the common discrete + continuous variables, probability + cumulative density functions, joint distributions, conditional, and marginal distributions.

📖 Assignment 3 - Bayesian Approach

We take a deeper look into Bayes Theorem, and we'll rework some variables too

π(θx)=f(xθ)π(θ)m(x)\pi(\theta |x) = \frac{f(x|\theta)\pi(\theta)}{m(x)}

Here, variable x is a typical observation within a dataset or experiment, and parameter θ\theta is the model. f(xθ)f(x|\theta), the likelihood, is the probability that x is observed given this parameter θ\theta. "Classical", or frequentist statistics considers this parameter to be a fixed number or vector, whereas in Bayesian statistics, uncertainty around θ\theta is addressed by considering it to be a random variable instead, with the probability distribution, π(θ)\pi(\theta), called the prior.

A goal in Bayesian statistics is to start with this prior, π(θ)\pi(\theta), which may be estimated by a domain expert. Given this, and then observed samples, we inform π(θx)\pi(\theta|x), the posterior, which represents a summary of the experiment, prior, and observed data.

The marginal, m(x), is also equivalent to: θf(xθ)π(θ)dθ\int_{\theta}{f(x|\theta)\pi(\theta)d\theta}

Results

This assignment involved deriving estimates for the posterior given the likelihood and the prior. This is done through the conjugate prior, which we can use to estimate the posterior distribution given the prior and likelihood. This is an established distribution, and I'll show an example which is similar to one from the assignment.

Here is a known conjugate pair:

Likelihood XiθUnif(0,θ)X_i|\theta \sim Unif(0, \theta)

Prior π(θ)Pareto(θ0,α)\pi(\theta) \sim Pareto(\theta_0, \alpha)

Posterior is: θXPareto(max[θ0,X0,...,Xn],α+n)\theta|X \sim Pareto(max[\theta_0, X_0, ..., X_n], \alpha + n)

We want to estimate the parameter θ\theta for this data, which we know pulls from a uniform distribution:

X=[0.4,1.0,1.5,1.7,2.0,2.1,3.1,3.7,4.3,4.9]X = [0.4, 1.0, 1.5, 1.7, 2.0, 2.1, 3.1, 3.7, 4.3, 4.9]

θ\theta, as shown in the previous likelihood equation, represents the upper bound of the distribution. Clearly, the value is at least 4.9- but the real bound is higher, as the chance that the true upper bound would be sampled from in 10 samples is incredibly low.

An expert previously provides the prior belief of this variable, which is π(θ)=\pi(\theta) = Pareto(4.0,2.5)Pareto(4.0, 2.5), where the parameters are the minimum and shape of the curve. We now update this belief with the data to produce the Posterior θX=Pareto(4.9,12.5)\theta |X = Pareto(4.9, 12.5), which is calculated from the given equation.

https://personal-site-kwang.s3.amazonaws.com/omscs/cs6420/Untitled%201.png

The prior Pareto model was provided by an "expert". The Posterior represents our belief of theta, which is likely below 6.Source

These Pareto distributions represent our estimate for θ\theta, which is a minimum of 4.9. We can take the mean to produce our final estimate, which for the posterior is 5.326.

If you want to produce a range for the estimate, one method is to use highest posterior density credible set. This range has (1-α\alpha) credibility, where the area underneath that range has area equal to (1-α\alpha).

https://personal-site-kwang.s3.amazonaws.com/omscs/cs6420/Untitled%202.png

The cut, in this case 0.05, is the largest constant where the area of the posterior is ≥ 1-$\alpha$

📖 Assignment 4 - Sampling

We saw the use of conjugates to find the distribution for the posterior in the previous section. Recall that the marginal m(x) can be rewritten as θf(xθ)π(θ)dθ\int_{\theta}{f(x|\theta)\pi(\theta)d\theta}. The conjugate pairs were possible when this integral can be solved. When we cannot solve this, we use Markov Chain Monte Carlo methods.

Gibbs Sampling enables us to simulate samples of the posterior distribution without explicitly having its function, but by looping through its conditionals over many iterations. Given 3 random variables X1,X2,X3X_1, X_2, X_3, we can initiate them to be random values, x0,x1,x2x_0, x_1, x_2, and then on iteration i, we sample point x1f(X1X2=x2i1,X3=x3i1)x_1 \sim f(X_1|X_2=x_2^{i-1}, X_3=x_3^{i-1}). We then hold x1x_1 to be constant for the next sample:

x2f(X2X1=x1i,X3=x3i1)x_2 \sim f(X_2|X_1=x_1^{i}, X_3=x_3^{i-1})

and finally,

x3f(X3X1=x1i,X2=x2i)x_3 \sim f(X_3|X_1=x_1^{i}, X_2=x_2^{i})

This repeats over many iterations until convergence.

Results

We utilized PyMC3 and used Gibbs Sampling to estimate the posterior distribution.

Here, we can visualize the samples of μ\mu and λ\lambda, which are sampled based on the conditionals where the other variable is set to the value of the previous sample.

https://personal-site-kwang.s3.amazonaws.com/omscs/cs6420/Untitled%203.png

📖 Assignment 5 - OpenBUGS

For this assignment, we apply the concepts from the previous sections, but have OpenBUGS do the heavy lifting to sample from the posterior!

We want to model the failure rate for 10 power plant pumps. The following data shows the operation time (1000s of hours), and the number of failures that occured.

94.3, 5
15.7, 1
62.9, 5
126,  14
5.24, 3
31.4, 19
1.05, 1
1.05, 1
2.1,  4
10.5, 22

The number of failures for plant j is assumed to follow a Poisson distribution, this is our likelihood:

XjPoisson(θj)X_j\sim Poisson(\theta_j)

As discussed, in Bayesian analysis, the parameter θ\theta is a random variable itself. In this case, the posterior, which we use to summarize our estimate, follows a Gamma distribution:

θjXGamma(1,β)\theta_j|X \sim Gamma(1, \beta)

A Gamma distribution is adopted as the prior:

θjGamma(0.1,1)\theta_j \sim Gamma(0.1, 1)

In OpenBUGS we declare a model programatically

model{
    for( j in 1:n){  # Update for every observation
		    theta[j] ~ dgamma(1, beta)    # Update the posterior
		    rate[j] <- theta[j] * t[j]    # Multiply by the # failures for current plant
		    X[j] ~ dpois(rate[j])         # Fit our likelihood to the model
    }
		beta ~ dgamma(0.1, 1)             # Initialize our prior
}

And after running this for 50,000 samples, we have an estimate for the parameters at each plant

mean	sd	MC_error	val2.5pc	median	val97.5pc	start	sample
beta	1.34	0.486	0.002973	0.5896	1.271	2.466	1001	50000
theta[1]	0.06261	0.02552	1.111E-4	0.02334	0.05914	0.1217	1001	50000
theta[2]	0.118	0.08349	3.694E-4	0.01431	0.09888	0.3296	1001	50000
theta[3]	0.09366	0.03829	1.706E-4	0.03439	0.08842	0.1828	1001	50000
theta[4]	0.1178	0.03048	1.472E-4	0.06595	0.115	0.1848	1001	50000
theta[5]	0.6116	0.3097	0.001409	0.1632	0.5589	1.361	1001	50000
theta[6]	0.6104	0.1366	6.453E-4	0.3705	0.6001	0.9058	1001	50000
theta[7]	0.8686	0.6494	0.003059	0.101	0.7124	2.537	1001	50000
theta[8]	0.8692	0.6481	0.003354	0.09784	0.7117	2.513	1001	50000
theta[9]	1.478	0.6897	0.00351	0.4705	1.367	3.128	1001	50000
theta[10]	1.944	0.4133	0.002022	1.223	1.916	2.83	1001	50000

📖 Assignment 6 - Missing Data

We continue to use OpenBUGS for this assignment, and configure our model and sampling to handle missing data.

💻 Project

There's not much more that's exciting to show for the previous section or this one which aren't repetitive figures.

If you would like to view more Bayesian analysis examples, I would suggest downloading OpenBUGS. It has a breadth of examples that come with it, which mostly use real data and have their own short analysis.

⏩ Next Steps

Bayesian statistics helps build a basis for data modeling and interpretation. I am very curious about its application and future in the medical field- we were recommended to read a paper by the FDA: Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. I will do this.. soon 🙂