In the past, a method for parameter estimation was covered, named maximum likelihood estimation (MLE). If you want to know more about it, visit the following blog post.
There is another parameter estimation method that it’s worth mentioning. It’s called maximum a posteriori, shortened by MAP. To see how that works we’ll revisit the Bayes’ theorem:
Each component of the equation is explained below:
- is known as the prior. It specifies the prior knowledge about the parameters before seeing the data.
- works as a normalization factor. It’s calculated as the marginal distribution of the data over the parameters.
- is the likelihood function, a function linking the data with your paramaters. It measures the probability that your data has been obtained with the current parameter values.
- is the posterior distribution. Given your data, it measures the probability that it has been generated by certain set of parameter values.
What’s the difference between MLE and MAP?
- Estimating parameters through MLE implies to maximize
- Estimating parameters through MAP, implies finding the parameters that maximize which in turn leads to maximizing the product of MLE times the prior
When all these components are normally distributed, this is how the prior and likelihood intervene on the posterior. The prior will attract the posterior towards its position. This will make sense in the next section.
Bayesian linear regression
The case study will be to estimate the parameters in a linear regression model, where two normally distributed random variables will act as the offset and slope the line. The data used for this tutorial is again simulated and looks like this:
For the sake of simplicity, the variance of these variables will be kept constant. With Tensorflow and Tensorflow Probability the model is defined as follows:
# Define model and its constituents (as normal random variables) b0_v = tf.Variable(0.0) b1_v = tf.Variable(0.0) model = tfd.JointDistributionSequential([ tfd.Normal(name = "b0", loc=b0_v, scale=1.0), tfd.Normal(name = "b1", loc=b1_v, scale=1.0), lambda b1, b0: tfd.Independent(tfd.Normal(loc = b0 + b1*X, scale=5.0), reinterpreted_batch_ndims=1)])
In terms of code what are the differences between MAP and MLE?
def loss_MAP(model, data): total_log_prob = -tf.reduce_mean(model.log_prob([b0_v, b1_v, data])) - tf.reduce_mean(model.log_prob_parts([b0_v, b1_v, data])[:-1]) return(total_log_prob) def loss_MLE(model, data): total_log_prob = -tf.reduce_mean(model.log_prob([b0_v, b1_v, data])) return(total_log_prob)
Whenever we calculate the posterior probability, by using the logarithm, the product of the likelihood and the prior becomes the addition of these two components. The negative sign is to transform the maximum likelihood into a minimization problem.
After several iterations the results of the fit are shown down below:
You can see how both approaches converges to the best fit to the data given the chosen model. The starting point of the model is the black line where both parameters are zero. Since both results look similiar, you may be thinking…What’s the difference between using MAP or MLE then?
Let’s see the learning curves of those parameters.
From these plots we can extract the following conclusions:
- Both methods converge to the same values
- MLE converges faster than MAP
At each iteration, the MAP algorithm takes into account the prior to estimate the value of the parameters. This becomes a burden when the true value is far away from the starting value of the prior.
MAP is useful to express uncertainty about the parameters and when you want your prior knowledge influence the estimation of the parameters. If that’s not particularly important for you, choose MLE instead.
These two parameter estimation techniques are now part of you toolset. Feel free to adapt the code of this example for your problems. As usual, you can find the complete tutorial for this post on my Github.