Maximum likelihood estimation with Tensorflow

We are given a data set and we are told to model its distribution. But, how to proceed?

A reasonable first step would be to plot in a histogram the distribution of values. After looking at the histogram, you recognize its shape. It looks like a Normal, or… maybe it’s a Gamma distribution? Maximum likelihood is one of the methods to identify the distribution.

If you read my post on random variables, every distribution is defined by a set of parameters. We can adopt a Bayesian approach to get the most likely value for these parameters. We will make use of the Bayes’ theorem as a starting point:

\displaystyle p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}

For now, let’s focus only on p(D|\theta). This is the likelihood function, and measures how likely it is that data D has been generated from this distribution with parameters \theta. Evaluating different choices for \theta our objective consists of finding the values that maximize the likelihood function. Mathematically, this is formulated as:

\displaystyle \hat\theta_{ML} = argmax_{\theta}\,p(D|\theta)

The maximum likelihood estimate is therefore the value of \theta that maximizes p(D|\theta). Depending on the complexity of the true distribution, it is possible to find a closed-form solution to this problem. In this post, we will try to find a good estimate of \theta using Tensorflow 2.0 with Tensorflow Probability.

Parameter estimation of a probability distribution

In this toy example we’ll start by loading the required libraries as we did in our previous post. For demonstration purposes we generate some data coming from a Normal distribution. Besides data, we create the model itself. It represents the distribution from which the data has been generated plus two variables that work as parameters for the distribution with initial values.

# Data
x = 0.66+tf.random.normal([5000])
# Two variables
mu = tf.Variable([0.]) # Mean of the distribution
sigma = tf.Variable([3.]) # Standard deviation of the distribution

# The model is a probability distribution
model = tfd.Normal(loc=mu, scale=sigma)

To fit our data to a distribution, we have to define two functions:

  • A loss function, although for us, it has to be the likelihood function. Since optimization in Tensorflow follows a minimization procedure we use the negative of the log-likelihood function which is equivalent to maximizing the likelihood. The log operator deters from running into underflow issues when calculating the likelihood of hundreds of samples. This function will depend on the data and parameters that have been declared.
  • A function to record the gradients and later use them to update the parameters during the learning process. These gradients are calculated by differentiation of the loss function with respect to the parameters. You don’t need to worry about hardcoding this step. The tape object in Tensorflow takes care of everything. This process is called Automatic Differentiation, and it’s deeply covered here.
def loss(model, data):
    total_log_prob = -tf.reduce_mean(model.log_prob(data))
    return total_log_prob

def grad(model, inputs):
    with tf.GradientTape() as tape:
        loss_value = loss(model, inputs)
    return loss_value, tape.gradient(loss_value, model.trainable_variables)

During the training process, the grad function returns the loss and the gradients. These are then used to update the parameters with the method apply_gradients of the optimizer. The choice for this example is the Adam optimizer. We show the loss and the parameter values are updated every 10 iterations. A custom function has been developed where you specify the distribution that your data follows and the list of parameters that define your distribution.

def mle_run(data, model, parameters, optimizer, steps=1000, verbose=False):
    update_list = []
    prob_values = []
    value_range = tf.linspace(tf.reduce_min(data),tf.reduce_max(data), 100)
    for i in range(steps):
        loss_value, grads = grad(model, data)
        optimizer.apply_gradients(zip(grads, parameters))
        if i % 10 == 0:
                *[p.numpy()[0] for p in parameters]))
            param_str = ", ".join([':')[0]+": "+str(p.numpy()) for p in parameters])
            iter_info = f"Step: {optimizer.iterations.numpy()}, initial loss: {loss_value.numpy()}, {param_str}"
            if verbose:
    return update_list, prob_values, value_range

Fitting a Normal distribution

In a few hundred of steps the model ends up fitting the data with decent accuracy:

We compare the final estimate of the parameters with the true sample mean and sample standard deviation of the data. The truth is that the maximum likelihood estimate for data normally distributed can be solved analytically and it should give the same result.

Fitting a Poisson distribution

Data coming from a Poisson distribution only contains discrete values, and it has only one parameter called rate. More on the Poisson distribution here.

To confirm how close it is the estimated rate from its true value, it is shown below the learning curves for this example as well.

Give this method a try with your own dataset or think what you should change to use a different probability distribution. The code of this tutorial has been uploaded to my Github.

2 responses to “Maximum likelihood estimation with Tensorflow”

  1. […] In the past, a method for parameter estimation was covered, named maximum likelihood estimation (MLE). If you want to know more about it, visit the following blog post. […]


  2. […] post of maximum likelihood estimation (MLE), was published in 2019 and still, there is a considerable […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: