Linear regression is by far, one of the most used statistical models in the industry. To the extent that many times it’s used because… why the fuck not? But in the end:
All models are wrong, but some are usefulGeorge E. P. Box
This statement remarks there are neither good or bad models. It’s what you assume about the data that some models work better than others. Here are some assumptions about the data when you are using linear regression:
- The error term has a population mean of zero and constant variance
- Regressors are uncorrelated with the error term
- Samples of the error term are uncorrelated
- There’s no collinearity among regressors
and a few others.
Introducing Generalized Linear Models
In the same train of thought, Generalized Linear Models (GLM) are models like linear regression but they will add more assumptions, such as:
- the target or output comes from the same probability distribution
- the parameters of this distribution are linear functions of your inputs
In practical terms, a linear regression model is defined as:
A GLM for the same dependece between inputs and outputs would be:
And now, it’s when it goes wild. Let’s say you are building a model to predict counts such as the number of visitors in a mall. It’s a discrete positive quantity, right? In this case we can assume that your data has been obtained from a Poisson distribution:
I know you are asking, why are you regressing the logarithm of λ against inputs? For certain distributions, their parameters can only take a specific range of values. The parameter of the Poisson distribution only has positive values. Applying the logarithm to the linear model increases the search space of the coefficients. In the end, when you replace λ as a function of the regressors, you get:
Which is strictly positive.
In the GLM domain, the function that relates the inputs with the center of the distribution of the output is called link function. Each type of distribution allows different link functions. The major machine learning libraries and frameworks have built-in function to build GLMs.
Building a GLM with TensorFlow Probability
To show what TensorFlow Probability can do to build GLMs we’ll use heights and weights observations from the !Kung people. The dataset can be downloaded from here. Descriptive plots will give us an idea of the relationship between inputs and outputs.
We will build a GLM that predicts height as a function of weight and age. In general, when people reach adult life, their height gets to its maximum value and may decrease a little for the elderly. In spite of colouring the measurements by gender, the shape of the distributions barely changes. We will do some feature engineering before the modeling step.
- log-transform the output so that its distribution has a Gaussian shape
- split data into two buckets in regards to age the measurement was taken
- remove outliers if possible
The function to fit a GLM to your data is the following:
model_coefficients, linear_response, is_converged, num_iter = tfp.glm.fit(model_matrix, response, model)
The function needs three arguments:
- model_matrix, are all the inputs of your model as a TensorFlow Tensor. In case you need an intercept, add a column of 1s.
- response, is the output of your model as a TensorFlow Tensor.
- model, is the model you will use to link inputs with output. These are all located in the tfp.glm module as well. For the normal distribution, you would set this argument with the function tfp.glm.Normal().
In case you are asking where you can specify the link function, each model has already one set by default. Later I will show you how to change that. The outputs of this function are self-explanatory. If you want to know about other optional arguments that you can specify, visit the documentation.
Once the model is fitted we compare the predictions with the true values:
The model is quite simple, and predictions are not bad at all. Adding polynomial features would increase model’s accuracy but GLM are not used with that purpose in mind. We are looking for interpretability of the coefficients.
TensorFlow Probability provides a set of distributions to model the output variable, but also we can create custom distribution to suit our needs. We will replace tfp.glm.Normal() by a T-Student distribution, where the tails of the distribution are thicker. This is how we do it with TensorFlow Probability:
custom_glm = tfp.glm.CustomExponentialFamily( lambda mean: tfp.distributions.StudentT(df, mean, scale), tf.identity)
This function has two arguments, in order:
- distribution_fn, is the distribution of the output where the center of the distribution is passed as input with a lambda expression
- linear_model_to_fn, argument that specifies the link function between inputs and output, in the example we are using the identity function, so there’s no transformation at all
Then, the custom distribution is fed as argument to the GLM model as shown below:
tfp.glm.fit(model_matrix=regressor_matrix, response=response, model=custom_glm)
If you have had a look at the documentation, you’ll have seen that tfp.glm.fit() has many optional parameters. Here we will have a look at a few of them.
- l2_regularizer, adds an L2 regularization component to your loss function
- l2_regularization_penalty_factor, provides finer control on the regularization of each parameter
- dispersion, controls the dispersion of the output (at observation level) when fitting model
Below these lines, you’ll see these parameters in action, along with the custom distribution explained previously:
def fit_model_t(scale=1.0): custom_glm = tfp.glm.CustomExponentialFamily(lambda mean: tfp.distributions.StudentT(len(response)+1, mean, scale), tf.identity) model_coefficients, linear_response, is_converged, num_iter = tfp.glm.fit( model_matrix=regressor_matrix, response=response, model=custom_glm, l2_regularizer=tf.constant(1.0), l2_regularization_penalty_factor=tf.constant([1., 1., 1., 1., 1e-10]), dispersion=tf.constant([0.1]) ) log_likelihood = custom_glm.log_prob(response, linear_response) return (model_coefficients, linear_response, is_converged, num_iter, log_likelihood)
And here we show the comparison of fitting a GLM to our data set for different levels of dispersion.
GLM are really flexible models that allow you to fit any type of distribution. In addition, the link function let’s you specify how the inputs and output are related. In case you’re wondering if linear regression is the best algorithm to use, GLMs can help you even if the output is not normally distributed. So give them a try!
You can find the code for this post on my GitHub.