Me: Hey, just took a pic in the library! Look how old and dusty are these books!
You: Are these the oldest ones you found in there?
Me: No, I didn’t inspect all of them, just picked those five.
You: Look! In the background you can see plenty of them. How many books do you think there were in the library?
Me: Uh, I don’t know…Maybe, between 3000 and 5000?
Ok, this is not a scene from a movie. My intention is to introduce random variables so that everybody understands the concept. Also, it’ll help me to skip this when I use them in subsequent blog posts.
To define what a random variable is, it’s better to start defining what’s the opposite of a random variable, a deterministic variable. The key difference between both terms is the degree of uncertainty we have on the value associated with that variable.
- A deterministic variable is a variable we are 100% sure about is value because it relies on data or observations.
- On the other hand, a random variable is characterized by absence of (complete) data, but we can provide a rough estimation of its true value based on intuition or past experiences.
In short, a deterministic variable is defined by a single value (like a number, text, date, etc.) whereas a random variable is defined with a probability distribution over a range of values. Broadly speaking, this distribution represents what are the possible values a variable can get, with a probability associated to each value.
Going back to our short dialogue, I said there were 5 books. And you can confirm this by the photo at the head of this blog post. That would be a deterministic variable. These variables can also be represented with a probability distribution, but it’s never used.
I also said few lines later how many books there were in the library, but obviously, I didn’t count all of them, shelf by shelf. Maybe there were other factors that led to that estimate, like the size of the library or books’ thickness. In any case, the distribution of a random variable like this can have many forms. I will show you two:
This is a uniform distribution. It can be interpreted as “the probability of the library having 3000 books is as likely as having 5000 or any other quantity in between”.
This is the normal distribution, where we highlight that “the number of books is between 3000 and 5000 but we are more certain that this quantity is somewhere in the middle”.
These examples show the fact that rules of probability always apply:
- Probabilities are bounded between zero and one, being zero those values we are sure the variable can’t get and being one in case we are sure the variable only gets that value.
- Summing up all the probabilities for the same random variable is equal to 1 or 100% (both expressions are equivalent). In this situation, asking “what is the probability that the library has any number of books?” would lead to a value of 1.
Random variables in Tensorflow Probability
Tensorflow is one of the most popular frameworks for numerical computation and machine learning. What makes Tensorflow Probability unique are the capabilities to do probabilistic programming, meaning we can use random variables in our code in addition to deterministic ones. Below we show how can be declared:
import tensorflow as tf import tensorflow_probability as tfp tfd = tfp.distributions deterministic_var = tf.Variable(5) # Books in the picture random_var = tfd.Normal(4000,500) # Books in the library
It is as easy as that. We used a random variable with normal distribution to specify the number of books in the library. Whenever a variable like
random_var, you should pass the mean (first argument) and standard deviation (second argument) as parameters.
Every random variable you define with Tensorflow Probability has several methods that allows you to perform different calculations over them. These are some examples:
# Mean of the random variable distribution random_var.mean() # Mode of the random variable distribution random_var.mode() # Probability density of the library having 3300 books random_var.prob(3300) # Cumulative density, probability of the library having # 3300 books or less random_var.cdf(3300) # The number of books in the library given a percentile random_var.quantile(0.34) # Generate 10 samples from the distribution of the number # of books random_var.sample(10)
If you have read up to this line, you may think…“Shouldn’t the distribution of books in the library be discrete?” And yes, you are right! It should have been defined as a multinomial distribution with two thousand categories, but for the sake of clarity and simplicity we went for the uniform and normal distribution. If you’ve run this example with Tensorflow 2.0 like me, all these calculations will be returned as tensors. You can add
.numpy() to get the results as Numpy arrays.
Random variables and distributions linked to them are some of the essential components for probabilistic modeling. We will explore applications with them in a future post.
We’ve barely scratched the surface of random variables. Depending on what we are trying to describe some distributions will fit better than others. Here is a list with (mostly) all the different probability distributions.