Introduction to random variables for non-data scientists

Photo by Morgan Harris on Unsplash

Me: Hey, just took a pic in the library! Look how old and dusty are these books!

You: Are these the oldest ones you found in there?

Me: No, I didn’t inspect all of them, just picked those five.

You: Look! In the background you can see plenty of them. How many books do you think there were in the library?

Me: Uh, I don’t know…Maybe, between 3000 and 5000?

Cut!

Ok, this is not a scene from a movie. My intention is to introduce random variables so that everybody understands the concept. Also, it’ll help me to skip this when I use them in subsequent blog posts.

To define what a random variable is, it’s better to start defining what’s the opposite of a random variable, a deterministic variable. The key difference between both terms is the degree of uncertainty we have on the value associated with that variable.

  • A deterministic variable is a variable we are 100% sure about is value because it relies on data or observations.
  • On the other hand, a random variable is characterized by absence of (complete) data, but we can provide a rough estimation of its true value based on intuition or past experiences.

In short, a deterministic variable is defined by a single value (like a number, text, date, etc.) whereas a random variable is defined with a probability distribution over a range of values. Broadly speaking, this distribution represents what are the possible values a variable can get, with a probability associated to each value.

Going back to our short dialogue, I said there were 5 books. And you can confirm this by the photo at the head of this blog post. That would be a deterministic variable. These variables can also be represented with a probability distribution, but it’s never used.

For a deterministic variable, the number of books in the image, its probability is concentrated in a single value, 5. The probability everywhere else is 0.

I also said few lines later how many books there were in the library, but obviously, I didn’t count all of them, shelf by shelf. Maybe there were other factors that led to that estimate, like the size of the library or books’ thickness. In any case, the distribution of a random variable like this can have many forms. I will show you two:

Uniform distribution for the number of books in the library

This is a uniform distribution. It can be interpreted as “the probability of the library having 3000 books is as likely as having 5000 or any other quantity in between”.

Normal distribution for the number of books in the library

This is the normal distribution, where we highlight that “the number of books is between 3000 and 5000 but we are more certain that this quantity is somewhere in the middle”.

These examples show the fact that rules of probability always apply:

  • Probabilities are bounded between zero and one, being zero those values we are sure the variable can’t get and being one in case we are sure the variable only gets that value.
  • Summing up all the probabilities for the same random variable is equal to 1 or 100% (both expressions are equivalent). In this situation, asking “what is the probability that the library has any number of books?” would lead to a value of 1.

Random variables in Tensorflow Probability

Tensorflow is one of the most popular frameworks for numerical computation and machine learning. What makes Tensorflow Probability unique are the capabilities to do probabilistic programming, meaning we can use random variables in our code in addition to deterministic ones. Below we show how can be declared:

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

deterministic_var = tf.Variable(5) # Books in the picture
random_var = tfd.Normal(4000,500) # Books in the library

It is as easy as that. We used a random variable with normal distribution to specify the number of books in the library. Whenever a variable like random_var, you should pass the mean (first argument) and standard deviation (second argument) as parameters.

Every random variable you define with Tensorflow Probability has several methods that allows you to perform different calculations over them. These are some examples:

# Mean of the random variable distribution
random_var.mean()

# Mode of the random variable distribution
random_var.mode()

# Probability density of the library having 3300 books
random_var.prob(3300)

# Cumulative density, probability of the library having 
# 3300 books or less
random_var.cdf(3300)

# The number of books in the library given a percentile
random_var.quantile(0.34)

# Generate 10 samples from the distribution of the number
# of books
random_var.sample(10)

If you have read up to this line, you may think…“Shouldn’t the distribution of books in the library be discrete?” And yes, you are right! It should have been defined as a multinomial distribution with two thousand categories, but for the sake of clarity and simplicity we went for the uniform and normal distribution. If you’ve run this example with Tensorflow 2.0 like me, all these calculations will be returned as tensors. You can add .numpy() to get the results as Numpy arrays.

Random variables and distributions linked to them are some of the essential components for probabilistic modeling. We will explore applications with them in a future post.

Further reading

We’ve barely scratched the surface of random variables. Depending on what we are trying to describe some distributions will fit better than others. Here is a list with (mostly) all the different probability distributions.

In the Tensorflow website there’s a great variety of tutorials, and also the instructions to install this library in your computer. Similarly for Tensorflow Probability.

One thought on “Introduction to random variables for non-data scientists

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s