13 Statistical Distributions

SLIDE DECKS

Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.

A statistical distribution, also known as a probability distribution, is a mathematical function that describes the likelihood of different outcomes or values occurring in a given data set or random experiment. In other words, it provides a way to model and represent the uncertainty or randomness inherent in many real-world phenomena.

Statistical distributions are used in various fields, including statistics, probability theory, and data analysis, to understand and analyze data, make predictions, and draw conclusions. Each type of distribution has its own set of characteristics and parameters that define its shape and behaviour.

Some common examples of statistical distributions include:

  1. Normal Distribution (Gaussian Distribution): It is characterized by a bell-shaped curve and is often used to describe naturally occurring phenomena like the heights of individuals in a population or errors in measurements.
  2. Uniform Distribution: In this distribution, all possible outcomes are equally likely. For example, rolling a fair six-sided die follows a uniform distribution.
  3. Binomial Distribution: It models the number of successful outcomes in a fixed number of independent Bernoulli trials, such as the number of heads in a series of coin flips.
  4. Poisson Distribution: It describes the number of events that occur within a fixed interval of time or space when events happen with a known average rate but independently of each other, such as the number of phone calls received at a call center in an hour.
  5. Exponential Distribution: It models the time between events in a Poisson process, such as the time between arrivals of customers at a store.
  6. Log-Normal Distribution: This distribution is used to describe data that are skewed to the right, where the logarithm of the data follows a normal distribution.
  7. Gamma Distribution: It generalizes the exponential distribution and is often used to model the waiting time until a Poisson process reaches a certain number of events.
  8. Chi-Square Distribution: This arises in statistical hypothesis testing and is often used to calculate test statistics in chi-square tests.
  9. Cauchy Distribution: This distribution has heavy tails and does not have a well-defined mean or variance. It’s often used in physics and engineering.
  10. Beta Distribution: It is commonly used to model probabilities, particularly when dealing with proportions and probabilities.

These are just a few examples, and there are many other probability distributions used in various applications, each with its own unique characteristics and use cases. Understanding the appropriate distribution to use for a given dataset or problem is crucial in data science and modelling.

Why Are Distributions Important?

Statistical distributions are important for several reasons:

  1. Data Description: Distributions provide a way to summarize and describe data. They help us understand the central tendency (mean, median), variability (standard deviation, variance), and shape of data. This information is crucial for data exploration and visualization.
  2. Modeling: Distributions serve as mathematical models for real-world phenomena. By choosing an appropriate distribution, statisticians and data scientists can approximate and simulate complex processes, making predictions and drawing conclusions from data.
  3. Inference: Statistical distributions play a fundamental role in statistical inference. They allow us to make probabilistic statements about population parameters based on sample data. For example, in hypothesis testing, we use known distributions to calculate p-values and confidence intervals and make decisions about hypotheses.
  4. Probability Calculations: Distributions help calculate probabilities associated with different events or outcomes. For instance, the binomial distribution can be used to calculate the probability of getting a certain number of successes in a series of independent trials.
  5. Risk Assessment: In fields like finance and insurance, distributions are used to model and assess risk. For example, the normal distribution is often used to model asset returns, and the tails of the distribution are used to estimate the risk of extreme events.
  6. Quality Control: In manufacturing and quality control processes, distributions are used to monitor and control the consistency and quality of products. Control charts and process capability analysis rely on statistical distributions.
  7. Machine Learning: In machine learning, understanding the underlying distributions of data is essential for model selection and parameter estimation. Many machine learning algorithms assume specific distributional properties of data.
  8. Random Sampling: Distributions help us understand the behaviour of random variables and random processes. They provide a framework for analyzing and predicting the outcomes of random events, such as the distribution of sample means in the central limit theorem.
  9. Data Generation and Simulation: Statistical distributions are used to generate synthetic data for testing algorithms and conducting simulations. This is valuable for research, experimentation, and training machine learning models.
  10. Decision Making: Distributions are used to make informed decisions under uncertainty. Whether it’s in business, healthcare, or public policy, understanding the probabilities associated with different outcomes is crucial for making sound decisions.

Statistical distributions are a fundamental tool in data analysis and statistics. They help us describe data, make predictions, conduct statistical tests, and make informed decisions across various domains. Choosing the appropriate distribution and understanding its properties are essential skills for statisticians, data scientists, and researchers.

Exploring the Normal Distribution with R

Download the following R scripts and open them in R Studio.

Normal Distribution Plots.R will plot different Normal curves based on the mean and standard deviation you enter. It will also show you dashed and dotted lines indicating where 1 and 2 standard deviations fall to the left and right of the mean.

  • How does a larger standard deviation affect the shape of the Normal distribution?
  • How does changing the mean affect the location of the Normal distribution?
  • Are two Normal distributions with the same mean but different standard deviations more similar than two Normal distributions with different means by the same standard deviation?

Normal Distribution Areas.R will plot a Normal curve with whatever mean and standard deviation you enter. It will also calculate the area under the curve between two points, a and b. This area would represent the probability P(a<x<b) for a Normal distribution with mean mu and standard deviation sigma.

However, this R script will also translate the information you’ve entered so that you can see the same area and bounds represented on the Standard Normal Distribution.

  • If I told you that the average height of a University of Guelph student was Normally distributed with a mean of 170 cm and a standard deviation of 25 cm, what is the probability that a random student would have a height greater than 195 cm? 215 cm? What is the probability that a random student would have a height of less than 180 cm? 140 cm?
  • Using the information provided in the first bullet, what would be the probability that a random student had a height between 160 cm and 185 cm?
  • Using the information provided in the first bullet, and given that the average height of a random university student is greater than 195 cm, what is the probability that their height is less than 205 cm?

License

Community Engaged Data Science Copyright © 2023 by Daniel Gillis. All Rights Reserved.

Share This Book