Table of Contents
ToggleVariance is a key statistical measure that describes how spread out the values are in a dataset. It is closely related to standard deviation, as the standard deviation is the square root of the variance. Alternatively, if you square the standard deviation, you get the variance.
We will first walk through an example with 10 observations to demonstrate how we can calculate the variance.
| Duration | Average_Pulse | Max_Pulse | Calorie_Burnage | Hours_Work | Hours_Sleep |
|---|---|---|---|---|---|
| 30 | 80 | 120 | 240 | 10 | 7 |
| 30 | 85 | 120 | 250 | 10 | 7 |
| 45 | 90 | 130 | 260 | 8 | 7 |
| 45 | 95 | 130 | 270 | 8 | 7 |
| 45 | 100 | 140 | 280 | 0 | 7 |
| 60 | 105 | 140 | 290 | 7 | 8 |
| 60 | 110 | 145 | 300 | 7 | 8 |
| 60 | 115 | 145 | 310 | 8 | 8 |
| 75 | 120 | 150 | 320 | 0 | 8 |
| 75 | 125 | 150 | 330 | 8 | 8 |
To calculate the variance of Average_Pulse, first calculate the mean:
(80 + 85 + 90 + 95 + 100 + 105 + 110 + 115 + 120 + 125) / 10 = 102.5
The mean is 102.5.
Now, find the difference from the mean for each value:
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Next, square each difference:
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
(2.5)^2 = 6.25
(7.5)^2 = 56.25
(12.5)^2 = 156.25
(17.5)^2 = 306.25
(22.5)^2 = 506.25
Finally, sum all the squared differences and divide by the number of observations to get the variance:
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25
The variance is 206.25.
We can use the var() function from the NumPy library to calculate the variance in Python:
import numpy as np
# Calculate the variance of the health data
var = np.var(health_data)
# Print the result
print(var)
The output will show the variance of the dataset. Let’s now calculate the variance for the entire dataset:
import numpy as np
# Calculate the variance for each column in the full health data set
var_full = np.var(full_health_data)
# Print the result
print(var_full)
Variance is an essential statistical measure to understand the spread of data. By calculating the variance, we get a sense of how data points differ from the mean, helping us assess the reliability and consistency of the dataset. Understanding both variance and standard deviation is crucial for any data science project, as they provide insights into data variability.
