Data Science – Statistics Variance

Data Science – Statistics: Variance

Table of Contents

Variance is a key statistical measure that describes how spread out the values are in a dataset. It is closely related to standard deviation, as the standard deviation is the square root of the variance. Alternatively, if you square the standard deviation, you get the variance.

We will first walk through an example with 10 observations to demonstrate how we can calculate the variance.

Example Data Set

Duration	Average_Pulse	Max_Pulse	Calorie_Burnage	Hours_Work	Hours_Sleep
30	80	120	240	10	7
30	85	120	250	10	7
45	90	130	260	8	7
45	95	130	270	8	7
45	100	140	280	0	7
60	105	140	290	7	8
60	110	145	300	7	8
60	115	145	310	8	8
75	120	150	320	0	8
75	125	150	330	8	8

Steps to Calculate Variance

Step 1: Find the Mean

To calculate the variance of Average_Pulse, first calculate the mean:

(80 + 85 + 90 + 95 + 100 + 105 + 110 + 115 + 120 + 125) / 10 = 102.5

The mean is 102.5.

Step 2: Find the Difference from the Mean

Now, find the difference from the mean for each value:

80 - 102.5 = -22.5

85 - 102.5 = -17.5

90 - 102.5 = -12.5

95 - 102.5 = -7.5

100 - 102.5 = -2.5

105 - 102.5 = 2.5

110 - 102.5 = 7.5

115 - 102.5 = 12.5

120 - 102.5 = 17.5

125 - 102.5 = 22.5

Step 3: Square the Differences

Next, square each difference:

(-22.5)^2 = 506.25

(-17.5)^2 = 306.25

(-12.5)^2 = 156.25

(-7.5)^2 = 56.25

(-2.5)^2 = 6.25

(2.5)^2 = 6.25

(7.5)^2 = 56.25

(12.5)^2 = 156.25

(17.5)^2 = 306.25

(22.5)^2 = 506.25

Step 4: Calculate the Variance

Finally, sum all the squared differences and divide by the number of observations to get the variance:

(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25

The variance is 206.25.

Using Python to Calculate the Variance

We can use the var() function from the NumPy library to calculate the variance in Python:

import numpy as np

# Calculate the variance of the health data
var = np.var(health_data)

# Print the result
print(var)

The output will show the variance of the dataset. Let’s now calculate the variance for the entire dataset:

import numpy as np

# Calculate the variance for each column in the full health data set
var_full = np.var(full_health_data)

# Print the result
print(var_full)

Conclusion

Variance is an essential statistical measure to understand the spread of data. By calculating the variance, we get a sense of how data points differ from the mean, helping us assess the reliability and consistency of the dataset. Understanding both variance and standard deviation is crucial for any data science project, as they provide insights into data variability.

Introduction to Data Science Written Edition English Tutorial

Curriculum