Variance is measure of how far a set of numbers is spread out from mean

**Range** is the difference between the largest and smallest data values in a

set of values for a variable

e.g.,

Subject |
Score out 100 |

Maths | 94 |

Physics | 85 |

Chemistry | 66 |

French | 55 |

Computers | 89 |

English | 64 |

In the above example, the range can be computed as: 94 – 55 = 39

Another e.g., time taken to reach office

Day: 1 2 3 4 5 6 7 8 9 10

time(mins): 45 43 39 39 33 55 35 33 40 198

Here the range is 198 -33 = 165

This is the largest possible difference between any two values in a set of data values for a variable. In the second example, range is very high. where as in 1st example, it is somewhat high.

**Variance** is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

Subject |
Score out of 100 |
Score – Mean (Difference) |
Square of Difference |

Maths | 94 | 18.5 | 342.25 |

Physics | 85 | 9.5 | 90.25 |

Chemistry | 66 | -9.5 | 90.25 |

French | 55 | -20.5 | 420.25 |

Computers | 89 | 13.5 | 182.25 |

English | 64 | -11.5 | 132.25 |

75.5 | 0 | 1257.5 |

In the above example, Mean / Average Marks is 75.5 and Sum of difference between Score and Mean results in Zero value always. Since we cannot compare this other data set, the variance is calculated as sum of squares of difference divided by number of variables. Hence the variance is 1257.5/6 = 209.58

This way, we can compare the variance of scores for two different students.

Day |
Time to office (Mins) |
Time – Mean (Difference) |
Square of Difference |

1 | 45 | -11 | 121 |

2 | 43 | -13 | 169 |

3 | 39 | -17 | 289 |

4 | 39 | -17 | 289 |

5 | 33 | -23 | 529 |

6 | 55 | -1 | 1 |

7 | 35 | -21 | 441 |

8 | 33 | -23 | 529 |

9 | 40 | -16 | 256 |

10 | 198 | 142 | 20164 |

56 | 0 | 22788 |

In the second example, the variance is 22788 /10 = 2278.8. The above data is a sample data for 10 days and may not represent the complete set of data. Hence the variance is calulated as sum of squares of difference divided by (number of variables -1)

Variance of sample data = 22788 /9= 2532.

**Standard Deviation** is the positive square root of the variance.

Hence, in the scores example, standard deviation of scores = sqrt (209.58) = 14.48 (approximate). In the second example, the standard deviation of time to reach office is = sqrt (2532) = 50.32 (approximate). what this means is the most of the cases, time take to reach office lies between the mean +/- standard deviation i.e., 56+/-50.32. Here 198 is extreme value. If we ignore 198, mean becomes 40.22 and standard deviation becomes 6.92. Hence the most of the cases, time take to reach office lies between the values 40.22 +/- 6.92 minutes which is the case here.

**Z-Scores** is difference between a data value for a variable and the mean of the variable, divided by the standard deviation.

Day |
Time to office (Mins) |
Time – Mean (Difference) |
Square of Difference |
Z-Score |

1 | 45 | -11 | 121 | -0.22 |

2 | 43 | -13 | 169 | -0.26 |

3 | 39 | -17 | 289 | -0.34 |

4 | 39 | -17 | 289 | -0.34 |

5 | 33 | -23 | 529 | -0.46 |

6 | 55 | -1 | 1 | -0.02 |

7 | 35 | -21 | 441 | -0.42 |

8 | 33 | -23 | 529 | -0.46 |

9 | 40 | -16 | 256 | -0.32 |

10 | 198 | 142 | 20164 | 2.82 |

56 | 0 | 50.32 |

Here, you can notice that the absolute value of Z-Score is low for low variance and high for extreme cases.

Generally if the Z-Score is either less than -3 or greater than +3, it means, the value of variable is extreme. In the above example, 198 minutes to office is close to being extreme value.

Below is the Python code to calulate Variance, Standard Deviation and Z-Scores. Here it calulates the Variance, Standard Deviation and Z-Scores considering count as entire count and not sample.

import numpy as np from pandas import DataFrame as df from scipy import stats as st subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English'] marks = [94,85,66,55,89,64] marksdf = df({'marks':marks}) daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10'] timemins = [45,43,39,39,33,55,35,33,40,198] timeminsdf = df({'timemins':timemins}) variance_marks = np.var(marks) print("variance in marks= " +str(variance_marks)) sd_marks = np.std(marks) print("standard deviation in marks= " +str(sd_marks)) zscore_marks = st.zscore(marksdf) print("Zscore of marks= " +str(zscore_marks)) mean_time = np.mean(timemins) print("mean of timemins= " +str(mean_time)) variance_time = np.var(timemins) print("variance in timemins= " +str(variance_time)) sd_time = np.std(timemins) print("standard deviation in timemins= " +str(sd_time)) zscore_time = st.zscore(timeminsdf) print("Zscore of timemins= " +str(zscore_time))

**Skewness** is a measure of the asymmetry of the distribution of a variable about its mean. Distribution can be either symmetrical, left-skewed (negative skew), or right-skewed (positive skew).

when the mean of a data set equals the median value, it is **symmetrical. **When the mean of a data set is less than the median value, it is **negative skewed. **When the mean is greater than the median value, the distribution is said to be **positive skewed**

In our scores example, Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

The scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values. Median = Mean (66,85)= 75.5. Here we have a symmetical data distribution.

In out time taken to reach office example, Mean of time taken to reach office is (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes. Median(33, 33, 35, 39, **39, 40,** 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes. Since the mean is greater than the median, the distribution is positive skewed

In order to understand the distances correctly, we can draw **box and whisker plot**. Here we have five values plotted as vertical lines (smallest value, the first quartile Q1 , the median, the third quartile Q3 , and the largest value.

Below is a simple python code to plot a box and whisker plot:

import matplotlib.pyplot as plt subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English'] marks = [94,85,66,55,89,64] daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10'] timemins = [45,43,39,39,33,55,35,33,40,198] plt.boxplot(marks) plt.title("Box and Whisker Plot of Scores",fontsize = 15) plt.show() plt.boxplot(timemins) plt.title("Box and Whisker Plot of Time to reach office ",fontsize = 15) plt.show()