Analyzing data in Python – Measures of Variation

Variance is measure of how far a set of numbers is spread out from mean

Range is the difference between the largest and smallest data values in a
set of values for a variable

e.g.,

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

In the above example, the range can be computed as: 94 – 55 = 39

Another e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Here the range is 198 -33 = 165

This is the largest possible difference between any two values in a set of data values for a variable. In the second example, range is very high. where as in 1st example, it is somewhat high.

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

Subject Score out of 100 Score – Mean (Difference) Square of Difference
Maths 94 18.5 342.25
Physics 85 9.5 90.25
Chemistry 66 -9.5 90.25
French 55 -20.5 420.25
Computers 89 13.5 182.25
English 64 -11.5 132.25
  75.5 0 1257.5

In the above example, Mean / Average Marks is 75.5 and Sum of difference between Score and Mean results in Zero value always. Since we cannot compare this other data set, the variance is calculated as sum of squares of difference divided by number of variables. Hence the variance is 1257.5/6 = 209.58

This way, we can compare the variance of scores for two different students.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference
1 45 -11 121
2 43 -13 169
3 39 -17 289
4 39 -17 289
5 33 -23 529
6 55 -1 1
7 35 -21 441
8 33 -23 529
9 40 -16 256
10 198 142 20164
56 0 22788

In the second example, the variance is 22788 /10 = 2278.8. The above data is a  sample data for 10 days and may not represent the complete set of data. Hence the variance is calulated as sum of squares of difference divided by (number of variables -1)

Variance of sample data = 22788 /9= 2532.

Standard Deviation is the positive square root of the variance.

Hence, in the scores example,  standard deviation of scores  = sqrt (209.58) = 14.48 (approximate). In the second example, the standard deviation of time to reach office is = sqrt (2532) = 50.32 (approximate). what this means is the most of the cases, time take to reach office lies between the mean +/- standard deviation i.e., 56+/-50.32. Here 198 is extreme value. If we ignore 198, mean becomes 40.22 and standard deviation becomes 6.92. Hence the most of the cases, time take to reach office lies between the values 40.22 +/- 6.92 minutes which is the case here.

Z-Scores is difference between a data value for a variable and the mean of the variable, divided by the standard deviation.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference Z-Score
1 45 -11 121 -0.22
2 43 -13 169 -0.26
3 39 -17 289 -0.34
4 39 -17 289 -0.34
5 33 -23 529 -0.46
6 55 -1 1 -0.02
7 35 -21 441 -0.42
8 33 -23 529 -0.46
9 40 -16 256 -0.32
10 198 142 20164 2.82
56 0 50.32

Here, you can notice that the absolute value of Z-Score is low for low variance and high for extreme cases.

Generally if the Z-Score is either less than -3 or greater than +3, it means, the value of variable is extreme. In the above example, 198 minutes to office is close to being extreme value.

Below is the Python code to calulate Variance, Standard Deviation and Z-Scores. Here it calulates the Variance, Standard Deviation and Z-Scores considering count as entire count and not sample.

import numpy as np
from pandas import DataFrame as df
from scipy import stats as st

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]
marksdf = df({'marks':marks})

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]
timeminsdf = df({'timemins':timemins})

variance_marks = np.var(marks)
print("variance in marks= " +str(variance_marks))

sd_marks = np.std(marks)
print("standard deviation in marks= " +str(sd_marks))

zscore_marks = st.zscore(marksdf)
print("Zscore of marks= " +str(zscore_marks))

mean_time = np.mean(timemins)
print("mean of timemins= " +str(mean_time))

variance_time = np.var(timemins)
print("variance in timemins= " +str(variance_time))

sd_time = np.std(timemins)
print("standard deviation in timemins= " +str(sd_time))

zscore_time = st.zscore(timeminsdf)
print("Zscore of timemins= " +str(zscore_time))

Skewness is a measure of the asymmetry of the distribution of a variable about its mean. Distribution can be either symmetrical, left-skewed (negative skew), or right-skewed (positive skew).

when the mean of a data set equals the median value, it is symmetrical. When the mean of a data set is less than the median value, it is negative skewed. When the mean is greater than the median value, the distribution is said to be positive skewed

In our scores example, Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

The scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values. Median = Mean (66,85)= 75.5. Here we have a symmetical data distribution.

In out time taken to reach office example, Mean of time taken to reach office is (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes.  Median(33, 33, 35, 39, 39, 40, 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes. Since the mean is greater than the median, the distribution is positive skewed

In order to understand the distances correctly, we can draw box and whisker plot. Here we have five values plotted as vertical lines (smallest value, the first quartile Q1 , the median, the third quartile Q3 , and the largest value.

Below is a simple python code to plot a box and whisker plot:

import matplotlib.pyplot as plt

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]

plt.boxplot(marks)
plt.title("Box and Whisker Plot of Scores",fontsize = 15)
plt.show()

plt.boxplot(timemins)
plt.title("Box and Whisker Plot of Time to reach office ",fontsize = 15)
plt.show()

Box1

Box2

 

 

 

Advertisements

Setting up Python Environment-Anaconda Installation

For a newbie like me, it is difficult to keep upgrading Python and associated packages while resolving package dependencies. This is where Anaconda comes to my rescue. Anaconda is a free Python distribution and package manager. It comes with lot of pre-installed packages (primarily for data science).

It can be downloaded for Linux from the Continuum’s site https://www.continuum.io/downloads#linux . The instructions for installation on Linux are available on the same site. I have downloaded and installed 64 bit Python 3.6 version on my Linux Mint.

In order to update Anaconda and Python to latest version, you need to run the below command on the Terminal.

Screenshot from 2017-06-06 20-39-16

However, I continue to have older version of Python. You can see in below screenshot, Python 3.5.2 which I manually installed and Python 2.7.12 which was pre-installed on Linux Mint are still available.

Screenshot from 2017-06-06 20-47-40

conda update anaconda

Screenshot from 2017-06-06 20-50-45.png

On my Linux Mint, I have already updated to Anaconda version 4.4.0 (latest available as of date). This way, it is easy to keep upgrading Python and required packages.

On my PyCharm, I can choose Python 3.6 (installed through Anaconda / conda update) as the project interpreter.

Screenshot from 2017-06-06 20-57-17.png

Anaconda also comes with Anaconda Navigator – a GUI useful to launch Applications, manage packages, learning Python etc,

To add short cut to anaconda-navigator to desktop, the created the following script (desktop entry file in usr/share/applications folder

[Desktop Entry]
Version=1.0
Type=Application
Name=Anaconda-Navigator
GenericName=Anaconda
Comment=Scientific PYthon Development EnviRonment - Python3
Exec=bash -c 'export PATH="/home/srinivas/anaconda3/bin:$PATH" && /home/srinivas/anaconda3/bin/anaconda-navigator'
Categories=Development;Science;IDE;Qt;Education;
Icon=/home/srinivas/anaconda3/Anaconda.png
Terminal=false
StartupNotify=true
Name[en_IN]=Anaconda

Screenshot from 2017-06-06 21-08-23

Spyder is an open source cross platform IDE for scientific programming in Python. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.

Screenshot from 2017-06-06 21-10-01

To conclude, Anaconda is a Python distribution with lot of useful features and learning opportunity in one place.

 

 

Setting up Python Environment-Installing Packages

In order to build useful applications, we need Python Libraries or Packages. Majority of such useful pckages can be downloaded from PyPI, the Python Package Index https://pypi.python.org/pypi

Best way to install the packages is by using a tool called pip. We can get pip from https://pip.pypa.io/en/latest/installing.html. However, on Linux Mint, pip is already installed along with Python 2.7.12. Similarly, when I installed Python 3.5.2, pip3 tool is installed. To upgrade to latest pip, you need to run below command on terminal

pip install -U pip

Now let us look at some of the useful packages for analyzing data.

Numpy (http://www.numpy.org/): Is useful for processing for numbers, strings, records, and objects.

pip install numpy

Pandas (http://pandas.pydata.org/): Python Data Analysis Library provides various data analysis tools for Python.

pip install pandas

Matplotlib (https://matplotlib.org/): Matplotlib is a Python 2D plotting library to produce publication quality graphs and figures.

pip install matplot

OpenPyXL (https://openpyxl.readthedocs.io/en/default/): Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

pip install openpyxl

Once these packages are installed, they can be imported and used in your application. E.g.,

import numpy
import pandas
import openpyxl
import matplotlib

from pandas import DataFrame
from pandas import *

 

 

How I got bitten by Python programming

Many years ago, I used to be a Java programmer. In fact, I started my information technology career in the year 1999 as a software developer in a small company which focussed on application software development. During my engineering days, I learnt Fortran and C programming. When I completed my engineering, there was Y2K problem (https://en.wikipedia.org/wiki/Year_2000_problem) which helped many job aspirants to jump into IT industry irrespective of their educational background.

During the same time, Java was one of the bleeding edge technologies. There was a saying – ‘To get into IT job,  all you need to know  is spelling of Java’.

After few years of programming (mainly in Java, Web development, SQL, Database design), like many others, I moved on to project management and with more focus on day to day operations, I gradually lost hold on coding but not the zeal.

Several years later in the current digital world, data analytics caught my attention. I am interested in learning data analytics and visualization. Since few years, I started using Linux Mint Cinnamon OS more frequently on my personal laptop as it is free and open source(FOSS). I was fascinated by Cinnamon Desktop Environment. The website – https://en.wikipedia.org/wiki/Linux_Mint, claims most of the Linux Mint is developed in Python language – https://www.python.org/. I was aware of the fact that majority of Unix/Linux development happens in C but was surprised when I saw Python. This was my first encounter /awareness on Python.  This is when I started gathering my understanding of Python from internet.

Why learn Python ? 

  • It is a free and open source (FOSS)
  • Already available in several Linux distributions
  • Easy to learn for beginners (minimal coding is required)
  • One of the languages widely used for Data Analytics
  • Popular (http://www.tiobe.com/tiobe-index/) and good Community support
  • Availability of code libraries / packages
    • Many Web development frameworks – Django, Bottle, Flask etc
    • Scientific and numeric computing – Numpy, Matplotlib, Pandas etc
    • Rich GUI development – pyQt, wxPython

Python 2.x or 3.x ? 

Several books and websites debate on whether to use Python 2 or Python 3. I have noticed that by default, Python 2.7 was installed on Linux Mint 18 (Sarah). When I started learing, I felt that going forward the focus would be on developing Python 3.x  as  it is the present and future. Hence I started with Python 3.5 interpreter. Fortunately, Python 3.5 is also pre-installed on the latest Linux Mint 18.1 (Serena).

My favourite books for learning Python / Data Analytics

There are several online books and tutorials available. One of my favourite is Tutorials Point – https://www.tutorialspoint.com/python/

I follow the Google plus Python community frequently – https://plus.google.com/u/0/communities/103393744324769547228

Also, Stackoverflow (http://stackoverflow.com/questions/tagged/python) comes to my rescue whenever I encounter some hurdles.

Screenshot from 2016-12-31 20-44-32.png

Disclaimer: The opinions and experiences listed on the site are my personal. In some cases, my understanding could be incorrect as I am a beginner to intermediate programmer. Please point out if any correction is required so that I can consider editing  the blog.