Analyzing data in Python – Measures of Variation

Variance is measure of how far a set of numbers is spread out from mean

Range is the difference between the largest and smallest data values in a
set of values for a variable

e.g.,

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

In the above example, the range can be computed as: 94 – 55 = 39

Another e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Here the range is 198 -33 = 165

This is the largest possible difference between any two values in a set of data values for a variable. In the second example, range is very high. where as in 1st example, it is somewhat high.

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

Subject Score out of 100 Score – Mean (Difference) Square of Difference
Maths 94 18.5 342.25
Physics 85 9.5 90.25
Chemistry 66 -9.5 90.25
French 55 -20.5 420.25
Computers 89 13.5 182.25
English 64 -11.5 132.25
  75.5 0 1257.5

In the above example, Mean / Average Marks is 75.5 and Sum of difference between Score and Mean results in Zero value always. Since we cannot compare this other data set, the variance is calculated as sum of squares of difference divided by number of variables. Hence the variance is 1257.5/6 = 209.58

This way, we can compare the variance of scores for two different students.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference
1 45 -11 121
2 43 -13 169
3 39 -17 289
4 39 -17 289
5 33 -23 529
6 55 -1 1
7 35 -21 441
8 33 -23 529
9 40 -16 256
10 198 142 20164
56 0 22788

In the second example, the variance is 22788 /10 = 2278.8. The above data is a  sample data for 10 days and may not represent the complete set of data. Hence the variance is calulated as sum of squares of difference divided by (number of variables -1)

Variance of sample data = 22788 /9= 2532.

Standard Deviation is the positive square root of the variance.

Hence, in the scores example,  standard deviation of scores  = sqrt (209.58) = 14.48 (approximate). In the second example, the standard deviation of time to reach office is = sqrt (2532) = 50.32 (approximate). what this means is the most of the cases, time take to reach office lies between the mean +/- standard deviation i.e., 56+/-50.32. Here 198 is extreme value. If we ignore 198, mean becomes 40.22 and standard deviation becomes 6.92. Hence the most of the cases, time take to reach office lies between the values 40.22 +/- 6.92 minutes which is the case here.

Z-Scores is difference between a data value for a variable and the mean of the variable, divided by the standard deviation.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference Z-Score
1 45 -11 121 -0.22
2 43 -13 169 -0.26
3 39 -17 289 -0.34
4 39 -17 289 -0.34
5 33 -23 529 -0.46
6 55 -1 1 -0.02
7 35 -21 441 -0.42
8 33 -23 529 -0.46
9 40 -16 256 -0.32
10 198 142 20164 2.82
56 0 50.32

Here, you can notice that the absolute value of Z-Score is low for low variance and high for extreme cases.

Generally if the Z-Score is either less than -3 or greater than +3, it means, the value of variable is extreme. In the above example, 198 minutes to office is close to being extreme value.

Below is the Python code to calulate Variance, Standard Deviation and Z-Scores. Here it calulates the Variance, Standard Deviation and Z-Scores considering count as entire count and not sample.

import numpy as np
from pandas import DataFrame as df
from scipy import stats as st

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]
marksdf = df({'marks':marks})

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]
timeminsdf = df({'timemins':timemins})

variance_marks = np.var(marks)
print("variance in marks= " +str(variance_marks))

sd_marks = np.std(marks)
print("standard deviation in marks= " +str(sd_marks))

zscore_marks = st.zscore(marksdf)
print("Zscore of marks= " +str(zscore_marks))

mean_time = np.mean(timemins)
print("mean of timemins= " +str(mean_time))

variance_time = np.var(timemins)
print("variance in timemins= " +str(variance_time))

sd_time = np.std(timemins)
print("standard deviation in timemins= " +str(sd_time))

zscore_time = st.zscore(timeminsdf)
print("Zscore of timemins= " +str(zscore_time))

Skewness is a measure of the asymmetry of the distribution of a variable about its mean. Distribution can be either symmetrical, left-skewed (negative skew), or right-skewed (positive skew).

when the mean of a data set equals the median value, it is symmetrical. When the mean of a data set is less than the median value, it is negative skewed. When the mean is greater than the median value, the distribution is said to be positive skewed

In our scores example, Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

The scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values. Median = Mean (66,85)= 75.5. Here we have a symmetical data distribution.

In out time taken to reach office example, Mean of time taken to reach office is (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes.  Median(33, 33, 35, 39, 39, 40, 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes. Since the mean is greater than the median, the distribution is positive skewed

In order to understand the distances correctly, we can draw box and whisker plot. Here we have five values plotted as vertical lines (smallest value, the first quartile Q1 , the median, the third quartile Q3 , and the largest value.

Below is a simple python code to plot a box and whisker plot:

import matplotlib.pyplot as plt

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]

plt.boxplot(marks)
plt.title("Box and Whisker Plot of Scores",fontsize = 15)
plt.show()

plt.boxplot(timemins)
plt.title("Box and Whisker Plot of Time to reach office ",fontsize = 15)
plt.show()

Box1

Box2

 

 

 

Advertisements

Analyzing data in Python – Scatter (xy) Plot

A Scatter plot (also known as XY plot) has points that show the relationship between two sets of variables.

e.g., a plot of persons height vs weight.

import matplotlib.pyplot as plt
import pandas

heights = []
weights = []

colnames = ['Height', 'Weight']
data = pandas.read_csv('ShortListOfHeightWeight.csv', names=colnames)

heights=data.Height.tolist()
weights=data.Weight.tolist()

plt.scatter(heights, weights)
plt.title('Scatter plot of height and corresponding weight', fontsize=15)
plt.xlabel('height', fontsize=15)
plt.ylabel('weight', fontsize=15)
plt.show()

Here we have data of 250 persons (height in inches and corresponding weight in lbs).

Scatter

 

Analyzing data in Python – Time Series Plot

A time series graph is a graph or plot that illustrates data points at successive intervals of time. It can be drawn using a Python Pandas’ Series.plot method.

e.g., Plot of the closing values of stock market S&P BSE sensex on the y axis vs time on the x axis (starting year 2000 to 2018).

Data is downloaded as a csv file from the site https://www.bseindia.com/indices/IndexArchiveData.aspx

from pandas import Series
from matplotlib import pyplot as plt
series = Series.from_csv('SENSEX.csv', header=0)
plt.ylabel('Sensex')
series.plot()
plt.show()

TimeSeries1

Analyzing data in Python – Pareto Charts

As per Wikipedia, a Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

Below is a simple example of Pareto Chart in Python

from matplotlib import pyplot as plot
import numpy as np
preference = ({'Comedy':1500,'Science Fiction':670,'Action':950,'Drama':450,'Romance':50})

# sort preference in descending order
weights, labels = zip(*sorted(((pref,genre) for genre,pref in preference.items()), reverse=True))

for i in weights:
 cumu_1 = weights[0]
 cumu_2 = weights[1] + cumu_1
 cumu_3 = weights[2] + cumu_2
 cumu_4 = weights[3] + cumu_3
 cumu_5 = weights[4] + cumu_4
cumu_weights = [cumu_1,cumu_2, cumu_3, cumu_4, cumu_5]

print(cumu_weights)

# lefthand edge of each bar
left = np.arange(len(weights))
fig, ax = plot.subplots(1, 1)
ax.bar(left, weights, 1)
ax.set_xticks(left)
ax.set_xticklabels(labels,fontsize=10, fontweight='bold', rotation=35, color='darkblue')
ax.plot(cumu_weights)

Here we are sorting the preference in decending order and drawing a barchart with weightage of preference on the y axis. We also take the cumulative values decreasing order of this weightages and plot as a line graph.

Pareto

 

Analyzing data in Python – Bar Charts

There are many good book on Data Analytics. Recently I borrowed a book from office library titled – “Even You Can Learn Statistics and Analytics: An Easy to Understand Guide to Statistics and Analytics” authored by David M. Levine and David F. Stephan. I feel this is a good one for beginners on Data Analytics.

There are also other good books like – ‘Python for Data Analysis’, ‘Python: Data Analytics and Visualization‘ , ‘Python for Finance’ etc.

One important aspect of presenting data is in Graph format (visual format – also known as Data Visualization).

A bar chart is useful for presenting categorical data. I has rectangular bars whose length is proportional to the categorical values we want to present.

E.g. We want to represent the Marks of a Student in several subjects:

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

Bar1

This is a vertical bar graph. This can be achieved by a small Python code:

import matplotlib.pyplot as plot
import numpy as np
subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

m = np.arange(len(subjects))
plot.bar(m, marks)
plot.xlabel('Subject')
plot.ylabel('Marks')
plot.xticks(m, subjects)
plot.title('Marks obtained out of 100')
plot.show()

Matplotlib is a Python 2D plotting library and Numpy is the fundamental package for scientific computing with Python are two impotant packages in Python

Here subjects and score (marks) are represented in Python arrays. len(subjects) return the length of subjects – in this case 6.

numpy.arrange is used to arrange the subjects on the graph in order.  On x-axis, we have the subject names (xticks) and on y-axis, we have marks.

plot.bar is plotting the vertical bar chart.

We can also define other parameters for the graph such as fontsize, weight, rotation:

plot.bar(m, marks,color='indigo')
plot.xlabel('Subject',fontsize=15, fontweight='bold', color='blue')
plot.ylabel('Marks',fontsize=15, fontweight='bold', color='blue')
plot.xticks(m, subjects, fontsize=10, fontweight='bold', rotation=35, color='blue')
plot.title('Marks obtained out of 100',fontsize=15, fontweight='bold', color='blue')

Bar2

The same can be represented as a horizontal bar graph. Instead of bar function, we use barh function.

import matplotlib.pyplot as plot
import numpy as np
subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

m = np.arange(len(subjects))
plot.barh(m, marks,color='indigo')
plot.ylabel('Subject',fontsize=15, fontweight='bold', color='blue')
plot.xlabel('Marks',fontsize=15, fontweight='bold', color='blue')
plot.yticks(m, subjects, fontsize=10, fontweight='bold', rotation=35, color='blue')
plot.title('Marks obtained out of 100',fontsize=15, fontweight='bold', color='blue')
plot.show()

Bar3

In order to add the data value on the graph, we need to

for i, value in enumerate(marks):
 plot.text(value, i, str(value), color='indigo', fontweight='bold')

Bar4.png

I am using Spyder IDE (from Anaconda Navigator) in order to run this code. It has a handy feature, a variable explorer that shows the details of the variables used in code.

Screenshot from 2018-02-03 12-57-00.png

Setting up Python Environment-Anaconda Installation

For a newbie like me, it is difficult to keep upgrading Python and associated packages while resolving package dependencies. This is where Anaconda comes to my rescue. Anaconda is a free Python distribution and package manager. It comes with lot of pre-installed packages (primarily for data science).

It can be downloaded for Linux from the Continuum’s site https://www.continuum.io/downloads#linux . The instructions for installation on Linux are available on the same site. I have downloaded and installed 64 bit Python 3.6 version on my Linux Mint.

In order to update Anaconda and Python to latest version, you need to run the below command on the Terminal.

Screenshot from 2017-06-06 20-39-16

However, I continue to have older version of Python. You can see in below screenshot, Python 3.5.2 which I manually installed and Python 2.7.12 which was pre-installed on Linux Mint are still available.

Screenshot from 2017-06-06 20-47-40

conda update anaconda

Screenshot from 2017-06-06 20-50-45.png

On my Linux Mint, I have already updated to Anaconda version 4.4.0 (latest available as of date). This way, it is easy to keep upgrading Python and required packages.

On my PyCharm, I can choose Python 3.6 (installed through Anaconda / conda update) as the project interpreter.

Screenshot from 2017-06-06 20-57-17.png

Anaconda also comes with Anaconda Navigator – a GUI useful to launch Applications, manage packages, learning Python etc,

To add short cut to anaconda-navigator to desktop, the created the following script (desktop entry file in usr/share/applications folder

[Desktop Entry]
Version=1.0
Type=Application
Name=Anaconda-Navigator
GenericName=Anaconda
Comment=Scientific PYthon Development EnviRonment - Python3
Exec=bash -c 'export PATH="/home/srinivas/anaconda3/bin:$PATH" && /home/srinivas/anaconda3/bin/anaconda-navigator'
Categories=Development;Science;IDE;Qt;Education;
Icon=/home/srinivas/anaconda3/Anaconda.png
Terminal=false
StartupNotify=true
Name[en_IN]=Anaconda

Screenshot from 2017-06-06 21-08-23

Spyder is an open source cross platform IDE for scientific programming in Python. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.

Screenshot from 2017-06-06 21-10-01

To conclude, Anaconda is a Python distribution with lot of useful features and learning opportunity in one place.

 

 

Setting up Python Environment-Installing Packages

In order to build useful applications, we need Python Libraries or Packages. Majority of such useful pckages can be downloaded from PyPI, the Python Package Index https://pypi.python.org/pypi

Best way to install the packages is by using a tool called pip. We can get pip from https://pip.pypa.io/en/latest/installing.html. However, on Linux Mint, pip is already installed along with Python 2.7.12. Similarly, when I installed Python 3.5.2, pip3 tool is installed. To upgrade to latest pip, you need to run below command on terminal

pip install -U pip

Now let us look at some of the useful packages for analyzing data.

Numpy (http://www.numpy.org/): Is useful for processing for numbers, strings, records, and objects.

pip install numpy

Pandas (http://pandas.pydata.org/): Python Data Analysis Library provides various data analysis tools for Python.

pip install pandas

Matplotlib (https://matplotlib.org/): Matplotlib is a Python 2D plotting library to produce publication quality graphs and figures.

pip install matplot

OpenPyXL (https://openpyxl.readthedocs.io/en/default/): Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

pip install openpyxl

Once these packages are installed, they can be imported and used in your application. E.g.,

import numpy
import pandas
import openpyxl
import matplotlib

from pandas import DataFrame
from pandas import *