Probability distribution functions in Python

As definied in Wikipedia (https://en.wikipedia.org/wiki/Probability_distribution), a probability distribution is a mathematical function that, stated in simple terms, can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment.

There are different distributions:

Binomial Distribution
Poisson Distribution
Normal Distribution
Exponential Distribution

Binomial Distribution: 

The binomial distribution is a discrete probability distribution.
It describes the outcome of n independent trials in an experiment. Each trial is has only two outcomes, either success or failure (e.g., tossing a coin).

Suppose a binomial experiment consists of n trials and results in successes. If the probability of success on an individual trial is P, then the binomial probability is:

bdist(x, n, p) = nCx * px * (1 – p)n – x

or

bdist(x, n, p) =  (n! / (x! (n – x)! )) * px * (1 – p)n – x

In an Engineering entrance test for chemistry, there are 60 multiple choice questions with 4 choices each with only 1 correct choice. I am very poor in chemistry and want to ramdonly mark an anwser for each question and complete the test.

For each question (trail), the probability of success is 1/4 = 0.25 (4 choices each with only 1 correct choice)

Hence the probability of getting 10 correct answers is bdist(10,60,0.25)

bdist(10;60,0.25) = 60C10 * 0.2510 * (1 – 0.25)60 – 10 

= (60! / (10!*50!)) * 0.2510 * (0.75)50

= 0.04071929

Below is few lines in Python:

import scipy.stats as ss
pmf = ss.distributions.binom.pmf(10,60,0.25) #10 correct answers, 60 attempted, each probability of correctness 1/4
print(pmf)

suppose I want to know what is the probability of getting atleast 10 answers correct, then I need to calculate the cumulative binomial probability i.e., bdist (x > 10, 60, 0.25) which is:

sum of (bdist(10;60,0.25), bdist(11;60,0.25), bdist(12;60,0.25), bdist(13;60,0.25)….., bdist(60;60,0.25))

This is calculated as 0.95483251

Below is few lines in Python:

import scipy.stats as ss
spmf =0.0
for x in range(10,60):
 spmf += ss.distributions.binom.pmf(x,60,0.25) #at least 10 correct answers, 60 attempted, each probability of correctness 1/4
print(spmf)

Poisson Distribution: As per investopedia, a Poisson distribution is a statistical distribution showing the likely number of times that an event will occur within a specified period of time. It is used for independent events which occur at a constant rate within a given interval of time.

Suppose we conduct experiment several times and the average number (mean) of success is represented by m, the poisson probability is for the actual number of successes

pdist(x, m) = (e-m) (mx) / x!

I have noticed that on an average, I get 7 spam calls a week (banks offering credit cards, advertising, people offering investment services etc). This happens usually during weekends. Suppose I want to calculate the probability of getting 4 spam calls next week, I could use the Poisson distribution formula – pdist(4, 7) = (e-7) (74) / 4! which calculates to 0.09122619.

Also, I want to clauclate the probability of  4 or less spam calls, it becomes the cumulative probability of 0 spam calls, 1 spam call, 2 spam calls, 3 spam calls, 4 spam calls. This should be 0.17299160

Below is few lines in Python:

import scipy.stats as ss
pmf = ss.distributions.poisson.pmf(4,7) #7 spam calls average a week, calculating probability of 4 spam calls
print(pmf)

spmf =0.0
for x in range(0,5):
 spmf += ss.distributions.poisson.pmf(x,7) #7 spam calls average a week, calculating probability of at least 4 spam calls
print(spmf)

Normal Distribution:

Unlike Binomial and Poisson distributions, this is a continious distribution (can take on any value within the range of the random variable and not just integers). Here the probability values are expressed in terms of an area under a curve  (bell and symmetrical) that represents the continuous distribution.

{\displaystyle f(x\mid \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}}

\mu   is the mean or expectation of the distribution (and also its median and mode), \sigma  is the standard deviation, and \sigma ^{2} is the variance.

About 68% of the area under the curve falls within 1 standard deviation of the mean. 95% of the area under the curve falls within 2 standard deviations of the mean and 99.7% of the area under the curve falls within 3 standard deviations of the mean.

Suppose, I know that the average score in chemistry test in a class is 68.5 %. I also know that the standard deviation is 10.8 %. To find probability of score <= 80 %, we can use Normal distribution. This calculates to 0.85652013

import scipy.stats as ss
cdf = ss.distributions.norm.cdf(80.0, 68.5, 10.8) 
print(cdf)

Please note that we are using cumiulative distribution function (cdf)

Suppose we need to know probability of score between 80 and 90 %, we need to calculate the cdf for <90% and <80% and subtract the cdfs

cdf for <90% is calculate to be 0.97674530. 

probability of score between 80 and 90 % = 0.97674530 – 0.85652013 = 0.12022517

import scipy.stats as ss
cdf80 = ss.distributions.norm.cdf(80.0, 68.5, 10.8) 
cdf90 = ss.distributions.norm.cdf(90.0, 68.5, 10.8) 
print(cdf90-cdf80)

Exponential distribution: As defined in wikipedia, Exponential distribution is the probability distribution that describes the time between events in a process in which events occur continuously and independently at a constant average rate. It is often used to model the time elapsed between events.

The probability density function (pdf) of an exponential distribution is

 f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases}

The cumulative distribution function is given by

{\displaystyle F(x;\lambda )={\begin{cases}1-e^{-\lambda x}&x\geq 0,\\0&x<0\end{cases}}}

where

  • λ is the mean time between events
  • x is a random variable

While driving on Bangalore roads, suppose you usually drive over 5 potholes per hour. In order to compute the probability that a pothole will arrive within the next half an hour -> 5 potholes per hour means that we would expect one pothole every 1/5 hour so λ = 0.2. We can then compute this as follows:
P(0 <= X <= 1)= expcdf(0.5,0.2) = 0.259181

import scipy.stats as ss

cdfexp = ss.distributions.expon.cdf(0.5,0.2)
print(cdfexp)

 

 

 

 

 

 

 

Advertisements

Analyzing data in Python – Measures of Central Tendency

In Stastistics, Central Tendancy could be defined as the tendency for the values of a random variable to cluster round its mean, mode, or median.

Mean, Median and Mode are known as Measures of Central Tendency

Mean is average i.e., sum of the data values for a variable, divided by the number of data values. It signifies balance point in data values.

e..g,

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

In the above table, the mean can be computed as:

Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

Another e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Mean of time = (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes

Please note that one extreme value can change the mean dramatically – e.g., one particular day (day 10) some repair work was going on the road and there was huge rain because of slow traffic with diversion , it took 198 minutes.

If we had considered first 9 days and calulated the mean,  it would have been

(45+43+39+39+33+55+35+33+40)/9 = 362/9= 40.22 minutes. Hence one bad day of traffic, increased the mean from 40.22 minutes to 56 minutes.

Median is the middle value in a set of data values for a variable when the
data values have been ordered from lowest to highest value.

In the above scores example, the scores can be ordered in increasing value:

55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values

Median = Mean (66,85)= 75.5 Coincidentally, the Mean and Median turned out to be the same 🙂

In the time example, the median can be calculated as below:

Median(33, 33, 35, 39, 39, 40, 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes

Here, the impact of the extreme value is little.

Mode is the value in a set of data values for a variable that appears most frequently

In the above scores example, the scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Here the frequency of occurance of each value is 1. Because each score in the example is unique. Hence all values are modes.

In the time example, 33, 33, 35, 39, 39, 40, 43, 45, 55, 198, there are two modes 33 and 39 as they are occuring twice (highest frequency of occurance)

import numpy as np
from pandas import DataFrame as df

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]
marksdf = df({'marks':marks})

print("Mean of marks = " + str(np.average(marks)))
print("Median of marks = " + str(np.median(marks)))
print("Mode of marks = " + str(df.mode(marksdf)))

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]
timeminsdf = df({'timemins':timemins})

print("Mean of time to reach office = " + str(np.average(timemins)))
print("Median of time to reach office = " + str(np.median(timemins)))
print("Mode of time to reach office = " + str(df.mode(timeminsdf)))

As per Wikipedia, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The first quartile splits off the lowest 25% of data from the highest 75%. This is know as lower quartile.

The second quartile (Q2) is the median of the data. This cuts data set in half.

The third quartile (Q3) is the middle value between the median and the highest value of the data set. This is known as upper quartile. This splits off the highest 25% of data from the lowest 75%.

e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Order Dataset 33, 33, 35, 39, 39, 40, 43, 45, 55, 198

There are even number of values in data set. The second quartile (i.e., median) is 39.5 since there are two mid points (5th and 6th value).

The first quartile is the mid point between least value and median (between 33 and 39).In this example, it is 35

The third quartile is the mid point between median and highest value (between 40 and 198). In this example, it is 45.

Below code snippet can be added to the code to print the Q1, Q2 and Q3 values

print("First quartile time to reach office = " + str(np.percentile(timeminsdf, 25)))
print("Second quartile time to reach office = " + str(np.percentile(timeminsdf, 50)))
print("Third quartile time to reach office = " + str(np.percentile(timeminsdf, 75)))

interquartile range (IQR), also called the midspread or middle 50%  is calculated as IQR = Q3 −  Q1. 

In the above example, the IQR = 45 – 35 =10

While determining outliers, we use lower fence and upper fence.

Lower Fence = Q1 – 1.5(IQR)

Upper Fence = Q3 + 1.5(IQR)

Lower Fence = 35 – 1.5*10 = 35 – 15 = 20

Upper Fence = 45 + 1.5*10 = 45 + 15 = 60

Hence, any time taken less than 20 mins to reach office or time taken greater than 60 minutes to reach office can be considered outliers. In the above example, 198 is only outlier.

We can also use describe method of DataFrame

print("Describe method of DataFrame " + str(timeminsdf.describe()))

In my example case, the below output is printed:

Describe method of DataFrame timemins
count 10.000000
mean 56.000000
std 50.318983
min 33.000000
25% 36.000000
50% 39.500000
75% 44.500000
max 198.000000

Please note that there may be different methods to calculate quatile values. Hence, the output may be slightly different from what I calculated manually.

 

 

Analyzing data in Python – Time Series Plot

A time series graph is a graph or plot that illustrates data points at successive intervals of time. It can be drawn using a Python Pandas’ Series.plot method.

e.g., Plot of the closing values of stock market S&P BSE sensex on the y axis vs time on the x axis (starting year 2000 to 2018).

Data is downloaded as a csv file from the site https://www.bseindia.com/indices/IndexArchiveData.aspx

from pandas import Series
from matplotlib import pyplot as plt
series = Series.from_csv('SENSEX.csv', header=0)
plt.ylabel('Sensex')
series.plot()
plt.show()

TimeSeries1

Analyzing data in Python – Histogram

As  per disctionary definition, Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

An example is Histogram of heights of people in inches. In the below web page, there is a Height and Weight data of 25000 people:

http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html

import matplotlib.pyplot as plt
import pandas
colnames = ['Height', 'Weight']
data = pandas.read_csv('HeightWeight.csv', names=colnames)

heights=data.Height.tolist()
weights=data.Weight.tolist()

plt.hist(heights, bins=10, alpha=0.5)
plt.title("Histogram of Heights of 25000 people")
plt.ylabel("Height")
plt.xlabel("Frequency of Heights")
plt.show()

plt.hist(weights, bins=10, alpha=0.5)
plt.title("Histogram of Weights of 25000 people")
plt.ylabel("Weights")
plt.xlabel("Frequency of Weights")
plt.show()

The data is saved as csv file. It contains 2 columns – Height in inches and Weight in lbs.

The csv file content looks like below:

Screenshot from 2018-02-15 20-55-40

The file is read using pandas read_csv function. This function returns a Dataframe object.

Each column can be converted into a list using tolist() function. Hence we have two lists – one with Heights (of 25000 people) and another with Weights (of  25000 people).

Now, hist function of Matplotlib, we are plotting two seperate histograms for Height and Weight specifying bin as 10.

Histogram1

Histogram2

Analyzing data in Python – Pie Charts

As per Wikipedia, a pie chart (or a circle chart) is a circular statistical graphic which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents.

Below is a sample code in Python to draw Pie Chart

import matplotlib.pyplot as plot

genre = ['Comedy', 'Action', 'Drama', 'Romance', 'Science Fiction']
preference = [1000,800,750,550,670]

pref_total = sum(preference)

subjects = ['Maths', 'Physics', 'Chemistry', 'Computers', 'English']
marks = [94,85,66,89,64]
marks_total=sum(marks)

colors = ['violet', 'grey', 'green', 'yellow', 'orange']

plot.pie(preference, labels=genre, colors=colors, autopct='%1.1f%%')
plot.axis('equal')
plot.show()

plot.pie(marks, labels=subjects, colors=colors, autopct=lambda p: '{:.0f}'.format(p * marks_total / 100))
plot.axis('equal')
plot.show()

The output is two pie charts – one with percentages and second with absolute values.

Pie1

Pie2

 

Setting up Python Environment-Anaconda Installation

For a newbie like me, it is difficult to keep upgrading Python and associated packages while resolving package dependencies. This is where Anaconda comes to my rescue. Anaconda is a free Python distribution and package manager. It comes with lot of pre-installed packages (primarily for data science).

It can be downloaded for Linux from the Continuum’s site https://www.continuum.io/downloads#linux . The instructions for installation on Linux are available on the same site. I have downloaded and installed 64 bit Python 3.6 version on my Linux Mint.

In order to update Anaconda and Python to latest version, you need to run the below command on the Terminal.

Screenshot from 2017-06-06 20-39-16

However, I continue to have older version of Python. You can see in below screenshot, Python 3.5.2 which I manually installed and Python 2.7.12 which was pre-installed on Linux Mint are still available.

Screenshot from 2017-06-06 20-47-40

conda update anaconda

Screenshot from 2017-06-06 20-50-45.png

On my Linux Mint, I have already updated to Anaconda version 4.4.0 (latest available as of date). This way, it is easy to keep upgrading Python and required packages.

On my PyCharm, I can choose Python 3.6 (installed through Anaconda / conda update) as the project interpreter.

Screenshot from 2017-06-06 20-57-17.png

Anaconda also comes with Anaconda Navigator – a GUI useful to launch Applications, manage packages, learning Python etc,

To add short cut to anaconda-navigator to desktop, the created the following script (desktop entry file in usr/share/applications folder

[Desktop Entry]
Version=1.0
Type=Application
Name=Anaconda-Navigator
GenericName=Anaconda
Comment=Scientific PYthon Development EnviRonment - Python3
Exec=bash -c 'export PATH="/home/srinivas/anaconda3/bin:$PATH" && /home/srinivas/anaconda3/bin/anaconda-navigator'
Categories=Development;Science;IDE;Qt;Education;
Icon=/home/srinivas/anaconda3/Anaconda.png
Terminal=false
StartupNotify=true
Name[en_IN]=Anaconda

Screenshot from 2017-06-06 21-08-23

Spyder is an open source cross platform IDE for scientific programming in Python. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.

Screenshot from 2017-06-06 21-10-01

To conclude, Anaconda is a Python distribution with lot of useful features and learning opportunity in one place.

 

 

Setting up Python Environment-Installing Packages

In order to build useful applications, we need Python Libraries or Packages. Majority of such useful pckages can be downloaded from PyPI, the Python Package Index https://pypi.python.org/pypi

Best way to install the packages is by using a tool called pip. We can get pip from https://pip.pypa.io/en/latest/installing.html. However, on Linux Mint, pip is already installed along with Python 2.7.12. Similarly, when I installed Python 3.5.2, pip3 tool is installed. To upgrade to latest pip, you need to run below command on terminal

pip install -U pip

Now let us look at some of the useful packages for analyzing data.

Numpy (http://www.numpy.org/): Is useful for processing for numbers, strings, records, and objects.

pip install numpy

Pandas (http://pandas.pydata.org/): Python Data Analysis Library provides various data analysis tools for Python.

pip install pandas

Matplotlib (https://matplotlib.org/): Matplotlib is a Python 2D plotting library to produce publication quality graphs and figures.

pip install matplot

OpenPyXL (https://openpyxl.readthedocs.io/en/default/): Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

pip install openpyxl

Once these packages are installed, they can be imported and used in your application. E.g.,

import numpy
import pandas
import openpyxl
import matplotlib

from pandas import DataFrame
from pandas import *