Probability distribution functions in Python

As definied in Wikipedia (https://en.wikipedia.org/wiki/Probability_distribution), a probability distribution is a mathematical function that, stated in simple terms, can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment.

There are different distributions:

Binomial Distribution
Poisson Distribution
Normal Distribution
Exponential Distribution

Binomial Distribution: 

The binomial distribution is a discrete probability distribution.
It describes the outcome of n independent trials in an experiment. Each trial is has only two outcomes, either success or failure (e.g., tossing a coin).

Suppose a binomial experiment consists of n trials and results in successes. If the probability of success on an individual trial is P, then the binomial probability is:

bdist(x, n, p) = nCx * px * (1 – p)n – x

or

bdist(x, n, p) =  (n! / (x! (n – x)! )) * px * (1 – p)n – x

In an Engineering entrance test for chemistry, there are 60 multiple choice questions with 4 choices each with only 1 correct choice. I am very poor in chemistry and want to ramdonly mark an anwser for each question and complete the test.

For each question (trail), the probability of success is 1/4 = 0.25 (4 choices each with only 1 correct choice)

Hence the probability of getting 10 correct answers is bdist(10,60,0.25)

bdist(10;60,0.25) = 60C10 * 0.2510 * (1 – 0.25)60 – 10 

= (60! / (10!*50!)) * 0.2510 * (0.75)50

= 0.04071929

Below is few lines in Python:

import scipy.stats as ss
pmf = ss.distributions.binom.pmf(10,60,0.25) #10 correct answers, 60 attempted, each probability of correctness 1/4
print(pmf)

suppose I want to know what is the probability of getting atleast 10 answers correct, then I need to calculate the cumulative binomial probability i.e., bdist (x > 10, 60, 0.25) which is:

sum of (bdist(10;60,0.25), bdist(11;60,0.25), bdist(12;60,0.25), bdist(13;60,0.25)….., bdist(60;60,0.25))

This is calculated as 0.95483251

Below is few lines in Python:

import scipy.stats as ss
spmf =0.0
for x in range(10,60):
 spmf += ss.distributions.binom.pmf(x,60,0.25) #at least 10 correct answers, 60 attempted, each probability of correctness 1/4
print(spmf)

Poisson Distribution: As per investopedia, a Poisson distribution is a statistical distribution showing the likely number of times that an event will occur within a specified period of time. It is used for independent events which occur at a constant rate within a given interval of time.

Suppose we conduct experiment several times and the average number (mean) of success is represented by m, the poisson probability is for the actual number of successes

pdist(x, m) = (e-m) (mx) / x!

I have noticed that on an average, I get 7 spam calls a week (banks offering credit cards, advertising, people offering investment services etc). This happens usually during weekends. Suppose I want to calculate the probability of getting 4 spam calls next week, I could use the Poisson distribution formula – pdist(4, 7) = (e-7) (74) / 4! which calculates to 0.09122619.

Also, I want to clauclate the probability of  4 or less spam calls, it becomes the cumulative probability of 0 spam calls, 1 spam call, 2 spam calls, 3 spam calls, 4 spam calls. This should be 0.17299160

Below is few lines in Python:

import scipy.stats as ss
pmf = ss.distributions.poisson.pmf(4,7) #7 spam calls average a week, calculating probability of 4 spam calls
print(pmf)

spmf =0.0
for x in range(0,5):
 spmf += ss.distributions.poisson.pmf(x,7) #7 spam calls average a week, calculating probability of at least 4 spam calls
print(spmf)

Normal Distribution:

Unlike Binomial and Poisson distributions, this is a continious distribution (can take on any value within the range of the random variable and not just integers). Here the probability values are expressed in terms of an area under a curve  (bell and symmetrical) that represents the continuous distribution.

{\displaystyle f(x\mid \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}}

\mu   is the mean or expectation of the distribution (and also its median and mode), \sigma  is the standard deviation, and \sigma ^{2} is the variance.

About 68% of the area under the curve falls within 1 standard deviation of the mean. 95% of the area under the curve falls within 2 standard deviations of the mean and 99.7% of the area under the curve falls within 3 standard deviations of the mean.

Suppose, I know that the average score in chemistry test in a class is 68.5 %. I also know that the standard deviation is 10.8 %. To find probability of score <= 80 %, we can use Normal distribution. This calculates to 0.85652013

import scipy.stats as ss
cdf = ss.distributions.norm.cdf(80.0, 68.5, 10.8) 
print(cdf)

Please note that we are using cumiulative distribution function (cdf)

Suppose we need to know probability of score between 80 and 90 %, we need to calculate the cdf for <90% and <80% and subtract the cdfs

cdf for <90% is calculate to be 0.97674530. 

probability of score between 80 and 90 % = 0.97674530 – 0.85652013 = 0.12022517

import scipy.stats as ss
cdf80 = ss.distributions.norm.cdf(80.0, 68.5, 10.8) 
cdf90 = ss.distributions.norm.cdf(90.0, 68.5, 10.8) 
print(cdf90-cdf80)

Exponential distribution: As defined in wikipedia, Exponential distribution is the probability distribution that describes the time between events in a process in which events occur continuously and independently at a constant average rate. It is often used to model the time elapsed between events.

The probability density function (pdf) of an exponential distribution is

 f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases}

The cumulative distribution function is given by

{\displaystyle F(x;\lambda )={\begin{cases}1-e^{-\lambda x}&x\geq 0,\\0&x<0\end{cases}}}

where

  • λ is the mean time between events
  • x is a random variable

While driving on Bangalore roads, suppose you usually drive over 5 potholes per hour. In order to compute the probability that a pothole will arrive within the next half an hour -> 5 potholes per hour means that we would expect one pothole every 1/5 hour so λ = 0.2. We can then compute this as follows:
P(0 <= X <= 1)= expcdf(0.5,0.2) = 0.259181

import scipy.stats as ss

cdfexp = ss.distributions.expon.cdf(0.5,0.2)
print(cdfexp)

 

 

 

 

 

 

 

Advertisements

Analyzing data in Python – Measures of Variation

Variance is measure of how far a set of numbers is spread out from mean

Range is the difference between the largest and smallest data values in a
set of values for a variable

e.g.,

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

In the above example, the range can be computed as: 94 – 55 = 39

Another e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Here the range is 198 -33 = 165

This is the largest possible difference between any two values in a set of data values for a variable. In the second example, range is very high. where as in 1st example, it is somewhat high.

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

Subject Score out of 100 Score – Mean (Difference) Square of Difference
Maths 94 18.5 342.25
Physics 85 9.5 90.25
Chemistry 66 -9.5 90.25
French 55 -20.5 420.25
Computers 89 13.5 182.25
English 64 -11.5 132.25
  75.5 0 1257.5

In the above example, Mean / Average Marks is 75.5 and Sum of difference between Score and Mean results in Zero value always. Since we cannot compare this other data set, the variance is calculated as sum of squares of difference divided by number of variables. Hence the variance is 1257.5/6 = 209.58

This way, we can compare the variance of scores for two different students.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference
1 45 -11 121
2 43 -13 169
3 39 -17 289
4 39 -17 289
5 33 -23 529
6 55 -1 1
7 35 -21 441
8 33 -23 529
9 40 -16 256
10 198 142 20164
56 0 22788

In the second example, the variance is 22788 /10 = 2278.8. The above data is a  sample data for 10 days and may not represent the complete set of data. Hence the variance is calulated as sum of squares of difference divided by (number of variables -1)

Variance of sample data = 22788 /9= 2532.

Standard Deviation is the positive square root of the variance.

Hence, in the scores example,  standard deviation of scores  = sqrt (209.58) = 14.48 (approximate). In the second example, the standard deviation of time to reach office is = sqrt (2532) = 50.32 (approximate). what this means is the most of the cases, time take to reach office lies between the mean +/- standard deviation i.e., 56+/-50.32. Here 198 is extreme value. If we ignore 198, mean becomes 40.22 and standard deviation becomes 6.92. Hence the most of the cases, time take to reach office lies between the values 40.22 +/- 6.92 minutes which is the case here.

Z-Scores is difference between a data value for a variable and the mean of the variable, divided by the standard deviation.

Day Time to office (Mins) Time – Mean (Difference) Square of Difference Z-Score
1 45 -11 121 -0.22
2 43 -13 169 -0.26
3 39 -17 289 -0.34
4 39 -17 289 -0.34
5 33 -23 529 -0.46
6 55 -1 1 -0.02
7 35 -21 441 -0.42
8 33 -23 529 -0.46
9 40 -16 256 -0.32
10 198 142 20164 2.82
56 0 50.32

Here, you can notice that the absolute value of Z-Score is low for low variance and high for extreme cases.

Generally if the Z-Score is either less than -3 or greater than +3, it means, the value of variable is extreme. In the above example, 198 minutes to office is close to being extreme value.

Below is the Python code to calulate Variance, Standard Deviation and Z-Scores. Here it calulates the Variance, Standard Deviation and Z-Scores considering count as entire count and not sample.

import numpy as np
from pandas import DataFrame as df
from scipy import stats as st

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]
marksdf = df({'marks':marks})

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]
timeminsdf = df({'timemins':timemins})

variance_marks = np.var(marks)
print("variance in marks= " +str(variance_marks))

sd_marks = np.std(marks)
print("standard deviation in marks= " +str(sd_marks))

zscore_marks = st.zscore(marksdf)
print("Zscore of marks= " +str(zscore_marks))

mean_time = np.mean(timemins)
print("mean of timemins= " +str(mean_time))

variance_time = np.var(timemins)
print("variance in timemins= " +str(variance_time))

sd_time = np.std(timemins)
print("standard deviation in timemins= " +str(sd_time))

zscore_time = st.zscore(timeminsdf)
print("Zscore of timemins= " +str(zscore_time))

Skewness is a measure of the asymmetry of the distribution of a variable about its mean. Distribution can be either symmetrical, left-skewed (negative skew), or right-skewed (positive skew).

when the mean of a data set equals the median value, it is symmetrical. When the mean of a data set is less than the median value, it is negative skewed. When the mean is greater than the median value, the distribution is said to be positive skewed

In our scores example, Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

The scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values. Median = Mean (66,85)= 75.5. Here we have a symmetical data distribution.

In out time taken to reach office example, Mean of time taken to reach office is (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes.  Median(33, 33, 35, 39, 39, 40, 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes. Since the mean is greater than the median, the distribution is positive skewed

In order to understand the distances correctly, we can draw box and whisker plot. Here we have five values plotted as vertical lines (smallest value, the first quartile Q1 , the median, the third quartile Q3 , and the largest value.

Below is a simple python code to plot a box and whisker plot:

import matplotlib.pyplot as plt

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]

plt.boxplot(marks)
plt.title("Box and Whisker Plot of Scores",fontsize = 15)
plt.show()

plt.boxplot(timemins)
plt.title("Box and Whisker Plot of Time to reach office ",fontsize = 15)
plt.show()

Box1

Box2

 

 

 

Analyzing data in Python – Measures of Central Tendency

In Stastistics, Central Tendancy could be defined as the tendency for the values of a random variable to cluster round its mean, mode, or median.

Mean, Median and Mode are known as Measures of Central Tendency

Mean is average i.e., sum of the data values for a variable, divided by the number of data values. It signifies balance point in data values.

e..g,

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64

In the above table, the mean can be computed as:

Mean of Score = (94+85+66+55+89+64)/6= 453/6= 75.5

Another e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Mean of time = (45+43+39+39+33+55+35+33+40+198)/10= 560/10= 56 minutes

Please note that one extreme value can change the mean dramatically – e.g., one particular day (day 10) some repair work was going on the road and there was huge rain because of slow traffic with diversion , it took 198 minutes.

If we had considered first 9 days and calulated the mean,  it would have been

(45+43+39+39+33+55+35+33+40)/9 = 362/9= 40.22 minutes. Hence one bad day of traffic, increased the mean from 40.22 minutes to 56 minutes.

Median is the middle value in a set of data values for a variable when the
data values have been ordered from lowest to highest value.

In the above scores example, the scores can be ordered in increasing value:

55, 64, 66, 85, 89, 94. Since there are 6 scores, there are two middle values here i.e., 66 and 85. Hence to calculate median we calculate mean of these two midde values

Median = Mean (66,85)= 75.5 Coincidentally, the Mean and Median turned out to be the same 🙂

In the time example, the median can be calculated as below:

Median(33, 33, 35, 39, 39, 40, 43, 45, 55, 198) = Mean (39,40) = 39.5 minutes

Here, the impact of the extreme value is little.

Mode is the value in a set of data values for a variable that appears most frequently

In the above scores example, the scores can be ordered in increasing value: 55, 64, 66, 85, 89, 94. Here the frequency of occurance of each value is 1. Because each score in the example is unique. Hence all values are modes.

In the time example, 33, 33, 35, 39, 39, 40, 43, 45, 55, 198, there are two modes 33 and 39 as they are occuring twice (highest frequency of occurance)

import numpy as np
from pandas import DataFrame as df

subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]
marksdf = df({'marks':marks})

print("Mean of marks = " + str(np.average(marks)))
print("Median of marks = " + str(np.median(marks)))
print("Mode of marks = " + str(df.mode(marksdf)))

daynums = ['1', '2', '3', '4', '5', '6','7','8','9','10']
timemins = [45,43,39,39,33,55,35,33,40,198]
timeminsdf = df({'timemins':timemins})

print("Mean of time to reach office = " + str(np.average(timemins)))
print("Median of time to reach office = " + str(np.median(timemins)))
print("Mode of time to reach office = " + str(df.mode(timeminsdf)))

As per Wikipedia, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The first quartile splits off the lowest 25% of data from the highest 75%. This is know as lower quartile.

The second quartile (Q2) is the median of the data. This cuts data set in half.

The third quartile (Q3) is the middle value between the median and the highest value of the data set. This is known as upper quartile. This splits off the highest 25% of data from the lowest 75%.

e.g., time taken to reach office

Day:                  1       2      3      4      5      6      7     8      9     10

time(mins):     45     43   39    39    33    55    35    33   40    198

Order Dataset 33, 33, 35, 39, 39, 40, 43, 45, 55, 198

There are even number of values in data set. The second quartile (i.e., median) is 39.5 since there are two mid points (5th and 6th value).

The first quartile is the mid point between least value and median (between 33 and 39).In this example, it is 35

The third quartile is the mid point between median and highest value (between 40 and 198). In this example, it is 45.

Below code snippet can be added to the code to print the Q1, Q2 and Q3 values

print("First quartile time to reach office = " + str(np.percentile(timeminsdf, 25)))
print("Second quartile time to reach office = " + str(np.percentile(timeminsdf, 50)))
print("Third quartile time to reach office = " + str(np.percentile(timeminsdf, 75)))

interquartile range (IQR), also called the midspread or middle 50%  is calculated as IQR = Q3 −  Q1. 

In the above example, the IQR = 45 – 35 =10

While determining outliers, we use lower fence and upper fence.

Lower Fence = Q1 – 1.5(IQR)

Upper Fence = Q3 + 1.5(IQR)

Lower Fence = 35 – 1.5*10 = 35 – 15 = 20

Upper Fence = 45 + 1.5*10 = 45 + 15 = 60

Hence, any time taken less than 20 mins to reach office or time taken greater than 60 minutes to reach office can be considered outliers. In the above example, 198 is only outlier.

We can also use describe method of DataFrame

print("Describe method of DataFrame " + str(timeminsdf.describe()))

In my example case, the below output is printed:

Describe method of DataFrame timemins
count 10.000000
mean 56.000000
std 50.318983
min 33.000000
25% 36.000000
50% 39.500000
75% 44.500000
max 198.000000

Please note that there may be different methods to calculate quatile values. Hence, the output may be slightly different from what I calculated manually.

 

 

Analyzing data in Python – Scatter (xy) Plot

A Scatter plot (also known as XY plot) has points that show the relationship between two sets of variables.

e.g., a plot of persons height vs weight.

import matplotlib.pyplot as plt
import pandas

heights = []
weights = []

colnames = ['Height', 'Weight']
data = pandas.read_csv('ShortListOfHeightWeight.csv', names=colnames)

heights=data.Height.tolist()
weights=data.Weight.tolist()

plt.scatter(heights, weights)
plt.title('Scatter plot of height and corresponding weight', fontsize=15)
plt.xlabel('height', fontsize=15)
plt.ylabel('weight', fontsize=15)
plt.show()

Here we have data of 250 persons (height in inches and corresponding weight in lbs).

Scatter

 

Analyzing data in Python – Time Series Plot

A time series graph is a graph or plot that illustrates data points at successive intervals of time. It can be drawn using a Python Pandas’ Series.plot method.

e.g., Plot of the closing values of stock market S&P BSE sensex on the y axis vs time on the x axis (starting year 2000 to 2018).

Data is downloaded as a csv file from the site https://www.bseindia.com/indices/IndexArchiveData.aspx

from pandas import Series
from matplotlib import pyplot as plt
series = Series.from_csv('SENSEX.csv', header=0)
plt.ylabel('Sensex')
series.plot()
plt.show()

TimeSeries1

Analyzing data in Python – Histogram

As  per disctionary definition, Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

An example is Histogram of heights of people in inches. In the below web page, there is a Height and Weight data of 25000 people:

http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html

import matplotlib.pyplot as plt
import pandas
colnames = ['Height', 'Weight']
data = pandas.read_csv('HeightWeight.csv', names=colnames)

heights=data.Height.tolist()
weights=data.Weight.tolist()

plt.hist(heights, bins=10, alpha=0.5)
plt.title("Histogram of Heights of 25000 people")
plt.ylabel("Height")
plt.xlabel("Frequency of Heights")
plt.show()

plt.hist(weights, bins=10, alpha=0.5)
plt.title("Histogram of Weights of 25000 people")
plt.ylabel("Weights")
plt.xlabel("Frequency of Weights")
plt.show()

The data is saved as csv file. It contains 2 columns – Height in inches and Weight in lbs.

The csv file content looks like below:

Screenshot from 2018-02-15 20-55-40

The file is read using pandas read_csv function. This function returns a Dataframe object.

Each column can be converted into a list using tolist() function. Hence we have two lists – one with Heights (of 25000 people) and another with Weights (of  25000 people).

Now, hist function of Matplotlib, we are plotting two seperate histograms for Height and Weight specifying bin as 10.

Histogram1

Histogram2

Analyzing data in Python – Pareto Charts

As per Wikipedia, a Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

Below is a simple example of Pareto Chart in Python

from matplotlib import pyplot as plot
import numpy as np
preference = ({'Comedy':1500,'Science Fiction':670,'Action':950,'Drama':450,'Romance':50})

# sort preference in descending order
weights, labels = zip(*sorted(((pref,genre) for genre,pref in preference.items()), reverse=True))

for i in weights:
 cumu_1 = weights[0]
 cumu_2 = weights[1] + cumu_1
 cumu_3 = weights[2] + cumu_2
 cumu_4 = weights[3] + cumu_3
 cumu_5 = weights[4] + cumu_4
cumu_weights = [cumu_1,cumu_2, cumu_3, cumu_4, cumu_5]

print(cumu_weights)

# lefthand edge of each bar
left = np.arange(len(weights))
fig, ax = plot.subplots(1, 1)
ax.bar(left, weights, 1)
ax.set_xticks(left)
ax.set_xticklabels(labels,fontsize=10, fontweight='bold', rotation=35, color='darkblue')
ax.plot(cumu_weights)

Here we are sorting the preference in decending order and drawing a barchart with weightage of preference on the y axis. We also take the cumulative values decreasing order of this weightages and plot as a line graph.

Pareto