Analyzing data in Python – Scatter (xy) Plot

A Scatter plot (also known as XY plot) has points that show the relationship between two sets of variables.

e.g., a plot of persons height vs weight.

import matplotlib.pyplot as plt
import pandas

heights = []
weights = []

colnames = ['Height', 'Weight']
data = pandas.read_csv('ShortListOfHeightWeight.csv', names=colnames)


plt.scatter(heights, weights)
plt.title('Scatter plot of height and corresponding weight', fontsize=15)
plt.xlabel('height', fontsize=15)
plt.ylabel('weight', fontsize=15)

Here we have data of 250 persons (height in inches and corresponding weight in lbs).




Analyzing data in Python – Time Series Plot

A time series graph is a graph or plot that illustrates data points at successive intervals of time. It can be drawn using a Python Pandas’ Series.plot method.

e.g., Plot of the closing values of stock market S&P BSE sensex on the y axis vs time on the x axis (starting year 2000 to 2018).

Data is downloaded as a csv file from the site

from pandas import Series
from matplotlib import pyplot as plt
series = Series.from_csv('SENSEX.csv', header=0)


Analyzing data in Python – Histogram

As  per disctionary definition, Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

An example is Histogram of heights of people in inches. In the below web page, there is a Height and Weight data of 25000 people:

import matplotlib.pyplot as plt
import pandas
colnames = ['Height', 'Weight']
data = pandas.read_csv('HeightWeight.csv', names=colnames)


plt.hist(heights, bins=10, alpha=0.5)
plt.title("Histogram of Heights of 25000 people")
plt.xlabel("Frequency of Heights")

plt.hist(weights, bins=10, alpha=0.5)
plt.title("Histogram of Weights of 25000 people")
plt.xlabel("Frequency of Weights")

The data is saved as csv file. It contains 2 columns – Height in inches and Weight in lbs.

The csv file content looks like below:

Screenshot from 2018-02-15 20-55-40

The file is read using pandas read_csv function. This function returns a Dataframe object.

Each column can be converted into a list using tolist() function. Hence we have two lists – one with Heights (of 25000 people) and another with Weights (of  25000 people).

Now, hist function of Matplotlib, we are plotting two seperate histograms for Height and Weight specifying bin as 10.



Analyzing data in Python – Pareto Charts

As per Wikipedia, a Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

Below is a simple example of Pareto Chart in Python

from matplotlib import pyplot as plot
import numpy as np
preference = ({'Comedy':1500,'Science Fiction':670,'Action':950,'Drama':450,'Romance':50})

# sort preference in descending order
weights, labels = zip(*sorted(((pref,genre) for genre,pref in preference.items()), reverse=True))

for i in weights:
 cumu_1 = weights[0]
 cumu_2 = weights[1] + cumu_1
 cumu_3 = weights[2] + cumu_2
 cumu_4 = weights[3] + cumu_3
 cumu_5 = weights[4] + cumu_4
cumu_weights = [cumu_1,cumu_2, cumu_3, cumu_4, cumu_5]


# lefthand edge of each bar
left = np.arange(len(weights))
fig, ax = plot.subplots(1, 1), weights, 1)
ax.set_xticklabels(labels,fontsize=10, fontweight='bold', rotation=35, color='darkblue')

Here we are sorting the preference in decending order and drawing a barchart with weightage of preference on the y axis. We also take the cumulative values decreasing order of this weightages and plot as a line graph.



Analyzing data in Python – Pie Charts

As per Wikipedia, a pie chart (or a circle chart) is a circular statistical graphic which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents.

Below is a sample code in Python to draw Pie Chart

import matplotlib.pyplot as plot

genre = ['Comedy', 'Action', 'Drama', 'Romance', 'Science Fiction']
preference = [1000,800,750,550,670]

pref_total = sum(preference)

subjects = ['Maths', 'Physics', 'Chemistry', 'Computers', 'English']
marks = [94,85,66,89,64]

colors = ['violet', 'grey', 'green', 'yellow', 'orange']

plot.pie(preference, labels=genre, colors=colors, autopct='%1.1f%%')

plot.pie(marks, labels=subjects, colors=colors, autopct=lambda p: '{:.0f}'.format(p * marks_total / 100))

The output is two pie charts – one with percentages and second with absolute values.




Analyzing data in Python – Bar Charts

There are many good book on Data Analytics. Recently I borrowed a book from office library titled – “Even You Can Learn Statistics and Analytics: An Easy to Understand Guide to Statistics and Analytics” authored by David M. Levine and David F. Stephan. I feel this is a good one for beginners on Data Analytics.

There are also other good books like – ‘Python for Data Analysis’, ‘Python: Data Analytics and Visualization‘ , ‘Python for Finance’ etc.

One important aspect of presenting data is in Graph format (visual format – also known as Data Visualization).

A bar chart is useful for presenting categorical data. I has rectangular bars whose length is proportional to the categorical values we want to present.

E.g. We want to represent the Marks of a Student in several subjects:

Subject Score out 100
Maths 94
Physics 85
Chemistry 66
French 55
Computers 89
English 64


This is a vertical bar graph. This can be achieved by a small Python code:

import matplotlib.pyplot as plot
import numpy as np
subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

m = np.arange(len(subjects)), marks)
plot.xticks(m, subjects)
plot.title('Marks obtained out of 100')

Matplotlib is a Python 2D plotting library and Numpy is the fundamental package for scientific computing with Python are two impotant packages in Python

Here subjects and score (marks) are represented in Python arrays. len(subjects) return the length of subjects – in this case 6.

numpy.arrange is used to arrange the subjects on the graph in order.  On x-axis, we have the subject names (xticks) and on y-axis, we have marks. is plotting the vertical bar chart.

We can also define other parameters for the graph such as fontsize, weight, rotation:, marks,color='indigo')
plot.xlabel('Subject',fontsize=15, fontweight='bold', color='blue')
plot.ylabel('Marks',fontsize=15, fontweight='bold', color='blue')
plot.xticks(m, subjects, fontsize=10, fontweight='bold', rotation=35, color='blue')
plot.title('Marks obtained out of 100',fontsize=15, fontweight='bold', color='blue')


The same can be represented as a horizontal bar graph. Instead of bar function, we use barh function.

import matplotlib.pyplot as plot
import numpy as np
subjects = ['Maths', 'Physics', 'Chemistry', 'French', 'Computers', 'English']
marks = [94,85,66,55,89,64]

m = np.arange(len(subjects))
plot.barh(m, marks,color='indigo')
plot.ylabel('Subject',fontsize=15, fontweight='bold', color='blue')
plot.xlabel('Marks',fontsize=15, fontweight='bold', color='blue')
plot.yticks(m, subjects, fontsize=10, fontweight='bold', rotation=35, color='blue')
plot.title('Marks obtained out of 100',fontsize=15, fontweight='bold', color='blue')


In order to add the data value on the graph, we need to

for i, value in enumerate(marks):
 plot.text(value, i, str(value), color='indigo', fontweight='bold')


I am using Spyder IDE (from Anaconda Navigator) in order to run this code. It has a handy feature, a variable explorer that shows the details of the variables used in code.

Screenshot from 2018-02-03 12-57-00.png

Setting up Python Environment-Anaconda Installation

For a newbie like me, it is difficult to keep upgrading Python and associated packages while resolving package dependencies. This is where Anaconda comes to my rescue. Anaconda is a free Python distribution and package manager. It comes with lot of pre-installed packages (primarily for data science).

It can be downloaded for Linux from the Continuum’s site . The instructions for installation on Linux are available on the same site. I have downloaded and installed 64 bit Python 3.6 version on my Linux Mint.

In order to update Anaconda and Python to latest version, you need to run the below command on the Terminal.

Screenshot from 2017-06-06 20-39-16

However, I continue to have older version of Python. You can see in below screenshot, Python 3.5.2 which I manually installed and Python 2.7.12 which was pre-installed on Linux Mint are still available.

Screenshot from 2017-06-06 20-47-40

conda update anaconda

Screenshot from 2017-06-06 20-50-45.png

On my Linux Mint, I have already updated to Anaconda version 4.4.0 (latest available as of date). This way, it is easy to keep upgrading Python and required packages.

On my PyCharm, I can choose Python 3.6 (installed through Anaconda / conda update) as the project interpreter.

Screenshot from 2017-06-06 20-57-17.png

Anaconda also comes with Anaconda Navigator – a GUI useful to launch Applications, manage packages, learning Python etc,

To add short cut to anaconda-navigator to desktop, the created the following script (desktop entry file in usr/share/applications folder

[Desktop Entry]
Comment=Scientific PYthon Development EnviRonment - Python3
Exec=bash -c 'export PATH="/home/srinivas/anaconda3/bin:$PATH" && /home/srinivas/anaconda3/bin/anaconda-navigator'

Screenshot from 2017-06-06 21-08-23

Spyder is an open source cross platform IDE for scientific programming in Python. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.

Screenshot from 2017-06-06 21-10-01

To conclude, Anaconda is a Python distribution with lot of useful features and learning opportunity in one place.