Analyzing data in Python – Time Series Plot

A time series graph is a graph or plot that illustrates data points at successive intervals of time. It can be drawn using a Python Pandas’ Series.plot method.

e.g., Plot of the closing values of stock market S&P BSE sensex on the y axis vs time on the x axis (starting year 2000 to 2018).

Data is downloaded as a csv file from the site https://www.bseindia.com/indices/IndexArchiveData.aspx

from pandas import Series
from matplotlib import pyplot as plt
series = Series.from_csv('SENSEX.csv', header=0)
plt.ylabel('Sensex')
series.plot()
plt.show()

TimeSeries1

Advertisements

Analyzing data in Python – Histogram

As  per disctionary definition, Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

An example is Histogram of heights of people in inches. In the below web page, there is a Height and Weight data of 25000 people:

http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html

import matplotlib.pyplot as plt
import pandas
colnames = ['Height', 'Weight']
data = pandas.read_csv('HeightWeight.csv', names=colnames)

heights=data.Height.tolist()
weights=data.Weight.tolist()

plt.hist(heights, bins=10, alpha=0.5)
plt.title("Histogram of Heights of 25000 people")
plt.ylabel("Height")
plt.xlabel("Frequency of Heights")
plt.show()

plt.hist(weights, bins=10, alpha=0.5)
plt.title("Histogram of Weights of 25000 people")
plt.ylabel("Weights")
plt.xlabel("Frequency of Weights")
plt.show()

The data is saved as csv file. It contains 2 columns – Height in inches and Weight in lbs.

The csv file content looks like below:

Screenshot from 2018-02-15 20-55-40

The file is read using pandas read_csv function. This function returns a Dataframe object.

Each column can be converted into a list using tolist() function. Hence we have two lists – one with Heights (of 25000 people) and another with Weights (of  25000 people).

Now, hist function of Matplotlib, we are plotting two seperate histograms for Height and Weight specifying bin as 10.

Histogram1

Histogram2

Analyzing data in Python – Pie Charts

As per Wikipedia, a pie chart (or a circle chart) is a circular statistical graphic which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents.

Below is a sample code in Python to draw Pie Chart

import matplotlib.pyplot as plot

genre = ['Comedy', 'Action', 'Drama', 'Romance', 'Science Fiction']
preference = [1000,800,750,550,670]

pref_total = sum(preference)

subjects = ['Maths', 'Physics', 'Chemistry', 'Computers', 'English']
marks = [94,85,66,89,64]
marks_total=sum(marks)

colors = ['violet', 'grey', 'green', 'yellow', 'orange']

plot.pie(preference, labels=genre, colors=colors, autopct='%1.1f%%')
plot.axis('equal')
plot.show()

plot.pie(marks, labels=subjects, colors=colors, autopct=lambda p: '{:.0f}'.format(p * marks_total / 100))
plot.axis('equal')
plot.show()

The output is two pie charts – one with percentages and second with absolute values.

Pie1

Pie2

 

Setting up Python Environment-Anaconda Installation

For a newbie like me, it is difficult to keep upgrading Python and associated packages while resolving package dependencies. This is where Anaconda comes to my rescue. Anaconda is a free Python distribution and package manager. It comes with lot of pre-installed packages (primarily for data science).

It can be downloaded for Linux from the Continuum’s site https://www.continuum.io/downloads#linux . The instructions for installation on Linux are available on the same site. I have downloaded and installed 64 bit Python 3.6 version on my Linux Mint.

In order to update Anaconda and Python to latest version, you need to run the below command on the Terminal.

Screenshot from 2017-06-06 20-39-16

However, I continue to have older version of Python. You can see in below screenshot, Python 3.5.2 which I manually installed and Python 2.7.12 which was pre-installed on Linux Mint are still available.

Screenshot from 2017-06-06 20-47-40

conda update anaconda

Screenshot from 2017-06-06 20-50-45.png

On my Linux Mint, I have already updated to Anaconda version 4.4.0 (latest available as of date). This way, it is easy to keep upgrading Python and required packages.

On my PyCharm, I can choose Python 3.6 (installed through Anaconda / conda update) as the project interpreter.

Screenshot from 2017-06-06 20-57-17.png

Anaconda also comes with Anaconda Navigator – a GUI useful to launch Applications, manage packages, learning Python etc,

To add short cut to anaconda-navigator to desktop, the created the following script (desktop entry file in usr/share/applications folder

[Desktop Entry]
Version=1.0
Type=Application
Name=Anaconda-Navigator
GenericName=Anaconda
Comment=Scientific PYthon Development EnviRonment - Python3
Exec=bash -c 'export PATH="/home/srinivas/anaconda3/bin:$PATH" && /home/srinivas/anaconda3/bin/anaconda-navigator'
Categories=Development;Science;IDE;Qt;Education;
Icon=/home/srinivas/anaconda3/Anaconda.png
Terminal=false
StartupNotify=true
Name[en_IN]=Anaconda

Screenshot from 2017-06-06 21-08-23

Spyder is an open source cross platform IDE for scientific programming in Python. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.

Screenshot from 2017-06-06 21-10-01

To conclude, Anaconda is a Python distribution with lot of useful features and learning opportunity in one place.

 

 

Setting up Python Environment-Installing Packages

In order to build useful applications, we need Python Libraries or Packages. Majority of such useful pckages can be downloaded from PyPI, the Python Package Index https://pypi.python.org/pypi

Best way to install the packages is by using a tool called pip. We can get pip from https://pip.pypa.io/en/latest/installing.html. However, on Linux Mint, pip is already installed along with Python 2.7.12. Similarly, when I installed Python 3.5.2, pip3 tool is installed. To upgrade to latest pip, you need to run below command on terminal

pip install -U pip

Now let us look at some of the useful packages for analyzing data.

Numpy (http://www.numpy.org/): Is useful for processing for numbers, strings, records, and objects.

pip install numpy

Pandas (http://pandas.pydata.org/): Python Data Analysis Library provides various data analysis tools for Python.

pip install pandas

Matplotlib (https://matplotlib.org/): Matplotlib is a Python 2D plotting library to produce publication quality graphs and figures.

pip install matplot

OpenPyXL (https://openpyxl.readthedocs.io/en/default/): Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

pip install openpyxl

Once these packages are installed, they can be imported and used in your application. E.g.,

import numpy
import pandas
import openpyxl
import matplotlib

from pandas import DataFrame
from pandas import *

 

 

Setting up Python Environment-SQLite

For applications involving data storage and usage, we need a Database. SQLite is a simple yet very useful SQL database engine. It can be downloaded from the website – https://www.sqlite.org.

There is a very nice description of when to consider using SQLite database and when to consider client server databases like MySQL and PostgreSQL here – https://www.sqlite.org/whentouse.html

I installed SQLite on my Linux Mint using the below command on terminal:

sudo apt-get update
sudo apt-get install sqlite

In order to create database, tables, views etc, you may use a DB client for SQLite DB called SQLiteStudio from – https://sqlitestudio.pl/index.rvt

All you need is download, unpack and run the app.

As you can see from below screenshots, there are are many number of good features available in SQLite like, Constraints, Indexes, Triggers, Views etc.

Data can be inserted using the user interface

Screenshot from 2017-05-20 20-28-43Screenshot from 2017-05-20 20-31-17Screenshot from 2017-05-20 20-33-46

Screenshot from 2017-05-20 20-40-49.png

How I got bitten by Python programming

Many years ago, I used to be a Java programmer. In fact, I started my information technology career in the year 1999 as a software developer in a small company which focussed on application software development. During my engineering days, I learnt Fortran and C programming. When I completed my engineering, there was Y2K problem (https://en.wikipedia.org/wiki/Year_2000_problem) which helped many job aspirants to jump into IT industry irrespective of their educational background.

During the same time, Java was one of the bleeding edge technologies. There was a saying – ‘To get into IT job,  all you need to know  is spelling of Java’.

After few years of programming (mainly in Java, Web development, SQL, Database design), like many others, I moved on to project management and with more focus on day to day operations, I gradually lost hold on coding but not the zeal.

Several years later in the current digital world, data analytics caught my attention. I am interested in learning data analytics and visualization. Since few years, I started using Linux Mint Cinnamon OS more frequently on my personal laptop as it is free and open source(FOSS). I was fascinated by Cinnamon Desktop Environment. The website – https://en.wikipedia.org/wiki/Linux_Mint, claims most of the Linux Mint is developed in Python language – https://www.python.org/. I was aware of the fact that majority of Unix/Linux development happens in C but was surprised when I saw Python. This was my first encounter /awareness on Python.  This is when I started gathering my understanding of Python from internet.

Why learn Python ? 

  • It is a free and open source (FOSS)
  • Already available in several Linux distributions
  • Easy to learn for beginners (minimal coding is required)
  • One of the languages widely used for Data Analytics
  • Popular (http://www.tiobe.com/tiobe-index/) and good Community support
  • Availability of code libraries / packages
    • Many Web development frameworks – Django, Bottle, Flask etc
    • Scientific and numeric computing – Numpy, Matplotlib, Pandas etc
    • Rich GUI development – pyQt, wxPython

Python 2.x or 3.x ? 

Several books and websites debate on whether to use Python 2 or Python 3. I have noticed that by default, Python 2.7 was installed on Linux Mint 18 (Sarah). When I started learing, I felt that going forward the focus would be on developing Python 3.x  as  it is the present and future. Hence I started with Python 3.5 interpreter. Fortunately, Python 3.5 is also pre-installed on the latest Linux Mint 18.1 (Serena).

My favourite books for learning Python / Data Analytics

There are several online books and tutorials available. One of my favourite is Tutorials Point – https://www.tutorialspoint.com/python/

I follow the Google plus Python community frequently – https://plus.google.com/u/0/communities/103393744324769547228

Also, Stackoverflow (http://stackoverflow.com/questions/tagged/python) comes to my rescue whenever I encounter some hurdles.

Screenshot from 2016-12-31 20-44-32.png

Disclaimer: The opinions and experiences listed on the site are my personal. In some cases, my understanding could be incorrect as I am a beginner to intermediate programmer. Please point out if any correction is required so that I can consider editing  the blog.