Analysis with Python[Pandas, Matplolib] on Data.gov.in portal data

October 13, 2019 | ⌛ 5 Min

Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely important for scientific computations.

In Data science NumPy [Numerical Python], Pandas and Matplotlib have emerged to be essential libraries for any scientific computation, including machine learning, in python due to their intuitive syntax and high-performance matrix computation capabilities.

What is NumPy?

NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.

What is Pandas?

Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

What is matplotlib?

Matplotlib is a 2d plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments. Matplotlib can be used in Python scripts, Python and IPython shell, Jupyter Notebook, web application servers and GUI toolkits.

I have done a small run down project using Pandas, Matplotlib.

Source file(CSV) is generated from Data.gov.in - All India level and State-wise Key Indicators of NFHS-3 and NFHS-4 Data set.

Link to NHFS CSV file

Step 1: Reading data from CSV file using pandas

# Load a small real dataset from CSV file and split it into input and output elements

import pandas as pd
import matplotlib.pyplot as plt

# Reading CSV file to Data Frame

newdf = pd.read_csv('NFHS-data.csv')

numeric_cols_list = list(newdf.columns)[0:5]

# Validation of data to numeric format

for k in numeric_cols_list:
   
    newdf['k'] = pd.to_numeric(newdf[k],errors='coerce')
    
# Replacing NaN values to 0

newdf.fillna(0, inplace=True)

newdf = newdf[numeric_cols_list]

# print col3 and col4 names

x = newdf.columns[3]
y = newdf.columns[4]

print("column no 3 value --->", x)
print("column no 3 value ---->", y)

df1 = newdf.copy()

Now, we have refined and extracted required columns from large dataset.

Step 2: Renaming the columns, Filtering data

we can create a new data frame and select required columns to work

# changing index cols with rename() 

df1.rename(columns={x: "Femaleage6attendeschool", y: "Householdbelowage15years"}, inplace=True)

# Filter rows based on condition

df_filtered = df1[df1['Femaleage6attendeschool'] > 0]

df_filtered = df1[df1['Area'] != 'Total']

Step 3: Generating the Barchart

# plotting the values of Female-age-6-years-and-above-attended-school Mean with reference to states

# creating a new dataframe 

df_filtered.iloc[:,[0,1,3]]

df_filtered.groupby("India/States/UTs")['Femaleage6attendeschool'].mean()

df_filtered.plot()

Step 4: Final Image Generated

Venkat Chiluvuri

Jupyter Notebook

Visualizations based on D3.js

COVID-19 test in Andhra Pradesh & Experience