Skip to content

Pandas

Practice NoteBook

Exercise NoteBook

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures.

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

  • Calculate statistics and answer questions about the data, like
    • What's the average, median, max, or min of each column?
    • Does column A correlate with column B?
    • What does the distribution of data in column C look like?
  • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
  • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
  • Store the cleaned, transformed data back into a CSV, other file or database

Import Pandas

import pandas as pd

Core components of pandas: Series and DataFrames

The primary two components of pandas are the Series and DataFrame.

Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

serieas_dataframe.png

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series( data, index, dtype, copy)

data - data takes various forms like ndarray, list, constants index - Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed. dtype - dtype is for data type. If None, data type will be inferred copy - Copy data. Default False

Series Explanation

colors = pd.Series(["Red", "Green", "Blue"])
colors
0      Red
1    Green
2     Blue
dtype: object
cars = pd.Series(["Audi", "Ferrai", "BMW"])
cars
0      Audi
1    Ferrai
2       BMW
dtype: object
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
100  a
101  b
102  c
103  d
dtype: object
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
a 0.0
b 1.0
c 2.0
dtype: float64
s = pd.Series(5, index=[0, 1, 2, 3])
print s
0  5
1  5
2  5
3  5
dtype: int64
#retrieve the first element
s[0]

#retrieve the first three element
s[:3]

#retrieve the last three element
s[-3:]

#Retrieve a single element using index label value.
s['a']

#Retrieve multiple elements using a list of index label values.
s[['a','c','d']]

DataFrames

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

pandas.DataFrame( data, index, columns, dtype, copy)

data - data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. index- For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed. columns - For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed. dtype - Data type of each column. copy - This command (or whatever it is) is used for copying of data, if the default is False.

DataFrames Explanation

pandas-anatomy-of-a-dataframe.png

A pandas DataFrame can be created using various inputs like −

  • Lists
  • dict
  • Series
  • Numpy ndarrays
  • Another DataFrame
dataframes = pd.DataFrame({ "color": colors, "car": cars })
dataframes
color car
0 Red Audi
1 Green Ferrai
2 Blue BMW
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
df
         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
    a    b      c
0   1   2     NaN
1   5   10   20.0

Column Operations

# select column
df['one']

# delete column
del df['one']
df.pop('two')

# add column
df['three']=pd.Series([10,20,30],index=['a','b','c'])
df['four']=df['one']+df['three']

Row Operations

# select row
df.loc['b'] # Selection by Label
df.iloc[2]  # Selection by integer location

# Slice Rows
df[2:4]

# Addition of Rows
df = df.append(df2)

# Drop rows with label 0
df.drop(0)

CSVs

Load CSV data

# import data
car_sales = pd.read_csv("car-sales.csv")
car_sales
Make Colour Odometer (KM) Doors Price
0 Toyota White 150043 4 $4,000.00
1 Honda Red 87899 4 $5,000.00
2 Toyota Blue 32549 3 $7,000.00
3 BMW Black 11179 5 $22,000.00
4 Nissan White 213095 4 $3,500.00
5 Toyota Green 99213 4 $4,500.00
6 Honda Blue 45698 4 $7,500.00
7 Honda Blue 54738 4 $7,000.00
8 Toyota White 60000 4 $6,250.00
9 Nissan White 31600 4 $9,700.00

Describe Data

# Attributes
car_sales.dtypes
Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object
car_sales.columns
# Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

car_sales.index
# RangeIndex(start=0, stop=10, step=1)

car_sales["Doors"].mean()
# 4.0

car_sales["Doors"].sum()
# 40

len(car_sales)
# 10

car_sales.info()
#  #   Column         Non-Null Count  Dtype
# ---  ------         --------------  -----
# dtypes: int64(2), object(3)
# memory usage: 528.0+ bytes
# Functions
car_sales.describe() # give min max mean and other info of numerical data

Selecting and Viewing data

car_sales.head(5) # gives small snapshot of top 5 lines of the data
car_sales.tail(5) # gives small snapshot of bottom 5 lines of the data
animals = pd.Series(["cat", "dog", "bird", "panda", "snake"], [0, 3, 9, 8, 5])
# 0      cat
# 3      dog
# 9     bird
# 8    panda
# 5    snake

animals.loc[3] # loc refers to the value of index
# 'dog'
animals.iloc[3] # iloc refers to the position of index
# panda
car_sales[car_sales["Make"] == "Toyota"]
car_sales[car_sales["Odometer (KM)"] > 100000]

Exernal Resources