Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures.

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

Calculate statistics and answer questions about the data, like
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
Store the cleaned, transformed data back into a CSV, other file or database

Import Pandas

import pandas as pd

Core components of pandas: Series and DataFrames¶

The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Series¶

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series( data, index, dtype, copy)

data - data takes various forms like ndarray, list, constants index - Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed. dtype - dtype is for data type. If None, data type will be inferred copy - Copy data. Default False

Series Explanation

colors = pd.Series(["Red", "Green", "Blue"])

colors

0      Red
1    Green
2     Blue
dtype: object

cars = pd.Series(["Audi", "Ferrai", "BMW"])
cars

0      Audi
1    Ferrai
2       BMW
dtype: object

import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])

100  a
101  b
102  c
103  d
dtype: object

data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)

a 0.0
b 1.0
c 2.0
dtype: float64

s = pd.Series(5, index=[0, 1, 2, 3])
print s

0  5
1  5
2  5
3  5
dtype: int64

#retrieve the first element
s[0]

#retrieve the first three element
s[:3]

#retrieve the last three element
s[-3:]

#Retrieve a single element using index label value.
s['a']

#Retrieve multiple elements using a list of index label values.
s[['a','c','d']]

DataFrames¶

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

pandas.DataFrame( data, index, columns, dtype, copy)

data - data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. index- For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed. columns - For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed. dtype - Data type of each column. copy - This command (or whatever it is) is used for copying of data, if the default is False.

DataFrames Explanation

A pandas DataFrame can be created using various inputs like −

Lists
dict
Series
Numpy ndarrays
Another DataFrame

dataframes = pd.DataFrame({ "color": colors, "car": cars })
dataframes

	color	car
0	Red	Audi
1	Green	Ferrai
2	Blue	BMW

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
df

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

    a    b      c
0   1   2     NaN
1   5   10   20.0

Column Operations

# select column
df['one']

# delete column
del df['one']
df.pop('two')

# add column
df['three']=pd.Series([10,20,30],index=['a','b','c'])
df['four']=df['one']+df['three']

Row Operations

# select row
df.loc['b'] # Selection by Label
df.iloc[2]  # Selection by integer location

# Slice Rows
df[2:4]

# Addition of Rows
df = df.append(df2)

# Drop rows with label 0
df.drop(0)

CSVs¶

Load CSV data

# import data
car_sales = pd.read_csv("car-sales.csv")

car_sales

	Make	Colour	Odometer (KM)	Doors	Price
0	Toyota	White	150043	4	$4,000.00
1	Honda	Red	87899	4	$5,000.00
2	Toyota	Blue	32549	3	$7,000.00
3	BMW	Black	11179	5	$22,000.00
4	Nissan	White	213095	4	$3,500.00
5	Toyota	Green	99213	4	$4,500.00
6	Honda	Blue	45698	4	$7,500.00
7	Honda	Blue	54738	4	$7,000.00
8	Toyota	White	60000	4	$6,250.00
9	Nissan	White	31600	4	$9,700.00

Describe Data¶

# Attributes
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object

car_sales.columns
# Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

car_sales.index
# RangeIndex(start=0, stop=10, step=1)

car_sales["Doors"].mean()
# 4.0

car_sales["Doors"].sum()
# 40

len(car_sales)
# 10

car_sales.info()
#  #   Column         Non-Null Count  Dtype
# ---  ------         --------------  -----
# dtypes: int64(2), object(3)
# memory usage: 528.0+ bytes

# Functions
car_sales.describe() # give min max mean and other info of numerical data

Selecting and Viewing data¶

car_sales.head(5) # gives small snapshot of top 5 lines of the data
car_sales.tail(5) # gives small snapshot of bottom 5 lines of the data

animals = pd.Series(["cat", "dog", "bird", "panda", "snake"], [0, 3, 9, 8, 5])
# 0      cat
# 3      dog
# 9     bird
# 8    panda
# 5    snake

animals.loc[3] # loc refers to the value of index
# 'dog'
animals.iloc[3] # iloc refers to the position of index
# panda

car_sales[car_sales["Make"] == "Toyota"]
car_sales[car_sales["Odometer (KM)"] > 100000]