by Shubhi Asthana

Series and DataFrame in Python

A couple of months ago, I took the online course “Using Python for Research” offered by Harvard University on edX. While taking the course, I learned many concepts of Python, NumPy, Matplotlib, and PyPlot. I also had an opportunity to work on case studies during this course and was able to use my knowledge on actual datasets. For more information about this program, check out here.

I learned two important concepts in this course — Series and DataFrame. I want to introduce these to you through a short tutorial.

To start with the tutorial, lets get the latest source code of Python from the official website here.

Once you’ve installed Python is installed, you’ll use a graphical user interface called IDLE to work with Python.

Let’s import Pandas to our workspace. Pandas is a Python library that provides data structures and data analysis tools for different functions.

Series

A Series is a one-dimensional object that can hold any data type such as integers, floats and strings. Let’s take a list of items as an input argument and create a Series object for that list.

>>> import pandas as pd

>>> x = pd.Series([6,3,4,6])

>>> x

0 6

1 3

2 4

3 6

dtype: int64

The axis labels for the data as referred to as the index. The length of index must be the same as the length of data. Since we have not passed any index in the code above, the default index will be created with values [0, 1, … len(data) -1]

Lets go ahead and define indexes for the data.

>>> x = pd.Series([6,3,4,6], index=[‘a’, ‘b’, ‘c’, ‘d’])

>>> x

a 6

b 3

c 4

d 6

dtype: int64

The index in left most column now refers to data in the right column.

We can lookup the data by referring to its index:

>>> x[“c”]

Python gives us the relevant data for the index.

One example of a data type is the dictionary defined below. The index and values correlate to keys and values. We can use the index to get the values of data corresponding to the labels in the index.

>>> data = {‘abc’: 1, ‘def’: 2, ‘xyz’: 3}

>>> pd.Series(data)

abc 1

def 2

xyz 3

dtype: int64

Another interesting feature in Series is having data as a scalar value. In that case, the data value gets repeated for each of the indexes defined.

>>> x = pd.Series(3, index=[‘a’, ‘b’, ‘c’, ‘d’])

>>> x

a 3

b 3

c 3

d 3

dtype: int64

DataFrame

A DataFrame is a two dimensional object that can have columns with potential different types. Different kind of inputs include dictionaries, lists, series, and even another DataFrame.

It is the most commonly used pandas object.

Lets go ahead and create a DataFrame by passing a NumPy array with datetime as indexes and labeled columns:

>>> import numpy as np

>>> dates = pd.date_range(‘20170505’, periods = 8)

>>> dates

DatetimeIndex([‘2017–05–05’, ‘2017–05–06’, ‘2017–05–07’, ‘2017–05–08’,

‘2017–05–09’, ‘2017–05–10’, ‘2017–05–11’, ‘2017–05–12’],

dtype=’datetime64[ns]’, freq=’D’)

>>> df = pd.DataFrame(np.random.randn(8,3), index=dates, columns=list(‘ABC’))

>>> df

A B C

2017–05–05 -0.301877 1.508536 -2.065571

2017–05–06 0.613538 -0.052423 -1.206090

2017–05–07 0.772951 0.835798 0.345913

2017–05–08 1.339559 0.900384 -1.037658

2017–05–09 -0.695919 1.372793 0.539752

2017–05–10 0.275916 -0.420183 1.744796

2017–05–11 -0.206065 0.910706 -0.028646

2017–05–12 1.178219 0.783122 0.829979

A DataFrame with a datetime range of 8 days gets created as shown above. We can view the top and bottom rows of the frame using df.head and df.tail:

>>> df.head()

A B C

2017–05–05 -0.301877 1.508536 -2.065571

2017–05–06 0.613538 -0.052423 -1.206090

2017–05–07 0.772951 0.835798 0.345913

2017–05–08 1.339559 0.900384 -1.037658

2017–05–09 -0.695919 1.372793 0.539752

>>> df.tail()

A B C

2017–05–08 1.339559 0.900384 -1.037658

2017–05–09 -0.695919 1.372793 0.539752

2017–05–10 0.275916 -0.420183 1.744796

2017–05–11 -0.206065 0.910706 -0.028646

2017–05–12 1.178219 0.783122 0.829979

We can observe a quick statistic summary of our data too:

>>> df.describe()

A B C

count 8.000000 8.000000 8.000000

mean 0.372040 0.729842 -0.109691

std 0.731262 0.657931 1.244801

min -0.695919 -0.420183 -2.065571

25% -0.230018 0.574236 -1.079766

50% 0.444727 0.868091 0.158633

75% 0.874268 1.026228 0.612309

max 1.339559 1.508536 1.744796

We can also apply functions to the data like cumulative sum, view histograms, merging DataFrames, concatenating and reshaping DataFrames.

>>> df.apply(np.cumsum)

A B C

2017–05–05 -0.301877 1.508536 -2.065571

2017–05–06 0.311661 1.456113 -3.271661

2017–05–07 1.084612 2.291911 -2.925748

2017–05–08 2.424171 3.192296 -3.963406

2017–05–09 1.728252 4.565088 -3.423654

2017–05–10 2.004169 4.144905 -1.678858

2017–05–11 1.798104 5.055611 -1.707504

2017–05–12 2.976322 5.838734 -0.877526

You can read more details about these data structures here.