Pandas tips & tricks
Pandas is a comprehensive Python library for data analysis, but its syntax for performing certain specific data transformations can be difficult to discern from its documentation. I recently used Pandas in my analysis of Billboard Top 100 Music data, and I’ve uncovered several useful commands that I will most likely use again in the future, so I’m compiling a (growing) list here for both my own reference and the benefit of someone else struggling with these problems in the future.
I will be updating this as I come across/learn new useful methods.
(Growing) List of Useful Tips
Note: assume import pandas as pd
below:
Datetime Manipulation
Many datasets contain a datetime
field of some sort, following patterns such as mm/dd/yyyy
, dd/mm/yyyy
, mm-dd-yyyy
etc. It is often useful to separate this single column into the respective month, day, and year. For example, if we have a Pandas dataframe with the following WeekID
field that follows mm/dd/yyyy
: `2/10/1999’, we can use the following Pandas code:
# use to_datetime to separate one column into multiple
data['WeekID'] = pd.to_datetime(data['WeekID'], format='%m/%d/%Y')
# extract the year, month, and day into separate columns
data['year'] = data['WeekID'].dt.year
data['month'] = data['WeekID'].dt.month
data['day'] = data['WeekID'].dt.day
Similarly, /
can be replaced in the format variable with -
if that’s the syntax for the datetime pattern.
Row Selections
Rows can be selected based on logical comparisons:
# return all the rows where 'year' is 2018 or 2017
data = data.loc[(data['year'] == 2018) | (data['year'] == 2017)]
For datetime
, can also be achieved using masks:
# isolate dates to between Dec. 1, 2017 and May 31, 2018
data['datetime'] = pd.to_datetime(data['datetime'])
mask = (data['datetime'] > '2017-12-01') & (data['datetime'] <= '2018-05-31')
data = data.loc[mask]
Map
Append additional information to every entry in a column using map
:
# add month and year to id column
data['id'] = data.id.map(str) + "-" + data.Month.map(str) +"-" + data.Year.map(str)
Pivot Tables
Pivot tables summarize data of more extensive tables using sums, aggregations, averages, or other statistics. They are useful in business intelligence and data analysis.
Say we have a dataframe data
with columns position
(between values 1
and 100
) and id
; I want a new table that displays, for each id
entry, the counts for each position
value. Pivot tables can be defined and performed in Pandas as follows:
# use pivot table to extract counts of position per id
data['count'] = 1
result = data.pivot_table(
index=['id'], columns='position', values='count',
fill_value=0, aggfunc=np.sum
)
Merge on str.contains()
instead of 1-to-1 match
I came across this particular problem when trying to merge two datasets together via partial matching. For example, I have dataframes data1
and data2
. data1
has entries in id
column that could be a substring of data2
’s id
column entries. Merges based on partial matches can be implemented as follows:
# join based on partial matches of data1's id with data2's id
tmp = (data1.id
.apply(lambda idx: data2[data2.id.str.find(idx).ge(0)]['id'])
.bfill(axis=1)
.iloc[:, 0])
result = pd.concat([data1.id, tmp], axis=1, ignore_index=True).rename(columns={0: 'id1', 1: 'id2'})
Sorting
Often it is useful to keep one max/min row after sorting on a column. In Pandas, this can be done in a line:
data = data.sort_values('position').drop_duplicates(["id"],keep='last')
Furthermore, sorting on two or more columns by order is also possible in a line:
# sort on position first from largest to smallest, then on id from smallest to largest
data.sort_values(['position', 'id'], ascending=[False, True])