Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:
(Wrote these notes for McGill’s course COMP 598.)HackMD version
- Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope:
- The most widely used software for dealing with this kind of data is Microsoft Excel.
Why use Python? :snake:
- I’ll give three (four) reasons why you might want to try working with this kind of data in Python instead
- It’s free. :money_with_wings:
- Automation and complex manipulations is much easier :pie:
- Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend:
- Clicking on buttons hurts :hospital: :mouse: :no_entry_sign:
- Loading a dataset in Pandas
- Inspecting it
- Basic manipulations
- These libraries are super extensive.
- This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs. :::
Getting a Dataset
- I browsed Kaggle for some interesting COVID-related datasets (of course)
- We’ll be using the COVID by US county dataset.
::: danger Disclaimer: This is a real dataset, but the following visualizations and statistics are purely for illustration purposes, not as a real analysis of COVID. ::: —
- Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
- Place the file in a folder
my_project/data/(we’ll be working in the
Loading the file in Python
Create a file in
- If the source file is a CSV. doc
import pandas as pd df = pd.read_csv('../data/covid.csv')
- If the source file is an Excel File. doc
import pandas as pd df = pd.read_excel('../data/covid.xls')
- The core object that Pandas uses is called a DataFrame.
- We loaded our data into a DataFrame called
- This is object stores the table data and supports most of the needed functionality.
- Let’s take a look, using the
headmethod which prints the first few rows of the table.
date county state fips cases deaths 0 2020-01-21 Snohomish Washington 53061.0 1 0 1 2020-01-22 Snohomish Washington 53061.0 1 0 2 2020-01-23 Snohomish Washington 53061.0 1 0 3 2020-01-24 Cook Illinois 17031.0 1 0 4 2020-01-24 Snohomish Washington 53061.0 1 0
Each row is a day with number of deaths and cases for each US county.
- To get a list of columns
Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')
- To get a single column
dates = df['date'] print(dates)
0 2020-01-21 1 2020-01-22 2 2020-01-23 3 2020-01-24 4 2020-01-24 5 2020-01-25 ...
- To get multiple columns
- To select a specific row, we use the
date 2020-01-23 county Snohomish state Washington fips 53061 cases 1 deaths 0 Name: 2, dtype: object
- You can then select columns within a row
- Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
- Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
.locis a powerful attribute which can take a condition on the column values to filter the rows.
print(df.loc[df['deaths'] > df['cases'] * 2 ])
date county state fips cases deaths 26292 2020-04-02 Unknown California NaN 0 1 28473 2020-04-03 Unknown Arizona NaN 0 4 29475 2020-04-03 Unknown Minnesota NaN 0 3 29803 2020-04-03 Unknown New York NaN 37 608 30135 2020-04-03 Unknown Pennsylvania NaN 0 1 30475 2020-04-03 Unknown Utah NaN 0 3
- Maybe we want to flag certain counties with lots of cases.
df['hot'] = df['cases']> 100 print(df.head())
date county state fips cases deaths hot 0 2020-01-21 Snohomish Washington 53061.0 1 0 False 1 2020-01-22 Snohomish Washington 53061.0 1 0 False 2 2020-01-23 Snohomish Washington 53061.0 1 0 False 3 2020-01-24 Cook Illinois 17031.0 1 0 False 4 2020-01-24 Snohomish Washington 53061.0 1 0 False
- Let’s say we want to publish our DataFrame with the
- We simply write it to a new CSV file which can be loaded later.
df.to_csv('new_data.csv') #or to excel format df.to_excel('new_data.xls')
- Maybe we want to know averages for the columns per state
groupbymethod comes in handy here.
#groupby object state_grouped = df.groupby(['state']) #new dataframe means_df = state_grouped.mean()
state fips cases deaths hot 0 Alabama 1067.804565 95.768957 3.544569 0.222011 1 Alaska 2148.606250 23.179455 0.465347 0.059406 2 Arizona 4014.058216 411.876394 18.223048 0.372677 3 Arkansas 5075.446098 37.349700 0.763608 0.077983 4 California 6058.734863 671.394359 25.821298 0.382446 5 Colorado 8062.313048 206.642279 10.232297 0.233342
:::info The logic behind grouping is a bit involved so I suggest reading the docs. :::
- Which states have the most daily deaths on average?
print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])
state deaths 33 New York 274.444252 31 New Jersey 212.262184 6 Connecticut 171.599364 22 Massachusetts 152.209790 8 District of Columbia 140.775000 41 Puerto Rico 58.486486
- I used the
ascending=Falsefor a decreasing sort by
- Then I print the
Plotting with Seaborn :ocean:
- Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.
Deaths per Day
- Let’s compare deaths over time between three states.
- Notice the use of
isin()to filter rows whose column values are within a list of accepted values.
import seaborn as sns import matplotlib.pyplot as plt g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])]) plt.show()
#I used this to get rid of some x-ticks for easier reading. for ind, label in enumerate(g.get_xticklabels()): if ind % 30 == 0: # every 30th label is kept label.set_visible(True) else: label.set_visible(False)
Mortality by State
- Let’s do one more.
- Let’s group by state again and instead of averaging take totals (sum)
- We also add a new column called
tot = df.groupby(['state'], as_index=False).sum() tot['mortality'] = tot['deaths'] / tot['cases'] df = df.sort_values(by='mortality', ascending=False) sns.barplot(x='state', y='mortality')