Chapter 6.2 - Pandas filtering - Computer Programming for the Geosciences

Pandas has powerful features that allow you to do many things with tabular data.

One very basic feature is the ability to filter data. Filtering data is the process of only looking at data that have specific attributes.

Pandas examples

Chapter 6.2.1 - SPC tornado dataset¶

The tornado data we will use in the following examples are curated and provided by the Storm Prediction Center. Each row from the “actual tornadoes” dataset represents one tornado report, and several attributes like time, location, and other information.

Download the dataset:

!wget -nc "https://raw.githubusercontent.com/ahaberlie/python_programming_geosciences/refs/heads/main/data/1950-2024_actual_tornadoes.csv"

File ‘1950-2024_actual_tornadoes.csv’ already there; not retrieving.

We can print out the column names using the columns method:

import pandas as pd

tor = pd.read_csv("1950-2024_actual_tornadoes.csv")

tor.columns

Index(['om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'stf', 'stn', 'mag',
       'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len',
       'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc'],
      dtype='object')

You can read about the column information here: https://www.spc.noaa.gov/wcm/data/SPC_severe_database_description.pdf

We can also look at the types of data in each column using the info method. This can assure that you are treating columns as numbers if you are running statistics:

tor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71813 entries, 0 to 71812
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   om      71813 non-null  int64  
 1   yr      71813 non-null  int64  
 2   mo      71813 non-null  int64  
 3   dy      71813 non-null  int64  
 4   date    71813 non-null  object 
 5   time    71813 non-null  object 
 6   tz      71813 non-null  int64  
 7   st      71813 non-null  object 
 8   stf     71813 non-null  int64  
 9   stn     71813 non-null  int64  
 10  mag     71813 non-null  int64  
 11  inj     71813 non-null  int64  
 12  fat     71813 non-null  int64  
 13  loss    71813 non-null  float64
 14  closs   71813 non-null  float64
 15  slat    71813 non-null  float64
 16  slon    71813 non-null  float64
 17  elat    71813 non-null  float64
 18  elon    71813 non-null  float64
 19  len     71813 non-null  float64
 20  wid     71813 non-null  int64  
 21  ns      71813 non-null  int64  
 22  sn      71813 non-null  int64  
 23  sg      71813 non-null  int64  
 24  f1      71813 non-null  int64  
 25  f2      71813 non-null  int64  
 26  f3      71813 non-null  int64  
 27  f4      71813 non-null  int64  
 28  fc      71813 non-null  int64  
dtypes: float64(7), int64(19), object(3)
memory usage: 15.9+ MB

You can run summary statistics using the describe method.

This method is very useful for looking at quick statistics. However, it demonstrates an important point--you need to understand the importance of these statistics.

The average dy (day) that a tornado happens is 15.9.. why do you think this is?

tor.describe()

We should always do some initial checks on our dataset. For example, we expect the data to be from 1950 to 2022. How can we assure this is the case? There are many options, including:

unique
describe
min or max

tor['yr'].unique()

array([1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
       1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024])

tor['yr'].describe()

count    71813.000000
mean      1993.226435
std         20.197898
min       1950.000000
25%       1977.000000
50%       1996.000000
75%       2010.000000
max       2024.000000
Name: yr, dtype: float64

print("max=", tor['yr'].max(), "min=", tor['yr'].min())

max= 2024 min= 1950

Indexing for the tornado database

We should just keep the default index. Each row is a tornado report, and there are many tornadoes that happen each year. So, unlike the examples in the last slideshow, yr (year) is not a good index column choice.

tor.yr

0        1950
1        1950
2        1950
3        1950
4        1950
         ... 
71808    2024
71809    2024
71810    2024
71811    2024
71812    2024
Name: yr, Length: 71813, dtype: int64

Chapter 6.2.2 - Using conditionals to subset pandas DataFrames¶

Please review Conditionals and If Statements

We can use conditionals to select what rows we want to examine.

Pandas does this by generating True or False arrays based on tests.

This code prints out the year associated with each tornado report

print(tor['yr'])

0        1950
1        1950
2        1950
3        1950
4        1950
         ... 
71808    2024
71809    2024
71810    2024
71811    2024
71812    2024
Name: yr, Length: 71813, dtype: int64

We can create a True or False array by running a test like tor['yr'] == 1950 which results in True for every row that the year is 1950, otherwise False.

print(tor['yr'] == 1950)

0         True
1         True
2         True
3         True
4         True
         ...  
71808    False
71809    False
71810    False
71811    False
71812    False
Name: yr, Length: 71813, dtype: bool

We combine this with our DataFrame in one of two ways.

One, set the condition and the resulting True or False to a variable and then put it inside square brackets:

only_in_1950 = tor['yr'] == 1950

tor[only_in_1950]

or you can put the test directly into the square brackets:

tor[tor.yr == 1950]

We can combine conditions to get very specific filters. The same rules of combining conditions earlier in the class apply here!

Here is how we get all reports on March 27th, 1950 by defining variables.

only_in_1950 = tor['yr'] == 1950
only_in_march = tor['mo'] == 3
only_on_the_27th = tor['dy'] == 27

tor[(only_in_1950) & (only_in_march) & (only_on_the_27th)]

here is how we do it by placing the conditions directly in the square brackets:

tor[(tor['yr'] == 1950) & (tor['mo'] == 3) & (tor['dy'] == 27)]

Advanced: You can also use the query method, but the syntax is a bit different:

tor.query('yr == 1950 & mo == 3 & dy == 27')

What happens if we change the last & to an |?

This would give us the rows where the yr was 1950 AND the month was 3.. or all rows where the date was 27

tor.query('(yr == 1950 & mo == 3) | dy == 27')

Chapter 6.2.3 - Practice answering questions using pandas¶

Read in the csv file once using tor, and then subset using a different variable name (tor_subset).

import pandas as pd

tor = pd.read_csv("1950-2024_actual_tornadoes.csv")

How many tornadoes are in the tornado dataset?

## Your code

How many variables/columns are associated with each tornado?

## Your code

What are the names of these variables/columns?

## Your code

Can you give summary statistics of the ‘mag’ column?

## Your code

Get all rows that have a “mag” of 5. How many rows did you get? Could you figure out what the most common state (e.g., IL, WI, OH, etc.) is for “mag” 5 tornadoes? (HINT try using the value_counts method)

## Your code

Get all rows that occurred in Hawaii OR Alaska. How many occurred in each state?

## Your code

Get all rows that occurred on April 27th 2011 and had a “mag” of 3 or greater. Can you give me the counts for each “mag” category 3 and above on this day? Based on how rare 3, 4, and 5 'mag’s are, do you think this was a mundane day or an historic day? Why or why not?

## Your code