Pandas has powerful features that allow you to do many things with tabular data.
One very basic feature is the ability to filter data. Filtering data is the process of only looking at data that have specific attributes.
Chapter 6.2.1 - SPC tornado dataset¶
The tornado data we will use in the following examples are curated and provided by the Storm Prediction Center. Each row from the “actual tornadoes” dataset represents one tornado report, and several attributes like time, location, and other information.
Download the dataset:
!wget -nc "https://raw.githubusercontent.com/ahaberlie/python_programming_geosciences/refs/heads/main/data/1950-2024_actual_tornadoes.csv"File ‘1950-2024_actual_tornadoes.csv’ already there; not retrieving.
We can print out the column names using the columns method:
import pandas as pd
tor = pd.read_csv("1950-2024_actual_tornadoes.csv")
tor.columnsIndex(['om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'stf', 'stn', 'mag',
'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len',
'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc'],
dtype='object')You can read about the column information here: https://
We can also look at the types of data in each column using the info method. This can assure that you are treating columns as numbers if you are running statistics:
tor.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71813 entries, 0 to 71812
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 om 71813 non-null int64
1 yr 71813 non-null int64
2 mo 71813 non-null int64
3 dy 71813 non-null int64
4 date 71813 non-null object
5 time 71813 non-null object
6 tz 71813 non-null int64
7 st 71813 non-null object
8 stf 71813 non-null int64
9 stn 71813 non-null int64
10 mag 71813 non-null int64
11 inj 71813 non-null int64
12 fat 71813 non-null int64
13 loss 71813 non-null float64
14 closs 71813 non-null float64
15 slat 71813 non-null float64
16 slon 71813 non-null float64
17 elat 71813 non-null float64
18 elon 71813 non-null float64
19 len 71813 non-null float64
20 wid 71813 non-null int64
21 ns 71813 non-null int64
22 sn 71813 non-null int64
23 sg 71813 non-null int64
24 f1 71813 non-null int64
25 f2 71813 non-null int64
26 f3 71813 non-null int64
27 f4 71813 non-null int64
28 fc 71813 non-null int64
dtypes: float64(7), int64(19), object(3)
memory usage: 15.9+ MB
You can run summary statistics using the describe method.
This method is very useful for looking at quick statistics. However, it demonstrates an important point--you need to understand the importance of these statistics.
The average dy (day) that a tornado happens is 15.9.. why do you think this is?
tor.describe()We should always do some initial checks on our dataset. For example, we expect the data to be from 1950 to 2022. How can we assure this is the case? There are many options, including:
uniquedescribeminormax
tor['yr'].unique()array([1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024])tor['yr'].describe()count 71813.000000
mean 1993.226435
std 20.197898
min 1950.000000
25% 1977.000000
50% 1996.000000
75% 2010.000000
max 2024.000000
Name: yr, dtype: float64print("max=", tor['yr'].max(), "min=", tor['yr'].min())max= 2024 min= 1950
Indexing for the tornado database
We should just keep the default index. Each row is a tornado report, and there are many tornadoes that happen each year. So, unlike the examples in the last slideshow, yr (year) is not a good index column choice.
tor.yr0 1950
1 1950
2 1950
3 1950
4 1950
...
71808 2024
71809 2024
71810 2024
71811 2024
71812 2024
Name: yr, Length: 71813, dtype: int64Chapter 6.2.2 - Using conditionals to subset pandas DataFrames¶
Please review Conditionals and If Statements
We can use conditionals to select what rows we want to examine.
Pandas does this by generating True or False arrays based on tests.
This code prints out the year associated with each tornado report
print(tor['yr'])0 1950
1 1950
2 1950
3 1950
4 1950
...
71808 2024
71809 2024
71810 2024
71811 2024
71812 2024
Name: yr, Length: 71813, dtype: int64
We can create a True or False array by running a test like tor['yr'] == 1950 which results in True for every row that the year is 1950, otherwise False.
print(tor['yr'] == 1950)0 True
1 True
2 True
3 True
4 True
...
71808 False
71809 False
71810 False
71811 False
71812 False
Name: yr, Length: 71813, dtype: bool
We combine this with our DataFrame in one of two ways.
One, set the condition and the resulting True or False to a variable and then put it inside square brackets:
only_in_1950 = tor['yr'] == 1950
tor[only_in_1950]or you can put the test directly into the square brackets:
tor[tor.yr == 1950]We can combine conditions to get very specific filters. The same rules of combining conditions earlier in the class apply here!
Here is how we get all reports on March 27th, 1950 by defining variables.
only_in_1950 = tor['yr'] == 1950
only_in_march = tor['mo'] == 3
only_on_the_27th = tor['dy'] == 27
tor[(only_in_1950) & (only_in_march) & (only_on_the_27th)]here is how we do it by placing the conditions directly in the square brackets:
tor[(tor['yr'] == 1950) & (tor['mo'] == 3) & (tor['dy'] == 27)]Advanced: You can also use the query method, but the syntax is a bit different:
tor.query('yr == 1950 & mo == 3 & dy == 27')What happens if we change the last & to an |?
This would give us the rows where the yr was 1950 AND the month was 3.. or all rows where the date was 27
tor.query('(yr == 1950 & mo == 3) | dy == 27')Chapter 6.2.3 - Practice answering questions using pandas¶
Read in the csv file once using tor, and then subset using a different variable name (tor_subset).
import pandas as pd
tor = pd.read_csv("1950-2024_actual_tornadoes.csv")How many tornadoes are in the tornado dataset?
## Your codeHow many variables/columns are associated with each tornado?
## Your codeWhat are the names of these variables/columns?
## Your codeCan you give summary statistics of the ‘mag’ column?
## Your codeGet all rows that have a “mag” of 5. How many rows did you get? Could you figure out what the most common state (e.g., IL, WI, OH, etc.) is for “mag” 5 tornadoes? (HINT try using the value_counts method)
## Your codeGet all rows that occurred in Hawaii OR Alaska. How many occurred in each state?
## Your codeGet all rows that occurred on April 27th 2011 and had a “mag” of 3 or greater. Can you give me the counts for each “mag” category 3 and above on this day? Based on how rare 3, 4, and 5 'mag’s are, do you think this was a mundane day or an historic day? Why or why not?
## Your code