Homework 5 - Pandas - Computer Programming for the Geosciences

10 total points

Directions¶

Due: 12/07/2025 @ 11:59 p.m.
Change the name of your notebook to HW5_FirstLast.ipynb, where First is your first name and Last is your last name.
For each of the following prompts, write or modify Python code that fulfills the requirements.

Important information to remember¶

Periodically, you should click File -> Save Notebook
When you are done working for the day, click File -> Log Out
When you are finished, download the notebook and upload it to the submission link
I will reset the notebook and run each code block on its own. You will receive half credit if you rely on a variable from a previous or subsequent code block.

Extra requirements:¶

Answer the following questions using pandas. You will only get credit if you correctly use pandas. What does it mean to correctly use pandas? Each question can and should be answered using only pandas methods. Again, you will not get credit if you do not use pandas to answer the question. You can use print and other basic Python functions to display answers if you wish. Exception: you can use len to find how many rows are in a subset. Alternatively, you can use df.shape[0] (where df is the name of your subset variable).

The final answer for each problem must be in the specified data type and set equal to the variable provided for each problem (question 1 is q1, question 2 is q2, etc.) or you will get an error in the grading system.

Use the following csv file in data/homework: 1950-2024_spc_tornadoes.csv

Here is the metadata for this file: https://www.spc.noaa.gov/wcm/data/SPC_severe_database_description.pdf

Important: Run the following code to download the file

!wget -nc "https://raw.githubusercontent.com/ahaberlie/python_programming_geosciences/refs/heads/main/data/1950-2024_actual_tornadoes.csv"

You can read this file using the following code. We are adding a column named ‘hr’ which is the hour during which the event occurred in CST.

import pandas as pd

df = pd.read_csv("1950-2024_actual_tornadoes.csv")

## extract the hour out of the dataset and set it equal to
## a column named 'hr'
df['hr'] = [int(x.split(":")[0]) for x in df.time.values]

df

You do not have to re-read the csv file each time if you do not want to, but make sure you keep in mind that any changes to the variable df could cause your answers to be incorrect later.

You can optionally re-read the csv file in for each question to make sure this doesn’t happen.

You will only get credit if your code correctly uses pandas to get these answers

Questions¶

Question 1 (0.5 pts) - Free half point¶

Create a subset of tornadoes from 1991 to 2020 (both 1991 and 2020 should be included).

Set the filtered subset to a variable named df_sub, and use this for the rest of the questions (unless otherwise specified).

Make sure that you do not modify df_sub, or you might get incorrect answers in later questions. When you are filtering to create a subset, use a variable like df_tmp or df_jja, etc., instead.

To test your subset, answer the following question:

How many tornadoes occurred during the period starting in 1991 and ending in 2020 (including both the start and end years)?

q1 should be an int and should be equal to 36,733.

It should be set using pandas (do NOT do something like q1 = 36733). I will not repeat this warning, but it applies to every question.

The question will originally look like this



q1 = None

print(q1)

Below is an example of an answer I am expecting for this question. You need to use pandas to answer the question (in this case, using pandas filtering).

Do not change the variable q1 and make sure you set your answer to the variable q1.

df_sub = df[(df.yr >= 1991) & (df.yr <= 2020)]

q1 = len(df_sub)

print(q1)

Question 2 (0.5 pts)¶

How many tornadoes from 1991 - 2020 (use df_sub!) occurred during the climatological winter (December, January, and February)?

q2 should be an int or float.



q2 = None

print(q2)

Question 3 (0.5 pts)¶

How many tornadoes occurred during the climatological spring (March, April, and May)?

q3 should be an int or float.



q3 = None

print(q3)

Question 4 (0.5 pts)¶

How many tornadoes occurred during the climatological spring (June, July, and August)?

q4 should be an int or float.



q4 = None

print(q4)

Question 5 (0.5 pts)¶

How many tornadoes occurred during the climatological spring (September, October, and November)?

q5 should be an int or float.



q5 = None

print(q5)

Double check these answers make sense:

print("the total count of tornadoes is", q1)

print("the sum of seasonal counts of tornadoes is", q2 + q3 + q4 + q5)

Question 6 (0.5 pts)¶

Which month had the most tornadoes during the period between 1991 and 2020?

q6 should be an int or float.



q6 = None

print(q6)

Question 7 (0.5 pts)¶

In Wisconsin, which month had the most tornadoes during the period between 1991 and 2020?

q7 should be an int or float.



q7 = None

print(q7)

Question 8 (0.5 pts)¶

What hour of the day had the most tornadoes during the period between 1991 and 2020?

q8 should be an int or float



q8 = None

print(q8)

Question 9 (0.5 pts)¶

How many total tornadoes occurred between 12:00 a.m. and 11:59 a.m. CST during the period between 1991 and 2020?

q9 should be an int or float



q9 = None

print(q9)

Question 10.a (0.25 pts)¶

How many total tornadoes occurred between 12:00 p.m. and 11:59 p.m. CST during the period between 1991 and 2020?

q10 should be an int or float



q10 = None

print(q10)

check to make sure these values make sense

print("the total count of tornadoes is", q1)

print("the sum of morning + afternoon counts of tornadoes is", q9 + q10)

Question 10.b (0.25 pts)¶

Based on your analysis, what time of the day experiences the most tornadoes: morning or afternoon? (1 point)

ANSWER HERE IN THE MARKDOWN:

Question 11 (0.5 pts)¶

How many tornadoes occur each year on average between 1991 - 2020?

q11 should be a float



q11 = None

display(q11)

Question 12.a (0.25 pts)¶

On average, how often does a tornado with a particular magnitude occur each year for the entire record provided in this HW assignment (1991 - 2020)?

Only include magnitudes 0, 1, 2, 3, 4, and 5.

Your output / DataFrame should look like close to this if you rounded (you do not need to round):

q12 should be a pandas DataFrame

mag	count
0	721.87
1	356.40
2	98.27
3	27.93
4	6.03
5	0.53



q12 = None

display(q12)

Question 12.b (0.25 pts)¶

Based on your analysis, EF0 tornadoes are (more/less) rare than EF5 tornadoes (pick one).

ANSWER:

Question 13 (0.5 pts)¶

What state experienced the highest count of tornadoes during entire record provided in this HW assignment (1991 - 2020)?

q13 should be a str



q13 = None

print(q13)

Question 14 (0.5 pts)¶

What state experienced the highest count of magnitude 3 or greater (EF3 or greater) tornadoes during entire record provided in this HW assignment (1991 - 2020)?

q14 should be a str



q14 = None

print(q14)

Question 15 (0.5 pts)¶

What was the mean path length (len) of Wisconsin (WI) tornadoes during entire record provided in this HW assignment (1991 - 2020)?

q15 should be a float



q15 = None

print(q15)

Question 16 (0.5 pts)¶

What was the mean path length (len) of Alabama (AL) during entire record provided in this HW assignment (1991 - 2020)?

q16 should be a float



q16 = None

print(q16)

Question 17 (0.5 pts)¶

What state that experienced at least 100 tornadoes had the largest mean path length (len) during entire record provided in this HW assignment (1991 - 2020)?

HINTS:

To find if a column matches at least one value in a list of values, use df['column_name'].isin(list_of_values).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

For example:

year_list = [1995, 1996, 1997]

display(df_sub[df_sub['yr'].isin(year_list)])

Will show you all rows from 1995, 1996, and 1997. For this question, how would you get a list of state abbreviations that have experienced at least 100 tornadoes?

You need to eventually use groupby for this problem.

To sort the values with the largest value first, use the method .sort_values(ascending=False) on the pandas DataFrame that just has your state and mean length. If you have more columns, you need to use .sort_values(by='len', ascending=False)

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

q17 should be a str



q17 = None

print(q17)

Question 18 (0.5 pts)¶

What state had the most injuries (inj) during entire record provided in this HW assignment (1991 - 2020)?

You need to use groupby for this problem.

q18 should be a str



q18 = None

print(q18)

Question 19 (0.5 pts)¶

What state had the most fatalities (fat) during entire record provided in this HW assignment (1991 - 2020)?

q19 should be a str



q19 = None

print(q19)

Question 20.a (0.25 points)¶

What mean count of fatalities organized by EF/F scale (mag) during entire record provided in this HW assignment (1991 - 2020)?

In other words, how many deaths occur, on average, when a mag 0 tornado occurs? mag 1 tornado? etc.

HINT: You can get this result by first finding the sum of fatalities per mag group, and then dividing it by the count of tornadoes in each mag group. You can perform mathematical operations on two pandas DataFrame in a similar way to numpy ndarray, as long as they are the same size.

Only consider magnitude 0 to 5.

Your output should look like this and q20 should be a pandas DataFrame.

mag	fat
0	0.000554
1	0.009914
2	0.093623
3	0.768496
4	3.232044
5	28.125000


q20 = None

print(q20)

Question 20.b (0.25 points)¶

Based on this result, what magnitude of tornado is most dangerous when it occurs?

PUT YOUR ANSWER HERE IN THE MARKDOWN:

Grader¶

Do no change the code below this line¶

answers = [q1, q2, q3, q4, q5, q6, q7, q8, q9, q10,
           q11, q12, q13, q14, q15, q16, q17, q18, q19, q20]
ans_s = {f"Question {x}": y for x, y in zip(range(1, 21), answers)}

print("Your answers were:")

for key, value in ans_s.items():
    print(f"{key}:", value)

import numpy as np

correct= {
    "Question 1": 36733,
    "Question 2": 3123,
    "Question 3": 15901,
    "Question 4": 12366,
    "Question 5": 5343,
    "Question 6": 5,
    "Question 7": 6,
    "Question 8": 17,
    "Question 9": 5936,
    "Question 10": 30797,
    "Question 11": 1224.433,
    "Question 12": {0: 721.867, 1: 356.40, 2: 98.266, 3: 27.9333, 4: 6.03333, 5: 0.53333},
    "Question 13": "TX",
    "Question 14": "KS",
    "Question 15": 3.4973390557939914,
    "Question 16": 5.086183673469393,
    "Question 17": "AR",
    "Question 18": "AL",
    "Question 19": "AL",
    "Question 20": {0: 0.00055411, 1: 0.009914, 2: 0.093623, 3: 0.768496, 4: 3.232044, 5: 28.125000}
}

total_score = 0

for key, value in ans_s.items():
    
    if (key == "Question 20") | (key == "Question 12"):
        
        try:
            for sub_key, sub_value in value.items():

                points_earned = 0.5

                correct_answer = correct[key][sub_key]

                if not np.isclose(sub_value, correct_answer):
                    print(key, "Your answer was", sub_value, "the correct answer was", correct_answer)
                    points_earned = 0

            total_score += points_earned
        except KeyError as e:
            print(key, "You have a magnitude in your DataFrame that is not recognized:", "mag =", e, "\nCheck the directions again.")
        
    else:
        if key in ['Question 13', 'Question 14', 'Question 17', 'Question 18', 'Question 19']:
            if value == correct[key]:
                total_score += 0.5
            else:
                print(key, "Your answer was", value, "the correct answer was", correct[key])
        else:
            if np.isclose(value, correct[key]):
                total_score += 0.5
            else:
                print(key, "Your answer was", value, "the correct answer was", correct[key])
                
print(f"Your Total Score is: {total_score}/10")