%%bash
wget -nc -q https://nimbus.niu.edu/courses/EAE483/exam1.csv

Exam 1 - Predicting storm mode using decision trees¶

Due March 22nd at 11:59 p.m.¶

You must not make any edits to your notebook after the due date. If edits are made or the link is submitted after the due date, there is a 10% penalty each day it is late.
Rename the notebook in the following pattern: Exam_1_First_Last.ipynb. You must rename it on triton before you save and download. There is a 5% penalty if you do not rename it.

Unique exam rules:¶

You must use the same version of the notebook from when you upload to colab to when you submit the link on Blackboard.
I am going to apply a 50% penalty if there are not 3 separate unique dates (in central time) with evidence of substantial work in your notebook history.
Yes, this means you must start working on the Exam, at the latest, 3 days before the due date. No, changing the name does not count as an edit. I need to see legitimate, substantial work on 3 different days, including new/edited code, written answers, etc.

Examples of acceptable (✅) or unacceptable (❌) notebook history:

I see notebook edits only on 03/12/2026, 03/16/2026, and 03/20/2026. ✅
I see notebook edits only on 03/16/2026 and 03/17/2026. ❌
I see notebook edits only on 03/20/2026 ❌

NOTE: You can work on any 3 days before the due date, these are just example dates

When you are ready to submit:¶

File -> Save Notebook (do this often..)
Click Share (upper right)
Change general access to “Anyone with the link”
Share the notebook with ahaberlie1@gmail.com
Click “Copy Link” and then “Done”
Add a comment to this title cell by clicking the three dots in the upper right corner of this cell and clicking on “Add a Comment”. Type in “I have completed the presubmission steps on MM/DD/YYYY.” and click on “comment” to save the comment.
Submit the link on Blackboard under the Exam 1 submission link.

There is a 5% penalty for each email I have to send you to remind you to follow the steps above. Here are some tips if you are concerned about falsely being accused of missing steps:

Take a screenshot (with a clearly visible clock) of the changed settings on colab. You do not have to send this to me, just save it on your computer to show me in case you lose points.
Paste the copied URL into a browser in “Incognito Mode” (in other words, without your colab login information). If you are able to open the notebook, you at least have allowed “anyone” to view it.
Send me an email / check before the due date so that I can attempt to view and edit your notebook.

Exam 1 Description:¶

Your job is to use the storm mode dataset provided to you below to predict the label_name field.

Using columns other than the ones below (feature_list) for training purposes will result in a 50% penalty. You are allowed to create new columns if I can clearly see in your code that you calculated the new column based on one or more of the existing columns.

feature_list = ['mean_intensity', 'intensity_variance', 'max_intensity', '20dbz_area', '40dbz_area', '50dbz_area', 'minor_axis_length', 'major_axis_length', 'year'] # only columns you are allowed to use for creating the model

The following column is what you are trying to predict. Do not use this as a predictor feature.

'label_name' # name of storm morphology for that sample

You must complete the exam on your own without a partner. If I see evidence of sharing code, all parties will receive a zero. Be prepared to explain your code in my office if I suspect extensive usage of ChatGPT or other generative AI tools.

Rubric (50 total points):¶

There are 6 parts to this exam, each part should include code that comprehensively supports your findings, and at least one “markdown” cell describing what you are seeing, why you are making decisions, or anything else you would like me to know. I would also like you to use obvious separators between each section (e.g., a large title header).

1. Description (10 pts): Show me the statistics of each column in the dataset. What are some patterns that you notice?¶

The student comprehensively described the relevant dataset features using high quality figures, tables, and other relevant data (10 pts)
The student adequately described the relevant dataset features using figures, tables, and other relevant data (7 pts)
The student adequately described the dataset using relevant data (5 pts)
The student provided a low-quality description of the dataset using relevant data (3 pts)
The student did not describe the dataset (0 pts)

2. Preprocessing (5 pts): I want you to create a new feature named `min_max_ratio` which is the ratio of `minor_axis_length` to `major_axis_length` for each sample. Use pandas functionality to perform this task.¶

The student correctly transformed the 'major_axis_length' and 'minor_axis_length' columns to set it up for creating a machine learning model (5 pts)
The student attempted to transform the 'major_axis_length' and 'minor_axis_length' columns to set it up for creating a machine learning model, but did not do it correctly (3 pts)
The student made no attempt to transform the the 'major_axis_length' and 'minor_axis_length' columns (0 pts)

3. Setting up machine learning subsets (5 pts): Create the required subsets for generating and assessing a machine learning model.¶

The student grouped the data into three subsets and gave them correct names based on best practices discussed in class and compared the subsets to the overall dataset (5 pts)
The student grouped the data into three subsets and gave them incorrect names or did not use the best practices discussed in class (3 pts)
The student made no attempt to group the data into three subsets (0 pts)

4. Training a Decision Tree classifier (10 pts): Train a `DecisionTreeClassifier` on your data after creating the required subsets.¶

The student used the correct subset for training and tried at least **3** different model configurations, assessed their generalizability using the correct subset, and selected the best model configuration based on those results (10 pts)
The student used the correct subset for training and tried at least **3** different model configurations, assessed their generalizability using the incorrect subset, and selected the best model configuration based on those results (7 pts)
The student used the incorrect subset for training and tried at least **2** different model configurations, assessed their generalizability using the incorrect subset, and selected the best model configuration based on those results (5 pts)
The student used the incorrect subset for training **1** model configuration, assessed its generalizability using the incorrect subset, and selected the best model configuration with no evidence (3 pts)
The student did not use the correct subsets for training and selected the best model configuration with no evidence (1 pts)
The student did not train a classifier (0 pts)

5. Assessing the Decision Tree classifier (10 pts): Demonstrate the ability to assess how well the model predicts the labels using approaches we discussed in class or used on assignments/labs.¶

The student used the correct subset for testing the generalizability of the classifier, calculated comprehensive performance metrics and variable importance, and summarized and interpreted the results (10 pts)
The student used the correct subset for testing the generalizability of the classifier, calculated some performance metrics, and summarized and interpreted the results (7 pts)
The student used the correct subset for testing the generalizability of the classifier, summarized and interpreted the results without any evidence (5 pts)
The student used the incorrect subset for testing the generalizability of the classifier, summarized and interpreted the results without any evidence (3 pts)
The made no attempt to assess the classifier (0 pts)

6. Summary (10 pts): Summarize the model workflow from start to finish, with a focus on how someone might use the model and what pitfalls / caveats / issues they may experience if using the model on a similar dataset.¶

The student provided a detailed assessment of the model, including where it did or did not perform well, what extra data may be needed to improve predictions, and explained *why* the model was producing the results it did (10 pts)
The student provided a detailed assessment of the model, including where it did or did not perform well, what extra data may be needed to improve predictions (7 pts)
The student provided a detailed assessment of the model, including where it did or did not perform well (5 pts)
The student provided a basic assessment of the model (3 pts).
The student did not attempt to summarize their findings (0 pts)

Random state

Please enter your zid in the code below and use it as the random_state parameter in your Decision Tree code (see performance metrics homework). If you do not change this value to your ZID (without the ‘z’ or ‘a’), it is a 10% penalty. Your Decision Tree print out (e.g., print(rf_clf)) should print random_state=whatever your ZID. This will make your results slightly different from other students, which is expected and done on purpose.

By changing this to your ZID, you are agreeing to complete this exam in your own words, without assistance from genAI. Extensive / obvious usage of genAI will result in a 50% reduction in points for that answer. Your notebook history will be reviewed for suspicious edits. Do not import numpy again after running this cell.

import numpy as np

zid = 9999999

Here is the code that reads in the csv file (exam1.csv). Make sure the csv file is in the same folder as your notebook. If it is not, rerun the first cell in this notebook to download the file.

Make sure that you run the code above to set your “random seed” before continuing with the exam.

Do not change the code in the following cell block that reads in the csv file. Doing so will result in a 0.

It should look exactly like this when you are done:

import pandas as pd
np.random.seed(zid)

df = pd.read_csv("exam1.csv")

df['year'] = np.random.randint(1996, 2013, size=len(df))

df

import pandas as pd
np.random.seed(zid)

df = pd.read_csv("exam1.csv")

df['year'] = np.random.randint(1996, 2013, size=len(df))

df

Exam 1 - Predicting storm mode using decision trees