%%bash
mkdir storm_mode
cd storm_mode
wget -nc -q https://raw.githubusercontent.com/ahaberlie/unidata-workshop-2018/refs/heads/master/workshop/data/training/sample_train_data.csv
wget -nc -q https://raw.githubusercontent.com/ahaberlie/unidata-workshop-2018/refs/heads/master/workshop/data/training/sample_test_data.csvL8 - Ensemble Methods with Decision Trees¶
Directions:
Please rename the file by clicking on “LX-First-Last.ipynb” where X is the lab number, and replace First and Last with your first and last name.
Click File -> Save to make sure your most recent edits are saved.
In the upper right hand corner of the screen, click on “Share”. Click on “Restricted” and change it to “Anyone with the link”. Make sure you also share it with
ahaberlie1@gmail.com.Copy the link and submit it on Blackboard. Make sure you follow these steps completely, or I will be unable to grade your work.
Overview¶
This lab will help you understand scikit-learn and its ensemble model capabilities. We will walk through some examples of how scikit-learn can help solve Geoscience problems. Periodically, I will 1) ask you to either repeat the demonstrated code in a slightly different way; or 2) ask you to combine two or more techniques to solve a problem.
You can use generative AI to help answer these problems. The answer should still be in your own words. Think of the generative AI descriptions as those from a book. You still have to cite your source and you cannot plagiarize directly from the source. For every question that you used generative AI for help, please reference the generative AI you used and what your prompt or prompts were.
However, it is crucial that you understand the code well enough to effectively use generative AI tools that are likely to be widely available and recommended for use at many organizations. Although they are improving at an incredible rate, they still produce bugs, especially with domain-specific and complex problems. Make sure that you verify the answers before putting them in your own words.
Model performance assessment
When assessing model performance, we typically separate the sample data into three subsets. Think of your own experiences preparing for an exam as motivation for the subsets:
Training data - this is what the machine learning approach uses to generate a model. Similar to lectures/homework/assignments/etc. in class.
Validation data - this is what is used to do quick “spot checks” on the model and help determine optimal model settings. This is similar to doing a practice quiz and learning what you need to focus on before the exam.
Testing data - this is what is used to test the performance of the model. This is similar to finally taking the exam.
I have provided you with the training and testing datasets. We will also generate the validation dataset below:
import pandas as pd
import numpy as np
np.random.seed(4)
def custom_train_split(df, val_year_start=2012):
'''Takes a pandas DataFrame with training data
and sorts it into two subsets: 1) training data before
and including `val_year_start`; and 2) validation data
after `val_year_start`
Parameters:
df: pandas DataFrame
Original training data.
val_year_start: int
Year that defines the split between training and
validation data. Default 2011. Must be between
2006 and 2012.
Returns:
(df_train, df_test): tuple
DataFrames split into training and testing data
'''
df_train = df[df['datetime'].dt.year < val_year_start]
df_val = df[df['datetime'].dt.year >= val_year_start]
return df_train, df_val
def convert_to_numpy(df, remove_cols=None, binary_label='MCS',
label_col='label_name'):
'''Converts DataFrame to machine learning friendly
format by removing non-numeric columns and columns
provided by `remove_cols`.
Parameters:
df: pandas DataFrame
Original DataFrame with all columns
remove_cols: list
Other columns to remove that are numeric.
binary_label: str
If not None, convert label to 1 if `label_cols`
is equal to `binary_label` in `df`.
label_col: str
Column name in `df` that contains original labels.
Returns:
npy_data: numpy ndarray
Converted data ready for the training process.
'''
df_ = df.copy()
if remove_cols is None:
remove_cols = ['index', 'label', 'label1']
numeric_cols = df_.select_dtypes(include="number").columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in remove_cols]
if binary_label is not None:
df_[binary_label] = df_[label_col] == binary_label
df_x = df_[numeric_cols].to_numpy()
df_y = df_[binary_label].to_numpy()
return df_x, df_y, numeric_cols
df_train = pd.read_csv("storm_mode/sample_train_data.csv")
df_test = pd.read_csv("storm_mode/sample_test_data.csv")
df_train['datetime'] = pd.to_datetime(df_train['datetime'])
df_train = df_train.sort_values(by='datetime')
df_test['datetime'] = pd.to_datetime(df_test['datetime'])
df_train, df_val = custom_train_split(df_train)Problem 1¶
A pandas DataFrame has a method called sample that will randomly sample the dataset (with or without replacement) and set it equal to a variable. As we discussed with Random Forest, this is one of the foundational aspects of bootstrapping.
In a random forest with 50 estimators, each tree is trained on its own bootstrapped version of the training data. That means we generate 50 resampled datasets, calculate the feature means for each one, and then plot the distribution of those means to show how sampling variability creates diversity across trees. We will standardize the values to identify outliers.
The size of each bootstrapped sample is the same as the original dataset. The difference is some rows are repeated. You can see the repeated rows as a percentage of total rows reported below.
In the textbox below labeled “Answer” for this problem:
Problem 1a¶
Follow these steps and answer where prompted:
Slowly increase the number of n_bootstrap from 1 - 50. discuss what you notice about the z-scores for each variable. What does this tell you about the datasets that each Decision Tree within the Random Forest will see and use for training?
Answer:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 20, 5
from sklearn.preprocessing import StandardScaler
import pandas as pd
n_bootstraps = 1
_, _, num_cols = convert_to_numpy(df_train)
bootstrap_means = []
repeat_list = []
for i in range(n_bootstraps):
train_sub = df_train.sample(n=len(df_train), replace=True, random_state=i)
repeats = train_sub.index.duplicated()
repeat_list.append(np.sum(repeats))
bootstrap_means.append(train_sub[num_cols].mean())
print(f"Mean % of repeats per bootstrap = {100 * np.mean(repeat_list) / len(df_train):.2f}%")
bootstrap_means_df = pd.DataFrame(bootstrap_means)
scaler = StandardScaler().set_output(transform="pandas")
bootstrap_z = scaler.fit_transform(bootstrap_means_df)
ax = bootstrap_z.boxplot(rot=45)
ax.set_title("Distribution of Mean Values per Bootstrap")Mean % of repeats per bootstrap = 36.57%

Random Forest also chooses a random subset of available features to make available for split decisions. The number of features it picks in sklearn is the square root of the total number of features.
Using the sample documentation, modify the code below to provide a random subset of sqrt_n columns and set it equal to the variable df_random_features.
Problem 1b¶
Follow these steps and answer where prompted:
sqrt_n: equal to the square root of the length ofnum_colsrounded up to the nearestint.df_random_features: in one line, use thepandassamplefunction to get a random subset of features (i.e., columns) fromdf_train[num_cols].Run the code a few times and answer below: What happened to the DataFrame? How is it different from the original
df_train[num_cols]?
Answer for #3:
sqrt_n = 1
df_random_features = df_train[num_cols]
display(df_random_features)Problem 2¶
A bootstrapped version of the dataset is provided to each Decision Tree in a Random Forest. Additional variance is introduced when features are randomly selected at each split decision. This second step can be pretty difficult code manually, so we are going to utilize the built-in functionality of DecisionTreeClassifier to simulate how one Decision Tree in a Random Forest might be generated.
Problem 2a¶
Follow these steps and answer where prompted:
Set
bootstrapequal to a bootstrapped sample ofdf_trainwith replacement.Modify the
train_xandtrain_yfunction callconvert_to_numpyso it converts the bootstrapped DataFrame to numpy, notdf_train.Set
n_sqrtbased onnum_colsand the code you created in1b.Create your
DecisionTreeClassifierby adding themax_featuresparameter and setting it equal ton_sqrt.Run the code multiple times. What do you notice about the plotted trees? Put your answer in the text box below:
Answer for #5
from sklearn.tree import DecisionTreeClassifier, plot_tree
plt.rcParams['figure.figsize'] = 15, 10
bootstrap = df_train
train_x, train_y, num_cols = convert_to_numpy(df_train)
n_sqrt = 1
CART = DecisionTreeClassifier(max_features=1, max_depth=2)
CART.fit(train_x, train_y)
plot_tree(CART, feature_names=num_cols, class_names=['Not MCS', 'MCS']);
Problem 4¶
Recall that Random Forest is not only a bootstrapping method, but also an aggregating method (i.e., “bagging”). We need to consider “votes” from many individual Decision Trees to arrive at the final prediction. We can simulate this by following the steps below:
Problem 4a:
create a variable named
n_estimatorsand set it equal to10for now.create a list that will store the trained decision trees called
random_trees.create a loop with
n_estimatorsiterationsfor each iteration, train a decision tree like you did in Problem 3. Do not plot the tree!!! You have been warned!!!
append the tree to the
random_treelist.
Congratulations, you trained a Random Forest from scratch (kind of...).
n_estimators = 1
random_trees = []
for i in range(1):
bootstrap = df_train.sample(n=len(df_train), replace=True, random_state=i)
train_x, train_y, num_cols = convert_to_numpy(bootstrap)
n_sqrt = 1
CART = DecisionTreeClassifier(max_features=1, max_depth=2, random_state=i)
CART.fit(train_x, train_y)
random_trees.append(CART)
print(random_trees)
[DecisionTreeClassifier(max_depth=2, max_features=1, random_state=0)]
Problem 4b
Next, we need to create and aggregate predictions. We should do this on the validation subset. For this subset, we do not bootstrap. Bootstrapping should only be done on the training data. Using validation data to adjust our model is our attempt to assess the generalizability of the model on the testing dataset.
We can demonstrate on one labeled validation sample.
convert
df_valto numpy, by setting theval_xandval_yvariables.choose one random sample from
val_xandval_yby randomly selecting an index between 0 andlen(val_y). We need this to extract the random sample from each numpy array.create a list named
predictionsto store the votes.loop through random_trees and make a prediction using
val_x. Save the vote inpredictions.find the most common label and that is your prediction.
The output should look like this (will be different based on your chosen idx):
The actual answer is MCS
----------------------
The random tree ensemble votes:
----------------------
tree #0 votes for Not MCS
tree #1 votes for Not MCS
tree #2 votes for MCS
***many rows removed for space just in this markdown. These should be included in your output below***
tree #48 votes for MCS
tree #49 votes for Not MCS
----------------------
The predicted label is MCS. 64.00% of trees agreed.Run the predictions for different values of
idx. Can you find any cases where the confidence of the model was low (e.g., max probability < 80%)? When the confidence was low, what did the votes look like in your ensemble? How would you explain to “Grandma Goody” what the probability means for the ensemble?Pick an
idxbelow and fill out the following. If you see something likeP(Not MCS), that is equivalent to the confidence of the ensemble:
Actual label:
Predicted label:
P(Not MCS)
P(MCS)
Next, change n_estimators to 100 in Problem 4, rerun it, and then run the code below. Fill in the new values:
Actual label:
Predicted label:
P(Not MCS)
P(MCS)
Did the prediction change? Did the confidence change? What did you notice about the run with 10 predictors and the one with 100 predictors? Answer below:
Answer:
import numpy as np
val_x, val_y, num_cols = convert_to_numpy(df_train)
idx = 0
sample_x, sample_y = val_x[idx], val_y[idx]
label_convert = {0: "Not MCS", 1: "MCS"}
predictions = []
for tree in random_trees:
vote = tree.predict([sample_x])[0]
predictions.append(1)
predictions = np.array(predictions)
print(f"The actual answer for index {idx} is {label_convert[int(sample_y)]}")
print("----------------------")
print("The random tree ensemble votes for:")
print("----------------------")
for i, vote in enumerate(predictions):
print(f"tree #{i} votes for {label_convert[int(vote)]}")
num_trees = len(random_trees)
# how many == 1?
yes_count = 1
# how many == 0?
no_count = 1
# what fraction == 1?
yes_frac = 1
# what fraction == 0?
no_frac = 1
combined_p = [no_frac, yes_frac]
max_label = np.argmax(combined_p)
p_agree = 100 * np.max(combined_p)
print("----------------------")
print(f"Predicted label: {label_convert[max_label]}")
print(f"{p_agree:.2f}% of trees agreed.")
print(f"P(Not MCS) = {100*no_frac:.2f}%")
print(f"P(MCS) = {100*yes_frac:.2f}%")The actual answer for index 0 is Not MCS
----------------------
The random tree ensemble votes for:
----------------------
tree #0 votes for MCS
----------------------
Predicted label: Not MCS
100.00% of trees agreed.
P(Not MCS) = 100.00%
P(MCS) = 100.00%
Problem 5¶
Based on how we have evaluated models in the past, evaluate the performance of your model at 10 and 100 estimators in the code block below using the testing dataset: