%%bash
mkdir storm_mode
cd storm_mode

wget -nc -q https://raw.githubusercontent.com/ahaberlie/unidata-workshop-2018/refs/heads/master/workshop/data/training/sample_train_data.csv
wget -nc -q https://raw.githubusercontent.com/ahaberlie/unidata-workshop-2018/refs/heads/master/workshop/data/training/sample_test_data.csv

L8 - Ensemble Methods with Decision Trees¶

Directions:

Please rename the file by clicking on “LX-First-Last.ipynb” where X is the lab number, and replace First and Last with your first and last name.
Click File -> Save to make sure your most recent edits are saved.
In the upper right hand corner of the screen, click on “Share”. Click on “Restricted” and change it to “Anyone with the link”. Make sure you also share it with ahaberlie1@gmail.com.
Copy the link and submit it on Blackboard. Make sure you follow these steps completely, or I will be unable to grade your work.

Overview¶

This lab will help you understand scikit-learn and its ensemble model capabilities. We will walk through some examples of how scikit-learn can help solve Geoscience problems. Periodically, I will 1) ask you to either repeat the demonstrated code in a slightly different way; or 2) ask you to combine two or more techniques to solve a problem.

You can use generative AI to help answer these problems. The answer should still be in your own words. Think of the generative AI descriptions as those from a book. You still have to cite your source and you cannot plagiarize directly from the source. For every question that you used generative AI for help, please reference the generative AI you used and what your prompt or prompts were.

However, it is crucial that you understand the code well enough to effectively use generative AI tools that are likely to be widely available and recommended for use at many organizations. Although they are improving at an incredible rate, they still produce bugs, especially with domain-specific and complex problems. Make sure that you verify the answers before putting them in your own words.

Model performance assessment

When assessing model performance, we typically separate the sample data into three subsets. Think of your own experiences preparing for an exam as motivation for the subsets:

Training data - this is what the machine learning approach uses to generate a model. Similar to lectures/homework/assignments/etc. in class.
Validation data - this is what is used to do quick “spot checks” on the model and help determine optimal model settings. This is similar to doing a practice quiz and learning what you need to focus on before the exam.
Testing data - this is what is used to test the performance of the model. This is similar to finally taking the exam.

I have provided you with the training and testing datasets. We will also generate the validation dataset below:

import pandas as pd
import numpy as np

np.random.seed(4)

def custom_train_split(df, val_year_start=2012):
    '''Takes a pandas DataFrame with training data
    and sorts it into two subsets: 1) training data before
    and including `val_year_start`; and 2) validation data
    after `val_year_start`

    Parameters:
        df: pandas DataFrame
            Original training data.
        val_year_start: int
            Year that defines the split between training and
            validation data. Default 2011. Must be between
            2006 and 2012.

    Returns:
        (df_train, df_test): tuple
            DataFrames split into training and testing data
    '''
    df_train = df[df['datetime'].dt.year < val_year_start]
    df_val = df[df['datetime'].dt.year >= val_year_start]

    return df_train, df_val

def convert_to_numpy(df, remove_cols=None, binary_label='MCS',
                     label_col='label_name'):
    '''Converts DataFrame to machine learning friendly
    format by removing non-numeric columns and columns
    provided by `remove_cols`.

    Parameters:
        df: pandas DataFrame
            Original DataFrame with all columns
        remove_cols: list
            Other columns to remove that are numeric.
        binary_label: str
            If not None, convert label to 1 if `label_cols`
            is equal to `binary_label` in `df`.
        label_col: str
            Column name in `df` that contains original labels.

    Returns:
        npy_data: numpy ndarray
            Converted data ready for the training process.
    '''
    df_ = df.copy()

    if remove_cols is None:
        remove_cols = ['index', 'label', 'label1']

    numeric_cols = df_.select_dtypes(include="number").columns.tolist()
    numeric_cols = [c for c in numeric_cols if c not in remove_cols]

    if binary_label is not None:
        df_[binary_label] = df_[label_col] == binary_label

    df_x = df_[numeric_cols].to_numpy()
    df_y = df_[binary_label].to_numpy()

    return df_x, df_y, numeric_cols

df_train = pd.read_csv("storm_mode/sample_train_data.csv")
df_test = pd.read_csv("storm_mode/sample_test_data.csv")

df_train['datetime'] = pd.to_datetime(df_train['datetime'])
df_train = df_train.sort_values(by='datetime')
df_test['datetime'] = pd.to_datetime(df_test['datetime'])

df_train, df_val = custom_train_split(df_train)

Problem 1¶

A pandas DataFrame has a method called sample that will randomly sample the dataset (with or without replacement) and set it equal to a variable. As we discussed with Random Forest, this is one of the foundational aspects of bootstrapping.

In a random forest with 50 estimators, each tree is trained on its own bootstrapped version of the training data. That means we generate 50 resampled datasets, calculate the feature means for each one, and then plot the distribution of those means to show how sampling variability creates diversity across trees. We will standardize the values to identify outliers.

The size of each bootstrapped sample is the same as the original dataset. The difference is some rows are repeated. You can see the repeated rows as a percentage of total rows reported below.

In the textbox below labeled “Answer” for this problem:

Problem 1a¶

Follow these steps and answer where prompted:

Slowly increase the number of n_bootstrap from 1 - 50. discuss what you notice about the z-scores for each variable. What does this tell you about the datasets that each Decision Tree within the Random Forest will see and use for training?

Answer:

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 20, 5

from sklearn.preprocessing import StandardScaler
import pandas as pd

n_bootstraps = 1

_, _, num_cols = convert_to_numpy(df_train)

bootstrap_means = []

repeat_list = []

for i in range(n_bootstraps):
    train_sub = df_train.sample(n=len(df_train), replace=True, random_state=i)
    repeats = train_sub.index.duplicated()
    repeat_list.append(np.sum(repeats))

    bootstrap_means.append(train_sub[num_cols].mean())

print(f"Mean % of repeats per bootstrap = {100 * np.mean(repeat_list) / len(df_train):.2f}%")
bootstrap_means_df = pd.DataFrame(bootstrap_means)

scaler = StandardScaler().set_output(transform="pandas")

bootstrap_z = scaler.fit_transform(bootstrap_means_df)

ax = bootstrap_z.boxplot(rot=45)

ax.set_title("Distribution of Mean Values per Bootstrap")

Mean % of repeats per bootstrap = 36.57%

Random Forest also chooses a random subset of available features to make available for split decisions. The number of features it picks in sklearn is the square root of the total number of features.

Using the sample documentation, modify the code below to provide a random subset of sqrt_n columns and set it equal to the variable df_random_features.

Problem 1b¶

Follow these steps and answer where prompted:

sqrt_n: equal to the square root of the length of num_cols rounded up to the nearest int.
df_random_features: in one line, use the pandas sample function to get a random subset of features (i.e., columns) from df_train[num_cols].
Run the code a few times and answer below: What happened to the DataFrame? How is it different from the original df_train[num_cols]?

Answer for #3:

sqrt_n = 1

df_random_features = df_train[num_cols]

display(df_random_features)

Problem 2¶

A bootstrapped version of the dataset is provided to each Decision Tree in a Random Forest. Additional variance is introduced when features are randomly selected at each split decision. This second step can be pretty difficult code manually, so we are going to utilize the built-in functionality of DecisionTreeClassifier to simulate how one Decision Tree in a Random Forest might be generated.

Problem 2a¶

Follow these steps and answer where prompted:

Set bootstrap equal to a bootstrapped sample of df_train with replacement.
Modify the train_x and train_y function call convert_to_numpy so it converts the bootstrapped DataFrame to numpy, not df_train.
Set n_sqrt based on num_cols and the code you created in 1b.
Create your DecisionTreeClassifier by adding the max_features parameter and setting it equal to n_sqrt.
Run the code multiple times. What do you notice about the plotted trees? Put your answer in the text box below:

Answer for #5

from sklearn.tree import DecisionTreeClassifier, plot_tree
plt.rcParams['figure.figsize'] = 15, 10


bootstrap = df_train

train_x, train_y, num_cols = convert_to_numpy(df_train)

n_sqrt = 1

CART = DecisionTreeClassifier(max_features=1, max_depth=2)

CART.fit(train_x, train_y)

plot_tree(CART, feature_names=num_cols, class_names=['Not MCS', 'MCS']);

Problem 4¶

Recall that Random Forest is not only a bootstrapping method, but also an aggregating method (i.e., “bagging”). We need to consider “votes” from many individual Decision Trees to arrive at the final prediction. We can simulate this by following the steps below:

Problem 4a:

create a variable named n_estimators and set it equal to 10 for now.
create a list that will store the trained decision trees called random_trees.
create a loop with n_estimators iterations
for each iteration, train a decision tree like you did in Problem 3. Do not plot the tree!!! You have been warned!!!
append the tree to the random_tree list.

Congratulations, you trained a Random Forest from scratch (kind of...).

n_estimators = 1

random_trees = []

for i in range(1):

    bootstrap = df_train.sample(n=len(df_train), replace=True, random_state=i)

    train_x, train_y, num_cols = convert_to_numpy(bootstrap)

    n_sqrt = 1

    CART = DecisionTreeClassifier(max_features=1, max_depth=2, random_state=i)

    CART.fit(train_x, train_y)

    random_trees.append(CART)

print(random_trees)

[DecisionTreeClassifier(max_depth=2, max_features=1, random_state=0)]

Problem 4b

Next, we need to create and aggregate predictions. We should do this on the validation subset. For this subset, we do not bootstrap. Bootstrapping should only be done on the training data. Using validation data to adjust our model is our attempt to assess the generalizability of the model on the testing dataset.

We can demonstrate on one labeled validation sample.

convert df_val to numpy, by setting the val_x and val_y variables.
choose one random sample from val_x and val_y by randomly selecting an index between 0 and len(val_y). We need this to extract the random sample from each numpy array.
create a list named predictions to store the votes.
loop through random_trees and make a prediction using val_x. Save the vote in predictions.
find the most common label and that is your prediction.

The output should look like this (will be different based on your chosen idx):

The actual answer is MCS
----------------------
The random tree ensemble votes:
----------------------
tree #0 votes for Not MCS
tree #1 votes for Not MCS
tree #2 votes for MCS
***many rows removed for space just in this markdown. These should be included in your output below***
tree #48 votes for MCS
tree #49 votes for Not MCS
----------------------
The predicted label is MCS. 64.00% of trees agreed.

Run the predictions for different values of idx. Can you find any cases where the confidence of the model was low (e.g., max probability < 80%)? When the confidence was low, what did the votes look like in your ensemble? How would you explain to “Grandma Goody” what the probability means for the ensemble?
Pick an idx below and fill out the following. If you see something like P(Not MCS), that is equivalent to the confidence of the ensemble:

Actual label:
Predicted label:
P(Not MCS)
P(MCS)

Next, change n_estimators to 100 in Problem 4, rerun it, and then run the code below. Fill in the new values:

Actual label:
Predicted label:
P(Not MCS)
P(MCS)

Did the prediction change? Did the confidence change? What did you notice about the run with 10 predictors and the one with 100 predictors? Answer below:

Answer:

import numpy as np

val_x, val_y, num_cols = convert_to_numpy(df_train)

idx = 0
sample_x, sample_y = val_x[idx], val_y[idx]

label_convert = {0: "Not MCS", 1: "MCS"}

predictions = []

for tree in random_trees:
    vote = tree.predict([sample_x])[0]
    predictions.append(1)

predictions = np.array(predictions)

print(f"The actual answer for index {idx} is {label_convert[int(sample_y)]}")
print("----------------------")
print("The random tree ensemble votes for:")
print("----------------------")

for i, vote in enumerate(predictions):
    print(f"tree #{i} votes for {label_convert[int(vote)]}")

num_trees = len(random_trees)

# how many == 1?
yes_count = 1

# how many == 0?
no_count = 1

# what fraction == 1?
yes_frac = 1

# what fraction == 0?
no_frac = 1

combined_p = [no_frac, yes_frac]
max_label = np.argmax(combined_p)
p_agree = 100 * np.max(combined_p)

print("----------------------")
print(f"Predicted label: {label_convert[max_label]}")
print(f"{p_agree:.2f}% of trees agreed.")
print(f"P(Not MCS) = {100*no_frac:.2f}%")
print(f"P(MCS)     = {100*yes_frac:.2f}%")

The actual answer for index 0 is Not MCS
----------------------
The random tree ensemble votes for:
----------------------
tree #0 votes for MCS
----------------------
Predicted label: Not MCS
100.00% of trees agreed.
P(Not MCS) = 100.00%
P(MCS)     = 100.00%

Problem 5¶

Based on how we have evaluated models in the past, evaluate the performance of your model at 10 and 100 estimators in the code block below using the testing dataset: