1 XGBoost classification model for pairwise ranking

1.1 Objective

Objective: Testing the performance of an XGBoost model trained for ranking pairwise schedules.

1.2 Background

In this experiment we develop a Machine Learning model using XGBoost that can evaluate two neighboring schedules and rank them according to preference. This ranking model can be applied to quickly guide the search process towards a ‘good enough’ solution.

The choice of using an ordinal model instead of a cardinal model is based on the consideration that it is significantly easier to determine whether alternative A is superior to B than to quantify the exact difference between A and B. This makes intuitive sense when considering the scenario of holding two identical-looking packages and deciding which one is heavier, as opposed to estimating the precise weight difference between them. (Ho et al. 2000).

1.3 Hypothesis

An XGBoost ranking model achieves superior computational efficiency compared to evaluating each element of a pair individually, leading to faster overall performance in ranking tasks.

1.4 Methodology

1.4.1 Tools and Materials

We use packages from Scikit-learn to prepare training data and evaluate the model and the XGBClassifier interface from the XGBoost library.

import time
import math
import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.base import clone
import xgboost as xgb
from xgboost.callback import TrainingCallback
import plotly.graph_objects as go
import pickle
import random

1.4.2 Experimental Design

To compare an XGBoost Machine Learning model with a simple evaluation of each individual element of the pair, we will use a pairwise ranking approach. The objective is to rank two neighboring schedules according to preference.

from functions import get_v_star
N = 12 # Number of patients
T = 18 # Number of intervals
d = 5 # Length of each interval
s = [0.0, 0.27, 0.28, 0.2, 0.15, 0.1] # Service times distribution
q = 0.20 # Probability of a scheduled patient not showing up
w = 0.8 # Weight for the waiting time in objective function
num_schedules = 20000 # Number of schedules to sample
v_star = get_v_star(T)

We will create a random set of pairs of neighboring schedules with \(N = 12\) patients and \(T = 18\) intervals of length \(d = 5\).

A neighbor of a schedule x is considered a schedule x’ where single patients have been shifted one interval to the left. Eg: ([2,1,1,2], [1,2,0,3]) are neighbors and ([2,1,1,2], [2,1,3,0]) are not, because [1,2,0,3] - [2,1,1,2] = [-1, 1, -1, 1] and [2,1,3,0] - [2,1,1,2] = [0, 0, 2, -2].

Service times will have a discrete distribution. The probability of a scheduled patient not showing up will be \(q = 0.2\).

The objective function will be the weighted average of the total waiting time of all patients and overtime. The model will be trained to predict which of the two neighboring schedules has the lowest objective value. The prediction time will be recorded. Then the same schedules will be evaluated by computing the objective value and then ranked.

1.4.3 Variables

Independent Variables: A list of tuples with pairs of neighboring schedules.
Dependent Variables: A list with rankings for each tuple of pairwise schedules. Eg: If the rank for ([2,1,1], [1,1,2]) equals 0 this means that the schedule with index 0 ([2,1,1]) has the lowest objective value.

1.4.4 Data Collection

The data set will be generated using simulation in which random samples will be drawn from the population of all possible schedules. For each sample a random neighboring schedule will be created.

1.4.5 Sample Size and Selection

Sample Size: The total population size equals \({{N + T -1}\choose{N}} \approx\) 52.0 mln. For this experiment we will be using a relatively small sample of 20000 pairs of schedules.

Sample Selection: The samples will be drawn from a lexicographic order of possible schedules in order to accurately reflect the combinatorial nature of the problem and to ensure unbiased sampling from the entire combinatorial space.

1.4.6 Experimental Procedure

The experiment involves multiple steps, beginning with data preparation and concluding with model evaluation.The diagram below illustrates the sequence of steps.

graph TD
    A["From population"] -->|"Sample"| B["Random subset"]
    B --> |Create neighbors| C["Features: Schedule pairs"]
    C --> |Calculate objectives| D["Objective values"]
    D --> |Rank objectives| E["Labels: Rankings"]
    E --> |"Split dataset"| F["Training set"]
    E --> |"Split dataset"| G["Test set"]
    F --> |"Train"| H["Model"]
    H["Model"] --> |"Apply"| G["Test set"]
    G["Test set"] --> |"Evaluate"| I["Performance"]

Step 1: Randomly select a subset of schedules.

from functions import random_combination_with_replacement

start = time.time()
schedules = random_combination_with_replacement(T, N, num_schedules)
print(f"Sampled: {len(schedules)} schedules\n")
h = random.choices(range(len(schedules)), k=7)
print(f"Sampled schedules: {h}")
for i in h:
    print(f"Schedule: {schedules[i]}")
end = time.time()
data_prep_time = end - start

print(f"\nProcessing time: {data_prep_time} seconds\n")

Total number of combinations: 51,895,935
Sampled: 20000 schedules

Sampled schedules: [3143, 16624, 16871, 2155, 2873, 3097, 6994]
Schedule: [7, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Schedule: [6, 1, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Schedule: [5, 1, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Schedule: [5, 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Schedule: [4, 2, 1, 2, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Schedule: [3, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
Schedule: [7, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

Processing time: 0.2260730266571045 seconds

Step 2: Create pairs of neighboring schedules.

from functions import create_neighbors_list

start = time.time()
neighbors_list = [create_neighbors_list(schedule, v_star) for schedule in schedules] # This can be done in parellel to improve speed
end = time.time()
for i in h:
    original_schedule = neighbors_list[i][0]
    neighbor_schedule = neighbors_list[i][1]
    difference = [x - y for x, y in zip(neighbors_list[i][0], neighbors_list[i][1])]
    print(f"Neighbors\n{original_schedule}\n{neighbor_schedule}\n{difference}")
training_set_feat_time = end - start
print(f"\nProcessing time: {training_set_feat_time} seconds\n")

Neighbors
[7, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Neighbors
[6, 1, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-1, 0, 1, 0, 0, 0, 0, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Neighbors
[5, 1, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[5, 0, 3, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
[0, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1]
Neighbors
[5, 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 1, 4, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[1, 0, -1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1]
Neighbors
[4, 2, 1, 2, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-1, 1, -1, 1, 0, 0, -1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Neighbors
[3, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[3, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, -1, 1, 0, 0, 0, -1]
Neighbors
[7, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, -1, 0, 1, -1, 0, 0, 1, 0, -1, 1, 0, 0, 0, 0, 0, 0, 0]

Processing time: 1.674152135848999 seconds

Step 3: For each schedule in each pair calculate the objective. For each pair save the index of the schedule that has the lowest objective value.

from functions import calculate_objective

objectives_schedule_1 = [w * calculate_objective(neighbor[0], s, d, q)[0] + (1 - w) * calculate_objective(neighbor[0], s, d, q)[1] for neighbor in neighbors_list]
start = time.time()
objectives_schedule_2 = [w * calculate_objective(neighbor[1], s, d, q)[0] + (1 - w) * calculate_objective(neighbor[1], s, d, q)[1] for neighbor in neighbors_list]
end = time.time()
training_set_lab_time = end - start
objectives = [[obj, objectives_schedule_2[i]] for i, obj in enumerate(objectives_schedule_1)]
rankings = np.argmin(objectives, axis=1).tolist()
for i in range(5):
    print(f"Objectives: {objectives[i]}, Ranking: {rankings[i]}")

print(f"\nProcessing time: {training_set_lab_time} seconds\n")

# Saving neighbors_list and objectives to a pickle file
file_path = 'datasets/neighbors_and_objectives.pkl'
with open(file_path, 'wb') as f:
    pickle.dump({'neighbors_list': neighbors_list, 'objectives': objectives, 'rankings': rankings}, f)
    print(f"Data saved successfully to '{file_path}'")

Objectives: [40.7132903786003, 44.29093770976293], Ranking: 0
Objectives: [53.109972522414175, 43.64776101084806], Ranking: 1
Objectives: [53.4264119642738, 46.525074507348194], Ranking: 1
Objectives: [47.571610887503645, 47.571610887503645], Ranking: 0
Objectives: [49.47536904144756, 40.17117767200208], Ranking: 1

Processing time: 17.09611225128174 seconds

Data saved successfully to 'datasets/neighbors_and_objectives.pkl'

Step 4: Create training and test sets.

# Prepare the dataset
X = []
for neighbors in neighbors_list:
    X.append(neighbors[0] + neighbors[1])

X = np.array(X)
y = np.array(rankings)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train the XGBoost model.

flowchart TD
    A[Start] --> B[Initialize StratifiedKFold]
    B --> C[Initialize XGBClassifier]
    C --> D[Set results as empty list]
    D --> E[Loop through each split of cv split]
    E --> F[Get train and test indices]
    F --> G[Split X and y into X_train, X_test, y_train, y_test]
    G --> H[Clone the classifier]
    H --> I[Call fit_and_score function]
    I --> J[Fit the estimator]
    J --> K[Score on training set]
    J --> L[Score on test set]
    K --> M[Return estimator, train_score, test_score]
    L --> M
    M --> N[Append the results]
    N --> E
    E --> O[Loop ends]
    O --> P[Print results]
    P --> Q[End]

class CustomCallback(TrainingCallback):
    def __init__(self, period=10):
        self.period = period

    def after_iteration(self, model, epoch, evals_log):
        if (epoch + 1) % self.period == 0:
            print(f"Epoch {epoch}, Evaluation log: {evals_log['validation_0']['logloss'][epoch]}")
        return False
    
def fit_and_score(estimator, X_train, X_test, y_train, y_test):
    """Fit the estimator on the train set and score it on both sets"""
    estimator.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0
    )

    train_score = estimator.score(X_train, y_train)
    test_score = estimator.score(X_test, y_test)

    return estimator, train_score, test_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=94)

# Initialize the XGBClassifier without early stopping here
# Load the best trial parameters from a JSON file.
with open("model_params.json", "r") as f:
    model_params = json.load(f)
    
# Initialize the EarlyStopping callback with validation dataset
early_stop = xgb.callback.EarlyStopping(
    rounds=10, metric_name='logloss', data_name='validation_0', save_best=True
)

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=model_params["max_depth"],
    min_child_weight=model_params["min_child_weight"],
    gamma=model_params["gamma"],
    subsample=model_params["subsample"],
    colsample_bytree=model_params["colsample_bytree"],
    learning_rate=model_params["learning_rate"],
    n_estimators=model_params["n_estimators"],
    early_stopping_rounds=9,
    #callbacks=[CustomCallback(period=50), early_stop],
    callbacks=[CustomCallback(period=50)],
)
print("Params: ")
for key, value in model_params.items():
    print(f" {key}: {value}")

start = time.time()
results = []

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    est, train_score, test_score = fit_and_score(
        clone(clf), X_train, X_test, y_train, y_test
    )
    results.append((est, train_score, test_score))
end = time.time()
training_time = end - start
print(f"\nTraining time: {training_time} seconds\n")

Params: 
 max_depth: 6
 min_child_weight: 1
 gamma: 0.1
 subsample: 0.8
 colsample_bytree: 0.8
 learning_rate: 0.1
 n_estimators: 100
Epoch 49, Evaluation log: 0.05455219040368683
Epoch 99, Evaluation log: 0.04216296073211015
Epoch 49, Evaluation log: 0.05368845318281092
Epoch 99, Evaluation log: 0.04094662195513956
Epoch 49, Evaluation log: 0.04904532701848075
Epoch 99, Evaluation log: 0.03687723997185094
Epoch 49, Evaluation log: 0.05073861599853263
Epoch 99, Evaluation log: 0.0393132749446886
Epoch 49, Evaluation log: 0.04947142685431754
Epoch 99, Evaluation log: 0.03850802336719062

Training time: 0.9882938861846924 seconds

Step 6: To evaluate the performance of the XGBoost ranking model, we will use Stratified K-Fold Cross-Validation with 5 splits, ensuring each fold maintains the same class distribution as the original dataset. Using StratifiedKFold(n_splits=5, shuffle=True, random_state=94), the dataset will be divided into five folds. In each iteration, the model will be trained on four folds and evaluated on the remaining fold. A custom callback, CustomCallback(period=10), will print the evaluation log every 10 epochs.

The fit_and_score function will fit the model and score it on both the training and test sets, storing the results for each fold. This provides insight into the model’s performance across different subsets of the data, helps in understanding how well the model generalizes to unseen data and identifies potential overfitting or underfitting issues. The overall processing time for the cross-validation will also be recorded.

# Print results
for i, (est, train_score, test_score) in enumerate(results):
    print(f"Fold {i+1} - Train Score (Accuracy): {train_score:.4f}, Test Score (Accuracy): {test_score:.4f}")

Fold 1 - Train Score (Accuracy): 0.9925, Test Score (Accuracy): 0.9862
Fold 2 - Train Score (Accuracy): 0.9922, Test Score (Accuracy): 0.9872
Fold 3 - Train Score (Accuracy): 0.9920, Test Score (Accuracy): 0.9890
Fold 4 - Train Score (Accuracy): 0.9924, Test Score (Accuracy): 0.9878
Fold 5 - Train Score (Accuracy): 0.9922, Test Score (Accuracy): 0.9882

Training the model on the entire dataset provides a final model that has learned from all available data. Recording the training time helps in understanding the computational efficiency and scalability of the model with the given hyperparameters.

# Fit the model on the entire dataset
# Initialize the XGBClassifier without early stopping here

start = time.time()

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=model_params["max_depth"],
    min_child_weight=model_params["min_child_weight"],
    gamma=model_params["gamma"],
    subsample=model_params["subsample"],
    colsample_bytree=model_params["colsample_bytree"],
    learning_rate=model_params["learning_rate"],
    n_estimators=model_params["n_estimators"],
)

clf.fit(X, y)
end= time.time()
modeling_time = end - start

# Calculate and print the training accuracy
training_accuracy = clf.score(X, y)
print(f"Training accuracy: {training_accuracy * 100:.2f}%\n")

print(f"\nTraining time: {modeling_time} seconds\n")

Training accuracy: 99.23%


Training time: 0.1360487937927246 seconds

1.4.7 Validation

Generating test schedules and calculating their objectives and rankings allows us to create a new dataset for evaluating the model’s performance on unseen data.

num_test_schedules = 1000

test_schedules = random_combination_with_replacement(T, N, num_test_schedules)
test_neighbors = [create_neighbors_list(test_schedule, v_star) for test_schedule in test_schedules] # This can be done in parellel to improve speed

print(f"Sampled: {len(test_schedules)} schedules\n")

test_objectives_schedule_1 = [w * calculate_objective(test_neighbor[0], s, d, q)[0] + (1 - w) * calculate_objective(test_neighbor[0], s, d, q)[1] for test_neighbor in test_neighbors]
# Start time measeurement for the evaluation
start = time.time()
test_objectives_schedule_2 = [w * calculate_objective(test_neighbor[1], s, d, q)[0] + (1 - w) * calculate_objective(test_neighbor[1], s, d, q)[1] for test_neighbor in test_neighbors]
test_rankings = [0 if test_obj < test_objectives_schedule_2[i] else 1 for i, test_obj in enumerate(test_objectives_schedule_1)]
end = time.time()
evaluation_time = end - start

# Combine the objectives for each pair for later processing
test_objectives = [[test_obj, test_objectives_schedule_2[i]] for i, test_obj in enumerate(test_objectives_schedule_1)]

print(f"\nEvaluation time: {evaluation_time} seconds\n")

for i in range(6):
    print(f"Neighbors: {test_neighbors[i]},\nObjectives: {test_objectives[i]}, Ranking: {test_rankings[i]}\n")

Total number of combinations: 51,895,935
Sampled: 1000 schedules


Evaluation time: 0.898392915725708 seconds

Neighbors: ([6, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0], [6, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]),
Objectives: [34.10370626270946, 36.3890775770396], Ranking: 0

Neighbors: ([6, 1, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [6, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]),
Objectives: [42.168055975490745, 33.26579003483442], Ranking: 1

Neighbors: ([4, 4, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [4, 5, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
Objectives: [43.17359852761168, 47.900340916589684], Ranking: 0

Neighbors: ([5, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [5, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
Objectives: [55.90431273631407, 63.446546787123296], Ranking: 0

Neighbors: ([7, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [7, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]),
Objectives: [53.4264119642738, 45.89097380607223], Ranking: 1

Neighbors: ([4, 1, 5, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [5, 1, 4, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]),
Objectives: [34.31133011310308, 39.00838306816456], Ranking: 0

Making predictions on new data and comparing them to the actual rankings provides an evaluation of the model’s performance in practical applications. Recording the prediction time helps in understanding the model’s efficiency during inference.

input_X = test_neighbors
X_new = []
for test_neighbor in input_X:
    X_new.append(test_neighbor[0] + test_neighbor[1])
    
# Predict the target for new data
y_pred = clf.predict(X_new)

# Probability estimates
start = time.time()
y_pred_proba = clf.predict_proba(X_new)
end = time.time()
prediction_time = end - start
print(f"\nPrediction time: {prediction_time} seconds\n")

print(f"test_rankings = {np.array(test_rankings)[:6]}, \ny_pred = {y_pred[:6]}, \ny_pred_proba = \n{y_pred_proba[:6]}")


Prediction time: 0.004464864730834961 seconds

test_rankings = [0 1 0 0 1 0], 
y_pred = [0 1 0 0 1 0], 
y_pred_proba = 
[[0.97218204 0.02781794]
 [0.00402802 0.995972  ]
 [0.99254835 0.00745166]
 [0.99486387 0.00513611]
 [0.07889068 0.9211093 ]
 [0.99056184 0.00943818]]

Calculating the ambiguousness of the predicted probabilities helps in understanding the model’s confidence in its predictions. High ambiguousness indicates uncertain predictions, while low ambiguousness indicates confident predictions.

Ambiguousness is calculated using the formula for entropy:

\[ H(X) = - \sum_{i} p(x_i) \log_b p(x_i) \]

Where in our case:

\(H(X)\) is the ambiguousness of the random variable \(X\) - the set of probability scores for the predicted rankings,
\(p(x_i)\) is probability score \(x_i\),
\(\log_b\) is the logarithm with base \(b\) (here \(\log_2\) as we have two predicted values),
The sum is taken over all possible outcomes of \(X\).

Calculating cumulative error rate and cumulative accuracy helps in understanding how the model’s performance evolves over the dataset.

Visualizing the relationship between ambiguousness and error provides insights into how uncertainty in the model’s predictions correlates with its accuracy. This can help in identifying patterns and understanding the conditions under which the model performs well or poorly.

from functions import calculate_ambiguousness

errors = np.abs(y_pred - np.array(test_rankings))

ambiguousness: np.ndarray = calculate_ambiguousness(y_pred_proba)
df = pd.DataFrame({"Ambiguousness": ambiguousness, "Error": errors}).sort_values(by="Ambiguousness")
df['Cumulative error rate'] = df['Error'].expanding().mean()
# Calculate cumulative accuracy
df['Cumulative accuracy'] = 1 - df['Cumulative error rate']
df.head()


# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Error"],
                    mode="markers",
                    name="Error",
                    marker=dict(size=9)))
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Cumulative accuracy"],
                    mode="lines",
                    name="Cum. accuracy",
                    line = dict(width = 3, dash = 'dash')))
fig.update_layout(
    title={
        'text': f"Error vs Ambiguousness</br></br><sub>n={num_test_schedules}</sub>",
        'y': 0.95,  # Keep the title slightly higher
        'x': 0.02,
        'xanchor': 'left',
        'yanchor': 'top'
    },
    xaxis_title="Ambiguousness",
    yaxis_title="Error / Accuracy",
    hoverlabel=dict(font=dict(color='white')),
    margin=dict(t=70)  # Add more space at the top of the chart
)
fig.show()

1.4.8 Hyperparameter Optimization

In the initial model the choice of hyperparameters was based on default values, examples from demo’s or trial and error. To improve the model’s performance, we applied a hyperparameter optimization technique to find the best set of hyperparameters. We used a grid search with cross-validation to find the optimal hyperparameters for the XGBoost model. The grid search was performed over a predefined set of hyperparameters, and the best hyperparameters were selected based on the model’s performance on the validation set. The best hyperparameters were then used to train the final model.

from functions import compare_json

with open("best_trial_params.json", "r") as f:
    best_trial_params = json.load(f)
    
differences = compare_json(model_params, best_trial_params)

params_tbl = pd.DataFrame(differences)
params_tbl.rename(index={'json1_value': 'base parameters', 'json2_value': 'optimized parameters'}, inplace=True)
print(params_tbl)

                      max_depth     gamma  subsample  colsample_bytree  \
base parameters               6  0.100000   0.800000          0.800000   
optimized parameters          5  0.304548   0.781029          0.922528   

                      learning_rate  n_estimators  
base parameters            0.100000           100  
optimized parameters       0.239488           490

# Fit the model on the entire dataset
# Initialize the XGBClassifier without early stopping here

# Load the best trial parameters from a JSON file.
with open("best_trial_params.json", "r") as f:
    best_trial_params = json.load(f)

start = time.time()

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=best_trial_params["max_depth"],
    min_child_weight=best_trial_params["min_child_weight"],
    gamma=best_trial_params["gamma"],
    subsample=best_trial_params["subsample"],
    colsample_bytree=best_trial_params["colsample_bytree"],
    learning_rate=best_trial_params["learning_rate"],
    n_estimators=best_trial_params["n_estimators"],
)

clf.fit(X, y)
end= time.time()
modeling_time = end - start
print(f"\nTraining time: {modeling_time} seconds\n")

# Calculate and print the training accuracy
training_accuracy = clf.score(X, y)
print(f"Training accuracy: {training_accuracy * 100:.2f}%")


Training time: 0.36020803451538086 seconds

Training accuracy: 99.97%

# Predict the target for new data
y_pred = clf.predict(X_new)

# Probability estimates
start = time.time()
y_pred_proba = clf.predict_proba(X_new)
end = time.time()
prediction_time = end - start
print(f"\nPrediction time: {prediction_time} seconds\n")

print(f"test_rankings = {np.array(test_rankings)[:6]}, \ny_pred = {y_pred[:6]}, \ny_pred_proba = \n{y_pred_proba[:6]}")


Prediction time: 0.004562854766845703 seconds

test_rankings = [0 1 0 0 1 0], 
y_pred = [0 1 0 0 1 0], 
y_pred_proba = 
[[9.5283890e-01 4.7161117e-02]
 [1.6450882e-05 9.9998355e-01]
 [9.9937177e-01 6.2821433e-04]
 [9.9999166e-01 8.3718114e-06]
 [8.5286498e-03 9.9147135e-01]
 [9.9326497e-01 6.7350427e-03]]

errors = np.abs(y_pred - np.array(test_rankings))
ambiguousness: np.ndarray = calculate_ambiguousness(y_pred_proba)
df = pd.DataFrame({"Ambiguousness": ambiguousness, "Error": errors, "Schedules": test_neighbors, "Objectives": test_objectives}).sort_values(by="Ambiguousness")
df['Cumulative error rate'] = df['Error'].expanding().mean()
# Calculate cumulative accuracy
df['Cumulative accuracy'] = 1 - df['Cumulative error rate']
df.head()

	Ambiguousness	Schedules	Objectives	Cumulative accuracy
757	3.321928e-09	([5, 0, 2, 2, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,...	[29.27167190264852, 22.08056364258747]	1.0
85	3.321928e-09	([5, 3, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0,...	[36.82335597592348, 22.700339918266497]	1.0
181	2.913796e-06	([7, 0, 1, 0, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,...	[39.958752516119404, 27.90929658917552]	1.0
768	2.913796e-06	([6, 1, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...	[47.25699782218697, 32.81150132439508]	1.0
353	2.913796e-06	([4, 4, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0,...	[36.869197819247226, 26.235646300158415]	1.0

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Error"],
                    mode="markers",
                    name="Error",
                    marker=dict(size=9),
                    customdata=df[["Schedules", "Objectives"]],
                    hovertemplate=
                        "Ambiguousness: %{x} <br>" +
                        "Error: %{y} <br>" +
                        "Schedules: %{customdata[0][0]} / %{customdata[0][1]} <br>" +
                        "Objectives: %{customdata[1]} <br>"
                    ))
                  
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Cumulative accuracy"],
                    mode="lines",
                    name="Cum. accuracy",
                    line = dict(width = 3, dash = 'dash')))
fig.update_layout(
    title={
        'text': f"Error vs Ambiguousness</br></br><sub>n={num_test_schedules}</sub>",
        'y': 0.95,  # Keep the title slightly higher
        'x': 0.02,
        'xanchor': 'left',
        'yanchor': 'top'
    },
    xaxis_title="Ambiguousness",
    yaxis_title="Error / Accuracy",
    hoverlabel=dict(font=dict(color='white')),
    margin=dict(t=70)  # Add more space at the top of the chart
)
fig.show()

1.5 Results

We wanted to test whether an XGBoost classification model could be used to assess and rank the quality of pairs of schedules. For performance benchmarking we use the conventional calculation method utilizing Lindley recursions.

We trained the XGBoost ranking model with a limited set of features (schedules) and labels (objectives). The total number of possible schedules is approximately 52.0 million. For training and validation, we sampled 20000 schedules. Generating the feature and label set took a total of 18.9963 seconds, with the calculation of objective values accounting for 17.0961 seconds.

The model demonstrates strong and consistent performance with high accuracies both for training as well as testing, good generalization and stability. Total training time for the final model was 0.3602 seconds. The evaluation of 1000 test schedules took 0.0046 seconds for the the XGBoost model and 0.8984 for the conventional method, which is an improvement of 196X.

1.6 Discussion

training_time = round(modeling_time, 4)
conventional_time = round(evaluation_time, 4)
xgboost_time = round(prediction_time, 4)

# Define time values for plotting
time_values = np.linspace(0, training_time+0.1, 1000)  # 0 to 2 seconds

# Calculate evaluations for method 1
method1_evaluations = np.where(time_values >= training_time, (time_values - training_time) / xgboost_time * 1000, 0)

# Calculate evaluations for method 2
method2_evaluations = time_values / conventional_time * 1000

# Create line chart
fig = go.Figure()

# Add method 1 trace
fig.add_trace(go.Scatter(x=time_values, y=method1_evaluations, mode='lines', name='Ranking model'))

# Add method 2 trace
fig.add_trace(go.Scatter(x=time_values, y=method2_evaluations, mode='lines', name='Conventional method'))

# Update layout
fig.update_layout(
    title="Speed comparison between XGBoost ranking model and conventional method",
    xaxis_title="Time (seconds)",
    yaxis_title="Number of Evaluations",
    legend_title="Methods",
    template="plotly_white"
)

fig.show()

1.7 Timeline

*This experiment was started on 25-07-2024. The completion date was 28-08-2024.**

1.8 References

Ho, Y-C, C G Cassandras, C-H Chen, and L Dai. 2000. “Ordinal Optimisation and Simulation.” Journal of the Operational Research Society 51 (4): 490–500. https://doi.org/10.1057/palgrave.jors.2600906.