4 Large instance XGBoost classification model for pairwise ranking with Bailey-Welch rule

4.1 Objective

Objective: Testing the performance of an XGBoost model trained for ranking pairwise schedules taken from a neighborhood around quasi optimal initial schedule (Bailey-Welch).

4.2 Background

In a previous experiment we developed a Machine Learning model using XGBoost that can evaluate two neighboring schedules and rank them according to preference. For evaluation random schedules were sampled from the full solution set.

The full solution set however contains many schedules that are obviously not optimal. Adding them to the training set would provide the model with rather useless knowledge. Therefore in this experiment we only sample pairs of schedules taken from within the vicinity of a ‘good’ starting point.

4.3 Hypothesis

An XGBoost ranking model achieves superior computational efficiency compared to evaluating each element of a pair individually, leading to faster overall performance in ranking tasks.

4.4 Methodology

4.4.1 Tools and Materials

We use packages from Scikit-learn to prepare training data and evaluate the model and the XGBClassifier interface from the XGBoost library.

import time
import math
import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.base import clone
import xgboost as xgb
from xgboost.callback import TrainingCallback
import plotly.graph_objects as go
import pickle
import random
from scipy.optimize import minimize
from itertools import combinations

4.4.2 Experimental Design

To compare an XGBoost Machine Learning model with a simple evaluation of each individual element of the pair, we will use a pairwise ranking approach. The objective is to rank two neighboring schedules according to preference.

from functions import compute_convolutions, bailey_welch_schedule

N = 22 # Number of patients
T = 20 # Number of intervals
d = 5 # Length of each interval
max_s = 20 # Maximum service time
q = 0.20 # Probability of a scheduled patient not showing up
w = 0.1 # Weight for the waiting time in objective function
l = 10
num_schedules = 300000 # Number of schedules to sample

# Create service time distribution
def generate_weighted_list(max_s, l, i):
    """
    Generates a service time probability distribution using optimization.

    This function creates a discrete probability distribution over T possible
    service times (from 1 to T). It uses optimization (SLSQP) to find a
    distribution whose weighted average service time is as close as possible
    to a target value 'l', subject to the constraint that the probabilities
    sum to 1 and each probability is between 0 and 1.

    After finding the distribution, it sorts the probabilities: the first 'i'
    probabilities (corresponding to service times 1 to i) are sorted in
    ascending order, and the remaining probabilities (service times i+1 to T)
    are sorted in descending order.

    Note:
        - This function relies on a globally defined integer 'T', representing
          the maximum service time considered (or number of probability bins).
        - The parameter 'max_s' is accepted but not used directly within this
          function's optimization or sorting logic as shown. It might be
          related to how 'T' is determined externally.
        - Requires NumPy and SciPy libraries (specifically scipy.optimize.minimize).

    Args:
        max_s (any): Maximum service time parameter (currently unused in the
                     provided function body's core logic).
        l (float): The target weighted average service time for the distribution.
        i (int): The index determining the sorting split point. Probabilities
                 for service times 1 to 'i' are sorted ascendingly, and
                 probabilities for service times 'i+1' to 'T' are sorted
                 descendingly. Must be between 1 and T-1 for meaningful sorting.

    Returns:
        numpy.ndarray: An array of size T+1. The first element (index 0) is 0.
                       Elements from index 1 to T represent the calculated
                       and sorted probability distribution, summing to 1.
                       Returns None if optimization fails.
    """
    # Initialize an array of T+1 values, starting with zero
    # Index 0 is unused for probability, indices 1 to T hold the distribution
    values = np.zeros(l + 1)

    # --- Inner helper function for optimization ---
    def objective(x):
        """Objective function: Squared difference between weighted average and target l."""
        # Calculate weighted average: sum(index * probability) / sum(probability)
        # Since sum(probability) is constrained to 1, it simplifies.
        weighted_avg = np.dot(np.arange(1, l + 1), x) # Corresponds to sum(k * P(ServiceTime=k))
        return (weighted_avg - l) ** 2

    # --- Constraints for optimization ---
    # Constraint 1: The sum of the probabilities (x[0] to x[T-1]) must be 1
    constraints = ({
        'type': 'eq',
        'fun': lambda x: np.sum(x) - 1
    })

    # Bounds: Each probability value x[k] must be between 0 and 1
    # Creates a list of T tuples, e.g., [(0, 1), (0, 1), ..., (0, 1)]
    bounds = [(0, 1)] * l

    # Initial guess: Use Dirichlet distribution to get a random distribution that sums to 1
    # Provides a starting point for the optimizer. np.ones(T) gives equal weights initially.
    initial_guess = np.random.dirichlet(np.ones(l))

    # --- Perform Optimization ---
    # Minimize the objective function subject to the sum and bounds constraints
    # using the Sequential Least Squares Programming (SLSQP) method.
    result = minimize(objective, initial_guess, method='SLSQP', bounds=bounds, constraints=constraints)

    # Check if optimization was successful
    if not result.success:
        print(f"Warning: Optimization failed! Message: {result.message}")
        # Handle failure case, e.g., return None or raise an error
        return None # Or potentially return a default distribution

    # Assign the optimized probabilities (result.x) to the correct slice of the values array
    # result.x contains the T probabilities for service times 1 to T.
    values[1:] = result.x

    # --- Reorder the values based on the index 'i' ---
    # Ensure 'i' is within a valid range for slicing and sorting
    if not (0 < i < l):
       print(f"Warning: Index 'i' ({i}) is outside the valid range (1 to {T-1}). Sorting might be trivial.")
       # Adjust i or handle as an error depending on requirements
       i = max(1, min(i, l - 1)) # Clamp i to a safe range for demonstration

    # Sort the first 'i' probabilities (indices 1 to i) in ascending order
    first_part = np.sort(values[1:i+1])
    # Sort the remaining 'T-i' probabilities (indices i+1 to T) in descending order
    second_part = np.sort(values[i+1:])[::-1] # [::-1] reverses the sorted array

    # Combine the sorted parts back into the 'values' array
    values[1:i+1] = first_part
    values[i+1:] = second_part

    # Return the final array with the sorted probability distribution
    return values

i = 5  # First 5 highest values in ascending order, rest in descending order
s = generate_weighted_list(max_s, l, i)
print(s)
print("Sum:", np.sum(s[1:]))  # This should be 1
print("Weighted service time:", np.dot(np.arange(len(s)), s))  # This should be close to l
initial_x = bailey_welch_schedule(T, d, N, s)
print(f"Initial schedule: {initial_x}")
convolutions = compute_convolutions(s, N, q)
file_path_parameters = f"datasets/parameters_{N}_{T}_{l}.pkl"
with open(file_path_parameters, 'wb') as f:
    pickle.dump({
      'N': N,
      'T': T,
      'd': d,
      'max_s': max_s,
      'q': q,
      'w': w,
      'l': l,
      'num_schedules': num_schedules,
      'convolutions': convolutions
      }, f)
    print(f"Data saved successfully to '{file_path_parameters}'")

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.00000000e+00 1.68307590e-11
 1.20362054e-11 7.26149696e-12 2.43707832e-12]
Sum: 1.0000000000385656
Weighted service time: 6.000000000303828
Initial schedule: [2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 5]
Data saved successfully to 'datasets/parameters_22_20_10.pkl'

We will create a random set of pairs of neighboring schedules from within the neighborhood around schedule [2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 5].

A neighbor of a schedule x is considered a schedule x’ where single patients have been shifted one interval to the left. Eg: ([2,1,1,2], [1,2,0,3]) are neighbors and ([2,1,1,2], [2,1,3,0]) are not, because [1,2,0,3] - [2,1,1,2] = [-1, 1, -1, 1] and [2,1,3,0] - [2,1,1,2] = [0, 0, 2, -2].

Service times will have a discrete distribution. The probability of a scheduled patient not showing up will be \(q = 0.2\).

The objective function will be the weighted average of the total waiting time of all patients and overtime. The model will be trained to predict which of the two neighboring schedules has the lowest objective value. The prediction time will be recorded. Then the same schedules will be evaluated by computing the objective value and then ranked.

4.4.3 Variables

Independent Variables: A list of tuples with pairs of neighboring schedules.
Dependent Variables: A list with rankings for each tuple of pairwise schedules. Eg: If the rank for ([2,1,1], [1,1,2]) equals 0 this means that the schedule with index 0 ([2,1,1]) has the lowest objective value.

4.4.4 Data Collection

The data set will be generated using simulation in which random samples will be drawn from the population of all possible schedules. For each sample a random neighboring schedule will be created.

4.4.5 Sample Size and Selection

Sample Size: The total population size equals \({{N + T -1}\choose{N}} \approx\) 244663.0 mln. For this experiment we will be using a relatively small sample of 300000 pairs of schedules.

Sample Selection: The samples will be drawn from a lexicographic order of possible schedules in order to accurately reflect the combinatorial nature of the problem and to ensure unbiased sampling from the entire combinatorial space.

4.4.6 Experimental Procedure

The experiment involves multiple steps, beginning with data preparation and concluding with model evaluation.The diagram below illustrates the sequence of steps.

graph TD
    A["From population"] -->|"Sample"| B["Random subset"]
    B --> |Create neighbors| C["Features: Schedule pairs"]
    C --> |Calculate objectives| D["Objective values"]
    D --> |Rank objectives| E["Labels: Rankings"]
    E --> |"Split dataset"| F["Training set"]
    E --> |"Split dataset"| G["Test set"]
    F --> |"Train"| H["Model"]
    H["Model"] --> |"Apply"| G["Test set"]
    G["Test set"] --> |"Evaluate"| I["Performance"]

Step 1: Create pairs of neighboring schedules. A set of 300000 schedules will be sampled from the neighborhood of the initial schedule. For each schedule a pair of neighbors will be created. The order of the neighbors will be randomly switched to create a more diverse training set. The time taken to sample the schedules and create the neighbors will be recorded.

from functions import get_v_star, get_neighborhood

def sample_neighbors_list(x: list[int], v_star: np.ndarray, all = True) -> (list[int], list[int]):
    """
    Create a set of pairs of schedules that are from the same neighborhood.
    
    Parameters:
      x (list[int]): A list of integers with |s| = T and sum N.
      v_star (np.ndarray): Precomputed vectors V* of length T.
      
    Returns:
      tuple(list[int], list[int]): A pair of schedules.
    """
    T = len(x)

    # Precompute binomial coefficients (weights for random.choices)
    binom_coeff = [math.comb(T, i) for i in range(1, T)]

    # Choose a random value of i with the corresponding probability
    i = random.choices(range(1, T), weights=binom_coeff)[0]

    # Instead of generating the full list of combinations, sample one directly
    j = random.sample(range(T), i)
    
    x_p = x.copy()
    for k in j:
        x_temp = np.array(x_p) + v_star[k]
        x_temp = x_temp.astype(int)
        if np.all(x_temp >= 0):
            x_p = x_temp.astype(int).tolist()
    if all:
        return x, x_p
    else:    
        return x_p

start = time.time()
v_star = get_v_star(T)
# Sample a set of schedules from the neighborhood of the initial schedule
neighbors_selection = [sample_neighbors_list(initial_x, v_star, all = False) for i in range(num_schedules)] # This can be done in parallel to improve speed
print(len(neighbors_selection))
end = time.time()
# For the sampled schedules, create the neighbors
neighbors_list = [sample_neighbors_list(schedule, v_star) for schedule in neighbors_selection]
# Randomly switch the order of the neighbors
neighbors_list = [neighbor if random.random() < 0.5 else neighbor[::-1] for neighbor in neighbors_list]
end = time.time()
h = random.choices(range(num_schedules), k=7)
print(f"Sampled schedules: {h}")
for i in h:
    original_schedule = neighbors_list[i][0]
    neighbor_schedule = neighbors_list[i][1]
    difference = [int(x - y) for x, y in zip(neighbors_list[i][0], neighbors_list[i][1])]
    print(f"Neighbors\n{original_schedule}\n{neighbor_schedule}\n{difference}")
training_set_feat_time = end - start
print(f"\nProcessing time: {training_set_feat_time} seconds\n")

300000
Sampled schedules: [69775, 15273, 299789, 83343, 173983, 181155, 68011]
Neighbors
[2, 3, 0, 1, 1, 0, 1, 0, 2, 1, 0, 0, 1, 1, 2, 1, 0, 1, 1, 4]
[2, 2, 1, 0, 1, 1, 0, 1, 2, 1, 0, 0, 1, 1, 2, 0, 1, 1, 0, 5]
[0, 1, -1, 1, 0, -1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 1, -1]
Neighbors
[2, 0, 2, 1, 0, 0, 1, 2, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 1, 6]
[1, 1, 2, 0, 1, 0, 1, 1, 1, 1, 0, 0, 2, 1, 2, 0, 0, 1, 0, 7]
[1, -1, 0, 1, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1, 1, 0, -1, 1, -1]
Neighbors
[1, 1, 2, 1, 0, 1, 0, 1, 2, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 6]
[1, 0, 2, 1, 1, 0, 0, 1, 2, 0, 2, 0, 0, 1, 1, 2, 1, 0, 0, 7]
[0, 1, 0, 0, -1, 1, 0, 0, 0, 0, -1, 1, 0, 0, 0, -1, 0, 1, 0, -1]
Neighbors
[3, 0, 2, 1, 0, 1, 1, 1, 0, 2, 0, 1, 0, 2, 0, 2, 0, 1, 0, 5]
[2, 0, 2, 1, 0, 1, 1, 1, 0, 2, 0, 2, 0, 1, 1, 1, 0, 1, 0, 6]
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 1, -1, 1, 0, 0, 0, -1]
Neighbors
[0, 1, 2, 1, 1, 0, 1, 0, 2, 1, 1, 0, 0, 2, 0, 1, 1, 0, 2, 6]
[1, 1, 2, 1, 1, 0, 0, 1, 1, 2, 1, 0, 0, 2, 0, 1, 1, 0, 2, 5]
[-1, 0, 0, 0, 0, 0, 1, -1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Neighbors
[3, 1, 0, 2, 0, 1, 1, 0, 2, 1, 0, 1, 0, 2, 0, 2, 0, 0, 1, 5]
[3, 0, 0, 2, 0, 1, 1, 0, 2, 1, 0, 2, 0, 1, 1, 1, 0, 0, 2, 5]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 1, -1, 1, 0, 0, -1, 0]
Neighbors
[2, 0, 1, 2, 0, 0, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 1, 1, 1, 5]
[2, 0, 1, 2, 0, 1, 0, 2, 1, 1, 0, 2, 0, 0, 2, 0, 1, 2, 0, 5]
[0, 0, 0, 0, 0, -1, 1, -1, 1, -1, 1, -1, 0, 1, 0, 0, 0, -1, 1, 0]

Processing time: 74.44435214996338 seconds

Step 2: For each schedule in each pair calculate the objective. For each pair save the index of the schedule that has the lowest objective value.

from functions import calculate_objective_serv_time_lookup

objectives_schedule_1 = [
    w * result[0] + (1 - w) * result[1]
    for neighbor in neighbors_list
    for result in [calculate_objective_serv_time_lookup(neighbor[0], d, convolutions)]
]
start = time.time()
objectives_schedule_2 = [
    w * result[0] + (1 - w) * result[1]
    for neighbor in neighbors_list
    for result in [calculate_objective_serv_time_lookup(neighbor[1], d, convolutions)]
]
end = time.time()
training_set_lab_time = end - start
objectives = [[obj, objectives_schedule_2[i]] for i, obj in enumerate(objectives_schedule_1)]
rankings = np.argmin(objectives, axis=1).tolist()
for i in range(5):
    print(f"Objectives: {objectives[i]}, Ranking: {rankings[i]}")

print(f"\nProcessing time: {training_set_lab_time} seconds\n")

# Step 1: Flatten the objectives into a 1D array
flattened_data = [value for sublist in objectives for value in sublist]

# Step 2: Find the index of the minimum value
min_index = np.argmin(flattened_data)

# Step 3: Convert that index back to the original 2D structure
row_index = min_index // 2  # Assuming each inner list has 2 values
col_index = min_index % 2

print(f"The minimum objective value is at index [{row_index}][{col_index}].\nThis is schedule: {neighbors_list[row_index][col_index]} with objective value {objectives[row_index][col_index]}.")

file_path_best_schedule = f"datasets/best_schedule_{N}_{T}_{l}.pkl"
with open(file_path_best_schedule, 'wb') as f:
    pickle.dump({'best_schedule':neighbors_list[row_index][col_index], 'objective': objectives[row_index][col_index]}, f)
    print(f"Data saved successfully to '{file_path_best_schedule}'")

print(f"\nAverage ranking: {np.mean(rankings)}\n")

# Saving neighbors_list and objectives to a pickle file
file_path_neighbors = f"datasets/neighbors_and_objectives_{N}_{T}_{l}.pkl"
with open(file_path_neighbors, 'wb') as f:
    pickle.dump({'neighbors_list': neighbors_list, 'objectives': objectives, 'rankings': rankings}, f)
    print(f"Data saved successfully to '{file_path_neighbors}'")

Objectives: [35.553919980073246, 32.33726491738459], Ranking: 1
Objectives: [30.01311862226904, 30.477549128030365], Ranking: 0
Objectives: [32.42887527378183, 30.48485009020675], Ranking: 1
Objectives: [40.1069462538209, 31.984975323612133], Ranking: 1
Objectives: [30.124409036404096, 30.668188818704948], Ranking: 0

Processing time: 127.02678227424622 seconds

The minimum objective value is at index [263821][1].
This is schedule: [2, 2, 0, 2, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3] with objective value 24.028625804839624.
Data saved successfully to 'datasets/best_schedule_22_20_10.pkl'

Average ranking: 0.50062

Data saved successfully to 'datasets/neighbors_and_objectives_22_20_10.pkl'

Step 3: Create training and test sets.

# Prepare the dataset
X = []
for neighbors in neighbors_list:
    X.append(neighbors[0] + neighbors[1])

X = np.array(X)
y = np.array(rankings)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the XGBoost model.

flowchart TD
    A[Start] --> B[Initialize StratifiedKFold]
    B --> C[Initialize XGBClassifier]
    C --> D[Set results as empty list]
    D --> E[Loop through each split of cv split]
    E --> F[Get train and test indices]
    F --> G[Split X and y into X_train, X_test, y_train, y_test]
    G --> H[Clone the classifier]
    H --> I[Call fit_and_score function]
    I --> J[Fit the estimator]
    J --> K[Score on training set]
    J --> L[Score on test set]
    K --> M[Return estimator, train_score, test_score]
    L --> M
    M --> N[Append the results]
    N --> E
    E --> O[Loop ends]
    O --> P[Print results]
    P --> Q[End]

class CustomCallback(TrainingCallback):
    def __init__(self, period=10):
        self.period = period

    def after_iteration(self, model, epoch, evals_log):
        if (epoch + 1) % self.period == 0:
            print(f"Epoch {epoch}, Evaluation log: {evals_log['validation_0']['logloss'][epoch]}")
        return False
    
def fit_and_score(estimator, X_train, X_test, y_train, y_test):
    """Fit the estimator on the train set and score it on both sets"""
    estimator.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0
    )

    train_score = estimator.score(X_train, y_train)
    test_score = estimator.score(X_test, y_test)

    return estimator, train_score, test_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=94)

# Initialize the XGBClassifier without early stopping here
# Load the best trial parameters from a JSON file.
with open("model_params.json", "r") as f:
    model_params = json.load(f)
    
# Initialize the EarlyStopping callback with validation dataset
early_stop = xgb.callback.EarlyStopping(
    rounds=10, metric_name='logloss', data_name='validation_0', save_best=True
)

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=model_params["max_depth"],
    min_child_weight=model_params["min_child_weight"],
    gamma=model_params["gamma"],
    subsample=model_params["subsample"],
    colsample_bytree=model_params["colsample_bytree"],
    learning_rate=model_params["learning_rate"],
    n_estimators=model_params["n_estimators"],
    early_stopping_rounds=9,
    #callbacks=[CustomCallback(period=50), early_stop],
    callbacks=[CustomCallback(period=50)],
)
print("Params: ")
for key, value in model_params.items():
    print(f" {key}: {value}")

start = time.time()
results = []

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    est, train_score, test_score = fit_and_score(
        clone(clf), X_train, X_test, y_train, y_test
    )
    results.append((est, train_score, test_score))
end = time.time()
training_time = end - start
print(f"\nTraining time: {training_time} seconds\n")

Params: 
 max_depth: 6
 min_child_weight: 1
 gamma: 0.1
 subsample: 0.8
 colsample_bytree: 0.8
 learning_rate: 0.1
 n_estimators: 100
Epoch 49, Evaluation log: 0.3335401283745809
Epoch 99, Evaluation log: 0.24401652437256852
Epoch 49, Evaluation log: 0.33205071952318926
Epoch 99, Evaluation log: 0.23965719920785827
Epoch 49, Evaluation log: 0.3323474727777454
Epoch 99, Evaluation log: 0.2458199597942381
Epoch 49, Evaluation log: 0.3376562827277618
Epoch 99, Evaluation log: 0.2460464604153424
Epoch 49, Evaluation log: 0.33256069475908295
Epoch 99, Evaluation log: 0.24362079493092606

Training time: 12.293036937713623 seconds

Step 5: To evaluate the performance of the XGBoost ranking model, we will use Stratified K-Fold Cross-Validation with 5 splits, ensuring each fold maintains the same class distribution as the original dataset. Using StratifiedKFold(n_splits=5, shuffle=True, random_state=94), the dataset will be divided into five folds. In each iteration, the model will be trained on four folds and evaluated on the remaining fold. A custom callback, CustomCallback(period=10), will print the evaluation log every 10 epochs.

The fit_and_score function will fit the model and score it on both the training and test sets, storing the results for each fold. This provides insight into the model’s performance across different subsets of the data, helps in understanding how well the model generalizes to unseen data and identifies potential overfitting or underfitting issues. The overall processing time for the cross-validation will also be recorded.

# Print results
for i, (est, train_score, test_score) in enumerate(results):
    print(f"Fold {i+1} - Train Score (Accuracy): {train_score:.4f}, Test Score (Accuracy): {test_score:.4f}")

Fold 1 - Train Score (Accuracy): 0.9296, Test Score (Accuracy): 0.9273
Fold 2 - Train Score (Accuracy): 0.9315, Test Score (Accuracy): 0.9309
Fold 3 - Train Score (Accuracy): 0.9269, Test Score (Accuracy): 0.9273
Fold 4 - Train Score (Accuracy): 0.9292, Test Score (Accuracy): 0.9287
Fold 5 - Train Score (Accuracy): 0.9277, Test Score (Accuracy): 0.9267

Training the model on the entire dataset provides a final model that has learned from all available data. Recording the training time helps in understanding the computational efficiency and scalability of the model with the given hyperparameters.

# Fit the model on the entire dataset
# Initialize the XGBClassifier without early stopping here

start = time.time()

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=model_params["max_depth"],
    min_child_weight=model_params["min_child_weight"],
    gamma=model_params["gamma"],
    subsample=model_params["subsample"],
    colsample_bytree=model_params["colsample_bytree"],
    learning_rate=model_params["learning_rate"],
    n_estimators=model_params["n_estimators"],
)

clf.fit(X, y)
end= time.time()
modeling_time = end - start
clf.save_model('models/classifier_large_instance.json')

# Calculate and print the training accuracy
training_accuracy = clf.score(X, y)
print(f"Training accuracy: {training_accuracy * 100:.2f}%\n")

print(f"\nTraining time: {modeling_time} seconds\n")

Training accuracy: 92.98%


Training time: 2.0916380882263184 seconds

4.4.7 Validation

Generating test schedules and calculating their objectives and rankings allows us to create a new dataset for evaluating the model’s performance on unseen data.

num_test_schedules = 1000

#test_schedules = random_combination_with_replacement(T, N, num_test_schedules)
test_schedules = [sample_neighbors_list(initial_x, v_star, all = False) for i in range(num_test_schedules)]

test_neighbors = [sample_neighbors_list(test_schedule, v_star) for test_schedule in test_schedules] # This can be done in parellel to improve speed

print(f"Sampled: {len(test_schedules)} schedules\n")

test_objectives_schedule_1 = [
    w * result[0] + (1 - w) * result[1]
    for test_neighbor in test_neighbors
    for result in [calculate_objective_serv_time_lookup(test_neighbor[0], d, convolutions)]
]
# Start time measurement for the evaluation
start = time.time()
test_objectives_schedule_2 = [
    w * result[0] + (1 - w) * result[1]
    for test_neighbor in test_neighbors
    for result in [calculate_objective_serv_time_lookup(test_neighbor[1], d, convolutions)]
]
test_rankings = [0 if test_obj < test_objectives_schedule_2[i] else 1 for i, test_obj in enumerate(test_objectives_schedule_1)]
end = time.time()
evaluation_time = end - start

# Combine the objectives for each pair for later processing
test_objectives = [[test_obj, test_objectives_schedule_2[i]] for i, test_obj in enumerate(test_objectives_schedule_1)]

print(f"\nEvaluation time: {evaluation_time} seconds\n")

for i in range(6):
    print(f"Neighbors: {test_neighbors[i]},\nObjectives: {test_objectives[i]}, Ranking: {test_rankings[i]}\n")

Sampled: 1000 schedules


Evaluation time: 0.4115619659423828 seconds

Neighbors: ([1, 2, 1, 0, 1, 1, 1, 0, 1, 2, 0, 1, 0, 1, 2, 1, 0, 0, 2, 5], [1, 3, 0, 1, 1, 1, 0, 0, 2, 1, 0, 1, 0, 1, 2, 1, 0, 1, 1, 5]),
Objectives: [33.18772614060374, 32.014184657492976], Ranking: 1

Neighbors: ([2, 0, 2, 0, 1, 1, 1, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 0, 6], [2, 1, 1, 1, 1, 1, 0, 0, 3, 1, 0, 0, 1, 0, 2, 0, 2, 0, 0, 6]),
Objectives: [32.927719304007255, 35.647250503738896], Ranking: 0

Neighbors: ([3, 0, 1, 1, 2, 0, 1, 0, 1, 2, 1, 0, 0, 1, 1, 2, 0, 1, 0, 5], [2, 1, 1, 1, 1, 0, 2, 0, 0, 2, 1, 0, 0, 2, 1, 2, 0, 0, 0, 6]),
Objectives: [29.945637131579648, 35.49276948970122], Ranking: 0

Neighbors: ([3, 1, 0, 1, 1, 0, 2, 1, 1, 1, 0, 1, 0, 2, 1, 1, 0, 1, 1, 4], [3, 2, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 2, 0, 0, 2, 0, 4]),
Objectives: [27.95496167081272, 28.433288634960313], Ranking: 0

Neighbors: ([2, 0, 1, 2, 0, 0, 1, 1, 2, 1, 0, 1, 1, 0, 2, 1, 0, 1, 0, 6], [1, 1, 1, 1, 0, 0, 2, 0, 3, 0, 0, 1, 1, 1, 2, 0, 1, 0, 1, 6]),
Objectives: [33.43479200225032, 35.90095274787456], Ranking: 0

Neighbors: ([1, 2, 0, 1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 1, 1, 2, 0, 0, 2, 5], [0, 2, 0, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 2, 1, 0, 1, 2, 5]),
Objectives: [32.698734609826815, 36.30741609531623], Ranking: 0

Making predictions on new data and comparing them to the actual rankings provides an evaluation of the model’s performance in practical applications. Recording the prediction time helps in understanding the model’s efficiency during inference.

input_X = test_neighbors
X_new = []
for test_neighbor in input_X:
    X_new.append(test_neighbor[0] + test_neighbor[1])
    
# Predict the target for new data
y_pred = clf.predict(X_new)

# Probability estimates
start = time.time()
y_pred_proba = clf.predict_proba(X_new)
end = time.time()
prediction_time = end - start
print(f"\nPrediction time: {prediction_time} seconds\n")

print(f"test_rankings = {np.array(test_rankings)[:6]}, \ny_pred = {y_pred[:6]}, \ny_pred_proba = \n{y_pred_proba[:6]}")


Prediction time: 0.006228923797607422 seconds

test_rankings = [1 0 0 0 0 0], 
y_pred = [1 0 0 0 0 0], 
y_pred_proba = 
[[0.19833893 0.8016611 ]
 [0.8240117  0.17598829]
 [0.9714385  0.02856148]
 [0.8815096  0.11849042]
 [0.8032509  0.19674908]
 [0.96251136 0.03748867]]

Calculating the ambiguousness of the predicted probabilities helps in understanding the model’s confidence in its predictions. High ambiguousness indicates uncertain predictions, while low ambiguousness indicates confident predictions.

Ambiguousness is calculated using the formula for entropy:

\[ H(X) = - \sum_{i} p(x_i) \log_b p(x_i) \]

Where in our case:

\(H(X)\) is the ambiguousness of the random variable \(X\) - the set of probability scores for the predicted rankings,
\(p(x_i)\) is probability score \(x_i\),
\(\log_b\) is the logarithm with base \(b\) (here \(\log_2\) as we have two predicted values),
The sum is taken over all possible outcomes of \(X\).

Calculating cumulative error rate and cumulative accuracy helps in understanding how the model’s performance evolves over the dataset.

Visualizing the relationship between ambiguousness and error provides insights into how uncertainty in the model’s predictions correlates with its accuracy. This can help in identifying patterns and understanding the conditions under which the model performs well or poorly.

from functions import calculate_ambiguousness

errors = np.abs(y_pred - np.array(test_rankings))

ambiguousness: np.ndarray = calculate_ambiguousness(y_pred_proba)
df = pd.DataFrame({"Ambiguousness": ambiguousness, "Error": errors}).sort_values(by="Ambiguousness")
df['Cumulative error rate'] = df['Error'].expanding().mean()
# Calculate cumulative accuracy
df['Cumulative accuracy'] = 1 - df['Cumulative error rate']
df.head()


# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Error"],
                    mode="markers",
                    name="Error",
                    marker=dict(size=9)))
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Cumulative accuracy"],
                    mode="lines",
                    name="Cum. accuracy",
                    line = dict(width = 3, dash = 'dash')))
fig.update_layout(
    title={
        'text': f"Error vs Ambiguousness</br></br><sub>n={num_test_schedules}</sub>",
        'y': 0.95,  # Keep the title slightly higher
        'x': 0.02,
        'xanchor': 'left',
        'yanchor': 'top'
    },
    xaxis_title="Ambiguousness",
    yaxis_title="Error / Accuracy",
    hoverlabel=dict(font=dict(color='white')),
    margin=dict(t=70)  # Add more space at the top of the chart
)
fig.show()

4.4.8 Hyperparameter Optimization

In the initial model the choice of hyperparameters was based on default values, examples from demo’s or trial and error. To improve the model’s performance, we applied a hyperparameter optimization technique to find the best set of hyperparameters. We used a grid search with cross-validation to find the optimal hyperparameters for the XGBoost model. The grid search was performed over a predefined set of hyperparameters, and the best hyperparameters were selected based on the model’s performance on the validation set. The best hyperparameters were then used to train the final model.

from functions import compare_json

with open("best_trial_params.json", "r") as f:
    best_trial_params = json.load(f)
    
differences = compare_json(model_params, best_trial_params)

params_tbl = pd.DataFrame(differences)
params_tbl.rename(index={'json1_value': 'base parameters', 'json2_value': 'optimized parameters'}, inplace=True)
print(params_tbl)

                      max_depth     gamma  subsample  colsample_bytree  \
base parameters               6  0.100000   0.800000          0.800000   
optimized parameters          5  0.304548   0.781029          0.922528   

                      learning_rate  n_estimators  
base parameters            0.100000           100  
optimized parameters       0.239488           490

# Fit the model on the entire dataset
# Initialize the XGBClassifier without early stopping here

# Load the best trial parameters from a JSON file.
with open("best_trial_params.json", "r") as f:
    best_trial_params = json.load(f)

start = time.time()

clf = xgb.XGBClassifier(
    tree_method="hist",
    max_depth=best_trial_params["max_depth"],
    min_child_weight=best_trial_params["min_child_weight"],
    gamma=best_trial_params["gamma"],
    subsample=best_trial_params["subsample"],
    colsample_bytree=best_trial_params["colsample_bytree"],
    learning_rate=best_trial_params["learning_rate"],
    n_estimators=best_trial_params["n_estimators"],
)

clf.fit(X, y)
end= time.time()
modeling_time = end - start
print(f"\nTraining time: {modeling_time} seconds\n")

# Calculate and print the training accuracy
training_accuracy = clf.score(X, y)
print(f"Training accuracy: {training_accuracy * 100:.2f}%")


Training time: 8.635821104049683 seconds

Training accuracy: 97.32%

# Predict the target for new data
y_pred = clf.predict(X_new)

# Probability estimates
start = time.time()
y_pred_proba = clf.predict_proba(X_new)
end = time.time()
prediction_time = end - start
print(f"\nPrediction time: {prediction_time} seconds\n")

print(f"test_rankings = {np.array(test_rankings)[:6]}, \ny_pred = {y_pred[:6]}, \ny_pred_proba = \n{y_pred_proba[:6]}")


Prediction time: 0.005888938903808594 seconds

test_rankings = [1 0 0 0 0 0], 
y_pred = [1 0 0 0 0 0], 
y_pred_proba = 
[[8.6118579e-03 9.9138814e-01]
 [9.9961799e-01 3.8200631e-04]
 [9.9999726e-01 2.7135134e-06]
 [8.9641535e-01 1.0358467e-01]
 [9.8878163e-01 1.1218370e-02]
 [9.9978304e-01 2.1696255e-04]]

errors = np.abs(y_pred - np.array(test_rankings))
ambiguousness: np.ndarray = calculate_ambiguousness(y_pred_proba)
df = pd.DataFrame({"Ambiguousness": ambiguousness, "Error": errors, "Schedules": test_neighbors, "Objectives": test_objectives}).sort_values(by="Ambiguousness")
df['Cumulative error rate'] = df['Error'].expanding().mean()
# Calculate cumulative accuracy
df['Cumulative accuracy'] = 1 - df['Cumulative error rate']
df.head()

	Ambiguousness	Schedules	Objectives	Cumulative accuracy
282	1.952094e-08	([1, 2, 1, 1, 0, 1, 0, 2, 1, 0, 1, 0, 2, 0, 1,...	[33.27605065469817, 43.90588213428762]	1.0
881	7.576995e-08	([1, 2, 0, 2, 1, 0, 1, 1, 0, 2, 0, 0, 2, 1, 0,...	[31.50709134156461, 39.045204664821995]	1.0
959	7.632419e-08	([2, 0, 2, 1, 0, 1, 0, 2, 0, 2, 0, 0, 2, 0, 1,...	[34.461392149557454, 42.47948500004395]	1.0
435	1.470183e-07	([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 1,...	[33.00896278203578, 42.07269588005636]	1.0
821	1.835731e-07	([1, 1, 2, 1, 0, 0, 2, 0, 1, 1, 1, 1, 0, 1, 1,...	[29.9172095021296, 36.93160018047193]	1.0

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Error"],
                    mode="markers",
                    name="Error",
                    marker=dict(size=9),
                    customdata=df[["Schedules", "Objectives"]],
                    hovertemplate=
                        "Ambiguousness: %{x} <br>" +
                        "Error: %{y} <br>" +
                        "Schedules: %{customdata[0][0]} / %{customdata[0][1]} <br>" +
                        "Objectives: %{customdata[1]} <br>"
                    ))
                  
fig.add_trace(go.Scatter(x=df["Ambiguousness"], y=df["Cumulative accuracy"],
                    mode="lines",
                    name="Cum. accuracy",
                    line = dict(width = 3, dash = 'dash')))
fig.update_layout(
    title={
        'text': f"Error vs Ambiguousness</br></br><sub>n={num_test_schedules}</sub>",
        'y': 0.95,  # Keep the title slightly higher
        'x': 0.02,
        'xanchor': 'left',
        'yanchor': 'top'
    },
    xaxis_title="Ambiguousness",
    yaxis_title="Error / Accuracy",
    hoverlabel=dict(font=dict(color='white')),
    margin=dict(t=70)  # Add more space at the top of the chart
)
fig.show()

4.5 Results

We wanted to test whether an XGBoost classification model could be used to assess and rank the quality of pairs of schedules. For performance benchmarking we use the conventional calculation method utilizing Lindley recursions.

We trained the XGBoost ranking model with a limited set of features (schedules) and labels (objectives). The total number of possible schedules is approximately 244663.0 million. For training and evaluation, we sampled 600000 schedules and corresponding neighbors. Generating the feature and label set took a total of 201.4711 seconds, with the calculation of objective values accounting for 127.0268 seconds.

The model demonstrates strong and consistent performance with high accuracies both for training, testing and validation (96.6%) with good generalization and stability. Total training time for the final model was 8.6358 seconds. The evaluation of 1000 test schedules took 0.0059 seconds for the the XGBoost model and 0.4116 for the conventional method, which is an improvement of 69X.

4.6 Discussion

training_time = round(modeling_time, 4)
conventional_time = round(evaluation_time, 4)
xgboost_time = round(prediction_time, 4)

# Define time values for plotting
time_values = np.linspace(0, training_time+0.1, 1000)  # 0 to 2 seconds

# Calculate evaluations for method 1
method1_evaluations = np.where(time_values >= training_time, (time_values - training_time) / xgboost_time * 1000, 0)

# Calculate evaluations for method 2
method2_evaluations = time_values / conventional_time * 1000

# Create line chart
fig = go.Figure()

# Add method 1 trace
fig.add_trace(go.Scatter(x=time_values, y=method1_evaluations, mode='lines', name='Ranking model'))

# Add method 2 trace
fig.add_trace(go.Scatter(x=time_values, y=method2_evaluations, mode='lines', name='Conventional method'))

# Update layout
fig.update_layout(
    title="Speed comparison between XGBoost ranking model and conventional method",
    xaxis_title="Time (seconds)",
    yaxis_title="Number of Evaluations",
    legend_title="Methods",
    template="plotly_white"
)

fig.show()

4.7 Timeline

This experiment was started on 26-04-2025. The expected completion date is 26-04-2025.