AIPL 2026 IPL Match Outcome Forecasting

Markdown

Cell 71 — Conceptual Framework: Two-Stage Decomposition

Key Conceptual Points

Introduces the mathematical theory behind the **Version 7 Conditional Two-Stage Model**.
Decomposes a 4-class problem into hierarchical sub-tasks: predicting the winner side (Stage 1), followed by predicting margins conditional on the winner (Stage 2).
Resolves a critical flaw in single-stage models: batting first margins (run-based) and chasing margins (wicket-based) follow entirely different distribution properties and require separate models.

🎤 Speaking Notes (For Praful)

"Here, I set up the architectural framework for Version 7. Instead of force-fitting a single 4-class model, I decouple the target mathematically. Stage 1 predicts which side wins (Team A vs Team B). Stage 2 uses two separate models: Model A predicts run margins for Team A, and Model B predicts wicket margins for Team B. We then recombine these using conditional probability, which keeps the model leakage-free and highly specialized."

🔄 Connection & Variable Flow

⬅️ Inputs: Conceptual feedback and validation losses from Version 5 and Version 6.

➡️ Outputs: The system design for the subsequent code blocks (Cells 72-80).

⛓️ Continuity: This cell acts as our architectural blueprint. We write out the math first to justify why we are splitting the target variable before writing any python code.

We can try one final advanced version, but we’ll keep it leakage-safe. V7 will split the margin model by winner side: A-win margin model and B-win margin model, instead of one generic big/small model.

Thought for a few seconds

Yes. Let’s try V7 to beat V6.

Your V6:

V6 best = 1.264503

To go below 1.25, the next serious idea is:

Stage 1: Predict winner side → A win or B win
Stage 2A: If A wins, predict A_small vs A_big
Stage 2B: If B wins, predict B_small vs B_big

This is better than V6 because V6 used one generic big/small model for both A and B. But margin behavior can be different:

A_big pattern ≠ B_big pattern

So V7 uses:

P(A_small) = P(A wins) × P(A small | A wins)
P(A_big) = P(A wins) × P(A big | A wins)
P(B_small) = P(B wins) × P(B small | B wins)
P(B_big) = P(B wins) × P(B big | B wins)

This is still leakage-safe.

V7 CELL 1 — Build conditional labels

Code

Cell 72 — Label Construction & Mapping

Code Logic & Technical Explanation

Copies the public training dataset from the Version 3 cells to preserve previous feature engineering.
Applies helper mapping functions to isolate binary labels: target_side_A (Winner Side), target_A_big (Large run margin), and target_B_big (Large wicket margin).
Prints target distributions to inspect class proportions and verify the correctness of the split mapping.

🎤 Speaking Notes (For Praful)

"In this cell, I copy the feature-engineered dataset from Version 3 and construct my binary labels. I map the multi-class labels into target_side_A to know who wins, and then create conditional indicators target_A_big and target_B_big to represent the margins. Printing the value counts ensures that the class boundaries are correct and balanced."

🔄 Connection & Variable Flow

⬅️ Inputs: train_df_v3_public (DataFrame containing all engineered rolling, phase, and venue features).

➡️ Outputs: train_df_v7_public (The primary training DataFrame with new binary target labels).

⛓️ Continuity: This cell serves as the data entry point for V7. It takes the feature-engineered dataset from V3 and appends the binary target targets required by the stage-wise classifiers.

Python Code

# ============================================================
# VERSION 7: CONDITIONAL TWO-STAGE MODEL
# Stage 1: A/B winner
# Stage 2A: A_small vs A_big only on A-win matches
# Stage 2B: B_small vs B_big only on B-win matches
# ============================================================

from catboost import CatBoostClassifier
from sklearn.metrics import log_loss
import numpy as np
import pandas as pd

train_df_v7_public = train_df_v3_public.copy()

def is_A_win(label):
    return 1 if str(label).startswith("A_") else 0

def is_A_big(label):
    return 1 if str(label) == "A_big" else 0

def is_B_big(label):
    return 1 if str(label) == "B_big" else 0

train_df_v7_public["target_side_A"] = train_df_v7_public["target"].apply(is_A_win)

# Conditional labels
train_df_v7_public["target_A_big"] = train_df_v7_public["target"].apply(is_A_big)
train_df_v7_public["target_B_big"] = train_df_v7_public["target"].apply(is_B_big)

print("Full target distribution:")
print(train_df_v7_public["target"].value_counts())

print("\nSide distribution:")
print(train_df_v7_public["target_side_A"].value_counts(normalize=True))

print("\nA-win matches:")
print(train_df_v7_public[train_df_v7_public["target_side_A"] == 1]["target"].value_counts())

print("\nB-win matches:")
print(train_df_v7_public[train_df_v7_public["target_side_A"] == 0]["target"].value_counts())

Code

Cell 73 — Time-Based Validation Splitting

Code Logic & Technical Explanation

Defines metadata and label columns to exclude from training features.
Enforces a **strict chronological validation split**: all matches from 2025 onwards are held out for validation.
Includes a safety fallback: if 2025 matches are fewer than 30, it automatically defaults to an 85% chronological percentile cutoff date.
Replaces missing values in categorical fields with 'Unknown' and registers categorical index boundaries for CatBoost.

🎤 Speaking Notes (For Praful)

"Here, I implement a strict time-based validation split. Matches on or after January 1, 2025, are held out. Using a temporal split instead of standard cross-validation is vital in sports analytics because it prevents temporal leakage. I also prepare my categorical features here, replacing nulls with 'Unknown' so CatBoost can interpret them natively."

🔄 Connection & Variable Flow

⬅️ Inputs: train_df_v7_public (From Cell 72).

➡️ Outputs: X_dev_v7, X_val_v7 (Winner Side features), y_side_dev_v7, y_side_val_v7 (Winner Side targets), and cat_features_v7 (Categorical indices).

⛓️ Continuity: Once labels are defined in Cell 72, this cell splits the dataset chronologically. It isolates features and targets for the Stage 1 Winner Side model and defines the validation boundary.

Python Code

# ============================================================
# VERSION 7: TIME VALIDATION SPLIT
# ============================================================

drop_cols_v7 = [
    "target",
    "target_side_A",
    "target_A_big",
    "target_B_big",
    "Date",
    "match_id"
]

val_mask_v7 = train_df_v7_public["Date"] >= pd.Timestamp("2025-01-01")

if val_mask_v7.sum() < 30:
    cutoff_idx = int(len(train_df_v7_public) * 0.85)
    cutoff_date = train_df_v7_public["Date"].iloc[cutoff_idx]
    val_mask_v7 = train_df_v7_public["Date"] >= cutoff_date

dev_df_v7 = train_df_v7_public[~val_mask_v7].copy()
val_df_v7 = train_df_v7_public[val_mask_v7].copy()

X_dev_v7 = dev_df_v7.drop(columns=drop_cols_v7)
X_val_v7 = val_df_v7.drop(columns=drop_cols_v7)

y_side_dev_v7 = dev_df_v7["target_side_A"].values
y_side_val_v7 = val_df_v7["target_side_A"].values

y_4class_val_v7 = val_df_v7["target"].map(CLASS_TO_ID).values

cat_cols_v7 = X_dev_v7.select_dtypes(include=["object"]).columns.tolist()

for c in cat_cols_v7:
    X_dev_v7[c] = X_dev_v7[c].fillna("Unknown").astype(str)
    X_val_v7[c] = X_val_v7[c].fillna("Unknown").astype(str)

cat_features_v7 = [X_dev_v7.columns.get_loc(c) for c in cat_cols_v7]

print("V7 Dev:", dev_df_v7.shape, dev_df_v7["Date"].min(), dev_df_v7["Date"].max())
print("V7 Val:", val_df_v7.shape, val_df_v7["Date"].min(), val_df_v7["Date"].max())
print("V7 Features:", X_dev_v7.shape[1])
print("V7 categorical:", cat_cols_v7)

print("\nV7 validation distribution:")
print(val_df_v7["target"].value_counts())

Code

Cell 74 — Conditional Subsets Extraction

Code Logic & Technical Explanation

Partitions the main Dev/Val datasets into two specialized, conditional training subsets.
A-win Subset: Filters rows where Team A won (target_side_A == 1) to isolate features and the target_A_big target.
B-win Subset: Filters rows where Team B won (target_side_A == 0) to isolate features and the target_B_big target.
Applies category string normalization to both sets to guarantee alignment.

🎤 Speaking Notes (For Praful)

"To train my Stage 2 models, I partition my split datasets. I build the A-win subset containing only matches won by Team A, which will train Model A on run-margins. Similarly, I build the B-win subset for Team B chases. This ensures the margin models only learn from matches where that specific team actually won, isolating the margin signals."

🔄 Connection & Variable Flow

⬅️ Inputs: dev_df_v7, val_df_v7 (From Cell 73) and targets (From Cell 72).

➡️ Outputs: X_dev_A_v7, y_dev_A_big_v7 (For Model A); and X_dev_B_v7, y_dev_B_big_v7 (For Model B).

⛓️ Continuity: This cell slices the primary chronological datasets into the conditional sub-matrices. It filters the rows so Model A trains only on matches where Team A won, and Model B trains only on matches where Team B won.

Python Code

# ============================================================
# VERSION 7: CONDITIONAL DATASETS
# ============================================================

# A-win subset for A_small vs A_big
dev_A_v7 = dev_df_v7[dev_df_v7["target_side_A"] == 1].copy()
val_A_v7 = val_df_v7[val_df_v7["target_side_A"] == 1].copy()

X_dev_A_v7 = dev_A_v7.drop(columns=drop_cols_v7)
y_dev_A_big_v7 = dev_A_v7["target_A_big"].values

X_val_A_v7 = val_A_v7.drop(columns=drop_cols_v7)
y_val_A_big_v7 = val_A_v7["target_A_big"].values

# B-win subset for B_small vs B_big
dev_B_v7 = dev_df_v7[dev_df_v7["target_side_A"] == 0].copy()
val_B_v7 = val_df_v7[val_df_v7["target_side_A"] == 0].copy()

X_dev_B_v7 = dev_B_v7.drop(columns=drop_cols_v7)
y_dev_B_big_v7 = dev_B_v7["target_B_big"].values

X_val_B_v7 = val_B_v7.drop(columns=drop_cols_v7)
y_val_B_big_v7 = val_B_v7["target_B_big"].values

for df in [X_dev_A_v7, X_val_A_v7, X_dev_B_v7, X_val_B_v7]:
    for c in cat_cols_v7:
        df[c] = df[c].fillna("Unknown").astype(str)

print("A conditional dev:", X_dev_A_v7.shape, "val:", X_val_A_v7.shape)
print("A conditional target:", pd.Series(y_dev_A_big_v7).value_counts(normalize=True))

print("\nB conditional dev:", X_dev_B_v7.shape, "val:", X_val_B_v7.shape)
print("B conditional target:", pd.Series(y_dev_B_big_v7).value_counts(normalize=True))

Code

Cell 75 — Mathematical Calibration Helpers

Code Logic & Technical Explanation

Defines numerical and probability utility functions for calibration and final prediction.
normalize_probs: Clips predictions to [1e-8, 1.0] and normalizes them, preventing math division errors.
apply_gamma: Implements **Temperature Scaling** (confidence adjustment exponentiation).
blend_prior: Shrinks extreme probability vectors towards baseline prior frequencies, safeguarding log-loss performance.
combine_conditional_probs: Computes the dot product of the Stage 1 winner probabilities and the Stage 2 conditional margin predictions to reconstruct the final 4-class output.

🎤 Speaking Notes (For Praful)

"This cell houses my mathematical calibration functions. `normalize_probs` secures numerical stability by clipping values. `apply_gamma` scales probability temperatures to adjust model confidence. `blend_prior` blends our predictions with the training priors to prevent extreme losses. Finally, `combine_conditional_probs` implements the probability multiplication formula to stack our outputs into the 4-class format."

🔄 Connection & Variable Flow

⬅️ Inputs: Mathematical logic for calibration.

➡️ Outputs: Helper functions (normalize_probs, apply_gamma, blend_prior, combine_conditional_probs).

⛓️ Continuity: This is a stateless utility cell. It defines the mathematical calibration tools that are used in Cell 76 to evaluate validation predictions, Cell 77 to run grid search, and Cell 79 for final test inference.

Python Code

# ============================================================
# VERSION 7: HELPERS
# ============================================================

def normalize_probs(p):
    p = np.clip(np.asarray(p), 1e-8, 1)
    p = p / p.sum(axis=1, keepdims=True)
    return p

def apply_gamma(p, gamma):
    p = normalize_probs(p)
    p = p ** gamma
    return normalize_probs(p)

def blend_prior(p, prior, alpha):
    p = normalize_probs(p)
    prior = np.asarray(prior).reshape(1, -1)
    return normalize_probs(alpha * p + (1 - alpha) * prior)

def binary_prob_positive(model, X):
    p = model.predict_proba(X)
    return np.asarray(p)[:, 1]

def combine_conditional_probs(p_A_win, p_A_big_given_A, p_B_big_given_B):
    p_A_win = np.clip(np.asarray(p_A_win), 1e-6, 1 - 1e-6)
    p_B_win = 1 - p_A_win

    p_A_big_given_A = np.clip(np.asarray(p_A_big_given_A), 1e-6, 1 - 1e-6)
    p_B_big_given_B = np.clip(np.asarray(p_B_big_given_B), 1e-6, 1 - 1e-6)

    p_A_small_given_A = 1 - p_A_big_given_A
    p_B_small_given_B = 1 - p_B_big_given_B

    out = np.vstack([
        p_A_win * p_A_small_given_A,
        p_A_win * p_A_big_given_A,
        p_B_win * p_B_small_given_B,
        p_B_win * p_B_big_given_B
    ]).T

    return normalize_probs(out)

Code

Cell 76 — Model Training & Hyperparameter Search

Code Logic & Technical Explanation

Sets up three distinct hyperparameter configurations with varying tree depths (2 and 3), learning rates, random seeds, and L2 leaf regularization.
Trains **three separate CatBoostClassifiers** (Winner Side, Model A, Model B) for each configuration on the development sets.
Uses validation sets for early stopping (use_best_model=True, od_wait=80) to prevent trees from memorizing training data.
Combines validation outputs using combine_conditional_probs and prints the validation log-loss for each config.

🎤 Speaking Notes (For Praful)

"Here, I define my model configurations and run the training loop. For each config, I train the Winner Side model, Model A, and Model B. I enable early stopping with validation sets to prevent overfitting, combine the predictions using my conditional formula, and evaluate the log loss. The raw validation scores are stored to help in ensembling."

🔄 Connection & Variable Flow

⬅️ Inputs: Stage 1 datasets (From Cell 73), Stage 2 datasets (From Cell 74), math helpers (From Cell 75), and CLASS_TO_ID mapping.

➡️ Outputs: Trained validation models (v7_side_models, etc.) and validation prediction vectors (v7_preds).

⛓️ Continuity: This cell brings together data preparation (Cells 73-74) and mathematical helpers (Cell 75). It trains three CatBoost models for each hyperparameter configuration and logs their raw validation performance.

Python Code

# ============================================================
# VERSION 7: TRAIN SIDE + CONDITIONAL MARGIN MODELS
# ============================================================

v7_configs = [
    {
        "name": "v7_depth2",
        "depth": 2,
        "learning_rate": 0.035,
        "l2_leaf_reg": 12,
        "random_strength": 2.0,
        "bagging_temperature": 0.9,
        "iterations": 900,
        "seed": 42
    },
    {
        "name": "v7_depth2_reg",
        "depth": 2,
        "learning_rate": 0.030,
        "l2_leaf_reg": 18,
        "random_strength": 2.5,
        "bagging_temperature": 1.0,
        "iterations": 1000,
        "seed": 99
    },
    {
        "name": "v7_depth3",
        "depth": 3,
        "learning_rate": 0.025,
        "l2_leaf_reg": 10,
        "random_strength": 2.0,
        "bagging_temperature": 0.8,
        "iterations": 900,
        "seed": 42
    }
]

v7_side_models = {}
v7_A_margin_models = {}
v7_B_margin_models = {}
v7_preds = {}
v7_results = []

for cfg in v7_configs:
    print("\nTraining config:", cfg["name"])

    # Winner side model
    side_model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="Logloss",
        iterations=cfg["iterations"],
        depth=cfg["depth"],
        learning_rate=cfg["learning_rate"],
        l2_leaf_reg=cfg["l2_leaf_reg"],
        random_strength=cfg["random_strength"],
        bagging_temperature=cfg["bagging_temperature"],
        border_count=128,
        random_seed=cfg["seed"],
        od_type="Iter",
        od_wait=80,
        verbose=100
    )

    side_model.fit(
        X_dev_v7,
        y_side_dev_v7,
        cat_features=cat_features_v7,
        eval_set=(X_val_v7, y_side_val_v7),
        use_best_model=True
    )

    # A margin model: A_small vs A_big
    A_model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="Logloss",
        iterations=cfg["iterations"],
        depth=cfg["depth"],
        learning_rate=cfg["learning_rate"],
        l2_leaf_reg=cfg["l2_leaf_reg"],
        random_strength=cfg["random_strength"],
        bagging_temperature=cfg["bagging_temperature"],
        border_count=128,
        random_seed=cfg["seed"] + 11,
        od_type="Iter",
        od_wait=80,
        verbose=100
    )

    A_model.fit(
        X_dev_A_v7,
        y_dev_A_big_v7,
        cat_features=cat_features_v7,
        eval_set=(X_val_A_v7, y_val_A_big_v7),
        use_best_model=True
    )

    # B margin model: B_small vs B_big
    B_model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="Logloss",
        iterations=cfg["iterations"],
        depth=cfg["depth"],
        learning_rate=cfg["learning_rate"],
        l2_leaf_reg=cfg["l2_leaf_reg"],
        random_strength=cfg["random_strength"],
        bagging_temperature=cfg["bagging_temperature"],
        border_count=128,
        random_seed=cfg["seed"] + 22,
        od_type="Iter",
        od_wait=80,
        verbose=100
    )

    B_model.fit(
        X_dev_B_v7,
        y_dev_B_big_v7,
        cat_features=cat_features_v7,
        eval_set=(X_val_B_v7, y_val_B_big_v7),
        use_best_model=True
    )

    p_A_win = binary_prob_positive(side_model, X_val_v7)
    p_A_big_given_A = binary_prob_positive(A_model, X_val_v7)
    p_B_big_given_B = binary_prob_positive(B_model, X_val_v7)

    p_4 = combine_conditional_probs(
        p_A_win=p_A_win,
        p_A_big_given_A=p_A_big_given_A,
        p_B_big_given_B=p_B_big_given_B
    )

    raw_loss = log_loss(y_4class_val_v7, p_4, labels=[0, 1, 2, 3])

    name = cfg["name"]

    v7_side_models[name] = side_model
    v7_A_margin_models[name] = A_model
    v7_B_margin_models[name] = B_model
    v7_preds[name] = p_4

    res = {
        "name": name,
        "raw_loss": raw_loss,
        "side_iter": side_model.best_iteration_,
        "A_margin_iter": A_model.best_iteration_,
        "B_margin_iter": B_model.best_iteration_
    }

    v7_results.append(res)
    print("\nV7 result:", res)

print("\nAll V7 raw results:")
for r in v7_results:
    print(r)

Code

Cell 77 — Ensembling & Calibration Grid Search

Code Logic & Technical Explanation

Ensembles predictions from the three major pipeline versions: V5 (baseline ensemble), V6 (first two-stage model), and V7 (this conditional model).
Rebuilds the optimal calibrated validation outputs for V5 and V6 using their respective configs.
Executes a multi-dimensional grid search over ensembling weights, temperature scale gamma, and prior blending weight alpha.
Selects and saves the configuration that minimizes multi-class log loss on the validation set.

🎤 Speaking Notes (For Praful)

"In this cell, I ensemble my models. I reconstruct the validation predictions from Version 5 and Version 6, and perform a grid search over blend weights, temperature scaling gamma, and prior blending alpha. Blending these different models and tuning the calibration parameters allows us to achieve a highly competitive log loss."

🔄 Connection & Variable Flow

⬅️ Inputs: v7_preds (From Cell 76), V5/V6 historical validation predictions, math helpers (From Cell 75), and ground truth target IDs (From Cell 73).

➡️ Outputs: best_v7_public (Dictionary with optimal ensembling weights, calibration gamma, and alpha).

⛓️ Continuity: This cell optimizes post-processing. It blends the newly trained V7 models with the predictions of V5 and V6, and grid-searches calibration constants to minimize log-loss penalty risk.

Python Code

# ============================================================
# VERSION 7 FAST BLEND SEARCH
# Replaces the slow interrupted V7 blend cell
# ============================================================

from sklearn.metrics import log_loss
import numpy as np

# Rebuild V5 validation probabilities
p_v5_val = np.zeros_like(list(v5_val_raw_preds.values())[0])

for name, w in best_v5_public["weights"].items():
    p_v5_val += w * v5_val_raw_preds[name]

p_v5_val = p_v5_val / sum(best_v5_public["weights"].values())
p_v5_val = apply_gamma(p_v5_val, best_v5_public["gamma"])
p_v5_val = blend_prior(p_v5_val, prior_dev_v5, best_v5_public["alpha"])
p_v5_val = normalize_probs(p_v5_val)

# Rebuild V6 validation probabilities
p_v6_raw_best = v6_two_stage_preds[best_v6_public["two_stage_model"]]

p_v6_mix = (
    best_v6_public["blend_with_v5"] * p_v5_val
    + (1 - best_v6_public["blend_with_v5"]) * p_v6_raw_best
)

p_v6_mix = normalize_probs(p_v6_mix)
p_v6_mix = apply_gamma(p_v6_mix, best_v6_public["gamma"])

prior_dev_v6_4class = np.bincount(
    dev_df_v6["target"].map(CLASS_TO_ID).values,
    minlength=4
) / len(dev_df_v6)

p_v6_val = blend_prior(
    p_v6_mix,
    prior_dev_v6_4class,
    best_v6_public["alpha"]
)

p_v6_val = normalize_probs(p_v6_val)

print("V5 check:", log_loss(y_4class_val_v7, p_v5_val, labels=[0, 1, 2, 3]))
print("V6 check:", log_loss(y_4class_val_v7, p_v6_val, labels=[0, 1, 2, 3]))

prior_dev_v7_4class = np.bincount(
    dev_df_v7["target"].map(CLASS_TO_ID).values,
    minlength=4
) / len(dev_df_v7)

best_v7_public = {
    "loss": 999,
    "v7_model": None,
    "w_v5": None,
    "w_v6": None,
    "w_v7": None,
    "gamma": None,
    "alpha": None
}

# Much smaller, smarter grid
weight_sets = [
    # mostly V6, small V7
    (0.0, 0.90, 0.10),
    (0.0, 0.80, 0.20),
    (0.0, 0.70, 0.30),

    # V6 + V5 + V7
    (0.10, 0.80, 0.10),
    (0.10, 0.70, 0.20),
    (0.10, 0.60, 0.30),

    # mostly V7
    (0.0, 0.40, 0.60),
    (0.0, 0.30, 0.70),
    (0.0, 0.20, 0.80),

    # pure checks
    (0.0, 1.00, 0.0),
    (0.0, 0.0, 1.0),
]

gamma_grid = np.linspace(1.6, 2.6, 21)
alpha_grid = [1.0, 0.95, 0.90]

for v7_name, p_v7_raw in v7_preds.items():
    p_v7_raw = normalize_probs(p_v7_raw)

    for w_v5, w_v6, w_v7 in weight_sets:
        total = w_v5 + w_v6 + w_v7

        p_mix = (
            w_v5 * p_v5_val +
            w_v6 * p_v6_val +
            w_v7 * p_v7_raw
        ) / total

        p_mix = normalize_probs(p_mix)

        for gamma in gamma_grid:
            pg = apply_gamma(p_mix, gamma)

            for alpha in alpha_grid:
                pf = blend_prior(pg, prior_dev_v7_4class, alpha)
                loss = log_loss(y_4class_val_v7, pf, labels=[0, 1, 2, 3])

                if loss < best_v7_public["loss"]:
                    best_v7_public = {
                        "loss": loss,
                        "v7_model": v7_name,
                        "w_v5": w_v5,
                        "w_v6": w_v6,
                        "w_v7": w_v7,
                        "gamma": gamma,
                        "alpha": alpha
                    }

print("BEST V7 FAST PUBLIC VALIDATION RESULT:")
print(best_v7_public)

print("\nCompare:")
print("V5:", best_v5_public["loss"])
print("V6:", best_v6_public["loss"])
print("V7 fast:", best_v7_public["loss"])

Code

Cell 78 — Full-Data Model Retraining

Code Logic & Technical Explanation

Prepares features and targets on 100% of the public dataset (combining development and validation data).
Extracts the optimal tree iteration numbers (best_iteration_ + 1) for each model from the validation runs in Cell 76.
Trains final, non-overfit Stage 1, Stage 2A, and Stage 2B CatBoostClassifiers on the complete dataset.
Computes and prints the final target class prior distributions.

🎤 Speaking Notes (For Praful)

"To prepare my final model for deployment, I retrain on the complete dataset—combining train and validation rows. I extract the exact best iteration numbers from my validation runs to prevent overfitting, fit my final three CatBoost models on all available data, and compute the final training prior for our prior blending step."

🔄 Connection & Variable Flow

⬅️ Inputs: train_df_v7_public (From Cell 72), v7_side_models, v7_A_margin_models, v7_B_margin_models (From Cell 76), and best_v7_public (From Cell 77).

➡️ Outputs: final_side_model_v7, final_A_margin_model_v7, final_B_margin_model_v7 (Final models trained on all public data), and prior_all_v7_public.

⛓️ Continuity: Once the optimal hyperparameters and ensembling weights are locked in (Cell 77), this cell prepares the final deployable models. It retrains the three estimators on 100% of the public data using the specific iteration thresholds recorded during validation.

Python Code

# ============================================================
# VERSION 7: FINAL TRAINING ON ALL PUBLIC DATA
# Run only if V7 beats V6.
# ============================================================

X_all_v7_public = train_df_v7_public.drop(columns=drop_cols_v7)

y_side_all_v7 = train_df_v7_public["target_side_A"].values

A_all_df_v7 = train_df_v7_public[train_df_v7_public["target_side_A"] == 1].copy()
B_all_df_v7 = train_df_v7_public[train_df_v7_public["target_side_A"] == 0].copy()

X_A_all_v7 = A_all_df_v7.drop(columns=drop_cols_v7)
y_A_big_all_v7 = A_all_df_v7["target_A_big"].values

X_B_all_v7 = B_all_df_v7.drop(columns=drop_cols_v7)
y_B_big_all_v7 = B_all_df_v7["target_B_big"].values

cat_cols_v7_final = X_all_v7_public.select_dtypes(include=["object"]).columns.tolist()

for df in [X_all_v7_public, X_A_all_v7, X_B_all_v7]:
    for c in cat_cols_v7_final:
        df[c] = df[c].fillna("Unknown").astype(str)

cat_features_v7_final = [X_all_v7_public.columns.get_loc(c) for c in cat_cols_v7_final]

chosen_v7_name = best_v7_public["v7_model"]
chosen_v7_cfg = None

for cfg in v7_configs:
    if cfg["name"] == chosen_v7_name:
        chosen_v7_cfg = cfg
        break

print("Chosen V7 config:", chosen_v7_cfg)

old_side = v7_side_models[chosen_v7_name]
old_A = v7_A_margin_models[chosen_v7_name]
old_B = v7_B_margin_models[chosen_v7_name]

side_iters = old_side.best_iteration_ + 1 if old_side.best_iteration_ is not None else 150
A_iters = old_A.best_iteration_ + 1 if old_A.best_iteration_ is not None else 100
B_iters = old_B.best_iteration_ + 1 if old_B.best_iteration_ is not None else 100

print("Final iters:", side_iters, A_iters, B_iters)

final_side_model_v7 = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="Logloss",
    iterations=side_iters,
    depth=chosen_v7_cfg["depth"],
    learning_rate=chosen_v7_cfg["learning_rate"],
    l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
    random_strength=chosen_v7_cfg["random_strength"],
    bagging_temperature=chosen_v7_cfg["bagging_temperature"],
    border_count=128,
    random_seed=chosen_v7_cfg["seed"],
    verbose=100
)

final_A_margin_model_v7 = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="Logloss",
    iterations=A_iters,
    depth=chosen_v7_cfg["depth"],
    learning_rate=chosen_v7_cfg["learning_rate"],
    l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
    random_strength=chosen_v7_cfg["random_strength"],
    bagging_temperature=chosen_v7_cfg["bagging_temperature"],
    border_count=128,
    random_seed=chosen_v7_cfg["seed"] + 11,
    verbose=100
)

final_B_margin_model_v7 = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="Logloss",
    iterations=B_iters,
    depth=chosen_v7_cfg["depth"],
    learning_rate=chosen_v7_cfg["learning_rate"],
    l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
    random_strength=chosen_v7_cfg["random_strength"],
    bagging_temperature=chosen_v7_cfg["bagging_temperature"],
    border_count=128,
    random_seed=chosen_v7_cfg["seed"] + 22,
    verbose=100
)

final_side_model_v7.fit(
    X_all_v7_public,
    y_side_all_v7,
    cat_features=cat_features_v7_final
)

final_A_margin_model_v7.fit(
    X_A_all_v7,
    y_A_big_all_v7,
    cat_features=cat_features_v7_final
)

final_B_margin_model_v7.fit(
    X_B_all_v7,
    y_B_big_all_v7,
    cat_features=cat_features_v7_final
)

prior_all_v7_public = np.bincount(
    train_df_v7_public["target"].map(CLASS_TO_ID).values,
    minlength=4
) / len(train_df_v7_public)

print("Final V7 public prior:")
print(dict(zip(CLASS_ORDER, prior_all_v7_public)))

Code

Cell 79 — Inference Wrappers Construction

Code Logic & Technical Explanation

Defines clean inference wrappers to generate predictions for test matches.
predict_v7_raw_public: Implements feature engineering alignment (imputing missing columns/categories) and calculates Stage 1 + Stage 2 probabilities.
predict_known_toss_public_v7: Houses the complete prediction logic. It obtains predictions from V5, V6, and V7, blends them using the optimal ensembling weights, and applies calibration transformations.

🎤 Speaking Notes (For Praful)

"Here, I build my final inference wrappers. `predict_v7_raw_public` takes any raw match row, handles missing categories, runs our final three V7 models, and combines them. `predict_known_toss_public_v7` combines predictions from V5, V6, and V7, applies our optimized ensembling weights, scales the probabilities with gamma, and blends them with the prior."

🔄 Connection & Variable Flow

⬅️ Inputs: Final models and priors (From Cell 78), best_v7_public parameters (From Cell 77), math helpers (From Cell 75), and prediction wrappers from V5 and V6.

➡️ Outputs: predict_v7_raw_public and predict_known_toss_public_v7 (Inference functions for the test sets).

⛓️ Continuity: This cell packages the models, ensembling weights, and calibration parameters into clean, unified prediction functions. It connects all previous stages into a single prediction API ready for deployment.

Python Code

# ============================================================
# VERSION 7: PUBLIC PREDICTION FUNCTIONS
# ============================================================

def predict_v7_raw_public(feature_df):
    feature_df = feature_df.copy()

    for c in cat_cols_v7_final:
        if c not in feature_df.columns:
            feature_df[c] = "Unknown"
        feature_df[c] = feature_df[c].fillna("Unknown").astype(str)

    for c in X_all_v7_public.columns:
        if c not in feature_df.columns:
            feature_df[c] = np.nan

    feature_df = feature_df[X_all_v7_public.columns]

    p_A_win = binary_prob_positive(final_side_model_v7, feature_df)
    p_A_big_given_A = binary_prob_positive(final_A_margin_model_v7, feature_df)
    p_B_big_given_B = binary_prob_positive(final_B_margin_model_v7, feature_df)

    return combine_conditional_probs(
        p_A_win=p_A_win,
        p_A_big_given_A=p_A_big_given_A,
        p_B_big_given_B=p_B_big_given_B
    )

def predict_known_toss_public_v7(date, team_a, team_b, venue, city, toss_winner, toss_decision):
    feat = make_feature_row_v2(
        date=date,
        team_a=team_a,
        team_b=team_b,
        venue=venue,
        city=city,
        toss_winner_is_A=int(toss_winner == team_a),
        toss_decision_bat=int(toss_decision == "bat")
    )

    feature_df = pd.DataFrame([feat])

    # V5 probability
    p_v5 = predict_raw_public_v5(feature_df)
    p_v5 = postprocess_public_v5(p_v5)

    # V6 probability
    p_v6 = predict_known_toss_public_v6(
        date=date,
        team_a=team_a,
        team_b=team_b,
        venue=venue,
        city=city,
        toss_winner=toss_winner,
        toss_decision=toss_decision
    ).reshape(1, -1)

    # V7 probability
    p_v7 = predict_v7_raw_public(feature_df)

    w_v5 = best_v7_public["w_v5"]
    w_v6 = best_v7_public["w_v6"]
    w_v7 = best_v7_public["w_v7"]

    total = w_v5 + w_v6 + w_v7

    p_mix = (w_v5 * p_v5 + w_v6 * p_v6 + w_v7 * p_v7) / total
    p_mix = normalize_probs(p_mix)

    p_mix = apply_gamma(p_mix, best_v7_public["gamma"])
    p_mix = blend_prior(p_mix, prior_all_v7_public, best_v7_public["alpha"])

    return normalize_probs(p_mix)[0]

Code

Cell 80 — Submission Generation & Audit Assertions

Code Logic & Technical Explanation

Generates outcome predictions for all 53 rows in the competition test set.
For public matches, it uses the V7 public known-toss wrapper. For private matches (unknown toss), it uses the V2 pre-toss marginalization model.
Runs strict assertions: verifies the shape is exactly 53x5, checks match ID ordering, guarantees values are within [0.0, 1.0], and confirms rows sum to 1.0.
Saves the final results to submission_v7.csv.

🎤 Speaking Notes (For Praful)

"This is the final step. I loop through the test set. For public matches where the toss is known, I use my ensembled V7 public model. For private matches where the toss is unknown, I use my V2 pre-toss marginalization model. I run strict assertions to verify shape and normalization, and export my final submission to submission_v7.csv."

🔄 Connection & Variable Flow

⬅️ Inputs: public_lb, schedule, sample (Datasets), predict_known_toss_public_v7 (From Cell 79), and predict_private_pretoss_v2 (From Version 2).

➡️ Outputs: submission_v7.csv (The final output submission file).

⛓️ Continuity: This is the final destination of the pipeline. It calls the prediction functions defined in Cell 79 on the test sets, performs formatting assertions, and writes the final predictions to disk.

Python Code

# ============================================================
# VERSION 7: CREATE FINAL HYBRID SUBMISSION
# Public rows = V7 public model
# Private rows = V2 symmetric pre-toss model
# ============================================================

pred_dict_v7 = {}

# Public rows: V7 public known-toss model
for _, r in public_lb.iterrows():
    match_id = str(r["match_id"])

    p = predict_known_toss_public_v7(
        date=r["date"],
        team_a=r["team_a"],
        team_b=r["team_b"],
        venue=r["venue"],
        city=r["city"],
        toss_winner=r["toss_winner"],
        toss_decision=r["toss_decision"]
    )

    pred_dict_v7[match_id] = p

# Private rows: V2 pre-toss marginalization remains safer
for _, r in schedule.iterrows():
    match_id = str(r["match_id"])

    p = predict_private_pretoss_v2(
        date=r["date"],
        team_a=r["team_a"],
        team_b=r["team_b"],
        venue=r["venue"],
        city=r["city"]
    )

    pred_dict_v7[match_id] = p

submission_v7 = sample.copy()
submission_v7["match_id"] = submission_v7["match_id"].astype(str)

for i, match_id in enumerate(submission_v7["match_id"]):
    if match_id not in pred_dict_v7:
        print("Missing prediction for:", match_id, "using V7 prior fallback")
        p = prior_all_v7_public.copy()
    else:
        p = pred_dict_v7[match_id]

    p = normalize_probs(np.asarray(p).reshape(1, -1))[0]

    submission_v7.loc[i, "A_small"] = p[0]
    submission_v7.loc[i, "A_big"] = p[1]
    submission_v7.loc[i, "B_small"] = p[2]
    submission_v7.loc[i, "B_big"] = p[3]

# Hard checks
assert submission_v7.shape == (53, 5), f"Wrong shape: {submission_v7.shape}"
assert submission_v7["match_id"].astype(str).tolist() == sample["match_id"].astype(str).tolist()
assert np.all(submission_v7[CLASS_ORDER].values >= 0)
assert np.all(submission_v7[CLASS_ORDER].values <= 1)
assert np.allclose(submission_v7[CLASS_ORDER].sum(axis=1).values, 1.0, atol=1e-6)

submission_v7.to_csv("submission_v7.csv", index=False)

print("submission_v7.csv saved successfully")
print("Rows:", submission_v7.shape)
print("Probability row sum min:", submission_v7[CLASS_ORDER].sum(axis=1).min())
print("Probability row sum max:", submission_v7[CLASS_ORDER].sum(axis=1).max())

print("\nAverage probabilities:")
print(submission_v7[CLASS_ORDER].mean())

display(submission_v7.head())
display(submission_v7.tail())

Block 1: Opening ⏱️ 30-45s

Block 2: Data & Problem Setup ⏱️ 40-60s

Block 3: Python Libraries ⏱️ 40-60s

Block 4: Feature Engineering ⏱️ 1.5-2 mins

Block 5: Modeling Strategy: Version 7 ⏱️ 1.5-2 mins

Block 6: Ensembling & Calibration ⏱️ 60-90s

Cell 71 — Conceptual Framework: Two-Stage Decomposition

Key Conceptual Points

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 72 — Label Construction & Mapping

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 73 — Time-Based Validation Splitting

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 74 — Conditional Subsets Extraction

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 75 — Mathematical Calibration Helpers

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 76 — Model Training & Hyperparameter Search

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 77 — Ensembling & Calibration Grid Search

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 78 — Full-Data Model Retraining

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 79 — Inference Wrappers Construction

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Cell 80 — Submission Generation & Audit Assertions

Code Logic & Technical Explanation

🎤 Speaking Notes (For Praful)

🔄 Connection & Variable Flow

Judges Q&A Preparation Cheat Sheet

📁 Question: What Python libraries were used in this project?

🤖 Question: Which ML algorithms did you choose, and why?

💡 Question: Can you justify the necessity of a two-stage conditional architecture?

🌟 Question: What other pipeline features make this notebook robust?