Cell 71 β Conceptual Framework: Two-Stage Decomposition
Key Conceptual Points
- Introduces the mathematical theory behind the **Version 7 Conditional Two-Stage Model**.
- Decomposes a 4-class problem into hierarchical sub-tasks: predicting the winner side (Stage 1), followed by predicting margins conditional on the winner (Stage 2).
- Resolves a critical flaw in single-stage models: batting first margins (run-based) and chasing margins (wicket-based) follow entirely different distribution properties and require separate models.
π€ Speaking Notes (For Praful)
"Here, I set up the architectural framework for Version 7. Instead of force-fitting a single 4-class model, I decouple the target mathematically. Stage 1 predicts which side wins (Team A vs Team B). Stage 2 uses two separate models: Model A predicts run margins for Team A, and Model B predicts wicket margins for Team B. We then recombine these using conditional probability, which keeps the model leakage-free and highly specialized."
π Connection & Variable Flow
Thought for a few seconds
Yes. Letβs try V7 to beat V6.
Your V6:
V6 best = 1.264503
To go below 1.25, the next serious idea is:
Stage 1: Predict winner side β A win or B win
Stage 2A: If A wins, predict A_small vs A_big
Stage 2B: If B wins, predict B_small vs B_big
This is better than V6 because V6 used one generic big/small model for both A and B. But margin behavior can be different:
A_big pattern β B_big pattern
So V7 uses:
P(A_small) = P(A wins) Γ P(A small | A wins)
P(A_big) = P(A wins) Γ P(A big | A wins)
P(B_small) = P(B wins) Γ P(B small | B wins)
P(B_big) = P(B wins) Γ P(B big | B wins)
This is still leakage-safe.
V7 CELL 1 β Build conditional labels
Cell 72 β Label Construction & Mapping
Code Logic & Technical Explanation
- Copies the public training dataset from the Version 3 cells to preserve previous feature engineering.
- Applies helper mapping functions to isolate binary labels:
target_side_A(Winner Side),target_A_big(Large run margin), andtarget_B_big(Large wicket margin). - Prints target distributions to inspect class proportions and verify the correctness of the split mapping.
π€ Speaking Notes (For Praful)
"In this cell, I copy the feature-engineered dataset from Version 3 and construct my binary labels. I map the multi-class labels into target_side_A to know who wins, and then create conditional indicators target_A_big and target_B_big to represent the margins. Printing the value counts ensures that the class boundaries are correct and balanced."
π Connection & Variable Flow
train_df_v3_public (DataFrame containing all engineered rolling, phase, and venue features).train_df_v7_public (The primary training DataFrame with new binary target labels).# ============================================================
# VERSION 7: CONDITIONAL TWO-STAGE MODEL
# Stage 1: A/B winner
# Stage 2A: A_small vs A_big only on A-win matches
# Stage 2B: B_small vs B_big only on B-win matches
# ============================================================
from catboost import CatBoostClassifier
from sklearn.metrics import log_loss
import numpy as np
import pandas as pd
train_df_v7_public = train_df_v3_public.copy()
def is_A_win(label):
return 1 if str(label).startswith("A_") else 0
def is_A_big(label):
return 1 if str(label) == "A_big" else 0
def is_B_big(label):
return 1 if str(label) == "B_big" else 0
train_df_v7_public["target_side_A"] = train_df_v7_public["target"].apply(is_A_win)
# Conditional labels
train_df_v7_public["target_A_big"] = train_df_v7_public["target"].apply(is_A_big)
train_df_v7_public["target_B_big"] = train_df_v7_public["target"].apply(is_B_big)
print("Full target distribution:")
print(train_df_v7_public["target"].value_counts())
print("\nSide distribution:")
print(train_df_v7_public["target_side_A"].value_counts(normalize=True))
print("\nA-win matches:")
print(train_df_v7_public[train_df_v7_public["target_side_A"] == 1]["target"].value_counts())
print("\nB-win matches:")
print(train_df_v7_public[train_df_v7_public["target_side_A"] == 0]["target"].value_counts())
Cell 73 β Time-Based Validation Splitting
Code Logic & Technical Explanation
- Defines metadata and label columns to exclude from training features.
- Enforces a **strict chronological validation split**: all matches from 2025 onwards are held out for validation.
- Includes a safety fallback: if 2025 matches are fewer than 30, it automatically defaults to an 85% chronological percentile cutoff date.
- Replaces missing values in categorical fields with
'Unknown'and registers categorical index boundaries for CatBoost.
π€ Speaking Notes (For Praful)
"Here, I implement a strict time-based validation split. Matches on or after January 1, 2025, are held out. Using a temporal split instead of standard cross-validation is vital in sports analytics because it prevents temporal leakage. I also prepare my categorical features here, replacing nulls with 'Unknown' so CatBoost can interpret them natively."
π Connection & Variable Flow
train_df_v7_public (From Cell 72).X_dev_v7, X_val_v7 (Winner Side features), y_side_dev_v7, y_side_val_v7 (Winner Side targets), and cat_features_v7 (Categorical indices).# ============================================================
# VERSION 7: TIME VALIDATION SPLIT
# ============================================================
drop_cols_v7 = [
"target",
"target_side_A",
"target_A_big",
"target_B_big",
"Date",
"match_id"
]
val_mask_v7 = train_df_v7_public["Date"] >= pd.Timestamp("2025-01-01")
if val_mask_v7.sum() < 30:
cutoff_idx = int(len(train_df_v7_public) * 0.85)
cutoff_date = train_df_v7_public["Date"].iloc[cutoff_idx]
val_mask_v7 = train_df_v7_public["Date"] >= cutoff_date
dev_df_v7 = train_df_v7_public[~val_mask_v7].copy()
val_df_v7 = train_df_v7_public[val_mask_v7].copy()
X_dev_v7 = dev_df_v7.drop(columns=drop_cols_v7)
X_val_v7 = val_df_v7.drop(columns=drop_cols_v7)
y_side_dev_v7 = dev_df_v7["target_side_A"].values
y_side_val_v7 = val_df_v7["target_side_A"].values
y_4class_val_v7 = val_df_v7["target"].map(CLASS_TO_ID).values
cat_cols_v7 = X_dev_v7.select_dtypes(include=["object"]).columns.tolist()
for c in cat_cols_v7:
X_dev_v7[c] = X_dev_v7[c].fillna("Unknown").astype(str)
X_val_v7[c] = X_val_v7[c].fillna("Unknown").astype(str)
cat_features_v7 = [X_dev_v7.columns.get_loc(c) for c in cat_cols_v7]
print("V7 Dev:", dev_df_v7.shape, dev_df_v7["Date"].min(), dev_df_v7["Date"].max())
print("V7 Val:", val_df_v7.shape, val_df_v7["Date"].min(), val_df_v7["Date"].max())
print("V7 Features:", X_dev_v7.shape[1])
print("V7 categorical:", cat_cols_v7)
print("\nV7 validation distribution:")
print(val_df_v7["target"].value_counts())
Cell 74 β Conditional Subsets Extraction
Code Logic & Technical Explanation
- Partitions the main Dev/Val datasets into two specialized, conditional training subsets.
- A-win Subset: Filters rows where Team A won (
target_side_A == 1) to isolate features and thetarget_A_bigtarget. - B-win Subset: Filters rows where Team B won (
target_side_A == 0) to isolate features and thetarget_B_bigtarget. - Applies category string normalization to both sets to guarantee alignment.
π€ Speaking Notes (For Praful)
"To train my Stage 2 models, I partition my split datasets. I build the A-win subset containing only matches won by Team A, which will train Model A on run-margins. Similarly, I build the B-win subset for Team B chases. This ensures the margin models only learn from matches where that specific team actually won, isolating the margin signals."
π Connection & Variable Flow
dev_df_v7, val_df_v7 (From Cell 73) and targets (From Cell 72).X_dev_A_v7, y_dev_A_big_v7 (For Model A); and X_dev_B_v7, y_dev_B_big_v7 (For Model B).# ============================================================
# VERSION 7: CONDITIONAL DATASETS
# ============================================================
# A-win subset for A_small vs A_big
dev_A_v7 = dev_df_v7[dev_df_v7["target_side_A"] == 1].copy()
val_A_v7 = val_df_v7[val_df_v7["target_side_A"] == 1].copy()
X_dev_A_v7 = dev_A_v7.drop(columns=drop_cols_v7)
y_dev_A_big_v7 = dev_A_v7["target_A_big"].values
X_val_A_v7 = val_A_v7.drop(columns=drop_cols_v7)
y_val_A_big_v7 = val_A_v7["target_A_big"].values
# B-win subset for B_small vs B_big
dev_B_v7 = dev_df_v7[dev_df_v7["target_side_A"] == 0].copy()
val_B_v7 = val_df_v7[val_df_v7["target_side_A"] == 0].copy()
X_dev_B_v7 = dev_B_v7.drop(columns=drop_cols_v7)
y_dev_B_big_v7 = dev_B_v7["target_B_big"].values
X_val_B_v7 = val_B_v7.drop(columns=drop_cols_v7)
y_val_B_big_v7 = val_B_v7["target_B_big"].values
for df in [X_dev_A_v7, X_val_A_v7, X_dev_B_v7, X_val_B_v7]:
for c in cat_cols_v7:
df[c] = df[c].fillna("Unknown").astype(str)
print("A conditional dev:", X_dev_A_v7.shape, "val:", X_val_A_v7.shape)
print("A conditional target:", pd.Series(y_dev_A_big_v7).value_counts(normalize=True))
print("\nB conditional dev:", X_dev_B_v7.shape, "val:", X_val_B_v7.shape)
print("B conditional target:", pd.Series(y_dev_B_big_v7).value_counts(normalize=True))
Cell 75 β Mathematical Calibration Helpers
Code Logic & Technical Explanation
- Defines numerical and probability utility functions for calibration and final prediction.
normalize_probs: Clips predictions to[1e-8, 1.0]and normalizes them, preventing math division errors.apply_gamma: Implements **Temperature Scaling** (confidence adjustment exponentiation).blend_prior: Shrinks extreme probability vectors towards baseline prior frequencies, safeguarding log-loss performance.combine_conditional_probs: Computes the dot product of the Stage 1 winner probabilities and the Stage 2 conditional margin predictions to reconstruct the final 4-class output.
π€ Speaking Notes (For Praful)
"This cell houses my mathematical calibration functions. `normalize_probs` secures numerical stability by clipping values. `apply_gamma` scales probability temperatures to adjust model confidence. `blend_prior` blends our predictions with the training priors to prevent extreme losses. Finally, `combine_conditional_probs` implements the probability multiplication formula to stack our outputs into the 4-class format."
π Connection & Variable Flow
normalize_probs, apply_gamma, blend_prior, combine_conditional_probs).# ============================================================
# VERSION 7: HELPERS
# ============================================================
def normalize_probs(p):
p = np.clip(np.asarray(p), 1e-8, 1)
p = p / p.sum(axis=1, keepdims=True)
return p
def apply_gamma(p, gamma):
p = normalize_probs(p)
p = p ** gamma
return normalize_probs(p)
def blend_prior(p, prior, alpha):
p = normalize_probs(p)
prior = np.asarray(prior).reshape(1, -1)
return normalize_probs(alpha * p + (1 - alpha) * prior)
def binary_prob_positive(model, X):
p = model.predict_proba(X)
return np.asarray(p)[:, 1]
def combine_conditional_probs(p_A_win, p_A_big_given_A, p_B_big_given_B):
p_A_win = np.clip(np.asarray(p_A_win), 1e-6, 1 - 1e-6)
p_B_win = 1 - p_A_win
p_A_big_given_A = np.clip(np.asarray(p_A_big_given_A), 1e-6, 1 - 1e-6)
p_B_big_given_B = np.clip(np.asarray(p_B_big_given_B), 1e-6, 1 - 1e-6)
p_A_small_given_A = 1 - p_A_big_given_A
p_B_small_given_B = 1 - p_B_big_given_B
out = np.vstack([
p_A_win * p_A_small_given_A,
p_A_win * p_A_big_given_A,
p_B_win * p_B_small_given_B,
p_B_win * p_B_big_given_B
]).T
return normalize_probs(out)
Cell 76 β Model Training & Hyperparameter Search
Code Logic & Technical Explanation
- Sets up three distinct hyperparameter configurations with varying tree depths (2 and 3), learning rates, random seeds, and L2 leaf regularization.
- Trains **three separate CatBoostClassifiers** (Winner Side, Model A, Model B) for each configuration on the development sets.
- Uses validation sets for early stopping (
use_best_model=True,od_wait=80) to prevent trees from memorizing training data. - Combines validation outputs using
combine_conditional_probsand prints the validation log-loss for each config.
π€ Speaking Notes (For Praful)
"Here, I define my model configurations and run the training loop. For each config, I train the Winner Side model, Model A, and Model B. I enable early stopping with validation sets to prevent overfitting, combine the predictions using my conditional formula, and evaluate the log loss. The raw validation scores are stored to help in ensembling."
π Connection & Variable Flow
CLASS_TO_ID mapping.v7_side_models, etc.) and validation prediction vectors (v7_preds).# ============================================================
# VERSION 7: TRAIN SIDE + CONDITIONAL MARGIN MODELS
# ============================================================
v7_configs = [
{
"name": "v7_depth2",
"depth": 2,
"learning_rate": 0.035,
"l2_leaf_reg": 12,
"random_strength": 2.0,
"bagging_temperature": 0.9,
"iterations": 900,
"seed": 42
},
{
"name": "v7_depth2_reg",
"depth": 2,
"learning_rate": 0.030,
"l2_leaf_reg": 18,
"random_strength": 2.5,
"bagging_temperature": 1.0,
"iterations": 1000,
"seed": 99
},
{
"name": "v7_depth3",
"depth": 3,
"learning_rate": 0.025,
"l2_leaf_reg": 10,
"random_strength": 2.0,
"bagging_temperature": 0.8,
"iterations": 900,
"seed": 42
}
]
v7_side_models = {}
v7_A_margin_models = {}
v7_B_margin_models = {}
v7_preds = {}
v7_results = []
for cfg in v7_configs:
print("\nTraining config:", cfg["name"])
# Winner side model
side_model = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=cfg["iterations"],
depth=cfg["depth"],
learning_rate=cfg["learning_rate"],
l2_leaf_reg=cfg["l2_leaf_reg"],
random_strength=cfg["random_strength"],
bagging_temperature=cfg["bagging_temperature"],
border_count=128,
random_seed=cfg["seed"],
od_type="Iter",
od_wait=80,
verbose=100
)
side_model.fit(
X_dev_v7,
y_side_dev_v7,
cat_features=cat_features_v7,
eval_set=(X_val_v7, y_side_val_v7),
use_best_model=True
)
# A margin model: A_small vs A_big
A_model = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=cfg["iterations"],
depth=cfg["depth"],
learning_rate=cfg["learning_rate"],
l2_leaf_reg=cfg["l2_leaf_reg"],
random_strength=cfg["random_strength"],
bagging_temperature=cfg["bagging_temperature"],
border_count=128,
random_seed=cfg["seed"] + 11,
od_type="Iter",
od_wait=80,
verbose=100
)
A_model.fit(
X_dev_A_v7,
y_dev_A_big_v7,
cat_features=cat_features_v7,
eval_set=(X_val_A_v7, y_val_A_big_v7),
use_best_model=True
)
# B margin model: B_small vs B_big
B_model = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=cfg["iterations"],
depth=cfg["depth"],
learning_rate=cfg["learning_rate"],
l2_leaf_reg=cfg["l2_leaf_reg"],
random_strength=cfg["random_strength"],
bagging_temperature=cfg["bagging_temperature"],
border_count=128,
random_seed=cfg["seed"] + 22,
od_type="Iter",
od_wait=80,
verbose=100
)
B_model.fit(
X_dev_B_v7,
y_dev_B_big_v7,
cat_features=cat_features_v7,
eval_set=(X_val_B_v7, y_val_B_big_v7),
use_best_model=True
)
p_A_win = binary_prob_positive(side_model, X_val_v7)
p_A_big_given_A = binary_prob_positive(A_model, X_val_v7)
p_B_big_given_B = binary_prob_positive(B_model, X_val_v7)
p_4 = combine_conditional_probs(
p_A_win=p_A_win,
p_A_big_given_A=p_A_big_given_A,
p_B_big_given_B=p_B_big_given_B
)
raw_loss = log_loss(y_4class_val_v7, p_4, labels=[0, 1, 2, 3])
name = cfg["name"]
v7_side_models[name] = side_model
v7_A_margin_models[name] = A_model
v7_B_margin_models[name] = B_model
v7_preds[name] = p_4
res = {
"name": name,
"raw_loss": raw_loss,
"side_iter": side_model.best_iteration_,
"A_margin_iter": A_model.best_iteration_,
"B_margin_iter": B_model.best_iteration_
}
v7_results.append(res)
print("\nV7 result:", res)
print("\nAll V7 raw results:")
for r in v7_results:
print(r)
Cell 77 β Ensembling & Calibration Grid Search
Code Logic & Technical Explanation
- Ensembles predictions from the three major pipeline versions: V5 (baseline ensemble), V6 (first two-stage model), and V7 (this conditional model).
- Rebuilds the optimal calibrated validation outputs for V5 and V6 using their respective configs.
- Executes a multi-dimensional grid search over ensembling weights, temperature scale
gamma, and prior blending weightalpha. - Selects and saves the configuration that minimizes multi-class log loss on the validation set.
π€ Speaking Notes (For Praful)
"In this cell, I ensemble my models. I reconstruct the validation predictions from Version 5 and Version 6, and perform a grid search over blend weights, temperature scaling gamma, and prior blending alpha. Blending these different models and tuning the calibration parameters allows us to achieve a highly competitive log loss."
π Connection & Variable Flow
v7_preds (From Cell 76), V5/V6 historical validation predictions, math helpers (From Cell 75), and ground truth target IDs (From Cell 73).best_v7_public (Dictionary with optimal ensembling weights, calibration gamma, and alpha).# ============================================================
# VERSION 7 FAST BLEND SEARCH
# Replaces the slow interrupted V7 blend cell
# ============================================================
from sklearn.metrics import log_loss
import numpy as np
# Rebuild V5 validation probabilities
p_v5_val = np.zeros_like(list(v5_val_raw_preds.values())[0])
for name, w in best_v5_public["weights"].items():
p_v5_val += w * v5_val_raw_preds[name]
p_v5_val = p_v5_val / sum(best_v5_public["weights"].values())
p_v5_val = apply_gamma(p_v5_val, best_v5_public["gamma"])
p_v5_val = blend_prior(p_v5_val, prior_dev_v5, best_v5_public["alpha"])
p_v5_val = normalize_probs(p_v5_val)
# Rebuild V6 validation probabilities
p_v6_raw_best = v6_two_stage_preds[best_v6_public["two_stage_model"]]
p_v6_mix = (
best_v6_public["blend_with_v5"] * p_v5_val
+ (1 - best_v6_public["blend_with_v5"]) * p_v6_raw_best
)
p_v6_mix = normalize_probs(p_v6_mix)
p_v6_mix = apply_gamma(p_v6_mix, best_v6_public["gamma"])
prior_dev_v6_4class = np.bincount(
dev_df_v6["target"].map(CLASS_TO_ID).values,
minlength=4
) / len(dev_df_v6)
p_v6_val = blend_prior(
p_v6_mix,
prior_dev_v6_4class,
best_v6_public["alpha"]
)
p_v6_val = normalize_probs(p_v6_val)
print("V5 check:", log_loss(y_4class_val_v7, p_v5_val, labels=[0, 1, 2, 3]))
print("V6 check:", log_loss(y_4class_val_v7, p_v6_val, labels=[0, 1, 2, 3]))
prior_dev_v7_4class = np.bincount(
dev_df_v7["target"].map(CLASS_TO_ID).values,
minlength=4
) / len(dev_df_v7)
best_v7_public = {
"loss": 999,
"v7_model": None,
"w_v5": None,
"w_v6": None,
"w_v7": None,
"gamma": None,
"alpha": None
}
# Much smaller, smarter grid
weight_sets = [
# mostly V6, small V7
(0.0, 0.90, 0.10),
(0.0, 0.80, 0.20),
(0.0, 0.70, 0.30),
# V6 + V5 + V7
(0.10, 0.80, 0.10),
(0.10, 0.70, 0.20),
(0.10, 0.60, 0.30),
# mostly V7
(0.0, 0.40, 0.60),
(0.0, 0.30, 0.70),
(0.0, 0.20, 0.80),
# pure checks
(0.0, 1.00, 0.0),
(0.0, 0.0, 1.0),
]
gamma_grid = np.linspace(1.6, 2.6, 21)
alpha_grid = [1.0, 0.95, 0.90]
for v7_name, p_v7_raw in v7_preds.items():
p_v7_raw = normalize_probs(p_v7_raw)
for w_v5, w_v6, w_v7 in weight_sets:
total = w_v5 + w_v6 + w_v7
p_mix = (
w_v5 * p_v5_val +
w_v6 * p_v6_val +
w_v7 * p_v7_raw
) / total
p_mix = normalize_probs(p_mix)
for gamma in gamma_grid:
pg = apply_gamma(p_mix, gamma)
for alpha in alpha_grid:
pf = blend_prior(pg, prior_dev_v7_4class, alpha)
loss = log_loss(y_4class_val_v7, pf, labels=[0, 1, 2, 3])
if loss < best_v7_public["loss"]:
best_v7_public = {
"loss": loss,
"v7_model": v7_name,
"w_v5": w_v5,
"w_v6": w_v6,
"w_v7": w_v7,
"gamma": gamma,
"alpha": alpha
}
print("BEST V7 FAST PUBLIC VALIDATION RESULT:")
print(best_v7_public)
print("\nCompare:")
print("V5:", best_v5_public["loss"])
print("V6:", best_v6_public["loss"])
print("V7 fast:", best_v7_public["loss"])
Cell 78 β Full-Data Model Retraining
Code Logic & Technical Explanation
- Prepares features and targets on 100% of the public dataset (combining development and validation data).
- Extracts the optimal tree iteration numbers (
best_iteration_ + 1) for each model from the validation runs in Cell 76. - Trains final, non-overfit Stage 1, Stage 2A, and Stage 2B CatBoostClassifiers on the complete dataset.
- Computes and prints the final target class prior distributions.
π€ Speaking Notes (For Praful)
"To prepare my final model for deployment, I retrain on the complete datasetβcombining train and validation rows. I extract the exact best iteration numbers from my validation runs to prevent overfitting, fit my final three CatBoost models on all available data, and compute the final training prior for our prior blending step."
π Connection & Variable Flow
train_df_v7_public (From Cell 72), v7_side_models, v7_A_margin_models, v7_B_margin_models (From Cell 76), and best_v7_public (From Cell 77).final_side_model_v7, final_A_margin_model_v7, final_B_margin_model_v7 (Final models trained on all public data), and prior_all_v7_public.# ============================================================
# VERSION 7: FINAL TRAINING ON ALL PUBLIC DATA
# Run only if V7 beats V6.
# ============================================================
X_all_v7_public = train_df_v7_public.drop(columns=drop_cols_v7)
y_side_all_v7 = train_df_v7_public["target_side_A"].values
A_all_df_v7 = train_df_v7_public[train_df_v7_public["target_side_A"] == 1].copy()
B_all_df_v7 = train_df_v7_public[train_df_v7_public["target_side_A"] == 0].copy()
X_A_all_v7 = A_all_df_v7.drop(columns=drop_cols_v7)
y_A_big_all_v7 = A_all_df_v7["target_A_big"].values
X_B_all_v7 = B_all_df_v7.drop(columns=drop_cols_v7)
y_B_big_all_v7 = B_all_df_v7["target_B_big"].values
cat_cols_v7_final = X_all_v7_public.select_dtypes(include=["object"]).columns.tolist()
for df in [X_all_v7_public, X_A_all_v7, X_B_all_v7]:
for c in cat_cols_v7_final:
df[c] = df[c].fillna("Unknown").astype(str)
cat_features_v7_final = [X_all_v7_public.columns.get_loc(c) for c in cat_cols_v7_final]
chosen_v7_name = best_v7_public["v7_model"]
chosen_v7_cfg = None
for cfg in v7_configs:
if cfg["name"] == chosen_v7_name:
chosen_v7_cfg = cfg
break
print("Chosen V7 config:", chosen_v7_cfg)
old_side = v7_side_models[chosen_v7_name]
old_A = v7_A_margin_models[chosen_v7_name]
old_B = v7_B_margin_models[chosen_v7_name]
side_iters = old_side.best_iteration_ + 1 if old_side.best_iteration_ is not None else 150
A_iters = old_A.best_iteration_ + 1 if old_A.best_iteration_ is not None else 100
B_iters = old_B.best_iteration_ + 1 if old_B.best_iteration_ is not None else 100
print("Final iters:", side_iters, A_iters, B_iters)
final_side_model_v7 = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=side_iters,
depth=chosen_v7_cfg["depth"],
learning_rate=chosen_v7_cfg["learning_rate"],
l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
random_strength=chosen_v7_cfg["random_strength"],
bagging_temperature=chosen_v7_cfg["bagging_temperature"],
border_count=128,
random_seed=chosen_v7_cfg["seed"],
verbose=100
)
final_A_margin_model_v7 = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=A_iters,
depth=chosen_v7_cfg["depth"],
learning_rate=chosen_v7_cfg["learning_rate"],
l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
random_strength=chosen_v7_cfg["random_strength"],
bagging_temperature=chosen_v7_cfg["bagging_temperature"],
border_count=128,
random_seed=chosen_v7_cfg["seed"] + 11,
verbose=100
)
final_B_margin_model_v7 = CatBoostClassifier(
loss_function="Logloss",
eval_metric="Logloss",
iterations=B_iters,
depth=chosen_v7_cfg["depth"],
learning_rate=chosen_v7_cfg["learning_rate"],
l2_leaf_reg=chosen_v7_cfg["l2_leaf_reg"],
random_strength=chosen_v7_cfg["random_strength"],
bagging_temperature=chosen_v7_cfg["bagging_temperature"],
border_count=128,
random_seed=chosen_v7_cfg["seed"] + 22,
verbose=100
)
final_side_model_v7.fit(
X_all_v7_public,
y_side_all_v7,
cat_features=cat_features_v7_final
)
final_A_margin_model_v7.fit(
X_A_all_v7,
y_A_big_all_v7,
cat_features=cat_features_v7_final
)
final_B_margin_model_v7.fit(
X_B_all_v7,
y_B_big_all_v7,
cat_features=cat_features_v7_final
)
prior_all_v7_public = np.bincount(
train_df_v7_public["target"].map(CLASS_TO_ID).values,
minlength=4
) / len(train_df_v7_public)
print("Final V7 public prior:")
print(dict(zip(CLASS_ORDER, prior_all_v7_public)))
Cell 79 β Inference Wrappers Construction
Code Logic & Technical Explanation
- Defines clean inference wrappers to generate predictions for test matches.
predict_v7_raw_public: Implements feature engineering alignment (imputing missing columns/categories) and calculates Stage 1 + Stage 2 probabilities.predict_known_toss_public_v7: Houses the complete prediction logic. It obtains predictions from V5, V6, and V7, blends them using the optimal ensembling weights, and applies calibration transformations.
π€ Speaking Notes (For Praful)
"Here, I build my final inference wrappers. `predict_v7_raw_public` takes any raw match row, handles missing categories, runs our final three V7 models, and combines them. `predict_known_toss_public_v7` combines predictions from V5, V6, and V7, applies our optimized ensembling weights, scales the probabilities with gamma, and blends them with the prior."
π Connection & Variable Flow
best_v7_public parameters (From Cell 77), math helpers (From Cell 75), and prediction wrappers from V5 and V6.predict_v7_raw_public and predict_known_toss_public_v7 (Inference functions for the test sets).# ============================================================
# VERSION 7: PUBLIC PREDICTION FUNCTIONS
# ============================================================
def predict_v7_raw_public(feature_df):
feature_df = feature_df.copy()
for c in cat_cols_v7_final:
if c not in feature_df.columns:
feature_df[c] = "Unknown"
feature_df[c] = feature_df[c].fillna("Unknown").astype(str)
for c in X_all_v7_public.columns:
if c not in feature_df.columns:
feature_df[c] = np.nan
feature_df = feature_df[X_all_v7_public.columns]
p_A_win = binary_prob_positive(final_side_model_v7, feature_df)
p_A_big_given_A = binary_prob_positive(final_A_margin_model_v7, feature_df)
p_B_big_given_B = binary_prob_positive(final_B_margin_model_v7, feature_df)
return combine_conditional_probs(
p_A_win=p_A_win,
p_A_big_given_A=p_A_big_given_A,
p_B_big_given_B=p_B_big_given_B
)
def predict_known_toss_public_v7(date, team_a, team_b, venue, city, toss_winner, toss_decision):
feat = make_feature_row_v2(
date=date,
team_a=team_a,
team_b=team_b,
venue=venue,
city=city,
toss_winner_is_A=int(toss_winner == team_a),
toss_decision_bat=int(toss_decision == "bat")
)
feature_df = pd.DataFrame([feat])
# V5 probability
p_v5 = predict_raw_public_v5(feature_df)
p_v5 = postprocess_public_v5(p_v5)
# V6 probability
p_v6 = predict_known_toss_public_v6(
date=date,
team_a=team_a,
team_b=team_b,
venue=venue,
city=city,
toss_winner=toss_winner,
toss_decision=toss_decision
).reshape(1, -1)
# V7 probability
p_v7 = predict_v7_raw_public(feature_df)
w_v5 = best_v7_public["w_v5"]
w_v6 = best_v7_public["w_v6"]
w_v7 = best_v7_public["w_v7"]
total = w_v5 + w_v6 + w_v7
p_mix = (w_v5 * p_v5 + w_v6 * p_v6 + w_v7 * p_v7) / total
p_mix = normalize_probs(p_mix)
p_mix = apply_gamma(p_mix, best_v7_public["gamma"])
p_mix = blend_prior(p_mix, prior_all_v7_public, best_v7_public["alpha"])
return normalize_probs(p_mix)[0]
Cell 80 β Submission Generation & Audit Assertions
Code Logic & Technical Explanation
- Generates outcome predictions for all 53 rows in the competition test set.
- For public matches, it uses the V7 public known-toss wrapper. For private matches (unknown toss), it uses the V2 pre-toss marginalization model.
- Runs strict assertions: verifies the shape is exactly 53x5, checks match ID ordering, guarantees values are within
[0.0, 1.0], and confirms rows sum to 1.0. - Saves the final results to
submission_v7.csv.
π€ Speaking Notes (For Praful)
"This is the final step. I loop through the test set. For public matches where the toss is known, I use my ensembled V7 public model. For private matches where the toss is unknown, I use my V2 pre-toss marginalization model. I run strict assertions to verify shape and normalization, and export my final submission to submission_v7.csv."
π Connection & Variable Flow
public_lb, schedule, sample (Datasets), predict_known_toss_public_v7 (From Cell 79), and predict_private_pretoss_v2 (From Version 2).submission_v7.csv (The final output submission file).# ============================================================
# VERSION 7: CREATE FINAL HYBRID SUBMISSION
# Public rows = V7 public model
# Private rows = V2 symmetric pre-toss model
# ============================================================
pred_dict_v7 = {}
# Public rows: V7 public known-toss model
for _, r in public_lb.iterrows():
match_id = str(r["match_id"])
p = predict_known_toss_public_v7(
date=r["date"],
team_a=r["team_a"],
team_b=r["team_b"],
venue=r["venue"],
city=r["city"],
toss_winner=r["toss_winner"],
toss_decision=r["toss_decision"]
)
pred_dict_v7[match_id] = p
# Private rows: V2 pre-toss marginalization remains safer
for _, r in schedule.iterrows():
match_id = str(r["match_id"])
p = predict_private_pretoss_v2(
date=r["date"],
team_a=r["team_a"],
team_b=r["team_b"],
venue=r["venue"],
city=r["city"]
)
pred_dict_v7[match_id] = p
submission_v7 = sample.copy()
submission_v7["match_id"] = submission_v7["match_id"].astype(str)
for i, match_id in enumerate(submission_v7["match_id"]):
if match_id not in pred_dict_v7:
print("Missing prediction for:", match_id, "using V7 prior fallback")
p = prior_all_v7_public.copy()
else:
p = pred_dict_v7[match_id]
p = normalize_probs(np.asarray(p).reshape(1, -1))[0]
submission_v7.loc[i, "A_small"] = p[0]
submission_v7.loc[i, "A_big"] = p[1]
submission_v7.loc[i, "B_small"] = p[2]
submission_v7.loc[i, "B_big"] = p[3]
# Hard checks
assert submission_v7.shape == (53, 5), f"Wrong shape: {submission_v7.shape}"
assert submission_v7["match_id"].astype(str).tolist() == sample["match_id"].astype(str).tolist()
assert np.all(submission_v7[CLASS_ORDER].values >= 0)
assert np.all(submission_v7[CLASS_ORDER].values <= 1)
assert np.allclose(submission_v7[CLASS_ORDER].sum(axis=1).values, 1.0, atol=1e-6)
submission_v7.to_csv("submission_v7.csv", index=False)
print("submission_v7.csv saved successfully")
print("Rows:", submission_v7.shape)
print("Probability row sum min:", submission_v7[CLASS_ORDER].sum(axis=1).min())
print("Probability row sum max:", submission_v7[CLASS_ORDER].sum(axis=1).max())
print("\nAverage probabilities:")
print(submission_v7[CLASS_ORDER].mean())
display(submission_v7.head())
display(submission_v7.tail())