Predicting the winner for English Premier League football matches
Category > Machine Learning
Aug 20, 2022
The goal of this project is to predict the winner of a English Premier League match using machine learning. In a previous project, we web-scrapped the relevant data for EPL matches. The project can be found here. The web-scrapped data was saved into a file named "matches.csv". We directly use this file to import the data and build a machine learning model to predict the winner of a match.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
match_df = pd.read_csv("matches.csv", index_col=0)
Exploratory Data Analysis¶
match_df.head()
match_df.tail()
match_df.shape
As we can notice, the index labels are not correct. Currently, it represents the match number of each season. We are not interested in that. So, we reset the index and drop the exiting column 0 which was infered as index column while reading data.
match_df.reset_index(drop=True, inplace=True)
match_df.head()
match_df.tail()
As our goal is to make prediction for winners, we enough data for each time. It is not the case that the same 20 teams contest with each other in every season. It operates on a system of promotion and relegation with the English Football League (EFL). However, in this project, we only want to make predictions for those teams, which regularly participate in each season. For these teams, we will have enough data to make accurate prediction
match_df.groupby(["team", "season"])["date"].count().unstack()
As we can observe, there are 5 teams in these dataset, who didn't play each season. We can delete the data for these teams from our dataset.
all_season_teams = match_df.groupby(["team", "season"])["date"].count().unstack().dropna().index.to_list()
all_season_teams.remove('Wolves')
match_df = match_df[match_df["team"].isin(all_season_teams) & match_df["opponent"].isin(all_season_teams)]
match_df.info()
As we can observe, there are some missing values for four columns: attendance, notes, dist, fk. The column 'notes' has no values in it. We can safely drop this column. For other columns, we need to take a more deep look to handle the missing data. Moreover, the datatype of 'date' column needs to be converted to a datetime object. The format of values in the 'time' column needs to be corrected.
match_df.drop('notes', axis=1, inplace=True)
match_df['hour'] = match_df['time'].str.replace(r':.+', "", regex=True).astype(int)
match_df.drop('time', axis=1, inplace=True)
match_df['date'] = pd.to_datetime(match_df['date'])
Now let's analyze the missing data. As we observed during the exploratory data analysis, there are some missing values for three columns: attendance, dist, fk. As only one data is missing from the 'dist' column, it is probably missing completely at random. As the 'dist' column had some outliers, we can replace the missing value with the median value of the 'dist' column. We will do that later in the feature engineering stage. Now let's focus on the other two columns. A lot of data are missing. This can't happen at random. Let's find the source of the missing data
match_df["fk_Null"] = np.where(match_df["fk"].isnull(), 1, 0)
match_df["attendance_Null"] = np.where(match_df["attendance"].isnull(), 1, 0)
match_df.groupby(["season"])[["fk_Null", "attendance_Null"]].sum()
As we can notice, all the missing data in the column 'fk' comes from the match records for season 2018. A similar pattern is observed for the 'attendance' column. There may be some problem with the data collection. For this reason, we again web-scrap the match data for 2018, 2020, and 2021.
match_df.drop(["fk_Null", "attendance_Null"], axis=1, inplace=True)
season18 = pd.read_csv("matches18.csv", index_col=0)
season18.head()
season18["fk"].isnull().sum()
As we can observe, this dataset has no missing value in the 'fk' column. So, we can replace the missing values of 'fk' column in 'match_df' using the values from this dataset
season18 = season18[season18["team"].isin(all_season_teams) & season18["opponent"].isin(all_season_teams)]
match_df.loc[match_df['season']==2018, 'fk'] = season18['fk'].to_list()
match_df['fk'].isnull().sum()
Now let's focus on the 'attendance' column. It is noticeable that the data is missing for season 2020 and 2021. This missing values can directly be attributed to the COVID-19 restrictions imposed by the UK government. Therefore, no actual data exists for correcting these missing values. Thus we need to impute the missing values for this column later in the feature engineering stage
Now let's observe the statistics of category variables
match_df.describe(include='O')
As we can observe, the column 'comp' and 'match report' has only one unique category. So, these columns won't add any valuable information in the prediction process. Moreover, the features 'referee' and 'captain' are not generalizable. It is highly liklely that a new referee or captain can be found in the test dataset, which may hurt the performance of machine learning algorithm. We can safely drop these columns. Other categorical columns need to be encoded properly in the feature engineering stage.
match_df.drop(['comp', 'match report', 'referee', 'captain'], axis=1, inplace=True)
match_df.head()
It's time to perform exploratory data analysis on numerical variables
match_df.describe(exclude=['O', 'datetime64'])
numerical_columns = match_df.select_dtypes(include=np.number).columns.tolist()
numerical_columns
nrow = 7
ncol = 2
index = 0
fig, axs = plt.subplots(nrow, ncol, figsize = (10,25))
for i in range(nrow):
for j in range(ncol):
label = numerical_columns[index]
sns.boxplot(match_df[label], ax=axs[i][j]).set_title(label)
index += 1
fig.tight_layout()
plt.show()
nrow = 7
ncol = 2
index = 0
fig, axs = plt.subplots(nrow, ncol, figsize = (10,25))
for i in range(nrow):
for j in range(ncol):
label = numerical_columns[index]
sns.histplot(match_df[label], ax=axs[i][j]).set_title(label)
index += 1
fig.tight_layout()
plt.show()
From the boxplots and histograms, it is clear that outliers are present in the dataset. But as these data are collected from actual matches, and there was no data collection error, we cannot just remove the outliers. Instead, we need to use a classification algorithm that is not sensitive to outliers.
match_df.columns
Before moving to the feature engineering step, let's split the dataset into training and test data.
match_df.groupby(["season"])['date'].count()
We reserve the match records for season 2022 as the test data. All other seasons are included in the training dataset
train_data = match_df[match_df["season"] != 2022]
test_data = match_df[match_df["season"] == 2022]
train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)
print(train_data.shape)
print(test_data.shape)
train_data.head()
Feature Engineering¶
Now let's focus on the 'attendance' column. It is noticeable that the data is missing for season 2020 and 2021. This missing values can directly be attributed to the COVID-19 restrictions imposed by the UK government. Therefore, no actual data exists for correcting these missing values. However, we can estimate the match attendance for these missing values based on prior data. For that purpose, we try to build a regression model. Before that, we need encode the categorical variables.
categorical_columns = train_data.select_dtypes(include="O").columns.tolist()
categorical_columns
train_data["round"].unique()
As we can observe, the matchweek number is the only valuable information in this column. Thus, we can created a new column named "MatchWeek" and fill that column with the matchweek number corresponding to each match. After that, we can drop this column. We need to do the same for X_test
train_data["match_week"] = train_data["round"].str.split().str[1].astype(int)
train_data.head()
train_data["day"].unique()
This column tells us about the day of each match. We can assign each day of the week an unique number
train_data["dayofweek"] = train_data["date"].dt.dayofweek
train_data.head()
However, it is still not the right way to encode days of the week if we want to use the data to train machine learning models! In reality, Saturday is closer to Monday than Wednesday. Encoding days of the week as numbers changes the sense of data. We donβt want to lose the information about the circular nature of weeks and the actual distance between the days. Therefore, we can encode the day of week feature as βpointsβ on a circle: 0Β° = Monday, 51.5Β° = Tuesday, etc.
There is one problem. We know that it is a circle, but for a machine learning model, the difference between Sunday and Monday is 308.5Β° instead of 51.5Β°. That is wrong.
To solve the problem we have to calculate the cosinus and sinus values of the degree. We need both because both functions produce duplicate outputs for difference inputs, but when we use them together we get unique pairs of values:
train_data['day_sin'] = np.sin(train_data['dayofweek'] * (2 * np.pi / 7))
train_data['day_cos'] = np.cos(train_data['dayofweek'] * (2 * np.pi / 7))
train_data.drop("dayofweek", axis=1, inplace=True)
train_data["venue"].unique()
As there are only two unique values and there is no sense of order in this column, we can use one-hot encoding to encode the venue type
train_data = pd.get_dummies(train_data, columns=["venue"], drop_first=True)
train_data.head()
!pip install category_encoders
from category_encoders.hashing import HashingEncoder
train_data["opponent"].unique()
train_data["team"].unique()
def hash_encode(col, hash_size=8):
ec = HashingEncoder([col], n_components=hash_size)
ec.fit(train_data)
return ec
transformed_opponent = hash_encode("opponent", 5).transform(train_data)
transformed_team = hash_encode("team", 5).transform(train_data)
name_map = lambda x: {"col_0": x+"_0", "col_1": x+"_1", "col_2": x+"_2", "col_3": x+"_3", "col_4": x+"_4"}
opponent_map = name_map("opponent")
team_map = name_map("team")
transformed_opponent.rename(opponent_map, axis=1, inplace=True)
transformed_team.rename(team_map, axis=1, inplace=True)
train_data = pd.concat([train_data, transformed_opponent[['opponent_0', 'opponent_1', 'opponent_2', 'opponent_3', 'opponent_4']]], axis=1)
train_data = pd.concat([train_data, transformed_team[['team_0', 'team_1', 'team_2', 'team_3', 'team_4']]], axis=1)
train_data.columns
train_data.head()
train_data["formation"].unique()
We can observe there are some non-numeric characters. We need to remove this characters. Moreover, we can extract the number of defenders, midfielders, and strikers from this column.
train_data["formation"] = train_data["formation"].str.replace("β", "")
train_data["num_defender"] = train_data["formation"].str.split(pat="-").str[0].astype(int)
train_data["num_striker"] = train_data["formation"].str.split(pat="-").str[-1].astype(int)
offensive_midfield_mapper = lambda x: x[4] if len(x) == 7 else 0
center_midfield_mapper = lambda x: x[2] if len(x) == 5 else 0
train_data["offensive_midfielder"] = train_data["formation"].apply(offensive_midfield_mapper).astype(int)
train_data["center_midfielder"] = train_data["formation"].apply(center_midfield_mapper).astype(int)
train_data.head()
train_data["result"].unique()
As the results are ordinal data (L < D < W), we can use ordinal encoding for the target variable
result_map = {"L":0, "D":1, "W":2}
train_data["target"] = train_data["result"].map(result_map)
Finally, we are done with feature encoding. As we found during the exploratory data analysis, there are some missing values left in two columns: attendance, and dist. Let's deal with these missing values one by one.
train_data.isnull().sum()
As only one data is missing from the 'dist' column, it is probably missing completely at random. As the 'dist' column had some outliers, we can replace the missing value with the median value of the 'dist' column
median_dist = train_data["dist"].median()
median_dist
train_data["dist"].fillna(median_dist, inplace=True)
train_data["dist"].isnull().sum()
Now let's focus on the 'attendance' column. The match attendance depends on a number of other factors like day, team, opponent, venue_type, etc. So, we can use MCIE method to impute the missing values for this column.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import HuberRegressor
numerical_columns = train_data.select_dtypes(include=np.number).columns.tolist()
numerical_columns
estimator = HuberRegressor(max_iter=15000)
imputer = IterativeImputer(estimator=estimator, random_state=2022, skip_complete=True)
imputed_train_data = imputer.fit_transform(train_data[numerical_columns])
train_data['attendance'] = imputed_train_data[:, 5]
train_data.head()
train_data["attendance"].plot.hist(bins=30)
train_data.info()
As we can notice, there are no missing values in the dataset. However, the dataset doesn't have any column capturing the past performance of a teamHowever, we haven't yet considered the historical performance of a team. It may have a significant influence on the winner prediction. We use the 'rolling average' method to capture the past performance of each time. For that purpose, we temporarily merge the training and test dataset.
def rolling_averages(group, cols, new_cols):
group = group.sort_values('date')
rolling_stats = group[cols].rolling(3, closed='left').mean()
group[new_cols] = rolling_stats
group = group.dropna(subset=new_cols)
return group
cols = ["gf", 'ga', 'sh', 'sot', 'dist', 'poss', 'fk', 'pk', 'pkatt']
new_cols = [f"{c}_rolling" for c in cols]
train_data = train_data.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))
train_data.index = train_data.index.droplevel()
train_data.index = range(train_data.shape[0])
train_data.info()
train_data.head()
train_data.tail()
It is important to realize that we want to predict the result of current match based on the data of previous matches. For that purpose, we have just added some features that represent past performance. Now, we can delete all those data for current matches that may provide the machine learning model a hint about the winner
train_data.drop(["gf", 'ga', 'sh', 'sot', 'dist', 'poss', 'fk', 'pk', 'pkatt'], axis=1, inplace=True)
print(train_data.shape)
print(test_data.shape)
drop_cols = categorical_columns + ["season"]
drop_cols.remove("venue")
train_data.drop(drop_cols, axis=1, inplace=True)
print(train_data.shape)
Now, we perform standardization to complete the feature engineering process. As outliers are present in the dataset, we apply robustscaling method
X_train = train_data.drop("target", axis=1)
y_train = train_data["target"]
from sklearn.preprocessing import RobustScaler
cols = X_train.columns.to_list()
cols.remove("date")
index_labels = X_train['date']
scaler = RobustScaler(unit_variance=True)
X_train = pd.DataFrame(scaler.fit_transform(X_train.drop("date", axis=1)), columns=cols, index=index_labels)
X_train.head()
X_train.describe()
ax = X_train.plot.box(figsize=(15, 10))
ax.set_xticklabels(labels=cols, rotation=90)
Feature Selection¶
corr = X_train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(h_neg=10, h_pos=240, as_cmap=True)
sns.heatmap(corr, mask=mask,
center=0, cmap=cmap, linewidths=1,
annot=False, fmt=".2f")
# Create positive correlation matrix
corr_df = X_train.corr().abs()
# Create and apply mask
mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_df.mask(mask)
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]
print(to_drop)
Before taking any action, let's double check if these columns have very high variance inflation factor
from statsmodels.stats.outliers_influence import variance_inflation_factor
def find_vif(df):
vif_data = pd.DataFrame()
vif_data["feature"] = df.columns
vif_data["VIF"] = [variance_inflation_factor(df.values, i)
for i in range(len(df.columns))]
return vif_data
vif_xtrain = find_vif(X_train)
print(vif_xtrain)
X_train_reduced = X_train.drop(['opponent_0', 'opponent_1', 'opponent_2', 'opponent_3', 'opponent_4'], axis=1)
vif_xtrain_reduced = find_vif(X_train_reduced)
print(vif_xtrain_reduced)
Even after removing all feature encodings for the 'oponnent' column, the VIF is extremely high for the 'team' column. IT suggests that some part of the hash encoding for the 'team' column may be redundant. So, we try dropping 'team_0' from the columns
X_train_reduced = X_train.drop(['opponent_0', 'opponent_1', 'opponent_2', 'opponent_3', 'opponent_4'] + ["team_0"], axis=1)
vif_xtrain_reduced = find_vif(X_train_reduced)
print(vif_xtrain_reduced)
vif_xtrain_reduced["VIF"].mean()
Now, the VIF for each feature is less than 10. Moreover, the mean VIF is less than 6. So, we don't have any multi-collinear column in the dataset.
X_train = X_train_reduced.copy()
X_train.head()
Now, let's try other feature selection method to check if the number of features can be further reduced
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
print(mutual_info)
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)
mask_MI = ~(X_train.columns.isin(mutual_info[mutual_info == 0].index))
mask_MI
X_train.columns[mask_MI]
As it is clear, there are a few columns which provide zero gain in mutual information. These are the candidate columns that can be dropped. But before taking action, let'try other methods
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
rfe_rf = RFE(estimator=RandomForestClassifier(), n_features_to_select=20, verbose=0)
rfe_rf.fit(X_train,y_train)
rf_mask = rfe_rf.support_
X_train.columns[rf_mask]
from sklearn.ensemble import GradientBoostingRegressor
rfe_gb = RFE(estimator=GradientBoostingRegressor(), n_features_to_select=20, step=5)
rfe_gb.fit(X_train, y_train)
gb_mask = rfe_gb.support_
X_train.columns[gb_mask]
from sklearn.ensemble import ExtraTreesClassifier
etc=ExtraTreesClassifier()
etc.fit(X_train, y_train)
ranked_features=pd.Series(etc.feature_importances_, index=X_train.columns)
top_features = ranked_features.nlargest(20)
etc_mask = X_train.columns.isin(top_features.index)
X_train.columns[etc_mask]
votes = np.sum([mask_MI, rf_mask, gb_mask, etc_mask], axis=0)
mask = votes >= 3
selected_col = X_train.columns[mask].to_list()
selected_col
selected_col.extend(["team_3"])
selected_col
X_train = X_train[selected_col]
X_train.shape
Pipeline Creation¶
Now we have completed all the preprocessing steps on the training data. However, we need to apply all these same steps on the validation data and test. And we need to do so without data leakage. To automate the preprocessing step and prevent data leakge, we build a pipeline for machine learning model. For that purpose, we again start with the unprocessed data, which will be fed to the pipeline.
match_df["dayofweek"] = match_df["date"].dt.dayofweek
match_df.drop("day", axis=1, inplace=True)
train_data = match_df[match_df["season"] != 2022]
test_data = match_df[match_df["season"] == 2022]
train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)
print(train_data.shape)
print(test_data.shape)
from sklearn.pipeline import Pipeline, FeatureUnion, clone, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.compose import ColumnTransformer, make_column_selector
class SelectColumnsTransfomer(BaseEstimator, TransformerMixin):
""" A DataFrame transformer that provides column selection
Allows to select columns by name from pandas dataframes in scikit-learn
pipelines.
Parameters
----------
columns : list of str, names of the dataframe columns to select
Default: []
"""
def __init__(self, columns=[]):
self.columns = columns
def transform(self, X, **transform_params):
""" Selects columns of a DataFrame
Parameters
----------
X : pandas DataFrame
Returns
----------
trans : pandas DataFrame
contains selected columns of X
"""
trans = X[self.columns].copy()
return trans
def fit(self, X, y=None, **fit_params):
""" Do nothing function
Parameters
----------
X : pandas DataFrame
y : default None
Returns
----------
self
"""
return self
class DataFrameFunctionTransformer(BaseEstimator, TransformerMixin):
""" A DataFrame transformer providing imputation or function application
Parameters
----------
impute : Boolean, default False
func : function that acts on an array of the form [n_elements, 1]
"""
def __init__(self, func):
self.func = func
def transform(self, X, **transformparams):
""" Transforms a DataFrame
Parameters
----------
X : DataFrame
Returns
----------
trans : pandas DataFrame
Transformation of X
"""
trans = pd.DataFrame(X).apply(self.func).copy()
return trans
def fit(self, X, y=None, **fitparams):
""" Fixes the values to impute or does nothing
Parameters
----------
X : pandas DataFrame
y : not used, API requirement
Returns
----------
self
"""
return self
class DataFrameFeatureUnion(BaseEstimator, TransformerMixin):
""" A DataFrame transformer that unites several DataFrame transformers
Fit several DataFrame transformers and provides a concatenated
Data Frame
Parameters
----------
list_of_transformers : list of DataFrameTransformers
"""
def __init__(self, list_of_transformers):
self.list_of_transformers = list_of_transformers
def transform(self, X, **transformparamn):
""" Applies the fitted transformers on a DataFrame
Parameters
----------
X : pandas DataFrame
Returns
----------
concatted : pandas DataFrame
"""
concatted = pd.concat([transformer.transform(X)
for transformer in
self.fitted_transformers_], axis=1).copy()
return concatted
def fit(self, X, y=None, **fitparams):
""" Fits several DataFrame Transformers
Parameters
----------
X : pandas DataFrame
y : not used, API requirement
Returns
----------
self : object
"""
self.fitted_transformers_ = []
for transformer in self.list_of_transformers:
fitted_trans = clone(transformer).fit(X, y=None, **fitparams)
self.fitted_transformers_.append(fitted_trans)
return self
class ToDummiesTransformer(BaseEstimator, TransformerMixin):
""" A Dataframe transformer that provide dummy variable encoding
"""
def __init__(self):
self.drop_first = True
def transform(self, X, **transformparams):
""" Returns a dummy variable encoded version of a DataFrame
Parameters
----------
X : pandas DataFrame
Returns
----------
trans : pandas DataFrame
"""
trans = pd.get_dummies(X, drop_first = self.drop_first).copy()
return trans
def fit(self, X, y=None, **fitparams):
""" Do nothing operation
Returns
----------
self : object
"""
return self
class RoundTransformer(BaseEstimator, TransformerMixin):
""" A Dataframe transformer to extract match week from the 'round' column
"""
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transformparams):
X = X.squeeze() # convert to a series to perform string operations
match_week = X.str.split().str[1].astype(int)
return match_week.to_frame(name="match_week").copy()
class DayTransformer(BaseEstimator, TransformerMixin):
""" A Dataframe transformer for feature encoding the 'day' column
"""
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transformparams):
day_sin = np.sin(X * (2 * np.pi / 7))
day_cos = np.cos(X * (2 * np.pi / 7))
trans_day = np.concatenate((day_sin, day_cos), axis=1)
return pd.DataFrame(trans_day, columns=["day_sin", "day_cos"]).copy()
class FeatureHasher(BaseEstimator, TransformerMixin):
def __init__(self, hash_size, name_map=None):
self.name_map = name_map
self.hash_size = hash_size
self.encoder = HashingEncoder(n_components=self.hash_size, return_df=True)
def fit(self, X, y=None, **fit_params):
self.encoder.fit(X)
return self
def transform(self, X, **transformparams):
trans_x = self.encoder.transform(X)
if self.name_map:
trans_x.rename(self.name_map, axis=1, inplace=True)
return trans_x.copy()
from category_encoders import OrdinalEncoder
class TargetEncoder(BaseEstimator, TransformerMixin):
def __init__(self, col_name, mapping):
self.col_name = col_name
self.mapping = mapping
self.cols_mapping = [{
"col": self.col_name,
"mapping": self.mapping
}]
self.encoder = OrdinalEncoder(mapping=self.cols_mapping, return_df=True)
def fit(self, X, y=None, **fit_params):
self.encoder.fit(X)
return self
def transform(self, X, **transformparams):
return self.encoder.transform(X).copy()
class FormationTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transformparams):
X = X.squeeze()
formation = X.str.replace("β", "")
num_defender = formation.str.split(pat="-").str[0].astype(int)
num_striker = formation.str.split(pat="-").str[-1].astype(int)
offensive_midfield_mapper = lambda x: x[4] if len(x) == 7 else 0
center_midfield_mapper = lambda x: x[2] if len(x) == 5 else 0
offensive_midfielder = formation.apply(offensive_midfield_mapper).astype(int)
center_midfielder = formation.apply(center_midfield_mapper).astype(int)
return pd.DataFrame(dict(num_defender=num_defender, num_striker=num_striker,\
offensive_midfielder=offensive_midfielder,\
center_midfielder=center_midfielder)).copy()
class RollingAverageComputer(BaseEstimator, TransformerMixin):
def __init__(self, columns, group_by, sort_by):
self.cols = columns
self.group_by = group_by
self.sort_by = sort_by
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transformparams):
def rolling_averages(group, cols, new_cols, sort_by):
group = group.sort_values(sort_by)
rolling_stats = group[cols].rolling(3, closed='left').mean()
group[new_cols] = rolling_stats
group = group.dropna(subset=new_cols)
return group
new_cols = [f"{c}_rolling" for c in self.cols]
trans_X = X.groupby(self.group_by).apply(lambda x: rolling_averages(x, self.cols, new_cols, self.sort_by))
trans_X.index = trans_X.index.droplevel()
trans_X.index = range(trans_X.shape[0])
return trans_X.copy()
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, select_columns):
self.to_select = select_columns
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transformparams):
trans_X = X[self.to_select]
return trans_X.copy()
from sklearn.preprocessing import RobustScaler
class Standardizer(BaseEstimator, TransformerMixin):
def __init__(self, scaler):
self.scaler = scaler
def fit(self, X, y=None, **fit_params):
self.cols = X.columns
self.scaler.fit(X)
return self
def transform(self, X, **transformparams):
trans_X = pd.DataFrame(scaler.transform(X), columns=self.cols)
return trans_X.copy()
class ImputeTransformer(BaseEstimator, TransformerMixin):
def __init__(self, imputer, columns):
self.imputer = imputer
self.cols = columns
def fit(self, X, y=None, **fit_params):
if not self.cols:
self.cols = X.select_dtypes(include=np.number).columns.tolist()
self.imputer.fit(X[self.cols])
self.other_cols = X.columns.drop(self.cols).to_list()
return self
def transform(self, X, **transformparams):
array = self.imputer.transform(X[self.cols])
trans_X = pd.DataFrame(array, columns=self.cols)
trans_X = pd.concat([trans_X, X[self.other_cols]], axis=1)
return trans_X.copy()
numerical_cols = train_data.select_dtypes(include=np.number).columns.tolist()
rolling_cols = ["gf", 'ga', 'sh', 'sot', 'dist', 'poss', 'fk', 'pk', 'pkatt']
selected_features = ['xg', 'xga', 'attendance', 'hour',\
'match_week', 'day_sin', 'day_cos',\
'team_1', 'team_2', 'team_3', 'team_4',\
'num_striker', 'center_midfielder',\
'gf_rolling', 'ga_rolling', 'sh_rolling',\
'sot_rolling', 'dist_rolling', 'poss_rolling',\
'fk_rolling', 'result']
preprocessor = Pipeline([
('feature_encoding', DataFrameFeatureUnion([
Pipeline([
('extract', SelectColumnsTransfomer(['round'])),
('week_day', RoundTransformer()),
]),
Pipeline([
('extract', SelectColumnsTransfomer(['dayofweek'])),
('week_day', DayTransformer())
]),
Pipeline([
('extract', SelectColumnsTransfomer(['venue'])),
('one_hot', ToDummiesTransformer())
]),
Pipeline([
('extract', SelectColumnsTransfomer(['opponent'])),
('hasher', FeatureHasher(5, {"col_0": "opponent"+"_0", "col_1": \
"opponent"+"_1", "col_2": "opponent"+"_2", \
"col_3": "opponent"+"_3", "col_4": "opponent"+"_4"}))
]),
Pipeline([
('extract', SelectColumnsTransfomer(['team'])),
('hasher', FeatureHasher(5, {"col_0": 'team'+"_0", "col_1": \
'team'+"_1", "col_2": 'team'+"_2", \
"col_3": 'team'+"_3", "col_4": 'team'+"_4"}))
]),
Pipeline([
('extract', SelectColumnsTransfomer(['formation'])),
('formation_transformer', FormationTransformer())
]),
Pipeline([
('extract', SelectColumnsTransfomer(['result'])),
('formation_transformer', TargetEncoder("result", dict([('L', 0), ('D', 1), ('W', 2)])))
]),
Pipeline([
('extract', SelectColumnsTransfomer(["team", "date"] + numerical_cols))
])
])),
('imputation', Pipeline([
('median_imputation', ImputeTransformer(SimpleImputer(strategy='median'), ['dist'])),
('MICE', ImputeTransformer(IterativeImputer(estimator=HuberRegressor(max_iter=15000), \
random_state=2022, skip_complete=True), None))
])),
('rolling_average', RollingAverageComputer(rolling_cols, 'team', 'date')),
('feature_selection', FeatureSelector(select_columns = selected_features))
])
processed_train_data = preprocessor.fit_transform(train_data)
processed_test_data = preprocessor.transform(test_data)
processed_train_data.head()
processed_test_data.head()
processed_train_data.shape
processed_test_data.shape
X_train = processed_train_data.drop("result", axis=1)
y_train = processed_train_data["result"]
X_test = processed_test_data.drop("result", axis=1)
y_test = processed_test_data["result"]
Baseline Model¶
from sklearn.svm import SVC,LinearSVC,NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
# Evaluation & CV Libraries
from sklearn.metrics import precision_score,accuracy_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV,RepeatedStratifiedKFold
models =[("SVC", SVC()),('KNN',KNeighborsClassifier(n_neighbors=10)),
("DTC", DecisionTreeClassifier()),("GNB", GaussianNB()),
("NuSVC", NuSVC()),("BNB", BernoulliNB()),
('RF',RandomForestClassifier()),('ADA',AdaBoostClassifier()),
('XGB',GradientBoostingClassifier())]
results = []
names = []
finalResults = []
for name, classifier in models:
model = Pipeline([
('standardizer', RobustScaler(unit_variance=True)),
(name, classifier)
])
model.fit(X_train, y_train)
model_results = model.predict(X_test)
score = precision_score(y_test, model_results, average='micro')
results.append(score)
names.append(name)
finalResults.append((name,score))
finalResults.sort(key=lambda k:k[1],reverse=True)
finalResults
Based on the performance of these baseline models, we select the top three ML algorithms and try improve their performance through hyperparameter optimization.
Hyperparameter tuning¶
Let's first define the hyperparameter space for each algorithm.
np.random.seed(2022)
model_params = {
'RF':
{
'model':RandomForestClassifier,
'params':
{
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['log2', 'sqrt', None],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10, 20, 30, 40, 50, 60],
'n_estimators': [100, 200, 400, 600, 800, 1000, 1200, 1400]
}
},
'SVC':
{
'model':SVC,
'params':
{
'kernel': ['linear', 'rbf'],
'gamma': [0.01, 0.1, 1, 10, 100],
'C': [0.001, 0.01, 0.1, 1, 10, 100]
}
},
'ADA':
{
'model':AdaBoostClassifier,
'params':
{
'learning_rate': 10**(-4 * np.random.rand(20)),
'n_estimators': [10, 50, 100, 200, 350, 500],
'algorithm': ['SAMME', 'SAMME.R']
}
},
'XGB':
{
'model':GradientBoostingClassifier,
'params':
{
'learning_rate': 10**(-3 * np.random.rand(25)),
'n_estimators': [10, 50, 100, 200, 400, 600, 800, 1000],
'max_depth':np.random.randint(3, 10, 5),
'subsample':np.random.uniform(0.6, 1.0, 20)
}
}
}
It is important to remember that we are dealing with time-series data. Here the chronological performance of each team matters. So, we need to be careful when performing cross-validation for hyperparameter tuning. We could have used the TimeSeriesSplit function provided by sklearn. But it doesn't allow us to define the number of records that we want to include in the training and validation set. We have found the following code from https://medium.com/eatpredlove/time-series-cross-validation-a-walk-forward-approach-in-python-8534dd1db51a very useful for our purpose.
import numpy as np
class expanding_window(object):
'''
Parameters
----------
Note that if you define a horizon that is too far, then subsequently the split will ignore horizon length
such that there is validation data left. This similar to Prof Rob hyndman's TsCv
initial: int
initial train length
horizon: int
forecast horizon (forecast length). Default = 1
period: int
length of train data to add each iteration
'''
def __init__(self,initial= 1,horizon = 1,period = 1):
self.initial = initial
self.horizon = horizon
self.period = period
def split(self,data):
'''
Parameters
----------
Data: Training data
Returns
-------
train_index ,test_index:
index for train and valid set similar to sklearn model selection
'''
self.data = data
self.counter = 0 # for us to iterate and track later
data_length = data.shape[0] # rows
data_index = list(np.arange(data_length))
output_train = []
output_test = []
# append initial
output_train.append(list(np.arange(self.initial)))
progress = [x for x in data_index if x not in list(np.arange(self.initial)) ] # indexes left to append to train
output_test.append([x for x in data_index if x not in output_train[self.counter]][:self.horizon] )
# clip initial indexes from progress since that is what we are left
while len(progress) != 0:
temp = progress[:self.period]
to_add = output_train[self.counter] + temp
# update the train index
output_train.append(to_add)
# increment counter
self.counter +=1
# then we update the test index
to_add_test = [x for x in data_index if x not in output_train[self.counter] ][:self.horizon]
output_test.append(to_add_test)
# update progress
progress = [x for x in data_index if x not in output_train[self.counter]]
# clip the last element of output_train and output_test
output_train = output_train[:-1]
output_test = output_test[:-1]
# mimic sklearn output
index_output = [(train,test) for train,test in zip(output_train,output_test)]
return index_output
from tqdm import tqdm
from sklearn.model_selection import ParameterSampler
tscv = expanding_window(500, 100, 100)
index_output = tscv.split(train_data)
best_param = {}
best_score = {}
for model_name, params in model_params.items():
print(f"Starting parameter tuning for {model_name}")
initializer = params['model']
param_distribution = params['params']
total_eval = 50
param_configs = ParameterSampler(param_distribution, total_eval, random_state=2022)
# training_scores = []
# validation_scores = []
best_so_far = -np.inf
for param in tqdm(param_configs):
train_score_fold = []
dev_score_fold = []
for train_index, val_index in index_output:
training_data = train_data.iloc[train_index, :]
validation_data = train_data.iloc[val_index, :]
validation_data = validation_data.reset_index(drop=True)
processed_training_data = preprocessor.fit_transform(training_data)
processed_validation_data = preprocessor.transform(validation_data)
X_train, y_train = processed_training_data.drop("result", axis=1), processed_training_data["result"]
X_val, y_val = processed_validation_data.drop("result", axis=1), processed_validation_data["result"]
model = Pipeline([
('standardization', RobustScaler(unit_variance=True)),
(model_name, initializer(**param))
])
model.fit(X_train, y_train)
val_pred = model.predict(X_val)
train_pred = model.predict(X_train)
dev_score = precision_score(y_val, val_pred, average='micro')
train_score = precision_score(y_train, train_pred, average='micro')
train_score_fold.append(train_score)
dev_score_fold.append(dev_score)
mean_train_score = np.mean(train_score_fold)
mean_dev_score = np.mean(dev_score_fold)
print(f"{param}: {mean_dev_score}")
# training_scores.append(mean_train_score)
# validation_scores.append(mean_dev_score)
if mean_dev_score > best_so_far:
print(f"Best score improved from {best_so_far} to {mean_dev_score}")
best_score[model_name] = mean_dev_score
best_param[model_name] = param
best_so_far = mean_dev_score
print("\n")
best_param
best_score
As we can observe, Random Forest performs the best on the validation dataset after hyperparameter tuning. Let's build the final model using Random Forest algorithm with these hypertuned parameter values and evaluate the model on the test data after preprocessing
processed_train_data = preprocessor.fit_transform(train_data)
processed_test_data = preprocessor.transform(test_data)
X_train = processed_train_data.drop("result", axis=1)
y_train = processed_train_data["result"]
X_test = processed_test_data.drop("result", axis=1)
y_test = processed_test_data["result"]
X_train.head()
classifier = RandomForestClassifier(**best_param["RF"])
best_model = Pipeline([
('standardizer', RobustScaler(unit_variance=True)),
('RF', classifier)
])
best_model.fit(X_train, y_train)
best_model_results = best_model.predict(X_test)
score = precision_score(y_test, best_model_results, average='micro')
print(f"Precision score: {score}")
accuracy = accuracy_score(y_test, best_model_results)
print(f"Accuracy score: {score}")
Conclusions¶
Our best performing results seem to be on par with related work that had used multiple seasonsβ data, such as this one. We exceed their best performing model, which managed to have an accuracy of 0.52. Data extraction, cleaning, dealing with missing data, and feature engineering took a lot of our time and was definitely the most challenging and cumbersome aspect of the project. Using Random forest, we were able to get a precision score of 0.55. As observed from the precision score, it is very difficult to predict the winner of a EPL match. There are other factors like player transfers, managerial changes before a seasonβs beginning, player morale etc that play a role but not accounted for by us.