Web scrapping data for English Premier League football matches
Category > Data Collection
Jun 11, 2022In this project, we're going to complete a machine learning project on the English Premier League (EPL) football matches. The final goal of the project is to predict the winner of each football match. At first, we're going to use web scraping to get the necessary data on the EPL match results from this page. Let's download the HTML for that page and then explore it in the web browser's inspector. We want to extract the first table — League Table — that lists every team in the league and its stats. In particular, we need to fetch the URL for each team to be able to grab the match log for the season from each of them.
import requests
URL = "https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats"
response = requests.get(URL)
if response.status_code == 200:
content = response.content
else:
print("Couldn't download the web page")
Let's explore the page first in the web browser's inspector to identify which HTML tag is associated with the URLs of the teams. After some exploration, we identify the id of the 'League Table'
from bs4 import BeautifulSoup
parser = BeautifulSoup(content, 'html.parser')
league_table = parser.select('#results2021-202291_overall')[0]
We notice that the table rows that contain the the URL for each team statistics have a special attribute 'data-stat' = 'team'
. We use this information to select the desired rows and finally scrap the URL from those rows
team_data = league_table.find_all("td", attrs={'data-stat' : 'team'})
team_stat_URL = {}
for team in team_data:
# The table only contains partial URL. We add the domain name to get the full URL
URL = "https://fbref.com" + team.select('a')[0]['href']
team_name = team.select('a')[0].text
team_stat_URL[team_name] = URL
print(team_stat_URL)
Now that we have a list of the URLs, one for each team, we can get the stats we want. Let's start with the first team: Manchester City. After exploring the web page for the team, we decide to parse the table named "Scores & Fixture" for our analysis. The parsed table is read into a pandas dataframe for our convenience.
import pandas as pd
link = team_stat_URL['Manchester City']
response_MC = requests.get(link)
if response.status_code == 200:
content_MC = response_MC.content
score_tables = pd.read_html(content_MC, match="Scores & Fixtures")
score_df = score_tables[0]
else:
print("The page couldn't be downloaded for team {}".format("Manchester City"))
score_df.head()
score_df.tail()
Here is a brief description of some columns in the table that can't be interpreted easily from their name
- Comp : Competition
- Round: Phase of competition
- GF : Goal for the team
- GA : Goal against the team
- xG : Expected goals
- xGA : Expected goals allowed
- Poss : Possession as a percentage of passes attempted
As we can observe, there is something we don't have in the table with scores and fixtures: the details about each match, such as the number of shots, the number of shots on target, the number of free kicks, and the number of penalty kicks. We can find some of these stats in the table under the Shooting tab. Let's find and download the table containing the shooting stats for the Manchester City team and read it in a pandas DataFrame.
MC_parser = BeautifulSoup(content_MC, 'html.parser')
parsed_links = MC_parser.select(".filter a") # After exploring the web page, we find that the desired URL can be found inside the body of <div> tag with class="filter"
shooting_tab_link = ["https://fbref.com" + link['href'] for link in parsed_links if link.text == "Shooting"]
print(shooting_tab_link)
response_shooting = requests.get(*shooting_tab_link)
if response_shooting.status_code == 200:
shooting_html = response_shooting.content
shooting_tables = pd.read_html(shooting_html, match="Shooting ")
shooting_df = shooting_tables[0]
else:
print("Couldn't download the shooting page")
shooting_df.head()
The dataframe has multi-level index, which is not important for our purpose. So, we can drop the multi-level index. After that we have two DataFrames: the matches and shootings. Since both refer to the same matches, we can combine these DataFrames.
shooting_df.columns = shooting_df.columns.droplevel()
shooting_df.head()
Both score and shooting dataframe have multiple common columns. The unique columns in the shooting dataframe are listed below:
- Sh : Shots Total (Does not include penalty kicks)
- SoT : Shots on Target (Without penalty kicks)
- Dist : Average distance travelled by a shot
- FK : Number of free kicks
- PK : Pealty kicks mades
- PKatt: Penalty kicks attempted
These unique columns are merged with the score dataframe
team_data = score_df.merge(shooting_df[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
team_data.head()
Now let's repeat these steps for each team who played last 5 seasons of EPL
import time
import re
import pandas as pd
years = list(range(2022, 2017, -1))
all_matches = []
for year in years:
for team in team_stat_URL:
link = team_stat_URL[team]
print(link)
response_status_code = 401
while response_status_code != 200:
response = requests.get(link)
if response.status_code == 200:
content = response.content
score_tables = pd.read_html(content, match="Scores & Fixtures")
score_df = score_tables[0]
response_status_code = 200
else:
print("The page couldn't be downloaded for team {}. Trying again".format(team))
time.sleep(1)
parser = BeautifulSoup(content, 'html.parser')
parsed_links = parser.select(".filter a") # After exploring the web page, we find that the desired URL can be found inside the body of <div> tag with class="filter"
shooting_tab_link = ["https://fbref.com" + link['href'] for link in parsed_links if link.text == "Shooting"]
response_shooting_status_code = 401
while response_shooting_status_code != 200:
response_shooting = requests.get(*shooting_tab_link)
if response_shooting.status_code == 200:
shooting_html = response_shooting.content
shooting_tables = pd.read_html(shooting_html, match="Shooting ")
shooting_df = shooting_tables[0]
response_shooting_status_code = 200
else:
print("Couldn't download the shooting page for team {}. Trying again".format(team))
time.sleep(1)
shooting_df.columns = shooting_df.columns.droplevel()
try:
team_data = score_df.merge(shooting_df[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
print(f"{team} : {year} : {team_data.shape}")
except ValueError:
continue
except KeyError as e:
print(f"Column {e.args[0]} missing from the dataframe. So adding an extra column for consistency")
team_data['FK'] = None
# Our goal is to predict winners for EPL match. So, ignore any data not within our scope
team_data = team_data[team_data["Comp"] == "Premier League"]
# Adding extra columns to keep track of the team name and season
team_data["Season"] = year
team_data["Team"] = team
all_matches.append(team_data)
time.sleep(1)
# Update the links in team_stat_URL for scrapping data for previous season
season_links = parser.select(".prevnext a")
prev_season_link = "https://fbref.com" + season_links[0]['href']
team_stat_URL[team] = prev_season_link
time.sleep(300)
match_df = pd.concat(all_matches)
match_df.columns = [c.lower() for c in match_df.columns]
match_df.to_csv("matches.csv")