Finding Heavy Traffic Indicators
Category > Data Visualization
May 05, 2022Finding Heavy Traffic Indicators on I-94¶
In this project, we are going to analyze a dataset about the westbound traffic on the I-94 Interstate highway. The dataset is available here.
The goal of this project is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.
The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway.
# Import all libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Read the dataset and get some insight
traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
traffic.head()
traffic.tail()
traffic.info()
traffic.describe(include='all')
By just observing the descriptive statistics of the dataset, it is obvious that the 'temp' and 'rain_1h' columns have some outliers.
- The minimum temperature is 0k. Such a low temperature is not possible on earth. So, it doesn't make sense.
- Similarly, the mean value of 'rain_1h' column is 0.33mm. But the maximum value is 9831.30mm, which also doesn't make sense.
Moreover, the date_time column holds non-numeric values. We need to convert them to either date-time object or integer values for future convenience. To get some insight about the the 'traffic_volume' columns, let's plot a histrogram.
traffic['traffic_volume'].hist()
As we can observe, the nature of the histogram is bimodal. It indicates that either the data has been collected from two different sources or the data can be seperated into two different parts. From descriptive statistics, we know that:
- About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction.
- About 25% of the time, the traffic volume was four times as much (4,933 cars or more).
This possibility that nighttime and daytime might influence traffic volume gives our analysis an interesting direction: comparing daytime with nighttime data. For this purpose, we divide the dataset into two parts:
- Daytime data: hours from 7 a.m. to 7 p.m. (12 hours)
- Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours)
First, we transform the date_time
column to datetime
for our convenience
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
traffic.head(2)
traffic.tail(2)
print(traffic['date_time'].dt.hour.head(2))
type(traffic['date_time'].dt.hour)
daytime = traffic.copy()[traffic['date_time'].dt.hour.between(6, 19, inclusive = False)]
nighttime = traffic.copy()[(traffic['date_time'].dt.hour > 18) |
(traffic['date_time'].dt.hour < 7)]
print(daytime.head(2)['date_time'])
print(daytime.tail(2)['date_time'])
print(nighttime.head(2)['date_time'])
print(nighttime.tail(2)['date_time'])
# Plot the histograms of traffic_volume for both day and night
plt.figure(figsize=(8,3))
plt.subplot(1, 2, 1)
plt.hist(daytime['traffic_volume'])
plt.title("Day time traffic volume")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.ylim([0, 8000])
plt.xlim([0, 8000])
plt.subplot(1, 2, 2)
plt.hist(nighttime['traffic_volume'])
plt.title("Night time traffic volume")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.ylim([0, 8000])
plt.xlim([0, 8000])
plt.tight_layout()
plt.show()
From the histogram above, it is obvious that the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we have decided to only focus on the daytime data moving forward.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:
- Month
- Day of the week
- Time of day
Let's start by getting average traffic volume for each month and generating the line plot.
daytime['month'] = daytime['date_time'].dt.month
by_month = daytime.groupby('month').mean()
by_month.head(12)
plt.plot(by_month.index, by_month['traffic_volume'])
plt.show()
As we can see, the daytime traffic volume is usually less heavy in cold months. and more intense during warm months - with one interesting exception: July. Is there anything special about July? Is traffic significantly less heavy in July each year? Let's check.
daytime['year'] = daytime['date_time'].dt.year
july_data = daytime[daytime['month'] == 7]
july_data_grouped = july_data.groupby('year').mean()
july_data_grouped.head()
plt.plot(july_data_grouped.index, july_data_grouped['traffic_volume'])
plt.show()
Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction.
As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars.
Now, let's get the traffic volume averages for each day of the week.
daytime['dayofweek'] = daytime['date_time'].dt.dayofweek
by_dayofweek = daytime.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
plt.plot(by_dayofweek.index, by_dayofweek['traffic_volume'])
plt.xticks([0,1,2,3,4,5,6], ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'], rotation = 30)
plt.show()
From this figure, we have found that the traffic volume is significantly heavier on business days compared to the weekends.
We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.
daytime['hour'] = daytime['date_time'].dt.hour
bussiness_days = daytime.copy()[daytime['dayofweek'] <= 4] # 4 == Friday
weekend = daytime.copy()[daytime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
plt.figure(figsize = (8,3))
plt.subplot(1, 2, 1)
plt.plot(by_hour_business.index, by_hour_business['traffic_volume'])
plt.title('Business days')
plt.xlabel('Hour')
plt.ylabel('Traffic volume')
plt.ylim([1000, 7000])
plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend.index, by_hour_weekend['traffic_volume'])
plt.title('Weekends')
plt.xlabel('Hour')
plt.ylabel('Traffic volume')
plt.ylim([1000, 7000])
plt.tight_layout()
plt.show()
At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.
To summarize, we found a few time-related indicators of heavy traffic:
- The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier on business days compared to weekends.
- On business days, the rush hours are around 7 and 16.
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
A few of these columns are numerical, so let's start by looking up their correlation values with traffic_volume.
daytime.corr()['traffic_volume']
Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value.
Let's generate a scatter plot to visualize the correlation between temp and traffic_volume.
plt.scatter(daytime['temp'], daytime['traffic_volume'])
plt.show()
Obviously, two outliers at 0k need to be removed. Let's generate the scatter plot without the outliers.
plt.scatter(daytime['temp'], daytime['traffic_volume'])
plt.xlim([230, 320])
plt.show()
We can conclude that temperature doesn't look like a solid indicator of heavy traffic.
Let's now look at the other weather-related columns: weather_main and weather_description.
by_weather_main = daytime.groupby('weather_main').mean()
by_weather_description = daytime.groupby('weather_description').mean()
by_weather_main.plot.barh(y='traffic_volume', legend = None)
plt.show()
It looks like there's no weather type where traffic volume exceeds 5,000 cars. This makes finding a heavy traffic indicator more difficult. Let's also group by weather_description, which has a more granular weather classification.
by_weather_description.plot.barh(y='traffic_volume', legend = None,
figsize=(5, 10))
plt.show()
It looks like there are three weather types where traffic volume exceeds 5,000:
- Shower snow
- Light rain and snow
- Proximity thunderstorm with drizzle
It's not clear why these weather types have the highest average traffic values — this is bad weather, but not that bad. Perhaps more people take their cars out of the garage when the weather is bad instead of riding a bike or walking.
Conclusions¶
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:
- Time indicators
- The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier on business days compared to the weekends.
- On business days, the rush hours are around 7 and 16.
- Weather indicators
- Shower snow
- Light rain and snow
- Proximity thunderstorm with drizzle