Exploring Hacker News Post
Category > Data Analysis
Mar 13, 2022Goal of the project¶
Hacker News is a site, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We're specifically interested in posts whose titles begin with either Ask HN
or Show HN
. Users submit Ask HN
posts to ask the Hacker News community a specific question. Likewise, users submit Show HN
posts to show the Hacker News community a project, product, or just generally something interesting. The goal of this project is to answer the following two questions:
- Do
Ask HN
orShow HN
recieve more comments on average? - Do posts created at a certain time receive more comments on average?
Data Collection¶
To find answers for our questions, we will analyse the hacker news post dataset, which can be downloaded from this link. Note that the dataset has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Let's first read the dataset as a list of list.
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0] # Seperate the header from the dataset for future analysis
hn = hn[1:]
def explore_data(dataset, header = [], rows_and_column = True):
if header:
print(header)
print('\n')
for row in dataset[:5]:
print(row)
print('\n')
if rows_and_column:
total_column = len(header) if header else len(dataset[0])
print("The number of rows: ", len(dataset))
print("The number of columns: ", total_column)
explore_data(hn, hn_header)
Below are the descriptions of the columns:
Column | Description |
---|---|
id | The unique identifier from Hacker News for the post |
title | The title of the post |
url | The URL that the posts links to, if the post has a URL |
num_points | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
num_comments | The number of comments that were made on the post |
author | The username of the person who submitted the post |
created_at | The date and time at which the post was submitted (the time zone is Eastern Time in the US) |
Data cleaning¶
In this project, we are only interested in those posts that starts with either Ask HN
or Show HN
. Thus we only create a new lists of lists containing just the data for those titles.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("Exploring ASK posts...")
explore_data(ask_posts)
print('\n')
print("Exploring SHOW posts...")
explore_data(show_posts)
Now let's check if there are any rows with missing data. If data is missing from any row, we will delete that row
def check_and_remove(dataset, total_column):
removed_items = []
for row in dataset:
if len(row) < total_column:
removed_items.append(row)
dataset.remove(row)
return removed_items
removed_ask_post_rows = check_and_remove(ask_posts, 7)
print("Total rows deleted from ask_posts: ",
len(removed_ask_post_rows))
removed_show_post_rows = check_and_remove(show_posts, 7)
print("Total rows deleted from show_posts: ",
len(removed_show_post_rows))
Excellent! No data is missing from any rows. Next, we need to check if there is any duplicate data.
def check_duplicate(dataset):
duplicate_rows = []
already_added = []
for row in dataset:
id = row[0]
if id not in already_added:
already_added.append(id)
else:
duplicate_rows.append(row)
return duplicate_rows
duplicate_ask_posts = check_duplicate(ask_posts)
duplicate_show_posts = check_duplicate(show_posts)
print("Total duplicates found in ask_post: ", len(duplicate_ask_posts))
print("Total duplicates found in show_post: ", len(duplicate_show_posts))
Bravo! Even no duplicates found in the dataset. That makes our life easier, isn't it? However, out dataset is almost clean. Even the dates in created_at column follows a consistent format. So, there is no necessity to standardize the date entries.
Data analysis¶
As we have cleaned the data, we will now start analyze the dataset to find answers for our questions. At first, let's determine if ask posts or show posts receive more comments on average.
def find_avg_comment(dataset):
total_comments = 0
for post in dataset:
num_comment = int(post[4])
total_comments += num_comment
avg_comments = total_comments / len(dataset)
return avg_comments
avg_ask_comments = find_avg_comment(ask_posts)
print("The average number of comments on ask posts: {:.2f}".format(avg_ask_comments))
avg_show_comments = find_avg_comment(show_posts)
print("The average number of comments on show posts: {:.2f}".format(avg_show_comments))
avg_other_comments = find_avg_comment(other_posts)
print("The average number of comments on other posts: {:.2f}".format(avg_other_comments))
As we can observe, the ask posts recieve more comments than show posts on average. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
num_comment = post[4]
result_list.append([created_at, num_comment])
print(result_list[:5])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
comment_num = int(row[1])
dt_object = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
post_hour = dt_object.hour
if post_hour not in counts_by_hour:
counts_by_hour[post_hour] = 1
comments_by_hour[post_hour] = comment_num
else:
counts_by_hour[post_hour] += 1
comments_by_hour[post_hour] += comment_num
# print the number of posts by hour in descending order
print({k:v for k, v in sorted(counts_by_hour.items(),
key = lambda elem: elem[1], reverse = True)})
# print the number of comments by hour in descending order
print({k:v for k, v in sorted(comments_by_hour.items(),
key = lambda elem: elem[1], reverse = True)})
avg_by_hour = []
for key, val in comments_by_hour.items():
avg = val / counts_by_hour[key]
avg_by_hour.append([key, avg])
print(avg_by_hour[:4])
sorted_by_avg_comment = sorted(avg_by_hour, key = lambda elem: elem[1],
reverse = True)
print("Top 5 hours for Ask Posts Comments:")
for elem in sorted_by_avg_comment[:5]:
dt_object = dt.datetime.strptime(str(elem[0]), "%H")
hour = dt_object.strftime("%H:%M")
avg_comment = elem[1]
print("{0}: {1:.2f} average comments per post".format(hour, avg_comment))
Conclusions¶
Now, we know which hours we should create a post during to have a higher chance of recieving comments. But if we refer back to the documentation for the data set, we will find that the Eastern time zone has been followed in the given dataset. So, we need to convert the time zone we live in. For example, I live in Arizona, USA. Hence, 3.00pm in Eastern timezone is equivalent to 12.00pm in my timezone. So, I may submit an ask post on HN at 12.00pm to maximize the chance of recieving comments.