Exploring Hacker News Post

Category > Data Analysis

Mar 13, 2022

python Hacker News Post analysis

Goal of the project¶

Hacker News is a site, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. The goal of this project is to answer the following two questions:

Do Ask HN or Show HN recieve more comments on average?
Do posts created at a certain time receive more comments on average?

Data Collection¶

To find answers for our questions, we will analyse the hacker news post dataset, which can be downloaded from this link. Note that the dataset has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Let's first read the dataset as a list of list.

In [1]:

from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0] # Seperate the header from the dataset for future analysis
hn = hn[1:]

In [2]:

def explore_data(dataset, header = [], rows_and_column = True):
    if header:
        print(header)
    print('\n')
    
    for row in dataset[:5]:
        print(row)
        print('\n')
        
    if rows_and_column:
        total_column = len(header) if header else len(dataset[0])
        print("The number of rows: ", len(dataset))
        print("The number of columns: ", total_column)
        
explore_data(hn, hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


The number of rows:  20100
The number of columns:  7

Below are the descriptions of the columns:

Column	Description
id	The unique identifier from Hacker News for the post
title	The title of the post
url	The URL that the posts links to, if the post has a URL
num_points	The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments	The number of comments that were made on the post
author	The username of the person who submitted the post
created_at	The date and time at which the post was submitted (the time zone is Eastern Time in the US)

Data cleaning¶

In this project, we are only interested in those posts that starts with either Ask HN or Show HN. Thus we only create a new lists of lists containing just the data for those titles.

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Exploring ASK posts...")
explore_data(ask_posts)
print('\n')
print("Exploring SHOW posts...")
explore_data(show_posts)

Exploring ASK posts...


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


The number of rows:  1744
The number of columns:  7


Exploring SHOW posts...


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


The number of rows:  1162
The number of columns:  7

Now let's check if there are any rows with missing data. If data is missing from any row, we will delete that row

In [4]:

def check_and_remove(dataset, total_column):
    removed_items = []
    for row in dataset:
        if len(row) < total_column:
            removed_items.append(row)
            dataset.remove(row)
            
    return removed_items
            
removed_ask_post_rows = check_and_remove(ask_posts, 7)
print("Total rows deleted from ask_posts: ", 
      len(removed_ask_post_rows))
removed_show_post_rows = check_and_remove(show_posts, 7)
print("Total rows deleted from show_posts: ", 
      len(removed_show_post_rows))

Total rows deleted from ask_posts:  0
Total rows deleted from show_posts:  0

Excellent! No data is missing from any rows. Next, we need to check if there is any duplicate data.

In [5]:

def check_duplicate(dataset):
    duplicate_rows = []
    already_added = []
    
    for row in dataset:
        id = row[0]
        if id not in already_added:
            already_added.append(id)
        else:
            duplicate_rows.append(row)
            
    return duplicate_rows

duplicate_ask_posts = check_duplicate(ask_posts)
duplicate_show_posts = check_duplicate(show_posts)

print("Total duplicates found in ask_post: ", len(duplicate_ask_posts))
print("Total duplicates found in show_post: ", len(duplicate_show_posts))

Total duplicates found in ask_post:  0
Total duplicates found in show_post:  0

Bravo! Even no duplicates found in the dataset. That makes our life easier, isn't it? However, out dataset is almost clean. Even the dates in created_at column follows a consistent format. So, there is no necessity to standardize the date entries.

Data analysis¶

As we have cleaned the data, we will now start analyze the dataset to find answers for our questions. At first, let's determine if ask posts or show posts receive more comments on average.

In [6]:

def find_avg_comment(dataset):
    total_comments = 0
    for post in dataset:
        num_comment = int(post[4])
        total_comments += num_comment

    avg_comments = total_comments / len(dataset)
    return avg_comments

In [7]:

avg_ask_comments = find_avg_comment(ask_posts)
print("The average number of comments on ask posts: {:.2f}".format(avg_ask_comments))

The average number of comments on ask posts: 14.04

In [8]:

avg_show_comments = find_avg_comment(show_posts)
print("The average number of comments on show posts: {:.2f}".format(avg_show_comments))

The average number of comments on show posts: 10.32

In [9]:

avg_other_comments = find_avg_comment(other_posts)
print("The average number of comments on other posts: {:.2f}".format(avg_other_comments))

The average number of comments on other posts: 26.87

As we can observe, the ask posts recieve more comments than show posts on average. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created

In [10]:

import datetime as dt
result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comment = post[4]
    result_list.append([created_at, num_comment])
    
print(result_list[:5])

[['8/16/2016 9:55', '6'], ['11/22/2015 13:43', '29'], ['5/2/2016 10:14', '1'], ['8/2/2016 14:20', '3'], ['10/15/2015 16:38', '17']]

In [11]:

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment_num = int(row[1])
    dt_object = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    post_hour = dt_object.hour
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = comment_num
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += comment_num

In [12]:

# print the number of posts by hour in descending order
print({k:v for k, v in sorted(counts_by_hour.items(), 
                              key = lambda elem: elem[1], reverse = True)})

{15: 116, 19: 110, 21: 109, 18: 109, 16: 108, 14: 107, 17: 100, 13: 85, 20: 80, 12: 73, 22: 71, 23: 68, 1: 60, 10: 59, 2: 58, 11: 58, 0: 55, 3: 54, 8: 48, 4: 47, 5: 46, 9: 45, 6: 44, 7: 34}

In [13]:

# print the number of comments by hour in descending order
print({k:v for k, v in sorted(comments_by_hour.items(), 
                              key = lambda elem: elem[1], reverse = True)})

{15: 4477, 16: 1814, 21: 1745, 20: 1722, 18: 1439, 14: 1416, 2: 1381, 13: 1253, 19: 1188, 17: 1146, 10: 793, 12: 687, 1: 683, 11: 641, 23: 543, 8: 492, 22: 479, 5: 464, 0: 447, 3: 421, 6: 397, 4: 337, 7: 267, 9: 251}

In [14]:

avg_by_hour = []
for key, val in comments_by_hour.items():
    avg = val / counts_by_hour[key]
    avg_by_hour.append([key, avg])
    
print(avg_by_hour[:4])

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085]]

In [15]:

sorted_by_avg_comment = sorted(avg_by_hour, key = lambda elem: elem[1], 
                               reverse = True)

print("Top 5 hours for Ask Posts Comments:")

for elem in sorted_by_avg_comment[:5]:
    dt_object = dt.datetime.strptime(str(elem[0]), "%H")
    hour = dt_object.strftime("%H:%M")
    avg_comment = elem[1]
    print("{0}: {1:.2f} average comments per post".format(hour, avg_comment))

Top 5 hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Conclusions¶

Now, we know which hours we should create a post during to have a higher chance of recieving comments. But if we refer back to the documentation for the data set, we will find that the Eastern time zone has been followed in the given dataset. So, we need to convert the time zone we live in. For example, I live in Arizona, USA. Hence, 3.00pm in Eastern timezone is equivalent to 12.00pm in my timezone. So, I may submit an ask post on HN at 12.00pm to maximize the chance of recieving comments.

Data Analysis

Profitable App Profiles for the App Store and Google Play Markets

In this post, I analyze apps in Google Play and Apple store to understand the type of apps that attract more users. Based on the analysis, a new app may be developed for english speaking users, which will be available for free on the popular app stores. The developers will earn revenue through in-app ads. The more users that see and engage with adds, the better.

40 min reading

Data Visualization

Finding Heavy Traffic Indicators

In this post, I analyze a dataset about the westbound traffic on the I-94 Interstate highway to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, month of the year, etc.

25 min reading

Data Collection

Web scrapping data for English Premier League football matches

In this post, I web-scrap the scoring and shooting data for teams that played the last five seasons of EPL.

10 min reading

Machine Learning

Predicting the winner for English Premier League football matches

In this post, I build a pipeline for preprocessing web-scrapped data of EPL matches and predicting the winner of a match using machine learning.

60 min reading