From the outside, Reddit appears autonomous: submit a link, upvote or downvote people’s submissions, or start a subreddit ad infinitum. But behind the scenes, there is emerging a group of cyber-urban-explorers who have made it their hobby to discover the inner workings of Reddit, seeking to reveal the secrets of the elusive Reddit admins whose soft power rules the network.
Randy Olson is one such enthusiast. A PhD student at Michigan State University, Olsen helped create a subreddit discovery tool that Co.Design reported on last year. Earlier this year on Co.Labs, Olson backwards engineered what he called the "window of virality" of Reddit posts. (“A 12-hour-old post needs roughly 3x the score to match the hotness of a six-hour-old post!” he wrote.)
Now he has revealed the internal workings of the recent subreddit coup which caused some of the site's most popular subreddits (such as /r/Technology) to be de-listed from the homepage.
For many people, Reddit is the source for discovering new things on the web. Reddit.com receives about 22 million unique visits per day, according to the analytics intelligence company Alexa. And it is the 57th most popular site on the web. If a link does well on Reddit’s home page, then it is bound to catch on with millions of web users.
So, around the week of May 7th, when Reddit changed which subreddits would show up on the main page, the Internet took notice. The default subreddits are the subreddit links that visitors see on Reddit’s front page when they aren’t logged in and what they are automatically subscribed to when they first create an account on Reddit. Olson saw a jump in popularity in some of the subreddits he had been tracking because of the change.
“It just so happens that the time period I sampled over covered the week that the Reddit admins made a huge change to the default set of subreddits,” he wrote on his blog.
Of the subreddits he had been tracking, a few of new defaults saw an immediate increase in virality. The more viral the posts were in a subreddit, the more red Olson’s heatmaps would become. Bluer shades meant the opposite. Some heatmaps went directly from blue to red, like /r/Art and /r/Documentaries, but Olson noted that /r/tifu had just turned a deeper shade of red.
Olson says that most of the existing default subreddits were not affected, but a few he was tracking did drop in virality. For instance, the /r/videos heatmap turned a deeper shade of blue after the defaults changed, but Olson notes that it had already been losing traction in the weeks beforehand.
Reddit determines the popularity of a post across the site with an algorithm it calls the “hot ranking.” According to the ranking, Reddit calculates a post’s popularity by subtracting the number of downvotes from upvotes and factors in the age of the post.
Olson determines the “virality” of Reddit posts within a subreddit by considering two measures. First, he takes a baseline “hotness” score, where he sums the hotness of 25 new posts that have no upvotes in the subreddit. Second, he sums together the hotness of the top 25 posts in the subreddit. Then, he divides the top posts’ summed hotness by the baseline hotness of the subreddit.
Olson has been active on Reddit for three years. He has tried out other social bookmarking sites, like Digg and StumbleUpon but was drawn to Reddit’s culture of hoarding.
“I’ve also just been just generally interested in it as a complex system. Maybe we can use that to our benefit if the goal is to, you know, make our posts catch on better or something like that,” Olson says.
During working hours, he is working toward a doctorate in computer science at Michigan State University. Olson’s interest in Reddit mainly spans his free time. Over spring break in 2013, he decided to analyze the behavior that got him hooked on Reddit in the first place. Taking advantage of Reddit’s API and PRAW, a Python wrapper for it, Olson was able to access all of the submissions and user comments on Reddit’s site. He let the API run for about a month and a half to assemble a dataset that covered the years 2005 through 2013.
One of the perks of being a doctoral student in MSU’s computer science department is having access to the department’s supercomputers. Olson estimates that he downloaded around 100 GB of data during that initial data scrape. He would not have been able to manage the scrape and data without the extra computing space.
Since then, Olson has been publishing his revelations about Reddit on his personal blog, focusing on a few variables, like a link’s timing, context, and title. He quickly realized that the time of the day during which you submit a post greatly influences the success of a Reddit link submission.
From his analyses, he has seen that a new post needs to be submitted onto a subreddit when posting activity is low. A new post has the best chance of getting upvoted when Reddit is at its weakest. This generally happens at around 7:00 a.m. EST.
Additionally, Olson says, a post has to make it into a subreddit’s top 25 posts before it can make the jump to Reddit’s default front page.
Another Reddit enthusiast did a similar data scrape, but he was able to do it much faster than Olson. Jason Baumgartner created his own API to pull posts and comments from the site, but his version has a much higher bandwidth than Reddit’s API does. He hosts the data on his own 128 GB server.
“My API runs faster mainly due to a lot of customized code and also because there is currently less load on my server than there is on Reddit,” Baumgartner says.
Reddit is completely open source, but Baumgartner’s version of the Reddit API was not written in Reddit’s Python. He coded his API using a MariaDB database framework and uses PHP to handle the requests to it. To deal with the posts and comments in real time, he wrote a script in Perl.
And Baumgartner did not stop there. He created a site called RedditAnalytics, which shows his real-time analysis as well as his custom search functionality. The search tool was created with the open source Sphinx search engine.
Giving Reddit’s bare site a more in-depth, analytical dimension can take many visual forms. When Olson and his colleague created the subreddit discovery visualization, they used an open source template. And it shows. When a user clicks on some of the nodes in the graphic, the phrase “Reticulating splines” appears, an empty reference to the Sims game series. The feature was just left over from the visualization template.
To create the graphics for his virality heatmaps, Olson just used the imshow() function in Python.
At the end of the day, it is hard to say whether some deliberate changes to the setting of a website can breathe new life into its content. But when data pundits get ahold of inside information, that all changes. Olson has submitted his data study to a journal, and it is still under review. For now, his data is open to the public.
[Image: Flickr user Eva Blue]