Constructing a detailed chronology of the key events in a person’s life is normally a largely manual task—the type of work best left to the likes of Walter Isaacson. But what if an algorithm could be used to generate biographical information in a fraction of the time it would take a flesh-and-blood researcher? And what if it could be done simply by using machine learning tools to analyze a person’s tweets?
This is exactly the problem that Cornell University professor Claire Cardie and computer science PhD Jiwei Li set out to solve with their "Timeline Generation" project: an innovative algorithm which uses the data contained in tweets to generate customized biographies of individual Twitter users.
"What I found was that while it is very easy to find biographical information about pop stars and athletes, it is far more difficult to keep track of events in the lives of non-famous individuals," Li says. "Where the lives of celebrities are very well documented, in the case of non-celebrities there is a real lack of available data. The aim was to create an algorithm to generate timelines for these users, detailing all of the significant events in their lives."
Although it’s only at an experimental stage—there’s no public-facing tool we can try out—the duo’s research may hold the key to one day turning data from one’s presence on the social web into a biographical narrative.
"Twitter is amazing for researchers, because it gives us access to publicly available biographical information of any person you happen to know the handle for," Li continues. "Many people tweet about everything that happens to them—and the shorter 140-character limit forces them to be concise."
So how exactly does the algorithm work?
Using Twitter feeds for topic extraction is not an entirely new idea. For the Timeline Generation project, however, the technical challenge was not just about pulling a few relevant keyword tweets out of a noisy mass of unfiltered information, but about sorting and classifying each tweet a person posts into different categories.
"We started off by just dividing tweets into ‘public’ and ‘private’ posts," says Cardie. Public posts were considered to be those events like the U.S. election or the NBA Finals, which most people have an opinion of, but are considered non-specific in terms of individual users.
"We don’t use a dataset of important public events, but rather look at what is being tweeted about across a large number of users," says Li. "If our set of users has a large overlap in the topics they are discussing, our model can determine that this is likely to be a public topic."
Public and private posts didn’t turn out be granular enough, though, so Li and Cardie filtered the posts again—this time into "general" and "specific" categories. A general event might be something relevant to an individual, but which tells you relatively little about the key events in their life. General events were regularly tweeted about over a long period of time, showing their predictability.
A specific event, on the other hand, would likely be characterized by it being the subject of a large number of tweets over a relatively short period. "A new job or a wedding is something that will likely be tweeted about a lot, but for a short timespan," Cardie says. "Your yoga class, however, or complaints about the commute to work might recur over a much longer timeframe—therefore being classed as more general in nature."
Of course, Twitter’s main strength is not just as a micro-blogging tool, but as a social network as well—meaning that people aren’t tweeting in isolation, but as part of a wider community. This, in turn, helps to answer the problem of missing data. If a person only tweets a few times a week—or even a day—it can be difficult to establish patterns in what they are talking about. Because of this, Li and Cardie decided to incorporate Twitter’s social aspects by also including data generated by followers.
"Our model takes this into account in an implicit way," Cardie continues. "The tweets used to generate each user biography come not just from your own tweets, but also the tweets that you retweet, as well as those belonging to the people that follow you. All of these are considered equally as important as the tweets that you post yourself."
The more information the algorithm has access to, the more accurate the biography Timeline Generation can create.
Li started the Timeline Generation project by looking at celebrities—seeing whether he could create an algorithm that could, for instance, keep a Wikipedia page or personal website up to the minute by incorporating new information on a constant basis. Having done this, he realized that looking at non-celebrities may be a better application of the Timeline Generation algorithm, since these are the individuals who prove tricky to research by conventional means.
The resulting algorithm, Li and Cardie explain, could have application across a broad range of areas. Entrepreneurs wanting to know more about their competitors’ past could use the Timeline Generation tool to quickly construct a biography of that individual. Employees, meanwhile, could get access to information about their bosses, while bosses could more easily keep on top of the personal lives of their employees—perhaps knowing when a wedding or significant life event is coming up.
Li acknowledges that their approach may raise privacy concerns. Although Twitter feeds are public by design, the idea of a person’s personal feed being used for easy reconnaissance is not one that is likely to be welcomed by everyone.
"The amount of information that can be extracted about an individual this way is something we really haven’t had to think about a great deal in the past," he says. "If you wanted to know about me previously, you didn’t have the access to do it. You could, of course, read through my tweets one at a time—but that might take too long, particularly if you’re an employer wanting to know about the hundreds of individuals who work for you."
"With our algorithm it can be done easily and, more important, quickly," he continues. "The question of privacy is one of the biggest challenges in scaling this work. As far as I’m concerned, it’s the only reason this couldn’t be adopted on a larger scale."
Moving forward, another area to think about concerns how to present the narrative information the algorithm pulls to the surface—whether this could be achieved through data visualization techniques, or even using algorithms to translate it into a readable narrative.
"At this stage, we haven’t put too much thought into the ways this information could be output," Cardie says. "But there are a number of different approaches we could potentially take—particularly when it comes to making the information accessible to users."
So could the best-seller biographies of the future be written by a bot? It might sound farfetched, but if recent history has proven anything, it is that betting against algorithms is a bad idea.
[Image: Flickr user Jenn (Yana)]