2013-11-12

Co.Labs

Can Artificial Intelligence Like IBM's Watson Do Investigative Journalism?

Winning Jeopardy was just a proof of concept. Now IBM’s artificial brain has moved onto conquering health care—and next, journalism.



Two years ago, the two greatest Jeopardy champions of all time got obliterated by a computer called Watson. It was a great victory for artificial intelligence—the system racked up more than three times the earnings of its next meat-brained competitor. For IBM’s Watson, the successor to Deep Blue, which famously defeated chess champion Gary Kasparov, becoming a Jeopardy champion was a modest proof of concept. The big challenge for Watson, and the goal for IBM, is to adapt the core question-answering technology to more significant domains, like health care.

WatsonPaths, IBM’s medical-domain offshoot announced last month, is able to derive medical diagnoses from a description of symptoms. From this chain of evidence, it’s able to present an interactive visualization to doctors, who can interrogate the data, further question the evidence, and better understand the situation. It’s an essential feedback loop used by diagnosticians to help decide which information is extraneous and which is essential, thus making it possible to home in on a most-likely diagnosis.

WatsonPaths scours millions of unstructured texts, like medical textbooks, dictionaries, and clinical guidelines, to develop a set of ranked hypotheses. The doctors’ feedback is added back into the brute-force information retrieval capabilities to help further train the system. That’s the AI part, which also provides transparency for the system’s diagnosis. Eventually, this "knowledge" will be used to articulate uncertainty, identifying information gaps and asking questions to help it gather more evidence.

Health care is just the beginning for Watson. Other disciplines that rely on evidentiary reasoning from unstructured documents or the Deep Web, including law, education, and finance, are also on the road map. But let’s consider another potential domain here, perhaps less lucrative than the others, but nonetheless important: news and journalism.

Media startup Vocativ identifies hot news stories by trawling the depth of the web, data-mining the vast seas of unindexed documents for information that might point to a story lead. Often journalists pair up with analysts, manually exploring data from different perspectives. The Associated Press’s Overview Project aims to build better visualization and analysis tools for investigative journalists to make sense of huge document sets.

What if much of this could be automated? A cognitive computer, like Watson, could search reams of evidence, generate hypotheses, and collect supporting and/or contradicting evidence. Potential news stories would be presented to journalists and analysts who would weigh the evidence, assessing its accuracy, and decide which story ideas to pass on to an editor for further pursuit. In this scenario, Watson would be providing a well-sourced tip.

Adapting Watson to new domains isn’t easy. According to a paper from IBM Research that describes the application of Watson in health care, the system has to be able to parse and understand the format of a variety of domain-specific documents. Then it needs to be re-trained so that it learns how to weigh different sources of evidence, and any special-purpose taxonomies or logic that drive the domain also need to be accessible to the system. For investigative journalism, documents might include interview transcripts, legal codes and statutes, social networks, other news articles, PDFs from the Freedom of Information Act (FOIA), or even requests or document-dumps from sources like WikiLeaks. Through an iterative process, the system would have to be trained, going back and forth with editors as it suggested stories and was told "yay" or "nay," each new vote modulating how the system weighs and integrates evidence.

Given a lot of re-engineering for Watson, how might an acumen for investigative reporting play out in a real-world news scenario? Earlier this year the International Consortium of Investigative Journalists (ICIJ) published a database of 2.5 million leaked documents about the offshore holdings and accounts of more than 100,000 entities, including emails, PDFs, spreadsheets, images, and four large databases packed with information about offshore companies, trusts, intermediaries, and other individuals involved with those companies. Undaunted, it took 112 reporters 15 months to analyze the data—a lot of human time and effort.

For Watson, ingesting all 2.5 million unstructured documents is the easy part. For this, it would extract references to real-world entities, like corporations and people, and start looking for relationships between them, essentially building up context around each entity. This could be connected out to open-entity databases like Freebase, to provide even more context. A journalist might orient the system’s "attention" by indicating which politicians or tax-dodging tycoons might be of most interest. Other texts, like relevant legal codes in the target jurisdiction or news reports mentioning the entities of interest, could also be ingested and parsed.

Watson would then draw on its domain-adapted logic to generate evidence, like "IF corporation A is associated with offshore tax-free account B, AND the owner of corporation A is married to an executive of corporation C, THEN add a tiny bit of inference of tax evasion by corporation C." There would be many of these types of rules, perhaps hundreds, and probably written by the journalists themselves to help the system identify meaningful and newsworthy relationships. Other rules might be garnered from common sense reasoning databases, like MIT’s ConceptNet. At the end of the day (or probably just a few seconds later), Watson would spit out 100 leads for reporters to follow. The first step would be to peer behind those leads to see the relevant evidence, rate its accuracy, and further train the algorithm. Sure, those follow-ups might still take months, but it wouldn’t be hard to beat the 15 months the ICIJ took in its investigation.

Watson isn’t going to "solve" investigative journalism, as if it were a great jigsaw puzzle, but it might speed things up and help us deal with scale, and it might help identify overlooked starting points and leads for journalists to delve into. Still, as much as Watson appears to be smart, it lacks human traits, like creativity, judgment, empathy, and ethics. Document dumps in an investigation can be a messy business that are hard for anyone to interpret and make sense of. All of the logic and data might suggest a person is using an offshore account to evade taxes, but the world can be a nuanced place, and we’ll still need people driving these big cognitive appliances to make the final call.

As big data and algorithms grow to exert more power on society, it stands to reason that their power might also be directed back toward holding more traditional institutions accountable. Building thinking machines that can help investigate fraud, abuse, negligence, and incompetence in government or corporations could help amplify the volume and impact of investigative journalism. But if news organizations are serious about our watchdog function, we’ll need to invest in developing ambitious new technologies, not just adapting off-the-shelf toolkits. It took IBM five years to build that first Jeopardy-winning version of Watson and it’s taking years more to adapt the technology to other new domains. Would media companies, philanthropists, or foundations fund the journalistic version of Watson, or could IBM one day be publishing competitive news scoops instead?

Nick Diakopoulos a Tow Fellow at the Columbia University Journalism School working on applications of data and computational journalism. He is also a consultant specializing in research, design, and development for computational media applications. Areas of expertise include data visualization, social computing, and news. Find him on Twitter: @ndiakopoulos

[Image: Flickr user Davey Rockwell]




Add New Comment

1 Comments

  • Steve Ardire

    > Adapting Watson to new domains isn’t easy. According to a paper from IBM Research that describes the application of Watson in health care, the system has to be able to parse and understand the format of a variety of domain-specific documents. Then it needs to be re-trained so that it learns how to weigh different sources of evidence, and any special-purpose taxonomies or logic that drive the domain also need to be accessible to the system.

    @saffrontech learns incrementally so does not have this problem or burden of @ibmwatson