NYU Builds Data-Sharing Network For Scientists--But Is It Legal?

If data-sharing is going to catch on, then science, technology and medical publishing will need to embrace the modern mantra of Open Everything--whether it’s technically within the boundaries of the law or not. Databrary, a video data-sharing site being developed at NYU, shows the enormous potential of allowing people to see behind the scenes of scientific studies.

You pay for science. Your tax dollars fund the national agencies that finance research. Yet you can't see most results of the science your dollars support--from cancer treatments to robotics--without paying the price.

Journals like Science and Nature will charge you $20 or more for access to a few-page-long report. Open Science advocates like NYU developmental psychologist Karen Adolph believe scientific information should be free, like books in a library, and she’s determined to do something about it.

Labs tend to be secretive places, where researchers guard their data from competition, and seldom share full methods openly. Scientists often publish selectively--sharing their successes while hiding null results, which is a common publication bias. Scientific elitism like this is worse than unfair: It's dangerous.

Adolph says science needs to be open, and her project Databrary, a sharing platform announced by NYU this month, may free scientists to discover more powerful findings, faster--if it doesn’t get hamstrung by antiquated privacy laws first.

Here’s Why You Should Care About Open Data

Adolph is bothered by another layer of opacity in science: Researchers don't often share the raw data on which their published papers’ results are based, making them hard to reproduce. Opening up data and methods is what she hopes her new online library will do--allowing researchers to share, browse, tag, critique, and reanalyze video clips across labs.

If you see a doctor who prescribes you a pill--say, Abilify, the #2 top-selling drug last year, a mood-stabilizer--and you want to read about what the chemical does to your brain--not some slick ad-copy written by the company that sells the dope, but the peer-reviewed scientific evidence itself, you're out of luck: That'll cost you $71, plus tax.

Your health is at stake here--not to mention your hunger for information. Shouldn't that science be everyone's right to read, seeing as your tax dollars funded the work? And shouldn't the raw data and procedures the company used to prove the drug's effects be open to the scientific community at large, to reinterpret and replicate? The same patient one psychiatrist calls "cured" another might call "sedated": Why should we take labs at their word?

If the government gets on board with transparent science, data sharing could become the new normal, by law--leveling the playing field for universities, hospitals, and companies alike. Free online journals like Public Library of Science (PLoS) are increasingly attracting the work of top scientists; The Public Access Policy of the National Institutes of Health requires all NIH-funded research since 2008 to be made public a year after publication on PubMed Central, the free online database. But Adolph believes we need much more.

How Government Is Teaming Up With Scientists To Set Data Free

Open-source science has been Adolph's priority since the '90s, when she worked on software for video labeling and data visualization. DataVyu (formerly OpenSHAPA) is mostly used by developmental psychologists like Adolph who study how infants learn--but in principle it can be used by anybody analyzing video, for whatever purpose.

"Everything we're doing is open," says Adolph. "Every line of code is on GitHub... All our administrative documents, operating procedures, everything is up there with all its bells and whistles, all its pimples and blemishes... So we are really moving forward with the intent that it's all just open: open Science, open sharing, open source."

When the Obama administration decided they wanted to invest public money in data-sharing, to make scientific research faster and more efficient, they approached Adolph for her expertise in open science. Adolph organized a conference in 2011 of 35 behavioral researchers, computer scientists, and library scientists to come up with a way to share video data, while protecting subject privacy. She invited representatives from every federal agency she knew.

"There's a whole list of worries people have, and many of them got raised at that workshop," Adolph says. "Will I get credited? What if I'm not done using my data?... What if people find things wrong in my data, and I'm sort of outed? But the people at the NIH and NSF kept telling us: It's going to happen. So the choice is: Researchers can figure out how to do it, or it will happen through the government. But one way or another, people are going to have to figure out how to share data."

The result of the two-day National Science Foundation workshop was a team headed by Adolph, along with Rick Gilmore, an associate professor at Penn State who studies vision and brain development, and David Millman, NYU's Director of Digital Library Technology Services.

The NSF and NIH awarded Adolph grants for the project: $2,443,500 and $786,677, for the first year of a five-year grant. These federal funds are more transparent than much of the science process, because they're tax dollars: We can see where our money goes--why not what comes out of it?

YouTube For Scientists: The Rawest Data Sharing

"Raw data" may bring to mind a spreadsheet of rows and columns--but that's not the only kind Adolph wants. Sure, Databrary can deal with "flat" numbers, but what she's after first is data in its rawest form: videos, labeled with nothing but the age and sex of subjects.

Data in a spreadsheet, she explains, is only useful insofar as you know what the labels mean, and are interested in the same question as the person who built the database. But video shows reality and lets you ask whatever question you want.

Film has been standard in developmental science for a century. From early pioneers like Yale's Arnold Gesell, who tracked babies from womb to walking, and Myrtle McGraw's 1938 video Growth: A Study of Johnny & Jimmy, to MIT's Deb Roy, who filmed the first 90,000 hours of his son's life to analyze how he learned language, studies of kids have begun with movies. Since babies can't talk, Adolph explains, studying them is kind of like studying animals: You have to infer what they're attending to, thinking, or feeling from what they do: what they look at, who they move toward or away from. "Looking time," as a result, is one of the main variables in developmental psych: How long did a baby look at "Display A" versus "Display B," at his mother versus an experimenter, for example. Videos are used to study how children learn walking and coordination (Adolph's lab's topic), as well as language, social attachment, and self-control. In the "marshmallow test," one famous example, videos showed that a kid's ability to withhold eating a treat often predicts academic success later in high school and college.

I'll Name Your Data However I Want To

Databrary's name is intentionally broad, to include not just video, but all kinds of data streams, from physiological measurements like brain scans or blood tests, to spreadsheets or questionnaire data--IQ, personality or mental health check-lists to diagnose psych patients, for example--or even transcripts of talk or text from media. Data-sharing efforts for brain-scans, like OpenfMRI, HumanConnectome.org, and Neuroshare have influenced the design of Databrary.

"What I'm saying is: By opening the data up, and allowing transparency, the field can police itself," Adolph says. "We'll have a better basis for deciding what's good science. So we as the builders, or the developers of this repository, aren't going to decide what's good science. We're just going to open up the science and allow the community to decide where the promising areas of growth really are."

Databrary videos are meant to be categorized democratically--"bottom-up" rather than "top-down." They're defined by users rather than the librarians or video creators.

The only mandatory labels for a video will be the age and sex of people in it, plus links to any papers published on the data. The "meta-data" attached to the video will define what it is "about"--i.e., what information different scientists have dug out of it. If Adolph posts videos of children crawling and walking, tagged for "falls," a language scientist might tag the same video every time the baby speaks, or one interested in social bonding might tag it for the moments when the kid approaches his mom. And pretty soon, a single video will sprout a forest of papers around it, covering a whole range of behavioral research.

Databrary may become a "YouTube for Scientists" of many kinds. Neuroscientists sometime use video to record the positioning of brain-scanning equipment. Doctors use it to record patients with movement disorders, before and after a surgery or drug intervention, or children with developmental problems. Animal researchers use video to record procedures like surgeries on rodent or monkey brains, so that other scientists can see exactly what part of the brain they tracked. Education researchers routinely use video to study classroom lessons, too, finding patterns in teacher and patient behavior. With this new tool and the proper permissions from participants, all this video could become open for critique, reanalysis, and to inspire questions in young scientists.

Data-Sharing: The Times Are A-Changing

Open data is the convention in a few sciences already, because of shared technology and cost. Astronomers, for example, pool data from a small number of powerful, expensive telescopes worldwide. Particle physicists also share data, as well as earth scientists. Genomics has had an open-data policy from the beginning: Whenever a species' genome is sequenced, it's required by law to be shared in GenBank, an open repository.

"[In science], just like in any other industry, cultures can change," Adolph says. "So that's part of what we're trying to do, is to be part of this new wave--changing the culture of behavioral science to make it more open, in the way that [other sciences] have moved. I think there's still plenty of room to compete, even if we share...You may even get more citations by opening up your lab rather than by keeping it closed."

The Clinical Problem: Trading Privacy For Transparency?

Transparency in science is trickiest with medical research. This makes it harder for Databrary to help the very people who have the most to gain from open data: sick people.

Privacy is an issue in all of the videos, since subjects are identifiable by face and voice. Databrary videos won't be public, but shared with a group of authorized researchers who have signed agreements with Databrary, to keep confidential the identities of the people on the videos. People in the videos must give written permission for their videos to be shared. Kids' videos can be shared by caregivers, but medical records are a different story--more strictly regulated by the government.

In building the community of potential Databrary contributors, Adolph and her collaborators contacted around 120 behavioral scientists. Of those, only a few declined to participate. Some were clinical researchers who study children with developmental disorders like autism, so-called "protected populations." Government regulations, called HIPAA (Health Information Portability and Accountability Act) particularly restrict sharing hospital records and other forms of private health information: psychiatric diagnoses, sexual histories, or medical illnesses, for example.

"The irony is: If you're a mother, and you have an autistic child, [you're] the most eager to get that data shared and let people figure it out--similar to a cancer cure," says Lisa Steiger, NYU's Community Engagement and Outreach officer for Databrary. "[Mothers of disabled kids] are the most eager to have their data shared and reused and analyzed more deeply. And yet they're the ones who are the least likely--[because] it's the most challenging to have their [HIPPA protected] data shared. We don't know that it's impossible. We just have to figure out how to navigate that."

Privacy laws for current health care has not kept up with modern technology. Patients still need to be protected, of course, and Databrary wants to find a way to protect privacy while enabling data sharing.

"In this digital age, everything's changing," Adolph says. "One of the last things to change has to be people's comfort at having certain kind of private data shared. Institutional review boards [the bodies at universities, colleges, and hospitals that decide if a study can be performed]--they're lagging behind. They're from the days long before YouTube and Instagram."

"We're in a time now where I have to remind my teenage daughter every day: You better be careful what you text and what images you post of yourself, because those are digital files now, out in the wild. People are much more comfortable sharing videos and pictures of themselves. Most basic research is pretty harmless."

There's an irony of a government that requires extreme privacy protections in hospitals, while spying on its own citizens through the NSA. In any case, videos can be authorized to play for other parties.

"Obviously, the restrictions were put in place with good intent," says Dylan Simon, the software developer of Databrary. "These are vulnerable populations, and you don't want people taking advantage of them. But I don't think they were put in place with the current technological landscape, where there are cameras everywhere, in mind. They were put in place so that dangerous people couldn't find [patients'] addresses and go and stalk them and take advantage of them, not so that researchers couldn't do science."

[Image: Flickr user Eric Fischer]