Data Sharing and Reuse: Expanding Our Concept of Collaboration
When we consider the spectrum of ways in which scientists are now interacting and collaborating with colleagues, a new set of collaborative practices comes into view: data sharing and reuse. While conventionally not considered teamwork, it is a highly discussed approach that expands our concept of collaboration. Yet there remains a lack of knowledge about how this work should be carried out efficiently and effectively.
A recent editorial in the New England Journal of Medicine sparked a firestorm when its authors used the term “research parasites” to describe scientists who use the data of others “for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited”. Scientists and others on Twitter responded by roundly and swiftly condemning the editorial and accusing its authors of being more interested in their own careers than in advancing scientific knowledge. Furthermore, the Twittersphere generally mocked and dismissed the concerns raised by the authors about being scooped or data being used improperly, considering such risks irrelevant and improbable.
As calls for data sharing grow increasingly loud and urgent, including recent remarks from Vice President Biden, and sentiment in the scientific community generally turns toward an open science model, it is important to take a step back and look at what we are actually asking scientists to do and what kinds of support they might need when sharing and reusing data.
The ubiquity of Information Communication Technologies (ICTs) has led to a dramatic increase in the practice of sharing and reusing data in scientific research, especially in large team science initiatives. Consortia such as the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), funded by the National Cancer Institute, bring together existing data from multiple studies, then work collaboratively to harmonize those data and possibly integrate new data. In doing so, they are able to answer questions individual studies are underpowered to investigate. But data sharing and reuse can be tricky, for both technological and social reason, as discussed below. Groups must negotiate how to work together and to what degree to share, then spend substantial amounts of time interpreting and understanding one another’s data. As more collaborative projects engage in data sharing and reuse to push the boundaries of what is possible when working in teams, the SciTS community can help ease some of the challenges.
While Big Data gets all the attention, the “small data” of many studies, when combined or used in new ways, may hold even greater potential for discovery. What sometimes gets lost in the discussions about data sharing challenges is that the goal is not really sharing the data. The goal is for others to reuse those data in a new and exciting way. As such, it is crucial that we consider the entire ecosystem of data sharing and reuse.
Data Sharing Terminology
While the term data seems straightforward, it is surprisingly difficult to pinpoint exactly what is meant by the term. In general, we describe data as measurements made and stored, but what counts as data in one field may not be regarded as such in another. In fact, different disciplines may regard the same measurements as data in one instance and metadata (data about data) in another. For example, consider oceanographers collecting water samples at various points in the ocean. A scientist concerned with the sea life present in each sample may be interested in the census of creatures and consider water temperature as just metadata about that collection. However, a scientist studying the water temperatures themselves would clearly use those measurements as data.
The terms data sharing, data deposit, data reuse, and data pooling (or harmonization) are often used interchangeably and yet may mean different things. Here, I define data sharing as the act of preparing one’s own study data for use by someone else, such as another investigator or study team, and transmitting those data. In general, data sharing is a one-time event from one investigator or study team to another. Large studies such as the Women’s Health Initiative or Nurses’ Health Study have procedures in place to streamline this process, but data sets are still generally prepared in response to a specific data request. Data sharing may require varying degrees of data cleaning and documentation, depending on the purpose for which the data will be used.
Data sharing is distinct from data deposit which is the act of preparing one’s data and depositing it into a repository of some sort, such as dbGap. In general, depositing data into a repository requires transforming the data into a standard format, with standard metadata attached. Data reuse takes place when an investigator or study team uses data someone else collected and prepared for sharing or deposit. To reuse data requires developing sufficient understanding of the data to use it in a way that is scientifically appropriate; this is not always an easy task. Finally, data pooling or harmonization involves combining data from several studies in a harmonized manner in order to perform statistical analyses in the integrated data set. As each of these activities requires different actions and expertise to complete, it is critical that we keep these distinctions in mind as we seek to develop solutions for the challenges discussed below.
Much has been written in the scientific literature about the opportunities presented by increased data sharing and reuse. These benefits include better and more rigorous science, training benefits to junior researchers and greater efficiency.
Reproducibility of results is a major issue in scientific research, as new studies are built on the results of previous work. When data are shared, outside researchers can analyze a data set independently and confirm findings before beginning new studies dependent on previous results[8-11]. Such transparency and openness helps prevent fraudulent data and cherry-picking of results and may help improve research practices. New explorations also can support wider conversations around the analyses that allow for different interpretations and generate new conclusions.
Acceleration of discoveries is a major potential benefit of data sharing and reuse, which can spur innovation by allowing researchers to utilize existing data in new analyses, explore secondary hypotheses or combine them for pooled analyses in unexpected ways[13-15]. Data sharing and reuse also supports exploration of new questions and enables inter- and trans-disciplinary research[10, 16]. Goodman notes that systematic reviews, often considered the most persuasive form of scientific evidence, can be vastly improved by access to the underlying data rather than just to the reported results of studies. For individual scientists, data sharing can also result in more citations for their work.
Junior investigators would also benefit greatly from increased data sharing. Attempting to replicate existing findings with original data sets would be a great training experience. Furthermore, access to underlying data could prevent graduate students from beginning new projects based on previous results that were weak or incorrect, saving time and frustration. Not only do real-world data allow junior researchers to work on substantive issues, they also provide training grounds for learning about data management issues. No data set is perfect and researchers must learn how to deal with missing and inconsistent data.
Finally, data sharing and reuse present opportunities to make science more cost-effective, efficient and ethical. Collecting data can be expensive and time-consuming. Utilizing existing data provides the opportunity to investigate questions, especially exploratory hypotheses, with a smaller investment of research dollars. Ferguson et al. go further and call it wasteful to collect new data that duplicate existing data sets. Increasingly, scientific leaders are also emphasizing our ethical obligation to maximize participants’ research contributions by utilizing existing data to its fullest potential.
There are, of course, also tremendous challenges in both sharing and reusing data. These challenges fall into five categories: (1) Lack of infrastructural support for data sharing and reuse activities; (2) misalignments in the incentive system; (3) privacy, ethical and legal issues; (4) cultural issues; and (5) lack of research around the work of sharing and reusing data.
Lack of infrastructural support for data sharing and reuse activities. In the context of data sharing and reuse, infrastructural support includes: funding to support data activities such as data documentation and setting up data management platforms, and the skilled staff to engage in those activities, as well as policies and guidelines around when and how to share and reuse data. Though difficult to quantify, it appears that much data sharing and reuse takes place in a peer-to-peer manner, what Poldrack and Gorgolewski call the “data bazaar”. This free-form type of sharing and reuse is expensive and time-consuming, and generally done with little or no additional financial support. Vickers notes a lack of guidance on how to prepare a dataset for sharing, which further complicates both sharing and reuse, as data recipients may also struggle with understanding the data. Data are complex and collected with a specific purpose in mind, and are organized in a database designed for a specific question. As such, it can be difficult to convey the full context of the data to someone who is unfamiliar with the data[5, 21].
Technology infrastructure issues are a major impediment to scaled-up data sharing and reuse. Many independent, non-integrated data repositories exist, including disciplinary-specific databases and institutional repositories, but most data creators are left on their own for data management. Without an obvious, reliable, sustainable storage infrastructure available to researchers, it would be a massive challenge to share data widely, given the difficulties associated with storing, managing, archiving and retrieving so much data[22, 23], not to mention the lack of highly trained data managers. Even if such a repository existed, the lack of ontologies for describing data remains a problem[15, 16, 23]. Shared data that are not discoverable through an organized retrieval system are less useful.
Misalignments in the incentive system. A second major challenge to data sharing and reuse is lack of credit for data sharing activities in the academic incentive system. Most scientists are part of the tenure systems at their institutions and thus under pressure to publish new results as frequently as possible. As discussed above, data sharing is time-consuming and the time spent on data sharing is not generally acknowledged as a scientific activity that counts toward tenure and promotion for scientists. Furthermore, data sharing is expensive, and under the current tight funding climate, few grant-funded investigators want to spend precious funds preparing data for someone else’s research, even in exchange for authorship credit.
Privacy, ethical and legal issues. Challenges related to privacy and confidentiality are further impediments to a strong system of data sharing and reuse[22, 25]. While newer studies, begun in the era of widespread reuse, may have consent forms that allow free sharing of data with other researchers, many older studies do not. Furthermore, deidentification or anonymization of data is not fool-proof, leaving researchers open to accusations of failing to preserve patients’ privacy. Dove et al. describe issues with conflicting privacy laws and frameworks, especially in international research. Data use agreements between institutions can be difficult and time-consuming to negotiate to the point that they can inhibit or delay sharing significantly.
Cultural issues. The culture of science can also inhibit data sharing. Miller writes that, in addition to the issues of credit, it can be “disconcerting” to allow another scientists free reign with one’s data. Scientists whose next grant or promotion is dependent on being the first to publish an exciting new discovery have a legitimate fear of being scooped or proven wrong by someone using their own data. Furthermore, allowing someone else to capitalize on data collected over many years can feel like giving away one’s work product and intellectual property[28, 29]. Such fears leave many scientists reluctant to embrace data sharing and reuse fully, especially given the issues with credit discussed above. To be clear, this is not an argument against data sharing and reuse, but to ignore these issues does not lead to smart design.
Lack of research around the work of sharing and reusing data. Finally, the reality is that we know very little about how scientists preserve, manage, share and reuse data. There simply has been too little research on these questions to be able to design and develop the type of infrastructural support that is needed. A very basic question that must be addressed is: “what do we consider data?” This question may seem trivial and academic, but is actually quite difficult and important and leads us to further consider what we consider the primary scientific products that are required to reproduce a given study. What are considered data, of course, is very discipline-specific, but is something that needs to be considered in order to develop a system to support data sharing. For example, are the software and analysis code used to analyze the data part of the data? As noted above, we do not have usable guidelines for documenting a study, primarily because we do not yet understand exactly what information a scientist requires in order to reuse data appropriately[5, 11].
If we set as our goal, then, taking full advantage of the opportunities offered by data sharing and reuse while minimizing the challenges, we can begin to design, build, and evaluate a robust infrastructural solution to support the practice of data sharing and reuse. There are numerous solutions proposed in the literature, most of which focus primarily on incentivizing data sharers, with much less attention paid to tools to support the users of those data. While no system will meet the needs of every researcher[14, 18], there are characteristics of a system that can substantially improve the current experience. Ferguson et al. call for a system of data that are discoverable, accessible, intelligible, accessible and useable. Pisani and AbouZahr describe the massive investments in infrastructure made by the Wellcome Trust and the National Institutes for Health (NIH) to support long-term data sharing for the Human Genome Project and the extraordinary results of that sharing and reuse. Ontologies and data standards are an important part of such infrastructure proposals[15, 16, 23]. The investment in strong, supportive, targeted infrastructure is a common theme among such successful projects.
Requiring researchers to deposit their data as a condition of funding and/or publication is a frequently proposed solution[11, 31]. However, current requirements are rarely enforced and those researchers who do not comply are rarely sanctioned[19, 31]. Poldrack and Gorgolewski suggest that the NIH track data sharing in the same way as it currently tracks grant-related publications and other researchers agree that some sort of tracking is essential[11, 18, 31]. However, requiring and tracking deposits only addresses one small piece of the puzzle. As discussed earlier, even if researchers have shared their data, there is no guarantee that anyone will be able to use them appropriately.
Solutions aimed at changing the culture and practice of science are also proposed. Both Miller and Ferguson et al. suggest that scientists start thinking of data as a primary scientific product, critical to the rigor and reproducibility of their science[18, 23]. To that end, many authors have discussed the advantages of direct citation of data sets in a repository through digital object identifiers (DOIs) or publication of “data papers,” citable manuscripts describing a data set[12, 18, 32, 33]. Such citations also meet the goals of making data more discoverable. Recent changes to the NIH biosketch allow for listing data sets as scientific products, akin to publications. Pisani makes the case for workforce development in the area of data management, arguing that few scientists receive such training but it is critical to a successful data sharing and reuse system.
While these solutions are all designed to address various aspects of the opportunities and challenges discussed above, it is crucial that we define more precisely what the aim of increasing data sharing should be. Is it to save time and money? To take full advantage of existing data sets and maximize participant contributions? To increase the power of our analyses? Answering this question is not simply an academic exercise, but, in essence, defines the solutions. We need to think carefully about the kinds of data used in our research, the kinds of studies for which data already exist or are being collected and the kinds of questions we want to be able to answer in the future.
Furthermore, we must think carefully about the entire ecosystem of research, considering how any changes to one component of the system might impact the remainder. The barriers to data sharing and reuse described here are intertwined and interact in complex ways. While it may seem straightforward to start requiring all data from NIH-funded studies, for example, to be deposited into a generic repository, such a solution raises many issues. How will that repository be chosen? Who will pay for it and ensure its sustainability? How will the work to prepare the data for submission be funded? Do major research centers and universities have enough data management staff to handle the load of depositing a wide variety of datasets? What kind of documentation and metadata will be required? These are all questions that need to be answered in a systematic way by the scientific community before moving forward.
Data are just one part of the system of collaborative research, a part that cannot be considered without thinking through the ramifications for the rest of the system. Practices around data sharing and reuse are developing and evolving, but they are doing so without the benefit of empirical research to support specific design choices. Scaling up data sharing and reuse will require investment of time and money and will not be simple. Further investment in research to understand the work of data sharing and reuse and a focus on user-centered design are critical to designing a system that is functional and usable.
- Longo, D.L. and J.M. Drazen, Data Sharing. New England Journal of Medicine, 2016. 374(3): p. 276-277.
- Investments to Launch the Next Phase of Cancer Research. [cited 2016 February 18]; Available from: http://www.cancer.gov/research/key-initiatives/biden-cancer-initiative.
- Fortier, I., et al., Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. International Journal of Epidemiology, 2011. 40(5): p. 1314-1328.
- Rolland, B., et al., Toward Rigorous Data Harmonization in Cancer Epidemiology Research: One Approach. American Journal of Epidemiology, 2015. 182(12): p. 1033-1038.
- Rolland, B. and C.P. Lee, Beyond trust and reliability: reusing data in collaborative cancer epidemiology research, in Proceedings of the 2013 conference on Computer supported cooperative work. 2013, ACM: San Antonio, Texas, USA. p. 435-444.
- Renear, A.H., S. Sacchi, and K.M. Wickett, Definitions of dataset in the scientific and technical literature, in Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47. 2010, American Society for Information Science: Pittsburgh, Pennsylvania. p. 1-4.
- Ioannidis, J.A. and M.J. Khoury, Assessing value in biomedical research: The pqrst of appraisal and reward. JAMA, 2014. 312(5): p. 483-484.
- Krumholz, H.M., Why data sharing should be the expected norm. BMJ, 2015. 350: p. h599.
- van Panhuis, W.G., et al., A systematic review of barriers to data sharing in public health. BMC Public Health, 2014. 14: p. 1144.
- Tenopir, C., et al., Data Sharing by Scientists: Practices and Perceptions. PLoS ONE, 2011. 6(6): p. e21101.
- Vickers, A., Whose data set is it anyway? Sharing raw data from randomized trials. Trials, 2006. 7(1): p. 15.
- Poldrack, R.A. and K.J. Gorgolewski, Making big data open: data sharing in neuroimaging. Nat Neurosci, 2014. 17(11): p. 1510-7.
- Chokshi, D.A., M. Parker, and D.P. Kwiatkowski, Data sharing and intellectual property in a genomic epidemiology network: policies for large-scale research collaboration. Bull World Health Organ, 2006. 84(5): p. 382-7.
- Goodman, S.N., Clinical trial data sharing: what do we do now? Ann Intern Med, 2015. 162(4): p. 308-9.
- Van Horn, J.D. and C.A. Ball, Domain-Specific Data Sharing in Neuroscience: What do we have to learn from each other? Neuroinformatics, 2008. 6(2): p. 117-121.
- Gaheen, S., et al., caNanoLab: data sharing to expedite the use of nanotechnology in biomedicine. Comput Sci Discov, 2013. 6(1): p. 014010.
- Piwowar, H.A., R.S. Day, and D.B. Fridsma, Sharing detailed research data is associated with increased citation rate. PloS one, 2007. 2(3).
- Ferguson, A.R., et al., Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat Neurosci, 2014. 17(11): p. 1442-7.
- Savage, C.J. and A.J. Vickers, Empirical study of data sharing by authors publishing in PLoS journals. PLoS One, 2009. 4(9): p. e7078.
- Bietz, M., E. Baumer, and C. Lee, Synergizing in Cyberinfrastructure Development. Computer Supported Cooperative Work (CSCW), 2010. 19(3): p. 245-281.
- Birnholtz, J.P. and M.J. Bietz, Data at work: supporting sharing in science and engineering, in Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work. 2003, ACM: Sanibel Island, Florida, USA. p. 339-348.
- Robinson, P.N., Genomic data sharing for translational research and diagnostics. Genome Med, 2014. 6(9): p. 78.
- Miller, G.W., Data sharing in toxicology: beyond show and tell. Toxicol Sci, 2015. 143(1): p. 3-5.
- Pisani, E. and C. AbouZahr, Sharing health data: good intentions are not enough. Bull World Health Organ, 2010. 88(6): p. 462-6.
- Sarpatwari, A., et al., Ensuring patient privacy in data sharing for postapproval research. N Engl J Med, 2014. 371(17): p. 1644-9.
- Medicine., I.o., Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. 2011: Washington, DC. The National Academies Press.
- Dove, E.S., A.M. Tasse, and B.M. Knoppers, What are some of the ELSI challenges of international collaborations involving biobanks, global sample collection, and genomic data sharing and how should they be addressed? Biopreserv Biobank, 2014. 12(6): p. 363-4.
- Zimmerman, A., Not by metadata alone: the use of diverse forms of knowledge to locate data for reuse. International Journal on Digital Libraries, 2007. 7(1-2): p. 5-16.
- Faniel, I.M. and T.E. Jacobsen, Reusing Scientific Data: How Earthquake Engineering Researchers Assess the Reusability of Colleagues™ Data. COMPUTER SUPPORTED COOPERATIVE WORK, 2010. 19(3-4): p. 355-375.
- Bajorath, J., On data sharing in computational drug discovery and the need for data notes. F1000Res, 2014. 3: p. 280.
- Chan, A.-W., et al., Increasing value and reducing waste: addressing inaccessible research. The Lancet. 383(9913): p. 257-266.
- Neumann, J. and J. Brase, DataCite and DOI names for research data. Journal of Computer-Aided Molecular Design, 2014. 28(10): p. 1035-1041.
- Force, M. and N. Robinson, Encouraging data citation and discovery with the Data Citation Index. Journal of Computer-Aided Molecular Design, 2014. 28(10): p. 1043-1048.
- SF424 (R&R) Application Guide for NIH and Other PHS agencies, U.S.D.o.H.a.H.S.P.H. Service, Editor. 2014. p. I-89.
About the Author
Betsy Rolland, PhD MLIS MPH, is a Cancer Prevention Fellow at the National Cancer Institute in the Healthcare Delivery Research Program, Health Systems and Interventions Research Branch
See all blog posts
If you are interested in contributing a column, please contact Amanda Vogel at Amanda.Vogel@nih.gov.