class: left, middle, inverse, title-slide # A Discussion on Issues around Making Data Available Alongside Research ###
Nicholas Tierney, Monash University
&
Karthik Ram, Berkeley Institute of Data Science, UC Berkeley
###
Sage Bionetworks
Wednesday 12th February, 2020
njt-sage.netlify.com
nj_tierney
--- layout: true <div class="my-footer"><span>njt-sage.netlify.com β’ @nj_tierney</span></div> --- .large[ > "Data! data! data!" he cried impatiently. "I can't make bricks without clay." > --Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle) ] --- class: inverse, middle, center .large[ This talk represents work that is currently in progress with Karthik Ram: **"A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility"** _Feedback, discussion, questions encouraged throughout talk!_ ] --- class: inverse, middle, center .vlarge[We need data] -- .large[We need it to do our job] -- .large[(though not strictly true for theory, but you get what I mean)] --- # Benefits of Sharing data .large[ π Makes work transparent β Increases trust π Increases visibility
Independent validation β»οΈ Reproducibility ] --- class: middle, invert, center # Research isn't often shared It's often really available to the authors --- # Rowhani-Farid & Barnett, 2016 .large[ Out of 160 randomly sampled BMJ papers: - 3 included data in the paper - 7/157 research articles shared their data sets - For 21 clinical trials bound by the BMJ data sharing policy, 24% shared data ] --- # Stodden, Seiler and Ma, 2018 .large[ _Science_ (the journal) made clause for Authors to provide data with papers Authors evaluated reproducibility of 204 papers after clause issued. ] --- class: inverse, middle, center # Your Turn: .large[ What are the resistance points to data sharing at Sage Bionetworks? Is data shared publicly? What is the data sharing process? ] --- # Responses to Stodden, Seiler and Ma, 2018 <img src="img/stodden-table-1.png" width="80%" style="display: block; margin: auto;" /> --- # Responses to Stodden, Seiler and Ma, 2018 .large[ πΏ > When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π’ > I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π > We do not typically share our internal data or code with people outside our collaboration. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ πΏ > The code we wrote is the accumulated product of years of effort by [redacted] and myself. Also, the data we processed was collected painstakingly over a long period by collaborators, and so we will need to ask permission from them too. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π’ > Normally we do not provide this kind of information to people we do not know. It might be that you want to check the data analysis, and that might be of some use to us, but only if you publish your findings while properly referring to us. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π > Thank you for your interest in our paper. For the [redacted] calculations I used my own code, and there is no public version of this code, which could be downloaded. Since this code is not very user-friendly and is under constant development I prefer not to share this code ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π > Our program [redacted] is available here [URL redacted] (documentation and tutorials were included) ] --- # Responses to Stodden, Seiler and Ma, 2018 .medium[ π > If you go to [URL redacted], under the publications, I have a link to the gitHub repository. I donβt know if I have all of the raw simulated data, but I certainly have the processed data used to make the plots. What do you need? All of the simulated data could of course be regenerated from the code. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π > Please find attached a .zip file called [redacted].zip that has the custom MATLAB [redacted] analysis code. If you run Masterrunfigureone.m this will generate several panels from the paper. ] --- # Responses to Stodden, Seiler and Ma, 2018 .large[ π > In the next email I will enclose the custom image analysis software. This can also be accessed from [URL redacted] where there is a manual and tutorial. ] --- # Sharing data? .large[ Plenty of research says it is important Fata sharing should be FAIR (findable, Accessible, Interoperable, and Reusable) These don't precicely tell you how to share data ] --- # Why not share data? .large[ There are indeed good reasons to not share data: 1. Privacy concerns (e.g., human subjects, locations of critically endangered species) 1. May put the authors at a competitive disadvantage ( but data can be embargoed for reasonable periods of time) ] --- # Why not share data? .large[ "If you can't do something right, don't do it" - This ^^ is wrong - you **can** provide something, even if it is just simulated data. - Sharing data (in most cases) has a net positive benefit ] --- # Another way to think about sharing data .pull-left[ Mountain <img src="img/rock.jpg" width="461" style="display: block; margin: auto;" /> ] -- .pull-right[ Ramp <img src="img/ramp.jpg" width="269" style="display: block; margin: auto;" /> ] ??? It should instead be an "on-ramp" It can feel like a wall or a mountain we need to climb. These require special tools and knowledge. --- # An on ramp to sharing data .medium[ 1. Analysis ready data: Final data used in analysis 1. README: A Human readable description of the data 1. Data dictionary: Human readable dictionary of data contents 1. Raw data: The original/first data provided 1. Scripts: To clean raw data ready for analysis 1. License: How to use and share the data 1. Citation: How you want your data to be cited 1. Machine readable meta data: Make your data searchable ] --- # Analysis ready data: Final data used in analysis .left-code[ ``` project βββ data βββ crime.csv ``` ] -- .right-plot.medium[ - Dataset(s) in the form used in analysis. - Ideally in "tidy data" form. - Plain-text format, `.csv`, `.tsv`, `.txt` - Binary formats discouraged - e.g., `.rda`, `.rds`, `.sav`, `.dta` ] --- # README: A Human readable description of the data .left-code[ ``` project βββ data | βββ crime.csv βββ README.md ``` ] .right-plot[ - Top level of the data repository, (optionally for each dataset) - The who, what, when, where, why, of your data. - Guides user how to understand this directory. - Handy when no reliable standards - `.md` allows you to take advantage of `markdown` ] --- # Data dictionary: Human readable dictionary of data contents .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv βββ README.md ``` ] .right-plot[ - Human readable description, context, and structure of the data - Helps familiarise user with data - It should contain: - **variable names** - **variable labels** - **variable codes**, and - special values for **missing data** ] --- # Data Dictionary: Human readable dictionary of data contents <img src="figures/print-table-1.png" width="936" style="display: block; margin: auto;" /> --- # Raw data: The original/first data provided .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv βββ data-raw β βββ crime-raw.dat βββ README.md ``` ] .right-plot[ - usually **first format** of data provided before any tidying or cleaning. - **If the raw data is a practical size to share**, it should be shared in a folder called `data-raw`. - Should be in the form that was first received, even if it is in binary or some proprietary format. - Option to include data dictionaries of the raw data can be provided in `data-raw`. ] --- # Scripts: To clean raw data .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv βββ data-raw β βββ crime-raw.dat β βββ clean-crime.R β βββ other-steps.md βββ README.md ``` ] .right-plot[ - Code used to clean and tidy the raw data. - `clean-crime.R` - Ideally involves only scripted languages - If other practical steps were taken to clean up the data, these should be recorded in a plain text or markdown file. - `other-steps.md` ] --- # License: How to use and share the data .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv βββ data-raw β βββ crime-raw.dat β βββ clean-crime.R β βββ other-steps.md βββ README.md βββ LICENSE ``` ] .right-plot[ - Data + license clearly establishes how everyone to modify, use, and share data. - Two licenses well suited for data sharing: 1. CCBY: enforce attribution and credit required, no warranty. 2. CC0: public domain. No ownership or warranty - Provide LICENSE file with entire license in the top level of directory. - `use_cc0_license()` - `use_ccby_license()` ] --- # Citation: How to cite your data .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv βββ data-raw β βββ crime-raw.dat β βββ clean-crime.R β βββ other-steps.md βββ README.md (reference DOI) βββ CITATION βββ LICENSE ``` ] .right-plot[ - A Digital Object Identifier (DOI) uniquely + permanently identifies a digital object ( paper, poster, or software) - DOIs are minted by repositories like Dryad or Zenodo for free. - Put the DOI in a reference format like BibTex (zenodo does this for you) ] --- # Citation: example ``` @software{housing-data, author = {Tony Pino, Nicholas Tierney}, title = {njtierney/melb-housing-data: Added LICENSE.md file}, month = feb, year = 2019, publisher = {Zenodo}, version = {1.0.1}, doi = {10.5281/zenodo.2575545}, url = {https://doi.org/10.5281/zenodo.2575545} } ``` --- # Machine readable meta data: Make your data searchable .left-code[ ``` project βββ data β βββ crime.csv β βββ crime-dictionary.csv β βββ metadata β βββ access.csv β βββ attributes.csv β βββ biblio.csv β βββ creators.csv β βββ dataspice.json βββ data-raw β βββ crime-raw.dat β βββ clean-crime.R β βββ other-steps.md βββ README.md (reference DOI here) βββ CITATION βββ LICENSE ``` ] .right-plot[ - Helps ensure data types are preserved. - Allows data to be indexed and searched online with google datasets search using JSON-LD. - To create appropriate metadata, we recommend metadata generators such as `dataspice` or `codebook` - Metadata should be provided in a folder called "metadata", which should be provided for every dataset. ] --- .large[ [Example Machine readable meta data: housing data](https://github.com/njtierney/melb-housing-data) ] ``` library(dataspice) create_spice(here::here("data")) prep_attributes() prep_access() edit_access() edit_attributes() edit_biblio() edit_creators() write_spice() ``` --- # Actually sharing the data .large[ Now that you've created your data folder, you need to **get it somewhere online** Two options to discuss: 1. Putting the data online 1. Sharing as an R package ] --- # Your Turn: .large[ How does Sage Bionetworks distribute data? ] --- class: inverse, center, middle # Online repositories: Zenodo & Dryad --- # Zenodo .large[ - Launched in 2013 in a joint collaboration between openAIRE and CERN - Free, archival location to deposit datasets - File size limit is 50gb for individual files - Able to accommodate larger file sizes upon request ] --- # Dryad .large[ - The Dryad Digital Repository takes data from any field of research, and perform human quality control and assistance of the data - Can link data with a journal publication, in exchange for a data publishing fee. ] --- # Linking zenodo and GitHub .large[ You can also link Zenodo with Github Zenodo updates with new DOI at every "release" (Helps avoid managing many moving pieces) See this article on github, [making your code citable](https://guides.github.com/activities/citable-code/) (Thanks to [Arfon Smith](https://www.arfon.org/)) ] --- # Sharing data as an R package .large[ e.g., [`nycflights13`](https://github.com/hadley/nycflights13), [`eechidna`](https://docs.ropensci.org/eechidna/), [more](https://ropenscilabs.github.io/OZdatasets/) ] .pull-left.large[ Pros - installable - documentation - share data cleaning - great for R users ] .pull-right.large[ Cons - Size: ! >= 5Mb (CRAN) - Doesn't help others outside R ] --- # Sharing data as an R package .medium[ - Most (but not all!) data shared as an R package, or with an R package is for teaching purposes ] -- .medium[ - Sharing data like this allows you to create a "research compendium", with centralised code + paper + computing environment + data. ] -- .medium[ - See more about this in Karthik Ram's talk: ["How To Make Your Data Analysis Notebooks More Reproducible"](https://github.com/karthik/rstudio2019) ] -- .medium[ - Note the directory structure based is on R packages ] --- # Online "data" journals .large[ - Provides familiar mechanism for citation - But journals don't yet have a good way to outline how to share data - Often looks like a "mini paper" with the methods, and isn't always about the data, but about collection methods. - Can link to a Zenodo or Dryad repository. ] --- # Online "data" journals .large[ Some example data journals: - Nature: Scientific Data - Data in Brief - Data ] --- # Take homes .large[ - You don't have to do every single thing to publish your data - Take small steps - get the data somewhere first, add more detail as you go ] --- # Future Directions .large[ - Currently working on a proposal for "datadevtools" - a set of developer tools to facilitate sharing data - These tools can then be used to assess "shareability" of data ] --- # Discussion Questions .large[ * Do you curate data? * What are the common painpoints of curating data / collaborating on data? * How do you manage data releases? * What are the resistance points to data sharing? * Is data shared publicly, is there a process? * How do you distribute data? ] --- # Thanks .large[ - Karthik Ram - Miles McBain - Anna Kystalli - Daniella Lowenberg - ACEMS International Mobility Programme - Helmsley Charitable Trust - Gordon and Betty Moore Foundation - Sloan Foundation ] --- # References .large[ - [making your code citable](https://guides.github.com/activities/citable-code/) - [Stodden et al](https://www.pnas.org/content/115/11/2584) - [Rowhani-Farid & Barnett](https://bmjopen.bmj.com/content/6/10/e011784.abstract) ] --- # Colophon .large[ - Slides made using [xaringan](https://github.com/yihui/xaringan) - Extended with [xaringanthemer](https://github.com/gadenbuie/xaringanthemer) - Colours taken + modified from [lorikeet theme from ochRe](https://github.com/ropenscilabs/ochRe) - Header font is **Josefin Sans** - Body text font is **Montserrat** - Code font is **Fira Mono** - template available: [njtierney/njt-talks](github.com/njtierney/njt-talks) ] --- # Learning more .large[
[paper (on github)](https://github.com/karthik/ddd)
[talk](https://njt-numbat-data.netlify.com/#1)
nj_tierney
njtierney
nicholas.tierney@gmail.com ] --- .vhuge[ **End.** ]