"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
--Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)
This talk represents work that is currently in progress with Karthik Ram:
"A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility"
Feedback, discussion, questions encouraged throughout talk!
We need data
We need data
We need it to do our job
We need data
We need it to do our job
(though not strictly true for theory, but you get what I mean)
🔎 Makes work transparent
✅ Increases trust
🔈 Increases visibility
Independent validation
♻️ Reproducibility
It's often really available to the authors
Out of 160 randomly sampled BMJ papers:
Science (the journal) made clause for Authors to provide data with papers
Authors evaluated reproducibility of 204 papers after clause issued.
What are the resistance points to data sharing at Sage Bionetworks?
Is data shared publicly?
What is the data sharing process?
😿
When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.
😢
I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.
🙍
We do not typically share our internal data or code with people outside our collaboration.
😿
The code we wrote is the accumulated product of years of effort by [redacted] and myself. Also, the data we processed was collected painstakingly over a long period by collaborators, and so we will need to ask permission from them too.
😢
Normally we do not provide this kind of information to people we do not know. It might be that you want to check the data analysis, and that might be of some use to us, but only if you publish your findings while properly referring to us.
😭
Thank you for your interest in our paper. For the [redacted] calculations I used my own code, and there is no public version of this code, which could be downloaded. Since this code is not very user-friendly and is under constant development I prefer not to share this code
🎉
Our program [redacted] is available here [URL redacted] (documentation and tutorials were included)
🎉
If you go to [URL redacted], under the publications, I have a link to the gitHub repository. I don’t know if I have all of the raw simulated data, but I certainly have the processed data used to make the plots. What do you need? All of the simulated data could of course be regenerated from the code.
🎉
Please find attached a .zip file called [redacted].zip that has the custom MATLAB [redacted] analysis code. If you run Masterrunfigureone.m this will generate several panels from the paper.
🎉
In the next email I will enclose the custom image analysis software. This can also be accessed from [URL redacted] where there is a manual and tutorial.
Plenty of research says it is important
Fata sharing should be FAIR (findable, Accessible, Interoperable, and Reusable)
These don't precicely tell you how to share data
There are indeed good reasons to not share data:
Privacy concerns (e.g., human subjects, locations of critically endangered species)
May put the authors at a competitive disadvantage ( but data can be embargoed for reasonable periods of time)
"If you can't do something right, don't do it"
This ^^ is wrong - you can provide something, even if it is just simulated data.
Sharing data (in most cases) has a net positive benefit
Mountain
Mountain
Ramp
It should instead be an "on-ramp"
It can feel like a wall or a mountain we need to climb.
These require special tools and knowledge.
project └── data └── crime.csv
project └── data └── crime.csv
.csv
, .tsv
, .txt
.rda
, .rds
, .sav
, .dta
project ├── data| └── crime.csv└── README.md
.md
allows you to take advantage of markdown
project ├── data│ ├── crime.csv│ └── crime-dictionary.csv └── README.md
project ├── data│ ├── crime.csv│ └── crime-dictionary.csv ├── data-raw│ └── crime-raw.dat└── README.md
data-raw
. data-raw
.project ├── data│ ├── crime.csv│ └── crime-dictionary.csv ├── data-raw│ ├── crime-raw.dat│ ├── clean-crime.R│ └── other-steps.md└── README.md
clean-crime.R
other-steps.md
project ├── data│ ├── crime.csv│ └── crime-dictionary.csv ├── data-raw│ ├── crime-raw.dat│ ├── clean-crime.R│ └── other-steps.md├── README.md└── LICENSE
use_cc0_license()
use_ccby_license()
project ├── data│ ├── crime.csv│ └── crime-dictionary.csv ├── data-raw│ ├── crime-raw.dat│ ├── clean-crime.R│ └── other-steps.md├── README.md (reference DOI)├── CITATION└── LICENSE
@software{housing-data, author = {Tony Pino, Nicholas Tierney}, title = {njtierney/melb-housing-data: Added LICENSE.md file}, month = feb, year = 2019, publisher = {Zenodo}, version = {1.0.1}, doi = {10.5281/zenodo.2575545}, url = {https://doi.org/10.5281/zenodo.2575545}}
project ├── data│ ├── crime.csv│ ├── crime-dictionary.csv │ └── metadata│ ├── access.csv│ ├── attributes.csv│ ├── biblio.csv│ ├── creators.csv│ └── dataspice.json├── data-raw│ ├── crime-raw.dat│ ├── clean-crime.R│ └── other-steps.md├── README.md (reference DOI here)├── CITATION└── LICENSE
dataspice
or codebook
library(dataspice)create_spice(here::here("data"))prep_attributes()prep_access()edit_access()edit_attributes()edit_biblio()edit_creators()write_spice()
Now that you've created your data folder, you need to get it somewhere online
Two options to discuss:
How does Sage Bionetworks distribute data?
You can also link Zenodo with Github
Zenodo updates with new DOI at every "release" (Helps avoid managing many moving pieces)
See this article on github, making your code citable (Thanks to Arfon Smith)
e.g., nycflights13
, eechidna
, more
Pros
Cons
Some example data journals:
End.
"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
--Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |