Description: FAIRly big: A framework for computationally reproducible processing of large-scale data

This title appears in the Scientific Report : 2022

FAIRly big: A framework for computationally reproducible processing of large-scale data

Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data...

Personal Name(s):	Wagner, Adina S. (Corresponding author)
	Waite, Laura K. / Wierzba, Małgorzata / Hoffstaedter, Felix / Waite, Alexander Q. / Poldrack, Benjamin / Eickhoff, Simon B. / Hanke, Michael
Contributing Institute:	Gehirn & Verhalten; INM-7
Published in:	Scientific data, 9 (2022) 1, S. 80
Imprint:	London Nature Publ. Group 2022
PubMed ID:	35277501
DOI:	10.1038/s41597-022-01163-2
Document Type:	Journal Article
Research Program:	Neuroscientific Data Analytics and AI
Link:	Get full text OpenAccess
	Publikationsportal JuSER

Please use the identifier: http://dx.doi.org/10.1038/s41597-022-01163-2 in citations.
Please use the identifier: http://hdl.handle.net/2128/31105 in citations.

Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework's performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).