Description: Facilitating the sharing of data analysis results through in-depth provenance capture

This title appears in the Scientific Report : 2021

Facilitating the sharing of data analysis results through in-depth provenance capture

INTRODUCTION/MOTIVATION Workflows for the analysis of electrophysiology activity data are typically composed of multiple steps. In the simplest case, these comprise several scripts executed in sequence, with several dependencies on data and parameters sets. However, workflows can become increasingly...

Personal Name(s):	Köhler, Cristiano (Corresponding author)
	Ulianych, Danylo / Gerkin, Richard C. / Davison, Andrew P. / Grün, Sonja / Denker, Michael
Contributing Institute:	Computational and Systems Neuroscience; IAS-6 Jara-Institut Brain structure-function relationships; INM-10 Computational and Systems Neuroscience; INM-6
Imprint:	2021
Conference:	5th HBP Student Conference on Interdisciplinary Brain Research, online (online), 2021-02-01 - 2021-02-04
Document Type:	Poster
Research Program:	Helmholtz School for Data Science in Life, Earth and Energy (HDS LEE) Theory, modelling and simulation Connectivity and Activity Neuroscientific Foundations Digitization of Neuroscience and User-Community Building Helmholtz Analytics Framework Human Brain Project Specific Grant Agreement 3 Human Brain Project Specific Grant Agreement 2
	Publikationsportal JuSER

INTRODUCTION/MOTIVATION Workflows for the analysis of electrophysiology activity data are typically composed of multiple steps. In the simplest case, these comprise several scripts executed in sequence, with several dependencies on data and parameters sets. However, workflows can become increasingly complex during the course of an analysis project: researchers can investigate alternative analysis paths or adjust the workflow components according to new hypotheses or additional experimental data. Considering this complexity and iterative nature, robust tools forming the basis of the workflow are necessary [1] to fully document the workflow and improve the reproducibility of the results. Provenance is the capture and characterization of data manipulations and parameters throughout the workflow [2]. This requires complete and self-explanatory descriptions of the generated data and a method to minimize the need for manually tracking the workflow execution, while maximizing the information content of the provenance trail. While frameworks to structure the input data and associated metadata exist, a similar representation for the outputs of the analysis part of the workflow is missing. Moreover, workflow management systems capture limited provenance information, as they do not provide details about the functions used inside each analysis script. Finally, the workflow output lacks information pertaining to its generation. Therefore, to satisfy the requirements of a practically useful provenance trail, existing tools must be improved to implement a data model that captures analysis outputs and their detailed provenance and, ultimately, represents the analysis and its results in accordance with the FAIR principles [3]. METHODS We focus on two open-source tools for the analysis of electrophysiology data developed in EBRAINS. The Neo (RRID:SCR_000634) framework provides an object model to standardize neural activity data acquired from different sources [4]. Elephant (RRID:SCR_003833) is a Python toolbox for analyses of electrophysiology data [5]. We implemented two synergistic prototype solutions that extend the functionality of these tools with respect to (i) the systematic standardization of analysis results and (ii) the automatical capture of provenance information during the execution of a Python analysis script. Both solutions are under development and being incorporated as new functionality into the Elephant package. The first solution represents the output of Elephant functions in a data model inspired by Neo. Objects for specific analysis results (e.g., a time histogram) are inherited from a base Python class that supports storage of provenance information such as timestamps and unique identifiers. The second solution is a provenance tracker implemented as a function decorator. It identifies the objects that are input to and output from the function, creating unique hashes. It also captures timestamps, statement code lines, and additional function parameters. Extended dependencies between objects (such as indexing and attributes) are mapped using the analysis of the abstract syntax tree (AST) obtained from the code. RESULTS AND DISCUSSION The solutions presented here capture provenance during the analysis of electrophysiology data with minimal user intervention. The data objects support a hierarchical standardization of the output of Elephant functions (e.g., a time histogram is a specific type of histogram) while encapsulating all the information about the generation of an analysis output. Therefore, these objects can be easily re-used or shared. This will eliminate the need to manually annotate the output of the analysis with corresponding parameters. The new objects also seamlessly extend the functionality of the Neo classes currently used as output of Elephant functions, and can be integrated into the existing code bases with minimal disruption. Additionally, we describe how to capture provenance information throughout the Python analysis script using decorators. These track the Elephant and user-defined functions used in the script while mapping the inputs to the outputs. We demonstrate how the captured information can be used to build a graph showing the steps followed in the script, and that can be stored as metadata. The analysis results obtained with or without the use of the two solutions are compared, highlighting the potential benefits for reproducibility and data re-use. The provenance tracker and the standard data objects capture and manage distinct aspects of the provenance information. In the end, both solutions are complementary. On one hand, the decorator is focused on building the provenance trail and the relationships between the different steps of the analysis within the script. On the other hand, the standard objects focus on the representation of the data, standardizing information that is similar among the outputs of different functions together with the storage of the relevant provenance information as metadata. Ultimately, those two developments aim to increase data interoperability and reusability in accordance with the FAIR principles. REFERENCES: [1] Denker, M. and Grün, S. (2016). Designing Workflows for the Reproducible Analysis of Electrophysiological Data. In Brain-Inspired Computing, Amunts, K. et al., eds. (Cham: Springer International Publishing), pp. 58-72. [2] Ragan, E.D. et al. (2016). Characterizing Provenance in Visualization and Data Analysis: An Organizational Framework of Provenance Types and Purposes. IEEE Transactions on Visualization and Computer Graphics. 22(1):31–40. [3] Wilkinson, M.D. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. [4] Garcia, S. et al. (2014) Neo: an object model for handling electrophysiology data in multiple formats. Frontiers in Neuroinformatics 8:10. [5] http://python-elephant.org