Published on Sun May 23 2021

A proteomics sample metadata representation for multiomics integration, and big data analysis.

Dai, C., Fullgrabe, A., Pfeuffer, J., Solovyeva, E., Deng, J., Moreno, P., Kamatchinathan, S., Jaiswal Kundu, D., George, N., Fexova, S., Gruning, B., Foll, M. C., Griss, J., Vaudel, M., Audain, E., Locard-Paulet, M., Turewicz, M., Eisenacher, M., Uszkoreit, J., Van Den Bossche, T., Schwammle, V., Webel, H., Schulze, S., Bouyssie, D., Jayaram, S., Kumar Duggineni, V., Samaras, P., Wilhelm, M., Choi, M., Wang, M., Kohlbacher, O., Brazma, A., Papatheodorou, I., Bandeira, N., W. Deutsch, E., Vizcaino, J. A., Bai, M., Levitsky, L., Sachsenberg, T., Perez-Riverol, Y.

The amount of public proteomics data is increasing at an extraordinary rate. Hundreds of datasets are submitted each month to ProteomeXchange repositories, representing many types of proteomics studies, focusing on different aspects such as quantitative experiments, post-translational modifications, protein-protein interactions, or subcellular localization, among many others. For every proteomics dataset, two levels of data are captured: the dataset description, and the data files (encoded in different file formats). Whereas the dataset description and data file formats are supported by all ProteomeXchange partner repositories, there is no standardized format to properly describe the sample metadata and their relationship with the dataset files in a way that fully allows their understanding or re-analysis. It is left to the users choice whether to provide or not an ad hoc document containing this information. Therefore, in many cases, understanding the study design and data requires going back to the associated publication. This can be tedious and may be restricted in the case of non-open access publications. In many cases, this problem limits the generalization and reuse of public proteomics data.

Here we present a standard representation for sample metadata tailored to proteomics datasets produced by the HUPO Proteomics Standards Initiative and supported by ProteomeXchange resources. We repurposed the existing data format MAGE-TAB used routinely in the transcriptomics field to represent and annotate proteomics datasets. MAGETAB-Proteomics defines a set of annotation rules that the datasets submitted to ProteomeXchange should follow, ranging from sample properties to data analysis protocols. We also introduce a crowdsourcing project that enabled the manual curation of over 200 public datasets using MAGE-TAB-Proteomics. In addition, we describe an ecosystem of tools and libraries that were developed to validate and submit sample metadata-related information to ProteomeXchange. We expect that these tools will improve the reproducibility of published results and facilitate the reanalysis and integration of public proteomics datasets.