___________________________________________________________________________ TITLE: Ocean Autotrophic Prokaryotes Abundance – 0.2 - 200 µm size fraction ___________________________________________________________________________ 1.- INTRODUCTION The geodatabases of Autotrophic Prokaryotes information of 428 samples from the Malaspina and Tara Oceans oceanographic expeditions were created in 2022 in the framework of the EU funded AtlantECO project. The database contains points representing the Autotrophic Prokaryotes Abundance from 0.2 - 200 µm size fraction across the global ocean. The abundance was estimated based on 16S rRNA gene amplicon sequencing reads. 2.- METHODOLOGY USED The metabarcoding dataset consists of an integration of quality-controlled sequencing data from the Tara Ocean and Malaspina expeditions. The primers used for the V4-V5 region of the 16S rRNA gene were 515FB: GTGYCAGCMGCCGCGGTAA and 926R: CCGYCAATTYMTTTRAGTTT (Quince et al., 2011; Parada et al., 2016). Amplicon sequences were processed using the DADA2 pipeline (Callahan et al., 2016; Lee, 2019) to characterize Amplicon Sequence Variants (ASVs) that were used as a proxy of microbial species (Callahan, McMurdie & Holmes, 2017). Each sequencing project was analyzed separately because different runs can have different error profiles following Callahan et al. (2016). The quality of the samples was explored, and the trimming and filtering parameters were chosen according to Callahan et al. (2016). After merging the runs, the taxonomic classification was performed using the IDTAXA algorithm implemented in the DECIPHER package for the R programming language (Murali et al., 2018) and the SILVA database (SILVA SSU r138 2019) as a reference (Karst et al., 2018). Only samples with more than 10,000 reads were analyzed, and we kept ASVs with 50 reads distributed in at least three samples or those that have less than 50 reads distributed in more than three samples. We filtered the Autotrophic Prokaryotes and their abundance was estimated with the number of reads count. 3.- DATASET DESCRIPTION Data type: Autotrophic Prokaryotes Abundance based on 16S rRNA gene amplicon sequencing Latitude/Longitude Format: decimal degrees Geographic area covered by the dataset: Global Ocean Depth range covered by the dataset: Min 3 m , Max 4000 m Time period covered by the dataset: 15-09-2009 and 27-10-2013 Dataset format: csv (comma-separated values) Date of dataset creation: 27-12-2022 Raw dataset repository: ENA (European Nucleotide Archive) and MARBITS (Marine Bioinformatics Platform at ICM-CSIC) 4.- MAIN VARIABLE DESCRIPTION MeasurementTypeID: Autotrophic Prokaryotes Abundance based on 16S rRNA gene amplicon sequencing MeasurementValue: Ocean Autotrophic Prokaryotes Abundance MeasurementID: Abundance All metadata provided: 1.- ProjectID Name of the overarching project (AtlantECO_H2020_GA#210591007) 2.- ProjectWP Work Package within the overarching project (WP2,WP4 etc.) 3.- DataSilo Name of data silo within WP (plastics, carbon, metaG, AmpSeq, microscopy, imaging etc.) 4.- ContactName String with names of the people in charge of the dataset - 2 names minimum 5.- ContactAdress String with the valid email adresses of the people in ContactName. 6.- occurrenceID AtlantECO-specific identifier 7.- decimalLatitude Geographic Latitude in decimal degree, following the -180/+180 WGS84 Spatial Reference System (SRS). 8.- decimalLongitude Geographic Longitude in decimal degree, following the -180/+180 WGS84 SRS. 9.- geodeticDatum SRS of the spatial coordinates; give 'WGS84' only. 10.- CoordUncertainty Uncertainty estimate of the decimal coordinates; in meters. 11.- CountryCode ISO3166-1-alpha-2 code for the country the observations belongs to. 12.- eventDate Date and time of the sampling event using the extended ISO 8601 format with hyphens (e.g. ‘YYYY-MM-DDTHH:MM:SS’). If the time was not recorded, then just add ‘T00:00:00’ (e.g. ‘2017-09-23T00:00:00’). 13.- eventDateInterval Numeric values indicating the duration of the in-situ sampling event (event duration, not the time spent in the lab analyzing the sample) 14.- eventDateIntervalUnit Unit of eventDateInterval (e.g. seconds, minutes, hours, light-years…). 15.- Year Year of the sampling event (YYYY format). 16.- Month Month of the sampling event (MM format). 17.- Day Day of the sampling event (DD format). 18.- Bathymetry Depth of the seafloor at sampling event, in meters, ≤ 0. 19.- BathySource String indicating whether Bathymetry was measured at sampling event or inferred a posteriori from NOOA or GEBCO 20.- HabitatType String indicating the type of habitat the sample was taken from (e.g. open ocean water column, river plume, river, coral reef, mangrove…) 21.- LonghurstProvince Longhurst Province the sample was taken from (one of 56 possible four-letter geocodes). You can get this information using the python script available here. 22.- Depth Sample depth (in meters below the local sea surface); > 0; = 0 is surface 23.- DepthAcurracy Single term that describes the accuracy of the collection depth, in meters. 24.- DepthIntegral Depth span below sea surface, in meters; > 0; = 0 if surface. 25.- MinDepth minimum depth for depth-integrated quantities, in meters, > 0. 26.- MaxDepth maximum depth for depth-integrated quantities, in meters, > 0. 27.- ParentEventID Describes the parent event, which is composed of one or more sub-sampling (child) events (eventID below). 28.- eventID Biosample Accession Number 29.- InstitutionCode Custodian institution for the data record 30.- SourceArchive Online archive where the data is stored. 31.- OrigCollectionCode Bioproject accession number given in ENA 32.- OrigCollectionID Run accession number given in ENA 33.- BiblioCitation String indicating the bibliographic citation associated with the data (when possible). 34.- BiblioCitationDOI DOI of the bibliographic citation (when possible). 35.- DateDataAccess Date at which the data was downloaded from the SourceArchive; in the ISO 8601 format 36.- OrigScientificName Instrument platform used for sample sequencing 37.- ScientificName marine amplicon sequencing [species] 38.- WoRMS_ID Not_applicable 39.- TaxonRank Not_applicable 40.- Kingdom marine amplicon sequencing [species]/Bac&Archae/Euk 41.- Phylum Not_applicable 42.- Order Not_applicable 43.- Class Not_applicable 44.- Family Not_applicable 45.- Genus Not_applicable 46.- Species Not_applicable 47.- Subspecies Not_applicable 48.- LifeForm Not_applicable 49.- AssocTaxa Not_applicable 51.- measurementType Read count (Abundance) 54.- measurementUnit Number of reads 55.- measurementAcurracy Not_applicable 56.- measurementValueID Instrument platform used for sample sequencing 57.- Biomass_mgCm3 Not_applicable 58.- BiomassConvFactor Not_applicable 59.- basisOfRecord Nature of the data record 60.- SamplingProtocol Indicates the sampling protocol used to make the measurement. In this database, this variable refers to the size fraction upper and lower threshold 61.- SampleAmount seawater filtered 62.- SampleAmountUnit Unit corresponding to the SampleAmont 63.- SampleEffort sample filtration time 64.- DeterminedBy Name(s) of the people, groups or organizations who made the measurement 65.- DeterminedDate The date on which the which the sample was added to the public repository 66.- Note Oceanographic Expedition ID + ORG. group target (eg. Viruses, Protist; Prok) which is related to the size fraction 67.- Surf_temperature Sea surface temperature in Celsius 68.- Temperature Temperature measured in Celsius at sampling event (day and depth resolved). 69.- Surf_salinity Same as above but for salinity. 70.- Salinity - 71.- Surf_nitrate_micromol Same as above but for Nitrates (NO3) concentration, in micromolar (µM). 72.- Nitrate_micromol - 73.- Surf_phosphate_micromol Same as above but for Phosphates (PO43-) concentration, in micromolar (µM). 74.- Phosphate_micromol - 75.- Surf_Chla_mgm3 Same as above but for Chlorophyll-a concentration, in mg.m-3. 76.- Chla_mgm3 - 77.- Flag to_be_defined 5.- LINKS Link to the csv table https://drive.google.com/file/d/1aBhD6zM8r6HXMT7KutkypFQi1IKWyOVM/view?usp=share_link Link to the nc table https://drive.google.com/file/d/1vEek_MN67dkpiNjobzBX3icFbny2lYax/view?usp=share_link 6.- CONTRIBUTORS Hugo Sarmento - UFSCar - Brazil- hugo.sarmento@gmail.com Clara Arboleda - UFSCar - Brazil - claraarboledab@gmail.com