_______________________________________________________________ TITLE: Ocean Microbial Functional Diversity – 0.2-3µm size fraction _______________________________________________________________ 1.- INTRODUCTION The geodatabase on Ocean Microbial Functional Diversity was created in 2022 in the framework of the EU funded AtlantECO project. It is the result of the assemblage and analysis of existing metagenomic data from different oceanographic expeditions (Țara Ocean, Malaspina, OSD 14, bioGEOTRACER and Anaconda). The database contains points representing the functional diversity of microbes from 0.2-3µm size fraction across the global ocean. The functional diversity was estimated by the Shannon Index based on the KEGG orthologs (KOs). 2.- METHODOLOGY USED The raw sequences were analyzed using the 5.0 version of MGnify’s pipeline. For each samples (n=451), functional tables contained the abundance of KOs were downloaded from MGnify and joined in a single table. A second table was generated discarding samples with abundance <10,000 reads. This table was normalized to an equal sampling depth (10294 reads) to generate the final table. From this table, the funcional diversity was estimated using by Shannon Index in the R software environment (Vegan R Package) 3.- DATASET DESCRIPTION Data type: This table contains the functional diversity (Shannon index) based on KEGG orthologs (KOs) Latitude/Longitude Format: decimal degrees Geographic area covered by the dataset: Global Ocean . Depth range covered by the dataset: min = 0mt; max = 5160 Mts, Avg: 335 Mts Time period covered by the dataset: 2009 - 2014. Dataset format: csv and cn Date of dataset creation: 20/11/22 Raw dataset repository: MGnify / local server, UFSCar 4.- MAIN VARIABLE DESCRIPTION MeasurementTypeID: Functional Diversity based on KEEG orthologs (KOs) MeasurementValue: Functional Diversity MeasurementID: Shannon Index All metadata provided: 1.- ProjectID Name of the overarching project (AtlantECO_H2020_GA#210591007) 2.- ProjectWP Work Package within the overarching project (WP2,WP4 etc.) 3.- DataSilo Name of data silo within WP (plastics, carbon, metaG, microscopy, imaging etc.) 4.- ContactName String with names of the people in charge of the dataset - 2 names minimum 5.- ContactAdress String with the valid email adresses of the people in ContactName. 6.- occurrenceID AtlantECO-specific identifier 7.- decimalLatitude Geographic Latitude in decimal degree, following the -180/+180 WGS84 Spatial Reference System (SRS). 8.- decimalLongitude Geographic Longitude in decimal degree, following the -180/+180 WGS84 SRS. 9.- geodeticDatum SRS of the spatial coordinates; give 'WGS84' only. 10.- CoordUncertainty Uncertainty estimate of the decimal coordinates; in meters. 11.- CountryCode ISO3166-1-alpha-2 code for the country the observations belongs to. 12.- eventDate Date and time of the sampling event using the extended ISO 8601 format with hyphens (e.g. ‘YYYY-MM-DDTHH:MM:SS’). If the time was not recorded, then just add ‘T00:00:00’ (e.g. ‘2017-09-23T00:00:00’). 13.- eventDateInterval Numeric values indicating the duration of the in-situ sampling event (event duration, not the time spent in the lab analyzing the sample) 14.- eventDateIntervalUnit Unit of eventDateInterval (e.g. seconds, minutes, hours, light-years…). 15.- Year Year of the sampling event (YYYY format). 16.- Month Month of the sampling event (MM format). 17.- Day Day of the sampling event (DD format). 18.- Bathymetry Depth of the seafloor at sampling event, in meters, ≤ 0. 19.- BathySource String indicating whether Bathymetry was measured at sampling event or inferred a posteriori from NOOA or GEBCO 20.- HabitatType String indicating the type of habitat the sample was taken from (e.g. open ocean water column, river plume, river, coral reef, mangrove…) 21.- LonghurstProvince Longhurst Province the sample was taken from (one of 56 possible four-letter geocodes). You can get this information using the python script available here. 22.- Depth Sample depth (in meters below the local sea surface); > 0; = 0 is surface 23.- DepthAcurracy Single term that describes the accuracy of the collection depth, in meters. 24.- DepthIntegral Depth span below sea surface, in meters; > 0; = 0 if surface. 25.- MinDepth minimum depth for depth-integrated quantities, in meters, > 0. 26.- MaxDepth maximum depth for depth-integrated quantities, in meters, > 0. 27.- ParentEventID Describes the parent event, which is composed of one or more sub-sampling (child) events (eventID below). 28.- eventID Biosample Accession Number 29.- InstitutionCode Custodian institution for the data record 30.- SourceArchive Online archive where the data is stored. 31.- OrigCollectionCode Bioproject accession number given in ENA 32.- OrigCollectionID Run accession number given in ENA 33.- BiblioCitation String indicating the bibliographic citation associated with the data (when possible). 34.- BiblioCitationDOI DOI of the bibliographic citation (when possible). 35.- DateDataAccess Date at which the data was downloaded from the SourceArchive; in the ISO 8601 format 36.- OrigScientificName Instrument platform used for sample sequencing 37.- ScientificName marine metagenome [species] 38.- WoRMS_ID Not_applicable 39.- TaxonRank Not_applicable 40.- Kingdom marine metagenome [species]/Bac&Archae/Euk 41.- Phylum Not_applicable 42.- Order Not_applicable 43.- Class Not_applicable 44.- Family Not_applicable 45.- Genus Not_applicable 46.- Species Not_applicable 47.- Subspecies Not_applicable 48.- LifeForm Not_applicable 49.- AssocTaxa Not_applicable 51.- measurementType Functional Diversity 54.- measurementUnit Shannon Index Value 55.- measurementAcurracy Not_applicabl 56.- measurementValueID Instrument platform used for sample sequencing 57.- Biomass_mgCm3 Not_applicable 58.- BiomassConvFactor Not_applicable 59.- basisOfRecord Nature of the data record 60.- SamplingProtocol Indicates the sampling protocol used to make the measurement. In this database, this variable refers to the size fraction upper and lower threshold 61.- SampleAmount seawater filtered 62.- SampleAmountUnit Unit corresponding to the SampleAmont 63.- SampleEffort sample filtration time 64.- DeterminedBy Name(s) of the people, groups or organizations who made the measurement 65.- DeterminedDate The date on which the which the sample was added to the public repository 66.- Note Oceanographic Expedition ID + ORG. group target (eg. Viruses, Protist; Prok) which is related to the size fraction 67.- Surf_temperature Sea surface temperature in Celsius 68.- Temperature Temperature measured in Celsius at sampling event (day and depth resolved). 69.- Surf_salinity Same as above but for salinity. 70.- Salinity - 71.- Surf_nitrate_micromol Same as above but for Nitrates (NO3) concentration, in micromolar (µM). 72.- Nitrate_micromol - 73.- Surf_phosphate_micromol Same as above but for Phosphates (PO43-) concentration, in micromolar (µM). 74.- Phosphate_micromol - 75.- Surf_Chla_mgm3 Same as above but for Chlorophyll-a concentration, in mg.m-3. 76.- Chla_mgm3 - 77.- Flag to_be_defined 5.- LINKS Link to the csv table https://drive.google.com/file/d/1v4QfYPaz8GQeguLZRC_hUv7DUaXc3IAS/view?usp=share_link Link to the nc table https://drive.google.com/file/d/1U7Vf5xPBUDxJA8hCapUUxrdriSO6zgp5/view?usp=share_link 6.- CONTRIBUTORS Paula Huber - UFSCar - Brazil- mariapaulahuber@gmail.com Hugo Sarmento - UFSCar - Brazil- hugo.sarmento@gmail.com Lorna Richardson - EMBL-EBI - UK- lornar@ebi.ac.uk Rob Finn -EMBL-EBI - UK - rdf@ebi.ac.uk Fabio Benedetti - ETH Züric- fabio.benedetti@usys.ethz.ch