Skip to content

Latest commit

 

History

History
159 lines (119 loc) · 13.5 KB

gc-public-data.md

File metadata and controls

159 lines (119 loc) · 13.5 KB

GBIF Public Datasets on Google Cloud

This describes the format and gives simple examples for getting started with the GBIF monthly snapshots stored on Google Cloud.

BigQuery

The latest snapshot is available as a public dataset in Google BigQuery. See the BigQuery description.

The snapshot includes all CC0, CC-BY and CC-BY-NC licensed occurrence data published through GBIF.

Cloud Storage

Periodic snapshots since April 2021 are stored as public data in GCS in the bucket public-datasets-gbif.

Within the bucket, the periodic occurrence snapshots are stored in occurrence/YYYY-MM-DD, where YYYY-MM-DD corresponds to the date of the snapshot. Data are stored in Parquet format, described below.

The snapshot includes all CC0, CC-BY and CC-BY-NC licensed occurrence data published through GBIF.

Each snapshot contains a citation.txt with instructions on how best to cite the data, and the data files themselves in Parquet format: occurrence.parquet/*.

Therefore, the data files for the first snapshot are at

gs://public-datasets-gbif/occurrence/2021-04-13/occurrence.parquet/*

and the citation information is at

gs://public-datasets-gbif/occurrence/2021-04-13/citation.txt

Schema

The Parquet file schema is described here.

Most field names correspond to terms from the Darwin Core standard, and have been interpreted by GBIF's systems to align taxonomy, location, dates etc. Additional information may be retrived using the GBIF API.

Field¹ Type Nullable Description
gbifid BigInt N GBIF's identifier for the occurrence
datasetkey String (UUID) N GBIF's UUID for the dataset containing this occurrence
publishingorgkey String (UUID) N GBIF's UUID for the organization publishing this occurrence.
occurrencestatus String N See dwc:occurrenceStatus. Either the value PRESENT or ABSENT. Many users will wish to filter for PRESENT data.
basisofrecord String N See dwc:basisOfRecord. One of PRESERVED_SPECIMEN, FOSSIL_SPECIMEN, LIVING_SPECIMEN, OBSERVATION, HUMAN_OBSERVATION, MACHINE_OBSERVATION, MATERIAL_SAMPLE, MATERIAL_CITATION, OCCURRENCE.
kingdom String Y See dwc:kingdom. This field has been aligned with the GBIF backbone taxonomy.
phylum String Y See dwc:phylum. This field has been aligned with the GBIF backbone taxonomy.
class String Y See dwc:class. This field has been aligned with the GBIF backbone taxonomy.
order String Y See dwc:order. This field has been aligned with the GBIF backbone taxonomy.
family String Y See dwc:family. This field has been aligned with the GBIF backbone taxonomy.
genus String Y See dwc:genus. This field has been aligned with the GBIF backbone taxonomy.
species String Y See dwc:species. This field has been aligned with the GBIF backbone taxonomy.
infraspecificepithet String Y See dwc:infraspecificEpithet. This field has been aligned with the GBIF backbone taxonomy.
taxonrank String Y See dwc:taxonRank. This field has been aligned with the GBIF backbone taxonomy.
scientificname String Y See dwc:scientificName. This field has been aligned with the GBIF backbone taxonomy.
verbatimscientificname String Y The scientific name as provided by the data publisher
verbatimscientificnameauthorship String Y The scientific name authorship provided by the data publisher.
taxonkey Integer Y The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to scientificname.
specieskey Integer Y The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to species.
typestatus String array See dwc:typeStatus.
countrycode String Y See dwc:countryCode. GBIF's interpretation has set this to an ISO 3166-2 code.
locality String Y See dwc:locality.
stateprovince String Y See dwc:stateProvince.
decimallatitude Double Y See dwc:decimalLatitude. GBIF's interpretation has normalized this to a WGS84 coordinate.
decimallongitude Double Y See dwc:decimalLongitude. GBIF's interpretation has normalized this to a WGS84 coordinate.
coordinateuncertaintyinmeters Double Y See dwc:coordinateUncertaintyInMeters.
coordinateprecision Double Y See dwc:coordinatePrecision.
elevation Double Y See dwc:elevation. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
elevationaccuracy Double Y See dwc:elevationAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
depth Double Y See dwc:depth. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
depthaccuracy Double Y See dwc:depthAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
eventdate Timestamp Y See dwc:eventDate. GBIF's interpretation has normalized this value to an ISO 8601 date with a local time.
year Integer Y See dwc:year.
month Integer Y See dwc:month.
day Integer Y See dwc:day.
individualcount Integer Y See dwc:individualCount.
establishmentmeans String Y See dwc:establishmentMeans.
occurrenceid String See dwc:occurrenceID.
institutioncode String See dwc:institutionCode.
collectioncode String See dwc:collectionCode.
catalognumber String See dwc:catalogNumber.
recordnumber String Y See dwc:recordNumber.
recordedby String array See dwc:recordedBy.
identifiedby String array See dwc:identifiedBy.
dateidentified Timestamp Y See dwc:dateIdentified. An ISO 8601 date.
mediatype String array See dwc:mediaType. May contain StillImage, MovingImage or Sound (from enumeration, detailing whether the occurrence has this media available.
issue String array A list of issues encountered by GBIF in processing this record.
license String N See dwc:license. Either CC0_1_0, CC_BY_4_0 or CC_BY_NC_4_0.
rightsholder String Y See dwc:rightsHolder.
lastinterpreted Timestamp N The ISO 8601 date when the record was last processed by GBIF. Data are reprocessed for several reasons, including changes to the backbone taxonomy, so this date is not necessarily the date the occurrence record last changed.

¹ Field names are lower case, but in later snapshots this may change to camelCase, for consistency with Darwin Core and the GBIF API.

² Either occurrenceID, or institutionCode + collectionCode + catalogNumber, or both, will be present on every record.

³ The array may be empty.

Change history

Snapshots from 2021-04-13 included only CC0 and CC-BY data.

From 2022-04-01, snapshots include all data (CC0, CC-BY, CC-BY-NC). Filter using the license column if you need to exclude CC-BY-NC data.

From 2022-05-01, the timestamp fields eventDate, dateIdentified and lastIntepreted have a timestamp type, rather than the previous string type. The fields identifiedby, recordedby and typestatus are changed from a string type to a string array.

Getting started with BigQuery

BigQuery provides a pay-per-query SQL service on Google Cloud, particularly well suited for producing summary counts from GBIF data. The following steps describe how to get started using BigQuery on the GBIF dataset.

  1. Open the GBIF Occurrences dataset in BigQuery
  2. Run a query, by pasting the following command in the editor window
SELECT kingdom, count(*) AS c
FROM `bigquery-public-data.gbif.occurrences`
GROUP BY kingdom;

Results (your numbers will differ since the data is updated every month):

kingdom c
1 Plantae 193391585
2 incertae sedis 4223619
3 Archaea 226905
4 Fungi 12210273
5 Viruses 42019
6 Bacteria 13313089
7 Animalia 1064194305
8 Protozoa 793176
9 Chromista 9440667
  1. Your results should show in the browser, and can also be saved in Google Drive, as CSV, as a new BigQuery table etc.
  2. The amount of data scanned will be shown under "Execution Details", which is used to calculate the billing.

Using an older snapshot with BigQuery

The public table contains the latest data from GBIF. You can create private tables from older snapshots if you need to.

Create a new BigQuery data set, and create a table within that data set. Import from public-datasets-gbif/occurrence/2022-05-01/occurrence.parquet/*, changing the date as required.

Downloading/mirroring the data

A monthly snapshot is roughly 180GB in size.

The GCS buckets are public, and can be accessed anonymously using the GCS APIs, gsutil CLI tool, or tools like rclone.

# gcloud CLI
gsutil ls gs://public-datasets-gbif/occurrence/

# rclone configuration
[gcs]
type = google cloud storage
project_number = [set by you]
service_account_file = [set by you]

# rclone commands
rclone ls gcs:public-datasets-gbif
rclone sync -v gcs:public-datasets-gbif/occurrence/2021-04-13/ ./gbif_2021-04-13/