DataViz Projects on topic of COVID-19 genomic sequencing.
Mostly showing Australia by default, other countries available for selection.
Most pages now show the nextclade.org lineages, an alternative lineage classification tool. Alternates of most pages are also available showing the original GISAID lineages if preferred, but most experts recommend nextclade for quicker and more precise calls, especially of newer lineages. Pages with (nextclade) in the page title show nextclade lineages, otherwise GISAID lineages are show. The page navigation is at bottom-centre, e.g. < 2 of 30 >.
Visualise the geographical spread of a selected lineage. Use the play control at bottom for an animated view of the spread.
Locations are approximate - typically by reporting state/province or country. Bubble sizes are driven by the % of the total set of samples selected.
Rolls up the evolutionary tree of lineages from the highest level ancestors (far left) to the most evolved descendants (far right). Each segment of a vertical column shows the counts of that lineage plus all it's descendants. Slicers for the range of Levels and Minimum # of Samples can be used to produce a more focussed output.
The rollup logic is a bit heavy, so please be patient with this page.
Jeff Gilchrist wrote an excellent thread explaining how to drive this Sankey page.
Track the weekly progress of a selected lineage for any combination of Continents and Countries. Shows the counts of that lineage vs the overall total, by week collected, also as a %.
International data on COVID-19 genomic sequencing, for analysis and reporting on variant prevalence by country, region and even global.
Global data gathered from GISAID. Sequence data is processed through the Nextclade CLI to produce the generally preferred Nextclade Lineage classifications.
I'm mainly following the visualisation style I first saw presented by Trevor Bedford. The main feature are clean, simple line charts, filtered by default to the top 7 series in the selected data. For each chart point, the frequency of that lineage in the last 7 days is calculated, always comparing to all the sequencing data available for that country/location.
Other pages presented include showing a single Lineage by Country or by Location. The top 7 lineages in the selected Continent/Countries/Locations will be shown, with frequency calculated as above.
The Lineage growth comparison (log) page was suggested by Uffe Poulsen, based on a chart produced by Alex Selby.
The main gisaid dataset only presents data for the last few months, to save processing time. It is typically refreshed weekly. The "gisaid - archive" dataviz presents all the historical data, but is only refreshed monthly at best.
The available sites presenting data on genomic sequencing are typically limited to country or global perspectives, with limited interactivity and often using overly complex visualisations. Each site has its own visualisation style. They are each updated independently.
In this project, the data from those sources is presented in an interactive data visualisation tool: Power BI. This allows interactive filtering of the data in the table, for easier analysis.
A page is presented for each data source (now only gisaid, but formerly also microreact, nextstrain, UCSC and cdgn), and the gisaid data has alternate pages showing either the Nextclade lineage classifications, or GISAID's own lineage classifications.
Earlier lineages are translated into the commonly known variant names (e.g. Delta) following the WHO naming. More recent lineages are grouped into "clans", roughly following the work of the Variant Trackers group e.g. T. Ryan Gregory. These are grouped using the field Lineage L2, for example the Lineage L2 "clan" BA.2.86.* includes the BA.2.86 lineage and all it's descendents. The Lineage L2 "clans" are mutually exclusive, so XBB.1.9.* excludes all of the EG.5.* lineages.
The default country selection for most pages is Australia. As well as being where I live, genomic sequencing for Australia has a relatively high proportion of genomes sequenced vs total COVID-19 cases.
The user can choose any alternative country, and also filter the date range or Lineages included. It is possible to combine multiple countries, even all data for a continent or globally. However note that the sampling is most datasets is heavily skewed to a handful of countries.
The primary visual on each page is a line chart showing the Lineage Frequency (calculated as a moving average over the prior 7 days, compared to all the other lineages present in the data (regardless of selections)). To keep the line charts clean, only the seven most-frequently occuring Lineages are shown (dynamically determined). Alternate pages compare Countries or Locations for a selected Lineage, again typically showing the top seven.
The gray inverted column chart below each line chart shows the counts of all genomes sequenced over the same period. A typical pattern is that the sample volume drops for more recent
An interactive table at the bottom right lists the individual observations presented by each dataset.
From gisaid.org we gather their EpiCoV metadata dataset. For most countries, this dataset is the most complete and up-to-date available.
Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258
From nextclade we classify the gisaid samples to obtain the nextclade pango lineage (using the nextclade cli tool). These offer an alternative to the pango lineages presented by gisaid. Typically new lineages are defined first in nextclade, and are preferred by some experts.
THIS REPORT IS NOT HEALTH ADVICE - REFER TO YOUR LOCAL HEALTH AUTHORITY.
Contributions, issues, feature requests and sponsorship are all welcome!
Give a ⭐️ if you like this project!