outbreak.info is a project to enable the tracking of SARS-CoV-2 Variants within the COVID-19 pandemic. This R package offers access to the data we have gathered and calculated to replicate the visualizations on outbreak.info. Here, we’ll outline some of the basic functionality of the R package to access the genomic (SARS-CoV-2 variant) data, research library, and epidemiology data (COVID-19 cases and deaths).
The core functionality within the suite of Variant Prevalence Functions include accessing data for:
Before we start, we need to provide our GISAID credentials to access the genomic data. If you don’t have a GISAID account, you can register for one on their website. It may take a day or two for the account to become active.
# Install the package, if you haven't already, using devtools devtools::install_github("outbreak-info/R-outbreak-info")
# Package imports library(outbreakinfo) #> Warning: replacing previous import 'crayon::%+%' by 'ggplot2::%+%' when loading #> 'outbreakinfo' #> Warning: replacing previous import 'jsonlite::flatten' by 'purrr::flatten' when #> loading 'outbreakinfo' # Not needed; just used to tidy / visualize some of the outputs library(dplyr) library(knitr) library(lubridate) library(ggplot2)
# Authenticate yourself using your GISAID credentials. authenticateUser()
Tracking how the prevalence of variants change over time is vital to understanding the evolution of SARS-CoV-2. We can calculate this prevalence for any lineage, mutation, combination of mutations, or lineage with added mutations. We’ll access the data and then plot the results.
# Function to grab all the data for the prevalence of B.1.1.7 in Texas b117_tx = getPrevalence(pangolin_lineage = "B.1.1.7", location="Texas") #> Retrieving data... # Accessing just the first row: t(b117_tx[1,]) #> 1 #> date "2020-07-21" #> total_count "440" #> lineage_count "1" #> total_count_rolling "379.2857" #> lineage_count_rolling "0.1428571" #> proportion "0.0003766478" #> proportion_ci_lower "1.294751e-06" #> proportion_ci_upper "0.006601559" #> lineage "B.1.1.7" #> location "Texas" # Plotting it: plotPrevalenceOverTime(b117_tx, title = "B.1.1.7 prevalence in Texas")
Variables are described in the Genomics Data Dictionary.
In addition to viewing the change in prevalence of a lineage over time, more complex variants can be queried. This includes combinations of lineages, like the Delta variant, lineages with additional mutation(s), like B.1.1.7 with S:E484K, individual mutations, like S:E484K, or groups of mutations, like S:E484K and S:P681R. View the Variant Tracker and Location Tracker vignettes for more advanced examples.
To get the prevalence of a particular variant to compare between locations, you can access the data through the
getCumulativeBySubadmin function. Note that there are options to supply a location, like
"United States" to view the prevalence broken down by U.S. state, and/or over the past
n days. See the Variant Tracker Vignette for more details.
# Calculate cumulative prevalence of B.1.1.7 by country b117_world = getCumulativeBySubadmin(pangolin_lineage = "B.1.1.7") #> Retrieving data... # filtering down the data to view a few countries b117_world %>% filter(name %in% c("Canada", "United Kingdom", "Australia", "New Zealand")) %>% select(name, proportion, proportion_ci_lower, proportion_ci_upper) %>% arrange(desc(proportion)) %>% knitr::kable()
When we say a lineage, like B.1.1.7, what does that actually mean? The
getMutationsByLineage function allows you to pull the prevalence of all the mutations within all the sequences assigned to B.1.1.7 or other lineages, and
plotMutationHeatmap allows you to compare their prevalence in a heatmap:
char_muts = getMutationsByLineage(pangolin_lineage = c("B.1.1.7", "B.1.351", "B.1.617.2", "P.1")) #> Retrieving data... #> Retrieving data... #> Retrieving data... #> Retrieving data... plotMutationHeatmap(char_muts, title = "Mutations with at least 75% prevalence in Variants of Concern", lightBorders = FALSE)
All the Variant Prevalence Functions provide documentation on their functionality and examples.
All COVID-19 research metadata can be accessed through the main function
getResourcesData which searches across a series of COVID-19 resources of various types. For instance, you can find all research on seroprevalence, including publications, clinical trials, datasets, and more:
# Get the resources metadata # Use `fetchAll = TRUE` to get all the results, not just the first 10 # Use double quotes around "sero-prevalence" to look for that exact phrase. Without quotes, the query will search for "sero" or "prevalence", not their combination. # Combine terms using OR or AND (capitalization is required!) seroprevalence = getResourcesData(query = 'seroprevalence OR "sero-prevalence"', fetchAll = TRUE, fields = c("name", "description", "@type", "date", "curatedBy", "journalName", "funding", "url")) # Accessing just the first row: t(seroprevalence[1,]) #> [,1] #> @type "Publication" #> _id "pmid35317419" #> curatedBy list,4 #> date 19023 #> journalName "Health science reports" #> name "Sero-prevalence of SARS-CoV-2 in certain cities of Kazakhstan." #> url "https://www.doi.org/10.1002/hsr2.562" #> funding NA #> description "NA" #> _ignored NA # Plot the increase in seroprevalence research over time # roll up the number of resources by week resources_by_date = seroprevalence %>% mutate(year = lubridate::year(date), iso_week = lubridate::isoweek(date)) # count the number of new resources per week. resources_per_week = resources_by_date %>% count(`@type`, iso_week, year) %>% # convert from iso week back to a date mutate(iso_date = lubridate::parse_date_time(paste(year,iso_week, "Mon", sep="-"), "Y-W-a")) ggplot(resources_per_week, aes(x = iso_date, y = n)) + geom_col(fill = "#66c2a5") + scale_x_datetime(date_labels = "%b %Y", name = "week") + ggtitle("Seroprevalence research by week", subtitle = paste0("Number of resources in outbreak.info's Research Library as of ", format(Sys.Date(), "%d %B %Y")))
# Visualize the breakdown of seroprevalence research resources_by_type = seroprevalence %>% count(`@type`) %>% arrange(n) # order the levels in the bar chart resources_by_type$`@type` = factor(resources_by_type$`@type`, resources_by_type %>% pull(`@type`)) ggplot(resources_by_type, aes(x = `@type`, y = n, fill=`@type`)) + geom_col() + coord_flip() + scale_fill_manual(values = c(Publication = "#e15759", ClinicalTrial = "#b475a3", Dataset = "#126b93", Protocol = "#59a14f")) + ggtitle("Seroprevalence research by type of resource", subtitle = paste0("Number of resources in outbreak.info's Research Library as of ", format(Sys.Date(), "%d %B %Y"))) + theme_minimal() + theme(axis.title = element_blank(), legend.position = "none", panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
All the Research Library Functions provide documentation on their functionality and examples.
Epidemiology data in outbreak.info includes daily cases and death data for World Bank Regions, individual countries, states/provinces, U.S. Metropolitan areas, and U.S. counties. These functions do not require a GISAID account or calling
authenticateUser() before use.
All cases and death data can be accessed through the main function
getEpiData and plotted with
plotEpiData. For instance, you can compare the cases per capita in a few major metropolitan areas:
# Get the epi data epi_metro = getEpiData(name = c("Detroit-Warren-Dearborn, MI", "New Orleans-Metairie, LA")) # Accessing just the first row: t(epi_metro[1,]) #> NA # Get the epi data and plot it. plotEpiData(locations = c("Detroit-Warren-Dearborn, MI", "New Orleans-Metairie, LA"), variable = "confirmed_rolling_per_100k") #> detroit-warren-dearborn, mi not found. Please check spelling. #> new orleans-metairie, la not found. Please check spelling. #>  "confirmed_rolling_per_100k is not a valid API field" #> NULL
Variables are described in the Epidemiology Data Dictionary.
All the Cases & Deaths Functions provide documentation on their functionality and examples.