Lineage Prevalence Analysis

The outbreak_data package has contains multiple endpoints that can collect information on SARS-CoV-2 lineages. Pulling data from a combination of endpoints will allow you to conduct your own analysis on the progression of SARS-CoV-2. On this page, you’ll find a few example workflows that demonstrate how to collect, manipulate, and visualize prevalence data in SARS-CoV-2 lineages.

Here is how we would go about collecting data to find all the XBB lineages prevalent in India within a 1-year timeframe:

# Perform authentication if you haven't already
from outbreak_data import authenticate_user
authenticate_user.authenticate_new_user()

# Import outbreak_data package
from outbreak_data import outbreak_data as od

# Get the prevalence of all circulating XBB lineages in India
data = od.prevalence_by_location("IND", startswith = 'xbb')
# multiply prevalence values by 100% for scale
data['prevalence_rolling'] = data['prevalence_rolling'].apply(lambda x: x*100)
# Search for data based on date range
data = data.sort_values(by="date")
data = data.loc[data["date"].between("2020-09-12", "2022-03-31")]

## Use the visual package of your choice to create an area graph using your data
import altair as alt

# Graph of results
alt.Chart(data, title = "Lineage Prevalence in India").mark_area().encode(
x='date:T',
y=alt.Y('prevalence_rolling:Q'),
color = 'lineage:N')

Output:

            date  total_count  lineage_count     lineage  prevalence  \
2022-09-12            0              0    xbb.1.16    0.000000
2022-09-12            0              0     xbb.2.3    0.000000
2022-09-12          152              2       xbb.1    0.013158
2022-09-13            0              0     xbb.2.3    0.000000
2022-09-13            0              0    xbb.1.16    0.000000
 ...         ...          ...            ...         ...         ...
2023-03-31          196              2   xbb.2.3.2    0.010204
2023-03-31          196             29  xbb.1.16.1    0.147959
2023-03-31          196              1       xbb.1    0.005102
2023-03-31          196              7  xbb.1.16.2    0.035714
2023-03-31          196             15     xbb.2.3    0.076531

      prevalence_rolling
          0.000000
          0.000000
          0.003451
          0.000000
          0.000000
 ...                  ...
          0.031184
          0.144578
          0.014174
          0.045358
          0.084337
[985 rows x 6 columns]

Note

The Vega-Altair visualization package is used for demonstration purposes. However, any Python visual package can be used to create graphical representations of the data.

Finding the Most Prevalent Lineages

If we wanted to determine and plot the top four most prevalent lineages in India, we can make a few queries and use a few simple commands to create a table that shows us what these lineages are:

data=od.prevalence_by_location("IND")
most_prev = data.groupby('lineage').apply(max) # Finds the lineages with the most hits
most_prev = most_prev.mask(most_prev == '').dropna(how = 'any') # Drop any unknowns
most_prev = most_prev.iloc[:4]
print(most_prev)

Output

                  date  total_count  lineage_count    lineage  prevalence  \
 lineage
 ba.2       2023-04-20         5668           1445       ba.2    0.822785
 ba.2.10.1  2023-04-19         5668             93  ba.2.10.1    0.285714
 bq.1.1     2023-03-27          402              7     bq.1.1    0.428571
 ch.1.1     2023-02-13          119              4     ch.1.1    0.400000

            prevalence_rolling
 lineage
 ba.2                 0.677541
 ba.2.10.1            0.095541
 bq.1.1               0.156863
 ch.1.1               0.066667

Next we’ll collect the prevalence data on each of the four lineages:

# Retrieve the official data on the prevalences of these lineages using `daily_prev()`
d1 = od.daily_prev('ba.2', "IND")
d2 = od.daily_prev('ba.2.10.1', "IND")
d3 = od.daily_prev('bq.1.1', "IND")
d4 = od.daily_prev( 'ch.1.1', "IND")

# Formatting for creating the graph
d1['lineage'] = 'ba.2'
d2['lineage'] = 'ba.2.10.1'
d3['lineage'] = 'bq.1.1'
d4['lineage'] = 'ch.1.1'

# Group together data from each lineage
data = pd.concat([d1, d2, d3, d4])
data = data.rename(columns = {'proportion': 'proportion (%)'})

#Pick a date range to analyze
data = data.sort_values(by="date")
data = data.loc[data["date"].between("2022-09-12", "2023-03-31")]
# Increase prevalence by 100%
data['proportion'] = data['proportion'].apply(lambda x: x*100)

#Graph using preferred visual package
import altair as alt
alt.Chart(data, title = "Top 4 Most Prevalent Lineages in India").mark_area().encode(
x='date:T',
y=alt.Y('proportion (%):Q'),
color = 'lineage:N')