Viral Mutation Analysis

Lineage_Mutations Heatmap

A basic but important question: how do we define a lineage? What mutations consistently appear in most sequences within a lineage? We could start by plotting some of the characteristic mutations of XBB occurring in 80% of sequences. Even better, we can do this for mulitple related lineages and compare them using a heatmap:

# Perform authentication if you haven't already
from outbreak_data import authenticate_user
authenticate_user.authenticate_new_user()

# Import outbreak_data package
from outbreak_data import outbreak_data as od


# Collecting 15 characteristic mutations and their prevalences for each variant

lin0 = od.lineage_mutations("xbb", freq = 0.80)
lin0["lineage"] = "xbb"
lin0 = lin0.iloc[:15]

lin1 = od.lineage_mutations("xbb.1", freq = 0.80)
lin1["lineage"] = "xbb.1"
lin1 = lin1.iloc[:15]

lin2 = od.lineage_mutations("xbb.1.5", freq = 0.80)
lin2["lineage"] = "xbb.1.5"
lin2 = lin2.iloc[:15]

lin3 = od.lineage_mutations("xbb.1.16", freq = 0.80)
lin3["lineage"] = "xbb.1.16"
lin3 = lin3.iloc[:15]

lin4 = od.lineage_mutations("xbb.1.9", freq = 0.80)
lin4["lineage"] = "xbb.1.9"
lin4 = lin4.iloc[:15]

lin5 = od.lineage_mutations("xbb.2.3", freq = 0.80)
lin5["lineage"] = "xbb.2.3"
lin5 = lin5.iloc[:15]

# Link each of the findings together for plotting
df = pd.concat([lin0, lin1, lin2, lin3, lin4, lin5])
df["prevalence"] = df["prevalence"].apply(lambda x: x*100)
df= df.rename(columns={'prevalence': 'prevalence %'})

# Plot mutation heatmap
import altair as alt

alt.Chart(df).mark_rect().encode(
    x = "mutation:N",
    y = "lineage:N",
    color = 'prevalence %:Q')
Output
         mutation  mutation_count  lineage_count  lineage   gene ref_aa alt_aa  \
 0      orf6:d61l            4198           4423      xbb   ORF6      D      L
 1         s:l24s            3919           4423      xbb      S      L      S
 2        n:r203k            4378           4423      xbb      N      R      K
 3        s:g339h            4054           4423      xbb      S      G      H
 4        s:v445p            3643           4423      xbb      S      V      P
 ...           ...             ...            ...      ...    ...    ...    ...
 10  orf1a:p3395h            3263           3279  xbb.2.3  ORF1a      P      H
 11       s:n969k            3261           3279  xbb.2.3      S      N      K
 12  orf1a:g1307s            3257           3279  xbb.2.3  ORF1a      G      S
 13       s:d614g            3257           3279  xbb.2.3      S      D      G
 14       s:q954h            3257           3279  xbb.2.3      S      Q      H

     codon_num codon_end          type  prevalence % change_length_nt
 0          61      None  substitution     94.912955             None
 1          24      None  substitution     88.605019             None
 2         203      None  substitution     98.982591             None
 3         339      None  substitution     91.657246             None
 4         445      None  substitution     82.364911             None
 ...        ...       ...           ...           ...              ...
 10       3395      None  substitution     99.512046             None
 11        969      None  substitution     99.451052             None
 12       1307      None  substitution     99.329064             None
 13        614      None  substitution     99.329064             None
 14        954      None  substitution     99.329064             None

[90 rows x 12 columns]

_images/mut_by_lin.png

Comparing Lineages

To better understand the evolution of SARS-CoV-2, we may also want to know what mutations are shared between different viral lineages and how many shared mutations there are.

For this example we’ll look at three different lineages: B.1.1.7, P.1, and B.1.1.529. First, let’s get an idea of which mutations are shared between all three lineages:

lin1 = od.lineage_mutations("b.1.1.7")
lin1["lineage"] = "b.1.1.7"

lin2 = od.lineage_mutations("P.1")
lin2["lineage"] = "P.1"

lin3 = od.lineage_mutations("b.1.1.529")
lin3["lineage"] = "b.1.1.529"

#Finding intersections

#Created an optional but convenient intersecting function
def intersect(lst1, lst2):
    return set(lst1).intersection(lst2)

a = set(lin1['mutation'])
b = set(lin2['mutation'])
c = set(lin3['mutation'])

m12 = intersect(a, b)
m13 = intersect(a, c)
m23 = intersect(b, c)
m123 = intersect(m12, c)

print('Set 1.2:', m12)
print('Set 1.3:', m13)
print('Set 2.3:' , m23)
print('Set 1.2.3:', m123)
Output
 Set 1.2: {'orf1b:p314l', 'orf8:s84l', 'orf1a:del3675/3677', 'n:r203k', 's:d614g', 's:n501y', 'n:g204r'}
 Set 1.3: {'s:p681h', 'orf1b:p314l', 'orf8:s84l', 'n:r203k', 's:d614g', 'n:g204r'}
 Set 2.3: {'orf1b:p314l', 's:h655y', 'orf8:s84l', 'n:r203k', 's:d614g', 'n:g204r'}
 Set 1.2.3: {'orf1b:p314l', 'orf8:s84l', 'n:r203k', 's:d614g', 'n:g204r'}

Finally, we can make a visual based on our findings and create a Venn diagram of our results:

# Import libraries for venn diagram visual
from matplotlib_venn import venn3, venn3_circles
from matplotlib import pyplot as plt

# Using matplotlib_venn to make venn diagram with custom visuals
 ## The venn diagram package used here automatically determines the number
 ## of intersecting values and plots accordingly

vd3 = venn3([a, b, c], set_labels = ('Alpha/B.1.1.7', 'Gamma/P.1', 'Omicron/B.1.1.529'), set_colors=('#c4e6ff', '#F4ACB7','#9D8189'),
 alpha = 0.8)
c = venn3_circles([a, b, c,], linestyle = '-.', linewidth=2, color='grey')
for text in vd3.set_labels:
 text.set_fontsize(16);
for text in vd3.subset_labels:
 text.set_fontsize(16)
plt.title('Number of Overlapping Mutations in 3 Variants',fontname ='Helvetica',fontweight ='bold',fontsize = 18,
 pad = 30,backgroundcolor ='#cbe7e3',color ='black', style ='normal');
c[0].set_lw(7.0)
c[0].set_ls(':')
plt.show()
_images/compare.png

We can see that the graph matches the number of intersecting mutations that we previously saw, given that the venn diagram also takes into account unique values for each set.

Note

Please see the matplotlib venn diagram documentation and this article for more info on how to create these graphs.