Visualizing

Some of the most popular features of the original tcrdist code were its beautiful SVG based graphics. For instance, gene usage plots visualize the frequency with which different V and J genes are used within a T cell repertoire. Because some users may wish to make these plots without using the other feature of tcrdist2, we have built them as stand-alone features. That means if you can create a pandas DataFrame you can make a gene-usage plot.

_images/plotpairing4.svg
tcrdist.gene_pairing_plot.plot_pairings(cell_df, cols, count_col=None, use_color_gradients=True, other_frequency_threshold=0.01)

Diagram depicts the gene-segment pairing structure of the dataset. The four genes (cols) are arrayed left to right. Below each gene-type label (eg “VA”) is a color-stack showing all the TCR clones and how they break down into the different genes for that gene-type. Each clone is devoted a constant vertical height in pixels indicated in the text at the top (N pixels in “Nx y-pixel scale”). The curved segments joining neighboring gene-stacks show how the two gene distributions pair up, with the thickness of the segments corresponding to the number of clones having those two segments (scaled by the indicated y-pixel scale).

Column names with format: VA or JB (for Valpha and Jbeta) will have the A, B, G, D replaced with its greek character.

Parameters:
  • cell_df (pd.DataFrame) – Contains gene segment data, one row per clone, optionally with a counts or frequency column.
  • cols (list) – List of columns for displaying frequency and pairings, in order from left to right.
  • count_col (str) – Optionally provide a count or frequency column for weighting the rows/clones
Returns:

raw_svg – Raw SVG txt that can be written to a file.

Return type:

str

Notes

For calling this function

import tcrdist as td
td.plotting.plot_pairings()

is equivalent to:

import tcrdist as td
td.gene_pairing_plot.plot_pairings()

I want to plot gene pairings

A popular feature of tcrdist is the ability to visualize the gene usage pattern of a T cell repertoire specific for a particular epitope. This feature has been re-implemented in tcrdist2 and is part of the plotting module.

To illustrate how tcrdist.plotting.plot_pairings() is based on a data frame of gene usage we simulate some data.

Notice only a few columns are necessary: (i) one for each gene (ii) and one specifying the count.

An Example With Simulated Data

import tcrdist as td
import pandas as pd
import numpy as np
import IPython
from tcrdist import plotting

np.random.seed(110820)
n = 50
df = pd.DataFrame({'v_a_gene':np.random.choice(['TRAV14', 'TRAV12', 'TRAV3',
                                'TRAV23', 'TRAV11', 'TRAV6', 'TRAV89'], n),
                   'j_a_gene':np.random.choice(['TRAJ4', 'TRAJ2', 'TRAJ3',
                                'TRAJ5', 'TRAJ21', 'TRAJ13'], n),
                   'v_b_gene':np.random.choice(['TRBV14', 'TRBV12',
                                'TRBV3'], n),
                   'j_b_gene':np.random.choice(['TRBJ4', 'TRBJ2', 'TRBJ3',
                                'TRBJ5', 'TRBJ21', 'TRBJ13'], n)})

df = df.assign(count=1)
df.loc[:10, 'count'] = 10 # Sim expansion of the genes used in the first 10 rows

svg = plotting.plot_pairings(cell_df = df,
                                cols = ['j_a_gene', 'v_a_gene',
                                        'v_b_gene', 'j_b_gene'],
                                count_col='count')
IPython.display.SVG(data=svg)
_images/plotpairing1.svg

The same type of visualization can be made for bulk unpaired single-chain data, by specifying the cols list as follows. Note the plot depends on the gene order specified in the cols argument

svg = td.plotting.plot_pairings(cell_df = df,
                               cols = ['v_a_gene', 'j_a_gene'],
                               count_col='count')
IPython.display.SVG(data=svg)
_images/plotpairing2.svg

An Example With Real Data

The code below produces a figure similar to that in the tcrdist manuscript, where the dominance of the TRBV29*01 is particularly evident.

  1. Load the data from Dash et al. paper.
  2. Use a mapper to convert vdjDB data to tcrdist2 input pd.DataFrame.
  3. Subset the data to consider only mouse sequences specific to the ‘PA’ epitope.

To quickly get data associated with these and other snippets, see the section on Downloads. The datafile referenced here is (vdjDB_PMID28636592).

import tcrdist as td
import pandas as pd
import numpy as np
import IPython
from tcrdist import plotting

pd_df = pd.read_csv("vdjDB_PMID28636592.tsv", sep = "\t")       # 1
t_df = td.mappers.vdjdb_to_tcrdist2(pd_df = pd_df)              # 2

organism =  "MusMusculus"                                       # 3
epitope  =  "PA"

ind_org   = t_df.organism == organism
ind_epi   = t_df.epitope  == epitope

t_df_spec = t_df.loc[(ind_org)&(ind_epi),:]

svg = plotting.plot_pairings(cell_df = t_df_spec,
                                cols = ['j_a_gene', 'v_a_gene',
                                        'v_b_gene', 'j_b_gene'],
                                count_col='count',
                                other_frequency_threshold = 0.01)
IPython.display.SVG(data=svg)
_images/plotpairing4.svg