Tabulating Meta-Clonotypes¶

Step-by-Step Example¶

This page offers a step-by-step explanation of how to tabulate meta-clonotype conformant clones in a bulk TCRb Repertoire using tcrdist.repertoire.TCRrep.compute_sparse_rect_distances() and tcrdist.join.join_by_dist(). For copying into a new example, all the code discussed below is contained in a single block (Full Example).

import multiprocessing
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.join import join_by_dist
import re

Meta-clonotypes can be learned from antigen-associated TCR data collected via tetramer sorting or activation marker enrichment. However, a single TCR may be conformant multiple meta-clonotypes, which should be a consideration when applying a set of meta-clonotypes together. For instance, tallying conformant TCRs in a repertoire should avoid counting a TCR more than once. This example illustrates an efficient approach to tabulating meta-clonotypes conformant sequences in bulk repertoires.

Determine the appropriate number of CPUS based on your system

ncpus = min(multiprocessing.cpu_count(), 6)

This block of code downloads the files that will be used in this example.

2 bulk repertoires that have been preprocessed to include valid IMGT names and
Meta-clonotypes described in (Mayer-Blackwell et al. 2021) from the data v2.0 release of SARS-CoV-2 peptide associated CD8 TCRs (Nolan et. al 2020)

files = ['1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
    download_and_extract_zip_file("ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", source = "dropbox", dest = ".")
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
    download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")

We will consider meta-clonotypes as the search sequences, for instance:

feature	v_b_gene	cdr3_b_aa	radius	regex	protein_coordinate	protein
M_55_1E6+TRBV28*01+CASSLRTDHYEQYF+22+(S[RLMF][RK][ST][ND].YEQ)	TRBV28*01	CASSLRTDHYEQYF	22	(S[RLMF][RK][ST][ND].YEQ)	ORF1ab 1316:1330	ORF1ab
M_55_1E6+TRBV28*01+CASSLRSDSYEQYF+10+(SL[RK][ST][ND]SYEQ)	TRBV28*01	CASSLRSDSYEQYF	10	(SL[RK][ST][ND]SYEQ)	ORF1ab1316:1330	ORF1ab
M_55_1E6+TRBV5-5*01+CASSPGQGAFTDTQYF+22+(S.G[QE]G[AS]F[ST]DTQ)	TRBV5-5*01	CASSPGQGAFTDTQYF	22	(S.G[QE]G[AS]F[ST]DTQ)	ORF1ab 1316:1330	ORF1ab
M_55_1E6+TRBV28*01+CASSLKTDSYEQYF+16+(S[RLMF][RGKS][ST][ND][ANQS]YEQ)	TRBV28*01	CASSLKTDSYEQYF	16	(S[RLMF][RGKS][ST][ND][ANQS]YEQ)	ORF1ab 1316:1330	ORF1ab

df_search = pd.read_csv("bioRxiv_v2_metaclonotypes.tsv", sep = "\t")
df_bulk = pd.read_csv(files[0], sep = "\t")

df_bulk = df_bulk.sort_values('count').reset_index(drop = True)
df_bulk['rank'] = df_bulk.index.to_list()

Note

‘rank’ refers to the rank abundance of each clone and ensures that we get an accurate estimate of clonal breadth without collapsing sequences with identical at the amino acid level

from tcrdist.repertoire import TCRrep

tr = TCRrep(
    cell_df = df_search,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)
tr.cpus = ncpus

tr_bulk = TCRrep(
    cell_df = df_bulk,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)

This is the only computationally demanding step, where the TCRdist is computed between 4500 TCRs in the search DataFrame and thousands of TCRs in the bulk repertoire. This computation is spread across multiple cores and only distances within the TCRdist specified by the radius argument are retained to avoid a massive memory demand.

Note

The chunk_size is the number of rows to compute on any one node at a time. The helper function get_safe_chunk will in most instances choose a chunk size that will avoid memory demand in excess of 1-2GB per core. By using multiple cores this relatively large computational task can be completed quickly and without the massive memory demand that would be required to store a massive number of pairwise distances.

from tcrdist.breadth import get_safe_chunk
chunk_size = get_safe_chunk(tr.clone_df.shape[0], tr_bulk.clone_df.shape[0])
tr.compute_sparse_rect_distances(
    df = tr.clone_df,
    df2 = tr_bulk.clone_df,
    radius = 36,
    chunk_size = chunk_size)

Sequences in the meta-clonotype search set are joined by distance with similar sequences in the bulk repertoire. Similar to database joins on a common key, we can perform either a “left”,”inner”, “outer” join operation (see tcrdist.join.join_by_dist() for details)

from tcrdist.join import join_by_dist
df_join = join_by_dist(
    how = 'inner',
    csrmat = tr.rw_beta,
    left_df = tr.clone_df,
    right_df = tr_bulk.clone_df,
    left_cols  = tr.clone_df.columns.to_list(),
    right_cols = tr_bulk.clone_df.columns.to_list(),
    left_suffix = '_search',
    right_suffix = '_bulk',
    max_n= 1000,
    radius = 36)

A a dataframe <df_join> is returned, where for every search sequences in the LEFT hand side the <max_n> closest neighbors sequences in the RIGHT hand side dataframe within the <radius> argument are returned.

Warning

Note that this includes potentially redundant hits, since a TCR clone in the bulk (RIGHT) dataframe could be a neighbor of multiple search sequences in the meta-clonotype (LEFT) dataframe. We will be able to address this redundancy in the next few steps.

First, we create new columns recording which sequence pairs meet certain meta-clonotype definitions:

We apply a value of true for each sequence that is within its meta-clonotype search radius.
We use regex matching to determine if the each joined sequence pair matches the motif-constraint, assigning True to a column MOTIF to indicate a motif-conformant CDR3.
We apply a value of True to the column RADIUSANDMOTIF to indicate that both the RADIUS and MOTIF condition is met for a given joined sequence pair.

df_join['RADIUS'] = df_join.apply(lambda x: x['dist'] <= x['radius_search'], axis = 1)
import re
df_join['MOTIF'] = df_join.apply(lambda x: re.search(string = x['cdr3_b_aa_bulk'], pattern = x['regex_search']) is not None, axis = 1)
df_join['RADIUSANDMOTIF'] =  df_join['RADIUS'] & df_join['MOTIF']
df_join['unique_clones'] = 1

Tip

Finding number of unique clones (breadth) can be achieved by adding a column of ones for unique clones that when summed will represent breadth of unique clones per ORF

Suppose you are interested in all RADIUS + MOTIF conforming sequences.

First, we query those joined pairs and sort them in descending order by distance.
Then, because it is possible that a sequence could show up as a match to multiple search meta-clonotypes, we use the Pandas groupby and head methods with the ‘rank’ variable. This ensures that a given bulk clone can only appear once in the final result even if it matches two meta-clnootypes.

df_join.query('RADIUSANDMOTIF').\
    sort_values('dist', ascending = True).\
    groupby(['rank_bulk']).\
    head(1)

Now suppose we are interested tabulating the number of unique RADIUS+MOTIF conformant clones per ORF. Note that ‘count_bulk’ has stored the productive frequency whereas templates_bulk stores raw read count that might be useful for counts regression modeling.

df_join.query('RADIUSANDMOTIF').\
    sort_values('dist', ascending = True).\
    groupby(['rank_bulk']).\
    head(1).\
    groupby('protein_search').\
    sum()[['count_bulk', 'templates_bulk','unique_clones']]

                count_bulk  templates_bulk  unique_clones
protein_search
E                 0.000013               1              1
M                 0.000078               6              6
N                 0.001320             102             42
ORF1ab            0.000919              71             59
ORF3a             0.000414              32             27
ORF7a             0.000052               4              3
ORF7b             0.000013               1              1
S                 0.001346             104             41

Alternatively, now suppose we are interested tabulating the number of unique RADIUS+MOTIF conformant clones per epitope in the spike protein. Note that in the final groupby method we use ‘protein_coordinate_search’ instead of ‘protein_search’

df_join.query('RADIUSANDMOTIF').\
    sort_values('dist', ascending = True).\
    groupby(['rank_bulk']).\
    head(1).\
    query('protein_search == "S"').\
    groupby('protein_coordinate_search').\
    sum()[['count_bulk', 'templates_bulk']]

                           count_bulk  templates_bulk
protein_coordinate_search
S 1056:1069                  0.000220              17
S 1192:1201                  0.000026               2
S 265:278                    0.000789              61
S 55:69                      0.000246              19
S 706:723                    0.000013               1
S 83:98                      0.000026               2
S 860:875                    0.000026               2

tcrdist.join.join_by_dist() can be used in multiple contexts where it may be useful for finding non-exact matches between sets of TCRs.

Full Example¶

import multiprocessing
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.join import join_by_dist
import re

ncpus = min(multiprocessing.cpu_count(), 6)
files = ['1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
    download_and_extract_zip_file("ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", source = "dropbox", dest = ".")
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
    download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")

df_search = pd.read_csv("bioRxiv_v2_metaclonotypes.tsv", sep = "\t")
f = '1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv'
df_bulk = pd.read_csv(f, sep = "\t")
# When one want to track each clone indivually regardless of identical TRBV-CDR3-TRBJ
df_bulk = df_bulk.sort_values('count').reset_index(drop = True)

df_bulk['rank'] = df_bulk.index.to_list()

from tcrdist.repertoire import TCRrep
tr = TCRrep(
    cell_df = df_search,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)
tr.cpus = ncpus

tr_bulk = TCRrep(
    cell_df = df_bulk,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)

chunk_size = get_safe_chunk(tr.clone_df.shape[0], tr_bulk.clone_df.shape[0])
tr.compute_sparse_rect_distances(
    df = tr.clone_df,
    df2 = tr_bulk.clone_df,
    radius = 36,
    chunk_size = chunk_size)

df_join = join_by_dist(
    how = 'inner',
    csrmat = tr.rw_beta,
    left_df = tr.clone_df,
    right_df = tr_bulk.clone_df,
    left_cols  = tr.clone_df.columns.to_list(),
    right_cols = tr_bulk.clone_df.columns.to_list(),
    left_suffix = '_search',
    right_suffix = '_bulk',
    max_n= 1000,
    radius = 36)

df_join['RADIUS'] = df_join.apply(lambda x: x['dist'] <= x['radius_search'], axis = 1)
import re
df_join['MOTIF'] = df_join.apply(lambda x: re.search(string = x['cdr3_b_aa_bulk'],
    pattern = x['regex_search']) is not None, axis = 1)

df_join['RADIUSANDMOTIF'] =  df_join['RADIUS'] & df_join['MOTIF']
df_join['unique_clones'] = 1

df_join.query('RADIUSANDMOTIF').\
    sort_values('dist', ascending = True).\
    groupby(['rank_bulk']).\
    head(1).\
    groupby('protein_search').\
    sum()[['count_bulk', 'templates_bulk','unique_clones']]

tcrdist.repertoire.TCRrep.compute_sparse_rect_distances(self, df=None, df2=None, radius=50, chunk_size=100)¶

Computes sparse rectangular distance as a scipy.sparse csr matrix, only retaining distances <= radius. This function is crucial when working with bulk data. Values > radius are zeros and not retained saving memory. True zeros are represented as -1.

NOTE if this function is called with out explicitly supplying df or df2 arguments, it computes all pairwise distances between clones in the TCRrep.clone_df by default

Parameters:

df (DataFrame or None) – clone_df (will specify rows of the distance matrix)
df2 (DataFrame or None) – clone_df (will specify columns of the distance matrix)
radius (int) – The maximum radius, where distance information is retained the scipy.sparse csr matrix. All distances greater than this radius are not retained, saving memory.
chunk_size (int) – How many rows to compute at a time. To manage memory computations are done on a chunk of rows at a time, and chunks are spread out over multiple cpus. See from tcrdist.breadth import get_safe_chunk if you want to tune chunk size to the scale of the task.

tcrdist.join.join_by_dist(csrmat, left_df, right_df, how='inner', left_cols=['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'subject'], right_cols=['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'subject'], left_suffix='_x', right_suffix='_y', max_n=5, radius=1, radius_list=None, sort_by_dist=True)¶

Join two sets of TCRs, based on a distance threshold encoded in a sparse matrix csrmat.

The TCRs in the Left-DataFrame, are joined with TCRs in the Right-Dataframe, for up to max_n closest TCRs where the paired distance is less that that specifed in the radius or radius_list arguments.

This is analogous to SQL type joins, except instead of matching common keys, the rows of the Left and Right DataFrames are merged based on finding simimlar TCRs within a specified radius of sequence divergence. Intutively, this permits fuzzy matching between similar TCRs in any two TCR repertoire data sets.

Crucially, one must provide a scipy.sparse csr matrix which can be pre-computed using. tcrdist.rep_funcs.compute_pws_sparse() or tcrdist.reperotire.TCRrep.compute_sparse_rect_distances()

It is also possible to join using a unique radius for each sequence in Left-DataFrame using the radius_list argument instead of the fixed radius argument. However, if using a radius list, it must match the number of rows in the csrmat and the number of rows in Left DataFrame (i.e., len(radius_list) == left_df.shape[1] ).

Parameters:	csrmat (scipy.sparse.matrix) – rows must correspond to index of left_df columns must correspond to index of right_df left_df (pandas.DataFrame) – Clone DataFrame right_df (pandas.DataFrame) – Clone DataFrame how (str) – Must be one of ‘inner’,’left’, or’outer’, which determines the type of merge to be performed. ’inner’ use intersection of top max_n neigbors between Left and Right DataFrames, droping rows where there is no match. ’left’ use top max_n rows from Right DataFrame neighboring rows in from Left DataFrame or produce NA where a TCR in Left DataFrame has no neighbor in the Right DataFrame. ’outer’ a FULL OUTER JOIN combines top max_n neighbors of both left and right DataFrame, producing NAs where a TCR in either the Right or Left DataFrame has no neighbor in other DataFrame. (hint: right joins are not possible, unless you switch input dataframe order and recompute the spase matrix) left_cols (list) – all columns to include in left_df right_cols (list) – all columns to include in right_df left_suffix (str) – appends to left columns right_suffix (str) – appends to right columns max_neighors (int) – limit on number of neighbors to join per row. For instance if a TCR has 100 neighbors only the first 10 rows in the right df will be included (this is to avoid massive blowup in cases where knowing about a few neighbors would suffice) = int (radius) – default is 1
Returns:	left_right_df – concatenates rows from left and right dataframe for all sequences within a specified distance
Return type:	pandas DataFrame