Tabulating Meta-Clonotypes¶
Step-by-Step Example¶
This page offers a step-by-step explanation of how to tabulate
meta-clonotype conformant clones in a bulk TCRb Repertoire using tcrdist.repertoire.TCRrep.compute_sparse_rect_distances()
and tcrdist.join.join_by_dist()
.
For copying into a new example, all the code discussed below is contained in a single block (Full Example).
import multiprocessing
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.join import join_by_dist
import re
Meta-clonotypes can be learned from antigen-associated TCR data collected via tetramer sorting or activation marker enrichment. However, a single TCR may be conformant multiple meta-clonotypes, which should be a consideration when applying a set of meta-clonotypes together. For instance, tallying conformant TCRs in a repertoire should avoid counting a TCR more than once. This example illustrates an efficient approach to tabulating meta-clonotypes conformant sequences in bulk repertoires.
Determine the appropriate number of CPUS based on your system
ncpus = min(multiprocessing.cpu_count(), 6)
This block of code downloads the files that will be used in this example.
- 2 bulk repertoires that have been preprocessed to include valid IMGT names and
- Meta-clonotypes described in (Mayer-Blackwell et al. 2021) from the data v2.0 release of SARS-CoV-2 peptide associated CD8 TCRs (Nolan et. al 2020)
files = ['1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
download_and_extract_zip_file("ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", source = "dropbox", dest = ".")
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")
We will consider meta-clonotypes as the search sequences, for instance:
feature | v_b_gene | cdr3_b_aa | radius | regex | protein_coordinate | protein |
M_55_1E6+TRBV28*01+CASSLRTDHYEQYF+22+(S[RLMF][RK][ST][ND].YEQ) | TRBV28*01 | CASSLRTDHYEQYF | 22 | (S[RLMF][RK][ST][ND].YEQ) | ORF1ab 1316:1330 | ORF1ab |
M_55_1E6+TRBV28*01+CASSLRSDSYEQYF+10+(SL[RK][ST][ND]SYEQ) | TRBV28*01 | CASSLRSDSYEQYF | 10 | (SL[RK][ST][ND]SYEQ) | ORF1ab1316:1330 | ORF1ab |
M_55_1E6+TRBV5-5*01+CASSPGQGAFTDTQYF+22+(S.G[QE]G[AS]F[ST]DTQ) | TRBV5-5*01 | CASSPGQGAFTDTQYF | 22 | (S.G[QE]G[AS]F[ST]DTQ) | ORF1ab 1316:1330 | ORF1ab |
M_55_1E6+TRBV28*01+CASSLKTDSYEQYF+16+(S[RLMF][RGKS][ST][ND][ANQS]YEQ) | TRBV28*01 | CASSLKTDSYEQYF | 16 | (S[RLMF][RGKS][ST][ND][ANQS]YEQ) | ORF1ab 1316:1330 | ORF1ab |
df_search = pd.read_csv("bioRxiv_v2_metaclonotypes.tsv", sep = "\t")
df_bulk = pd.read_csv(files[0], sep = "\t")
df_bulk = df_bulk.sort_values('count').reset_index(drop = True)
df_bulk['rank'] = df_bulk.index.to_list()
Note
‘rank’ refers to the rank abundance of each clone and ensures that we get an accurate estimate of clonal breadth without collapsing sequences with identical at the amino acid level
from tcrdist.repertoire import TCRrep
tr = TCRrep(
cell_df = df_search,
organism = "human",
chains = ['beta'],
compute_distances= False)
tr.cpus = ncpus
tr_bulk = TCRrep(
cell_df = df_bulk,
organism = "human",
chains = ['beta'],
compute_distances= False)
This is the only computationally demanding step, where the TCRdist is computed between 4500 TCRs in the search DataFrame and thousands of TCRs in the bulk repertoire. This computation is spread across multiple cores and only distances within the TCRdist specified by the radius argument are retained to avoid a massive memory demand.
Note
The chunk_size is the number of rows to compute on any one node at a time. The helper function get_safe_chunk will in most instances choose a chunk size that will avoid memory demand in excess of 1-2GB per core. By using multiple cores this relatively large computational task can be completed quickly and without the massive memory demand that would be required to store a massive number of pairwise distances.
from tcrdist.breadth import get_safe_chunk
chunk_size = get_safe_chunk(tr.clone_df.shape[0], tr_bulk.clone_df.shape[0])
tr.compute_sparse_rect_distances(
df = tr.clone_df,
df2 = tr_bulk.clone_df,
radius = 36,
chunk_size = chunk_size)
Sequences in the meta-clonotype search set are joined by distance
with similar sequences in the bulk repertoire. Similar to database joins on a common key,
we can perform either a “left”,”inner”, “outer” join operation (see tcrdist.join.join_by_dist()
for details)
from tcrdist.join import join_by_dist
df_join = join_by_dist(
how = 'inner',
csrmat = tr.rw_beta,
left_df = tr.clone_df,
right_df = tr_bulk.clone_df,
left_cols = tr.clone_df.columns.to_list(),
right_cols = tr_bulk.clone_df.columns.to_list(),
left_suffix = '_search',
right_suffix = '_bulk',
max_n= 1000,
radius = 36)
A a dataframe <df_join> is returned, where for every search sequences in the LEFT hand side the <max_n> closest neighbors sequences in the RIGHT hand side dataframe within the <radius> argument are returned.
Warning
Note that this includes potentially redundant hits, since a TCR clone in the bulk (RIGHT) dataframe could be a neighbor of multiple search sequences in the meta-clonotype (LEFT) dataframe. We will be able to address this redundancy in the next few steps.
First, we create new columns recording which sequence pairs meet certain meta-clonotype definitions:
- We apply a value of true for each sequence that is within its meta-clonotype search radius.
- We use regex matching to determine if the each joined sequence pair matches the motif-constraint, assigning True to a column MOTIF to indicate a motif-conformant CDR3.
- We apply a value of True to the column RADIUSANDMOTIF to indicate that both the RADIUS and MOTIF condition is met for a given joined sequence pair.
df_join['RADIUS'] = df_join.apply(lambda x: x['dist'] <= x['radius_search'], axis = 1)
import re
df_join['MOTIF'] = df_join.apply(lambda x: re.search(string = x['cdr3_b_aa_bulk'], pattern = x['regex_search']) is not None, axis = 1)
df_join['RADIUSANDMOTIF'] = df_join['RADIUS'] & df_join['MOTIF']
df_join['unique_clones'] = 1
Tip
Finding number of unique clones (breadth) can be achieved by adding a column of ones for unique clones that when summed will represent breadth of unique clones per ORF
Suppose you are interested in all RADIUS + MOTIF conforming sequences.
- First, we query those joined pairs and sort them in descending order by distance.
- Then, because it is possible that a sequence could show up as a match to multiple search meta-clonotypes, we use the Pandas groupby and head methods with the ‘rank’ variable. This ensures that a given bulk clone can only appear once in the final result even if it matches two meta-clnootypes.
df_join.query('RADIUSANDMOTIF').\
sort_values('dist', ascending = True).\
groupby(['rank_bulk']).\
head(1)
Now suppose we are interested tabulating the number of unique RADIUS+MOTIF conformant clones per ORF. Note that ‘count_bulk’ has stored the productive frequency whereas templates_bulk stores raw read count that might be useful for counts regression modeling.
df_join.query('RADIUSANDMOTIF').\
sort_values('dist', ascending = True).\
groupby(['rank_bulk']).\
head(1).\
groupby('protein_search').\
sum()[['count_bulk', 'templates_bulk','unique_clones']]
count_bulk templates_bulk unique_clones
protein_search
E 0.000013 1 1
M 0.000078 6 6
N 0.001320 102 42
ORF1ab 0.000919 71 59
ORF3a 0.000414 32 27
ORF7a 0.000052 4 3
ORF7b 0.000013 1 1
S 0.001346 104 41
Alternatively, now suppose we are interested tabulating the number of unique RADIUS+MOTIF conformant clones per epitope in the spike protein. Note that in the final groupby method we use ‘protein_coordinate_search’ instead of ‘protein_search’
df_join.query('RADIUSANDMOTIF').\
sort_values('dist', ascending = True).\
groupby(['rank_bulk']).\
head(1).\
query('protein_search == "S"').\
groupby('protein_coordinate_search').\
sum()[['count_bulk', 'templates_bulk']]
count_bulk templates_bulk
protein_coordinate_search
S 1056:1069 0.000220 17
S 1192:1201 0.000026 2
S 265:278 0.000789 61
S 55:69 0.000246 19
S 706:723 0.000013 1
S 83:98 0.000026 2
S 860:875 0.000026 2
tcrdist.join.join_by_dist()
can be used in multiple contexts
where it may be useful for finding non-exact matches between sets of TCRs.
Full Example¶
import multiprocessing
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.join import join_by_dist
import re
ncpus = min(multiprocessing.cpu_count(), 6)
files = ['1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
download_and_extract_zip_file("ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", source = "dropbox", dest = ".")
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")
df_search = pd.read_csv("bioRxiv_v2_metaclonotypes.tsv", sep = "\t")
f = '1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv'
df_bulk = pd.read_csv(f, sep = "\t")
# When one want to track each clone indivually regardless of identical TRBV-CDR3-TRBJ
df_bulk = df_bulk.sort_values('count').reset_index(drop = True)
df_bulk['rank'] = df_bulk.index.to_list()
from tcrdist.repertoire import TCRrep
tr = TCRrep(
cell_df = df_search,
organism = "human",
chains = ['beta'],
compute_distances= False)
tr.cpus = ncpus
tr_bulk = TCRrep(
cell_df = df_bulk,
organism = "human",
chains = ['beta'],
compute_distances= False)
chunk_size = get_safe_chunk(tr.clone_df.shape[0], tr_bulk.clone_df.shape[0])
tr.compute_sparse_rect_distances(
df = tr.clone_df,
df2 = tr_bulk.clone_df,
radius = 36,
chunk_size = chunk_size)
df_join = join_by_dist(
how = 'inner',
csrmat = tr.rw_beta,
left_df = tr.clone_df,
right_df = tr_bulk.clone_df,
left_cols = tr.clone_df.columns.to_list(),
right_cols = tr_bulk.clone_df.columns.to_list(),
left_suffix = '_search',
right_suffix = '_bulk',
max_n= 1000,
radius = 36)
df_join['RADIUS'] = df_join.apply(lambda x: x['dist'] <= x['radius_search'], axis = 1)
import re
df_join['MOTIF'] = df_join.apply(lambda x: re.search(string = x['cdr3_b_aa_bulk'],
pattern = x['regex_search']) is not None, axis = 1)
df_join['RADIUSANDMOTIF'] = df_join['RADIUS'] & df_join['MOTIF']
df_join['unique_clones'] = 1
df_join.query('RADIUSANDMOTIF').\
sort_values('dist', ascending = True).\
groupby(['rank_bulk']).\
head(1).\
groupby('protein_search').\
sum()[['count_bulk', 'templates_bulk','unique_clones']]
-
tcrdist.repertoire.TCRrep.
compute_sparse_rect_distances
(self, df=None, df2=None, radius=50, chunk_size=100)¶ Computes sparse rectangular distance as a scipy.sparse csr matrix, only retaining distances <= radius. This function is crucial when working with bulk data. Values > radius are zeros and not retained saving memory. True zeros are represented as -1.
NOTE if this function is called with out explicitly supplying df or df2 arguments, it computes all pairwise distances between clones in the TCRrep.clone_df by default
Parameters: - df (DataFrame or None) – clone_df (will specify rows of the distance matrix)
- df2 (DataFrame or None) – clone_df (will specify columns of the distance matrix)
- radius (int) – The maximum radius, where distance information is retained the scipy.sparse csr matrix. All distances greater than this radius are not retained, saving memory.
- chunk_size (int) – How many rows to compute at a time. To manage memory computations are done on a chunk of rows at a time, and chunks are spread out over multiple cpus. See from tcrdist.breadth import get_safe_chunk if you want to tune chunk size to the scale of the task.
-
tcrdist.join.
join_by_dist
(csrmat, left_df, right_df, how='inner', left_cols=['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'subject'], right_cols=['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'subject'], left_suffix='_x', right_suffix='_y', max_n=5, radius=1, radius_list=None, sort_by_dist=True)¶ Join two sets of TCRs, based on a distance threshold encoded in a sparse matrix csrmat.
The TCRs in the Left-DataFrame, are joined with TCRs in the Right-Dataframe, for up to max_n closest TCRs where the paired distance is less that that specifed in the radius or radius_list arguments.
This is analogous to SQL type joins, except instead of matching common keys, the rows of the Left and Right DataFrames are merged based on finding simimlar TCRs within a specified radius of sequence divergence. Intutively, this permits fuzzy matching between similar TCRs in any two TCR repertoire data sets.
Crucially, one must provide a scipy.sparse csr matrix which can be pre-computed using.
tcrdist.rep_funcs.compute_pws_sparse()
ortcrdist.reperotire.TCRrep.compute_sparse_rect_distances()
It is also possible to join using a unique radius for each sequence in Left-DataFrame using the radius_list argument instead of the fixed radius argument. However, if using a radius list, it must match the number of rows in the csrmat and the number of rows in Left DataFrame (i.e., len(radius_list) == left_df.shape[1] ).
Parameters: - csrmat (scipy.sparse.matrix) – rows must correspond to index of left_df columns must correspond to index of right_df
- left_df (pandas.DataFrame) – Clone DataFrame
- right_df (pandas.DataFrame) – Clone DataFrame
- how (str) –
Must be one of ‘inner’,’left’, or’outer’, which determines the type of merge to be performed.
- ’inner’ use intersection of top max_n neigbors between Left and Right DataFrames, droping rows where there is no match.
- ’left’ use top max_n rows from Right DataFrame neighboring rows in from Left DataFrame or produce NA where a TCR in Left DataFrame has no neighbor in the Right DataFrame.
- ’outer’ a FULL OUTER JOIN combines top max_n neighbors of both left and right DataFrame, producing NAs where a TCR in either the Right or Left DataFrame has no neighbor in other DataFrame.
- (hint: right joins are not possible, unless you switch input dataframe order and recompute the spase matrix)
- left_cols (list) – all columns to include in left_df
- right_cols (list) – all columns to include in right_df
- left_suffix (str) – appends to left columns
- right_suffix (str) – appends to right columns
- max_neighors (int) – limit on number of neighbors to join per row. For instance if a TCR has 100 neighbors only the first 10 rows in the right df will be included (this is to avoid massive blowup in cases where knowing about a few neighbors would suffice)
- = int (radius) – default is 1
Returns: left_right_df – concatenates rows from left and right dataframe for all sequences within a specified distance
Return type: pandas DataFrame