Breadth Estimation¶
The breadth and depth of a disease-specific T-cell response.
This module concerns the estimation of clonal breadth, whether it be at the pathogen, protein, or epitope level. Once meta-clonotype have been defined they can be used to search for biochemically similar TCRs in bulk repertoires that are likely to share antigen recognition. It is possible that a single TCR clonotype may be conformant with multiple TCR meta-clonotypes, so an accurate estimate of clonal breadth must avoid double counting such clonotypes.
To estimate clonal breadth of antigen-associated TCRs within a bulk repertoires with (N) productive clonotypes and (M) total productive templates, we use a set of (X) previously defined antigen-associated meta-clonotypes (defined as a (i) Centroid TRV,CDR3, (ii) TCR-specific RADIUS, (iii) MOTIF.
1. Compute the TCRdist between each centroid TCRij for i {1…i…X} and all bulk clones {1…j..N} using rectangular search with the tcrdist.repertoires.TCRrep.compute_sparse_rect_distances(), producing a sparse distance matrix.
2. Next perform a long-form tabulation that records the network formed between all meta-clonotype centroids and bulk sequences within the specified radius. This is performed with the function tcrdist.breadth.long_form_tabulation().
The network is represented as a Pandas DataFrame. Where centroid sequences are recorded as “cdr3_b_aa”, “v_b_gene”, “j_b_gene” and the conformant sequence in the bulk repertoire is “cdr3_b_aa_hit”, ‘v_b_gene_hit’, ‘j_b_gene_hit’. Crucially there is a column “MOTIF” which indicates whether the CDR3 of the hit sequence is conformant with the regular expression in the column “regex”.
3. The long-form Pandas DataFrame can then be used as input to the function tcrdist.breadth.estimate_breadth_and_depth(). The unit of analysis – that is, whether breadth refers to pathogen, protein, or epitope specific breadth – can be specified in with the argument breadth_cols. Crucially, when running tcrdist.breadth.long_form_tabulation() the argument search_cols must include a column indicating the association between a metac-clonotype and a particular ‘protein’ or ‘epitope’ e.g., [‘tag’, ‘protein’, ‘epitope’, cdr3_b_aa’, ‘v_b_gene’, ‘j_b_gene’, ‘pgen’,’regex’, ‘radius’]
Note that breadth and depth follow the definitions in Synder et al. 2020
Synder et al. 2020, See section: The breadth and depth of a disease-specific T-cell response.
Attention
REALLY IMPORTANT: This is a live unit test, where testing_only is set to True, using only a sample of metaclonotypes. However, be sure to set testing_only to False when you run your own meta-clonotypes
Tip
Notice that this example first shows how to do this for a single file. Then in proceeds to an example showing how to estmate breadth for a list of files in a loop. The default is to write out the breadth estimates as flat files, but you could store these in a list and pd.concat() them together
Tip
The metaclonotype downloadable in this example come from Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, Gaevert JA, Hertz T, Thomas PG, Bradley PH, Fiore-Gartland A. 2020 TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, LA-restricted SARS-CoV-2 associated TCR features. bioRxiv (2020) doi:10.1101/2020.12.24.424260
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.breadth import long_form_tabulation
from tcrdist.breadth import estimate_breadth_and_depth
import multiprocessing
import time
import scipy.sparse
# For testing use only a subset of meta-clonotypes
ncpus = min(multiprocessing.cpu_count(), 6)
# Download 2 bulk files for testing and demonstration purposes
files = [
'1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv',
'1349BW_unsorted_cc1000000_ImmunRACE_043020_003_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
download_and_extract_zip_file(
"ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip",
source = "dropbox",
dest = ".")
assert np.all([os.path.isfile(f) for f in files])
# Download a Meta-Clonotypes File, All Meta-Clonotypes from Pre-Print Manuscript
# (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7781332/)
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")
assert os.path.isfile('bioRxiv_v2_metaclonotypes.tsv')
# ███████╗███████╗ █████╗ ██████╗ ██████╗██╗ ██╗
# ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██║
# ███████╗█████╗ ███████║██████╔╝██║ ███████║
# ╚════██║██╔══╝ ██╔══██║██╔══██╗██║ ██╔══██║
# ███████║███████╗██║ ██║██║ ██║╚██████╗██║ ██║
# ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
meta_clonotypes_filename = 'bioRxiv_v2_metaclonotypes.tsv'
meta_clonotypes = pd.read_csv(meta_clonotypes_filename , sep = "\t")
# Assign feature id as a tag for tracking by meta-clonotype definition
meta_clonotypes['tag'] = meta_clonotypes['feature'].copy()
cols = ['cdr3_b_aa','v_b_gene','j_b_gene',
'pgen','radius','regex','protein','protein_coordinate', 'tag']
df_search = meta_clonotypes[cols]
df_search = df_search.assign(count= 1)
if testing_only:
# take a subsample to accelerated unit testing
df_search = df_search.sample(100, random_state=1)
tr_search = TCRrep(cell_df = df_search,
organism = 'human',
chains = ['beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
tr_search.cpus = ncpus
# ██████╗ ██╗ ██╗██╗ ██╗ ██╗
# ██╔══██╗██║ ██║██║ ██║ ██╔╝
# ██████╔╝██║ ██║██║ █████╔╝
# ██╔══██╗██║ ██║██║ ██╔═██╗
# ██████╔╝╚██████╔╝███████╗██║ ██╗
# ╚═════╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝
bulk_filename = files[0]
df_bulk = pd.read_csv( bulk_filename, sep = "\t")
valid_TRBV_CDR3_TRBJ = (df_bulk['valid_cdr3'] == True)&(df_bulk['v_b_gene'].notna())&(df_bulk['j_b_gene'].notna())
df_bulk = df_bulk[valid_TRBV_CDR3_TRBJ].reset_index(drop = True)
# Convert templates to counts
if 'templates' in df_bulk.columns:
df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','templates',
'productive_frequency']].\
rename(columns = {'templates':'count'})
else:
df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','count','productive_frequency']]
tr_bulk = TCRrep(cell_df = df_bulk,
organism = 'human',
chains = ['beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
# Get dimensions
search_clones = tr_search.clone_df.shape[0]
bulk_clones = tr_bulk.clone_df.shape[0]
# Get and ideal chunksize that will control memory usage
# 10**7 will keep memory > 2GB per CPU.
ideal_chunk_size = get_safe_chunk(
search_clones = tr_search.clone_df.shape[0],
bulk_clones = tr_bulk.clone_df.shape[0],
target = 10**7)
print(f"IDEAL CHUNK SIZE {ideal_chunk_size},{ncpus} CPUS")
# ██████╗ ███████╗ ██████╗████████╗ ███████╗███████╗ █████╗ ██████╗ ██████╗██╗ ██╗
# ██╔══██╗██╔════╝██╔════╝╚══██╔══╝ ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██║
# ██████╔╝█████╗ ██║ ██║ ███████╗█████╗ ███████║██████╔╝██║ ███████║
# ██╔══██╗██╔══╝ ██║ ██║ ╚════██║██╔══╝ ██╔══██║██╔══██╗██║ ██╔══██║
# ██║ ██║███████╗╚██████╗ ██║ ███████║███████╗██║ ██║██║ ██║╚██████╗██║ ██║
# ╚═╝ ╚═╝╚══════╝ ╚═════╝ ╚═╝ ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(
df = tr_search.clone_df,
df2 = tr_bulk.clone_df,
chunk_size = ideal_chunk_size,
radius = 37)
toc = time.perf_counter()
print(f"SEARCHING {search_clones} META-CLONOTYPES IN {bulk_clones} BULK CLONES IN {toc - tic:0.2f} sec.")
# ████████╗ █████╗ ██████╗ ██╗ ██╗██╗ █████╗ ████████╗███████╗
# ╚══██╔══╝██╔══██╗██╔══██╗██║ ██║██║ ██╔══██╗╚══██╔══╝██╔════╝
# ██║ ███████║██████╔╝██║ ██║██║ ███████║ ██║ █████╗
# ██║ ██╔══██║██╔══██╗██║ ██║██║ ██╔══██║ ██║ ██╔══╝
# ██║ ██║ ██║██████╔╝╚██████╔╝███████╗██║ ██║ ██║ ███████╗
# ╚═╝ ╚═╝ ╚═╝╚═════╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝ ╚═╝ ╚══════╝
df_tab = long_form_tabulation(
clone_df1 = tr_search.clone_df,
clone_df2 = tr_bulk.clone_df,
csrmat = tr_search.rw_beta,
search_cols = ['tag', 'protein','protein_coordinate','cdr3_b_aa',
'v_b_gene', 'j_b_gene', 'pgen','regex', 'radius'])
# ██████╗ ██████╗ ███████╗ █████╗ ██████╗ ████████╗██╗ ██╗
# ██╔══██╗██╔══██╗██╔════╝██╔══██╗██╔══██╗╚══██╔══╝██║ ██║
# ██████╔╝██████╔╝█████╗ ███████║██║ ██║ ██║ ███████║
# ██╔══██╗██╔══██╗██╔══╝ ██╔══██║██║ ██║ ██║ ██╔══██║
# ██████╔╝██║ ██║███████╗██║ ██║██████╔╝ ██║ ██║ ██║
# ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═╝
df_breadth = estimate_breadth_and_depth(df=df_tab,
breadth_cols = ['protein'],
N =tr_bulk.clone_df.shape[0] ,
M= tr_bulk.clone_df['count'].sum(),
motif = True,
exact = False)
df_breadth = df_breadth.assign(file = files[0])
# ███╗ ███╗ █████╗ ███╗ ██╗██╗ ██╗ ███████╗██╗██╗ ███████╗███████╗
# ████╗ ████║██╔══██╗████╗ ██║╚██╗ ██╔╝ ██╔════╝██║██║ ██╔════╝██╔════╝
# ██╔████╔██║███████║██╔██╗ ██║ ╚████╔╝ █████╗ ██║██║ █████╗ ███████╗
# ██║╚██╔╝██║██╔══██║██║╚██╗██║ ╚██╔╝ ██╔══╝ ██║██║ ██╔══╝ ╚════██║
# ██║ ╚═╝ ██║██║ ██║██║ ╚████║ ██║ ██║ ██║███████╗███████╗███████║
# ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝ ╚═╝ ╚═╝╚══════╝╚══════╝╚══════╝
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.breadth import long_form_tabulation
from tcrdist.breadth import estimate_breadth_and_depth
import multiprocessing
import time
import scipy.sparse
ncpus = min(multiprocessing.cpu_count(), 6)
destination = "breadth_estimates"
if not os.path.isdir(destination):
os.mkdir(destination)
files = [
'1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv',
'1349BW_unsorted_cc1000000_ImmunRACE_043020_003_gDNA_TCRB.tsv.tcrdist3.tsv']
# ███████╗███████╗ █████╗ ██████╗ ██████╗██╗ ██╗
# ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██║
# ███████╗█████╗ ███████║██████╔╝██║ ███████║
# ╚════██║██╔══╝ ██╔══██║██╔══██╗██║ ██╔══██║
# ███████║███████╗██║ ██║██║ ██║╚██████╗██║ ██║
# ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
meta_clonotypes_filename = 'bioRxiv_v2_metaclonotypes.tsv'
meta_clonotypes = pd.read_csv(meta_clonotypes_filename , sep = "\t")
# Assign feature id as a tag for tracking by meta-clonotype definition
meta_clonotypes['tag'] = meta_clonotypes['feature'].copy()
cols = ['cdr3_b_aa','v_b_gene','j_b_gene',
'pgen','radius','regex','protein','protein_coordinate', 'tag']
df_search = meta_clonotypes[cols]
df_search = df_search.assign(count= 1)
if testing_only:
# take a subsample to accelerated unit testing
df_search = df_search.sample(100, random_state=1)
tr_search = TCRrep(cell_df = df_search,
organism = 'human',
chains = ['beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
tr_search.cpus = ncpus
for file in files:
bulk_filename = file
df_bulk = pd.read_csv( bulk_filename, sep = "\t")
valid_TRBV_CDR3_TRBJ = (df_bulk['valid_cdr3'] == True)&(df_bulk['v_b_gene'].notna())&(df_bulk['j_b_gene'].notna())
df_bulk = df_bulk[valid_TRBV_CDR3_TRBJ].reset_index(drop = True)
# Convert templates to counts
if 'templates' in df_bulk.columns:
df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','templates','productive_frequency']].rename(columns = {'templates':'count'})
else:
df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','count','productive_frequency']]
tr_bulk = TCRrep(cell_df = df_bulk,
organism = 'human',
chains = ['beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
# Get dimensions
search_clones = tr_search.clone_df.shape[0]
bulk_clones = tr_bulk.clone_df.shape[0]
# Get and ideal chunksize that will control memory usage
# 10**7 will keep memory > 2GB per CPU.
ideal_chunk_size = get_safe_chunk(tr_search.clone_df.shape[0], tr_bulk.clone_df.shape[0], target = 10**7)
print(f"IDEAL CHUNK SIZE {ideal_chunk_size},{ncpus} CPUS")
# ██████╗ ███████╗ ██████╗████████╗ ███████╗███████╗ █████╗ ██████╗ ██████╗██╗ ██╗
# ██╔══██╗██╔════╝██╔════╝╚══██╔══╝ ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██║
# ██████╔╝█████╗ ██║ ██║ ███████╗█████╗ ███████║██████╔╝██║ ███████║
# ██╔══██╗██╔══╝ ██║ ██║ ╚════██║██╔══╝ ██╔══██║██╔══██╗██║ ██╔══██║
# ██║ ██║███████╗╚██████╗ ██║ ███████║███████╗██║ ██║██║ ██║╚██████╗██║ ██║
# ╚═╝ ╚═╝╚══════╝ ╚═════╝ ╚═╝ ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(
df = tr_search.clone_df,
df2 = tr_bulk.clone_df,
chunk_size = ideal_chunk_size,
radius = 37)
toc = time.perf_counter()
print(f"SEARCHING {search_clones} META-CLONOTYPES IN {bulk_clones} BULK CLONES IN {toc - tic:0.2f} sec.")
# ████████╗ █████╗ ██████╗ ██╗ ██╗██╗ █████╗ ████████╗███████╗
# ╚══██╔══╝██╔══██╗██╔══██╗██║ ██║██║ ██╔══██╗╚══██╔══╝██╔════╝
# ██║ ███████║██████╔╝██║ ██║██║ ███████║ ██║ █████╗
# ██║ ██╔══██║██╔══██╗██║ ██║██║ ██╔══██║ ██║ ██╔══╝
# ██║ ██║ ██║██████╔╝╚██████╔╝███████╗██║ ██║ ██║ ███████╗
# ╚═╝ ╚═╝ ╚═╝╚═════╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝ ╚═╝ ╚══════╝
df_tab = long_form_tabulation(
clone_df1 = tr_search.clone_df,
clone_df2 = tr_bulk.clone_df,
csrmat = tr_search.rw_beta,
search_cols = ['tag', 'protein','protein_coordinate','cdr3_b_aa',
'v_b_gene', 'j_b_gene', 'pgen','regex', 'radius'])
# ██████╗ ██████╗ ███████╗ █████╗ ██████╗ ████████╗██╗ ██╗
# ██╔══██╗██╔══██╗██╔════╝██╔══██╗██╔══██╗╚══██╔══╝██║ ██║
# ██████╔╝██████╔╝█████╗ ███████║██║ ██║ ██║ ███████║
# ██╔══██╗██╔══██╗██╔══╝ ██╔══██║██║ ██║ ██║ ██╔══██║
# ██████╔╝██║ ██║███████╗██║ ██║██████╔╝ ██║ ██║ ██║
# ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═╝
df_breadth = estimate_breadth_and_depth(df=df_tab,
breadth_cols = ['protein'],
N =tr_bulk.clone_df.shape[0] ,
M= tr_bulk.clone_df['count'].sum(),
motif = True,
exact = False)
df_breadth = df_breadth.assign(file = files[0])
# save rectangular sparse matrix, tabulation, and breadth esimates for future reference
scipy.sparse.save_npz(os.path.join(destination, f"{file}.rw.npz"), tr_search.rw_beta)
df_tab.to_csv(os.path.join(destination, f"{file}.tabulation.tsv"),
sep = '\t', index = True)
df_breadth.to_csv(os.path.join(destination, f"{file}.breadth.tsv"),
sep = '\t', index = True)
|