Breadth Estimation

The breadth and depth of a disease-specific T-cell response.

This module concerns the estimation of clonal breadth, whether it be at the pathogen, protein, or epitope level. Once meta-clonotype have been defined they can be used to search for biochemically similar TCRs in bulk repertoires that are likely to share antigen recognition. It is possible that a single TCR clonotype may be conformant with multiple TCR meta-clonotypes, so an accurate estimate of clonal breadth must avoid double counting such clonotypes.

To estimate clonal breadth of antigen-associated TCRs within a bulk repertoires with (N) productive clonotypes and (M) total productive templates, we use a set of (X) previously defined antigen-associated meta-clonotypes (defined as a (i) Centroid TRV,CDR3, (ii) TCR-specific RADIUS, (iii) MOTIF.

1. Compute the TCRdist between each centroid TCRij for i {1…i…X} and all bulk clones {1…j..N} using rectangular search with the tcrdist.repertoires.TCRrep.compute_sparse_rect_distances(), producing a sparse distance matrix.

2. Next perform a long-form tabulation that records the network formed between all meta-clonotype centroids and bulk sequences within the specified radius. This is performed with the function tcrdist.breadth.long_form_tabulation().

The network is represented as a Pandas DataFrame. Where centroid sequences are recorded as “cdr3_b_aa”, “v_b_gene”, “j_b_gene” and the conformant sequence in the bulk repertoire is “cdr3_b_aa_hit”, ‘v_b_gene_hit’, ‘j_b_gene_hit’. Crucially there is a column “MOTIF” which indicates whether the CDR3 of the hit sequence is conformant with the regular expression in the column “regex”.

3. The long-form Pandas DataFrame can then be used as input to the function tcrdist.breadth.estimate_breadth_and_depth(). The unit of analysis – that is, whether breadth refers to pathogen, protein, or epitope specific breadth – can be specified in with the argument breadth_cols. Crucially, when running tcrdist.breadth.long_form_tabulation() the argument search_cols must include a column indicating the association between a metac-clonotype and a particular ‘protein’ or ‘epitope’ e.g., [‘tag’, ‘protein’, ‘epitope’, cdr3_b_aa’, ‘v_b_gene’, ‘j_b_gene’, ‘pgen’,’regex’, ‘radius’]

Note that breadth and depth follow the definitions in Synder et al. 2020

Synder et al. 2020, See section: The breadth and depth of a disease-specific T-cell response.

Attention

REALLY IMPORTANT: This is a live unit test, where testing_only is set to True, using only a sample of metaclonotypes. However, be sure to set testing_only to False when you run your own meta-clonotypes

Tip

Notice that this example first shows how to do this for a single file. Then in proceeds to an example showing how to estmate breadth for a list of files in a loop. The default is to write out the breadth estimates as flat files, but you could store these in a list and pd.concat() them together

Tip

The metaclonotype downloadable in this example come from Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, Gaevert JA, Hertz T, Thomas PG, Bradley PH, Fiore-Gartland A. 2020 TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, LA-restricted SARS-CoV-2 associated TCR features. bioRxiv (2020) doi:10.1101/2020.12.24.424260

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
import numpy as np
import os 
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.breadth import long_form_tabulation
from tcrdist.breadth import estimate_breadth_and_depth
import multiprocessing
import time
import scipy.sparse
# For testing use only a subset of meta-clonotypes

ncpus = min(multiprocessing.cpu_count(), 6)
# Download 2 bulk files for testing and demonstration purposes
files = [
'1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv',
'1349BW_unsorted_cc1000000_ImmunRACE_043020_003_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
    download_and_extract_zip_file(
        "ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", 
        source = "dropbox", 
        dest = ".")
    assert np.all([os.path.isfile(f) for f in files])
# Download a Meta-Clonotypes File, All Meta-Clonotypes from Pre-Print Manuscript 
# (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7781332/)
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"): 
    download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")
    assert os.path.isfile('bioRxiv_v2_metaclonotypes.tsv')
# ███████╗███████╗ █████╗ ██████╗  ██████╗██╗  ██╗
# ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║  ██║
# ███████╗█████╗  ███████║██████╔╝██║     ███████║
# ╚════██║██╔══╝  ██╔══██║██╔══██╗██║     ██╔══██║
# ███████║███████╗██║  ██║██║  ██║╚██████╗██║  ██║
# ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝
meta_clonotypes_filename = 'bioRxiv_v2_metaclonotypes.tsv'
meta_clonotypes          = pd.read_csv(meta_clonotypes_filename , sep = "\t")
# Assign feature id as a tag for tracking by meta-clonotype definition
meta_clonotypes['tag'] = meta_clonotypes['feature'].copy()
cols = ['cdr3_b_aa','v_b_gene','j_b_gene',
        'pgen','radius','regex','protein','protein_coordinate', 'tag']
df_search = meta_clonotypes[cols]
df_search = df_search.assign(count= 1)

if testing_only: 
    # take a subsample to accelerated unit testing
    df_search = df_search.sample(100, random_state=1)
tr_search = TCRrep(cell_df = df_search,
                organism = 'human',
                chains = ['beta'],
                db_file = 'alphabeta_gammadelta_db.tsv',
                compute_distances = False)
tr_search.cpus = ncpus
# ██████╗ ██╗   ██╗██╗     ██╗  ██╗
# ██╔══██╗██║   ██║██║     ██║ ██╔╝
# ██████╔╝██║   ██║██║     █████╔╝ 
# ██╔══██╗██║   ██║██║     ██╔═██╗ 
# ██████╔╝╚██████╔╝███████╗██║  ██╗
# ╚═════╝  ╚═════╝ ╚══════╝╚═╝  ╚═╝
bulk_filename = files[0]
df_bulk = pd.read_csv( bulk_filename, sep = "\t")
valid_TRBV_CDR3_TRBJ = (df_bulk['valid_cdr3'] == True)&(df_bulk['v_b_gene'].notna())&(df_bulk['j_b_gene'].notna())
df_bulk = df_bulk[valid_TRBV_CDR3_TRBJ].reset_index(drop = True)
# Convert templates to counts
if 'templates' in df_bulk.columns:
    df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','templates',
    'productive_frequency']].\
    rename(columns = {'templates':'count'})
else:
    df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','count','productive_frequency']]
tr_bulk   = TCRrep(cell_df = df_bulk,
                organism = 'human',
                chains = ['beta'],
                db_file = 'alphabeta_gammadelta_db.tsv',
                compute_distances = False)
# Get dimensions
search_clones = tr_search.clone_df.shape[0] 
bulk_clones   = tr_bulk.clone_df.shape[0]
# Get and ideal chunksize that will control memory usage
# 10**7 will keep memory > 2GB per CPU. 
ideal_chunk_size = get_safe_chunk(
    search_clones = tr_search.clone_df.shape[0], 
    bulk_clones = tr_bulk.clone_df.shape[0],
    target = 10**7)
print(f"IDEAL CHUNK SIZE {ideal_chunk_size},{ncpus} CPUS")
# ██████╗ ███████╗ ██████╗████████╗    ███████╗███████╗ █████╗ ██████╗  ██████╗██╗  ██╗
# ██╔══██╗██╔════╝██╔════╝╚══██╔══╝    ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║  ██║
# ██████╔╝█████╗  ██║        ██║       ███████╗█████╗  ███████║██████╔╝██║     ███████║
# ██╔══██╗██╔══╝  ██║        ██║       ╚════██║██╔══╝  ██╔══██║██╔══██╗██║     ██╔══██║
# ██║  ██║███████╗╚██████╗   ██║       ███████║███████╗██║  ██║██║  ██║╚██████╗██║  ██║
# ╚═╝  ╚═╝╚══════╝ ╚═════╝   ╚═╝       ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(
    df = tr_search.clone_df, 
    df2 = tr_bulk.clone_df, 
    chunk_size = ideal_chunk_size, 
    radius = 37)
toc = time.perf_counter()
print(f"SEARCHING {search_clones} META-CLONOTYPES IN {bulk_clones} BULK CLONES IN {toc - tic:0.2f} sec.")
# ████████╗ █████╗ ██████╗ ██╗   ██╗██╗      █████╗ ████████╗███████╗
# ╚══██╔══╝██╔══██╗██╔══██╗██║   ██║██║     ██╔══██╗╚══██╔══╝██╔════╝
#    ██║   ███████║██████╔╝██║   ██║██║     ███████║   ██║   █████╗  
#    ██║   ██╔══██║██╔══██╗██║   ██║██║     ██╔══██║   ██║   ██╔══╝  
#    ██║   ██║  ██║██████╔╝╚██████╔╝███████╗██║  ██║   ██║   ███████╗
#    ╚═╝   ╚═╝  ╚═╝╚═════╝  ╚═════╝ ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚══════╝
df_tab = long_form_tabulation(
    clone_df1   = tr_search.clone_df,
    clone_df2   = tr_bulk.clone_df,
    csrmat      = tr_search.rw_beta, 
    search_cols = ['tag', 'protein','protein_coordinate','cdr3_b_aa', 
    'v_b_gene', 'j_b_gene', 'pgen','regex', 'radius'])
# ██████╗ ██████╗ ███████╗ █████╗ ██████╗ ████████╗██╗  ██╗
# ██╔══██╗██╔══██╗██╔════╝██╔══██╗██╔══██╗╚══██╔══╝██║  ██║
# ██████╔╝██████╔╝█████╗  ███████║██║  ██║   ██║   ███████║
# ██╔══██╗██╔══██╗██╔══╝  ██╔══██║██║  ██║   ██║   ██╔══██║
# ██████╔╝██║  ██║███████╗██║  ██║██████╔╝   ██║   ██║  ██║
# ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚═════╝    ╚═╝   ╚═╝  ╚═╝
df_breadth = estimate_breadth_and_depth(df=df_tab, 
    breadth_cols = ['protein'], 
    N =tr_bulk.clone_df.shape[0] , 
    M= tr_bulk.clone_df['count'].sum(), 
    motif = True,
    exact = False)
df_breadth = df_breadth.assign(file = files[0])
# ███╗   ███╗ █████╗ ███╗   ██╗██╗   ██╗    ███████╗██╗██╗     ███████╗███████╗
# ████╗ ████║██╔══██╗████╗  ██║╚██╗ ██╔╝    ██╔════╝██║██║     ██╔════╝██╔════╝
# ██╔████╔██║███████║██╔██╗ ██║ ╚████╔╝     █████╗  ██║██║     █████╗  ███████╗
# ██║╚██╔╝██║██╔══██║██║╚██╗██║  ╚██╔╝      ██╔══╝  ██║██║     ██╔══╝  ╚════██║
# ██║ ╚═╝ ██║██║  ██║██║ ╚████║   ██║       ██║     ██║███████╗███████╗███████║
# ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═╝       ╚═╝     ╚═╝╚══════╝╚══════╝╚══════╝
import numpy as np
import os 
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.breadth import long_form_tabulation
from tcrdist.breadth import estimate_breadth_and_depth
import multiprocessing
import time
import scipy.sparse
ncpus = min(multiprocessing.cpu_count(), 6)
destination = "breadth_estimates"
if not os.path.isdir(destination):
    os.mkdir(destination)
files = [
'1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv',
'1349BW_unsorted_cc1000000_ImmunRACE_043020_003_gDNA_TCRB.tsv.tcrdist3.tsv']

# ███████╗███████╗ █████╗ ██████╗  ██████╗██╗  ██╗
# ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║  ██║
# ███████╗█████╗  ███████║██████╔╝██║     ███████║
# ╚════██║██╔══╝  ██╔══██║██╔══██╗██║     ██╔══██║
# ███████║███████╗██║  ██║██║  ██║╚██████╗██║  ██║
# ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝
meta_clonotypes_filename = 'bioRxiv_v2_metaclonotypes.tsv'
meta_clonotypes          = pd.read_csv(meta_clonotypes_filename , sep = "\t")
# Assign feature id as a tag for tracking by meta-clonotype definition
meta_clonotypes['tag'] = meta_clonotypes['feature'].copy()
cols = ['cdr3_b_aa','v_b_gene','j_b_gene',
        'pgen','radius','regex','protein','protein_coordinate', 'tag']
df_search = meta_clonotypes[cols]
df_search = df_search.assign(count= 1)
if testing_only: 
    # take a subsample to accelerated unit testing
    df_search = df_search.sample(100, random_state=1)
tr_search = TCRrep(cell_df = df_search,
                organism = 'human',
                chains = ['beta'],
                db_file = 'alphabeta_gammadelta_db.tsv',
                compute_distances = False)
tr_search.cpus = ncpus

for file in files:
    bulk_filename = file
    df_bulk = pd.read_csv( bulk_filename, sep = "\t")
    valid_TRBV_CDR3_TRBJ = (df_bulk['valid_cdr3'] == True)&(df_bulk['v_b_gene'].notna())&(df_bulk['j_b_gene'].notna())
    df_bulk = df_bulk[valid_TRBV_CDR3_TRBJ].reset_index(drop = True)
    # Convert templates to counts
    if 'templates' in df_bulk.columns:
        df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','templates','productive_frequency']].rename(columns = {'templates':'count'})
    else:
        df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','count','productive_frequency']]
    tr_bulk   = TCRrep(cell_df = df_bulk,
                    organism = 'human',
                    chains = ['beta'],
                    db_file = 'alphabeta_gammadelta_db.tsv',
                    compute_distances = False)
    # Get dimensions
    search_clones = tr_search.clone_df.shape[0] 
    bulk_clones   = tr_bulk.clone_df.shape[0]
    # Get and ideal chunksize that will control memory usage
    # 10**7 will keep memory > 2GB per CPU. 
    ideal_chunk_size = get_safe_chunk(tr_search.clone_df.shape[0], tr_bulk.clone_df.shape[0], target = 10**7)
    print(f"IDEAL CHUNK SIZE {ideal_chunk_size},{ncpus} CPUS")
    # ██████╗ ███████╗ ██████╗████████╗    ███████╗███████╗ █████╗ ██████╗  ██████╗██╗  ██╗
    # ██╔══██╗██╔════╝██╔════╝╚══██╔══╝    ██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║  ██║
    # ██████╔╝█████╗  ██║        ██║       ███████╗█████╗  ███████║██████╔╝██║     ███████║
    # ██╔══██╗██╔══╝  ██║        ██║       ╚════██║██╔══╝  ██╔══██║██╔══██╗██║     ██╔══██║
    # ██║  ██║███████╗╚██████╗   ██║       ███████║███████╗██║  ██║██║  ██║╚██████╗██║  ██║
    # ╚═╝  ╚═╝╚══════╝ ╚═════╝   ╚═╝       ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝
    tic = time.perf_counter()
    tr_search.compute_sparse_rect_distances(
        df = tr_search.clone_df, 
        df2 = tr_bulk.clone_df, 
        chunk_size = ideal_chunk_size, 
        radius = 37)
    toc = time.perf_counter()
    print(f"SEARCHING {search_clones} META-CLONOTYPES IN {bulk_clones} BULK CLONES IN {toc - tic:0.2f} sec.")
    # ████████╗ █████╗ ██████╗ ██╗   ██╗██╗      █████╗ ████████╗███████╗
    # ╚══██╔══╝██╔══██╗██╔══██╗██║   ██║██║     ██╔══██╗╚══██╔══╝██╔════╝
    #    ██║   ███████║██████╔╝██║   ██║██║     ███████║   ██║   █████╗  
    #    ██║   ██╔══██║██╔══██╗██║   ██║██║     ██╔══██║   ██║   ██╔══╝  
    #    ██║   ██║  ██║██████╔╝╚██████╔╝███████╗██║  ██║   ██║   ███████╗
    #    ╚═╝   ╚═╝  ╚═╝╚═════╝  ╚═════╝ ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚══════╝
    df_tab = long_form_tabulation(
        clone_df1   = tr_search.clone_df,
        clone_df2   = tr_bulk.clone_df,
        csrmat      = tr_search.rw_beta, 
        search_cols = ['tag', 'protein','protein_coordinate','cdr3_b_aa', 
        'v_b_gene', 'j_b_gene', 'pgen','regex', 'radius'])
    
    # ██████╗ ██████╗ ███████╗ █████╗ ██████╗ ████████╗██╗  ██╗
    # ██╔══██╗██╔══██╗██╔════╝██╔══██╗██╔══██╗╚══██╔══╝██║  ██║
    # ██████╔╝██████╔╝█████╗  ███████║██║  ██║   ██║   ███████║
    # ██╔══██╗██╔══██╗██╔══╝  ██╔══██║██║  ██║   ██║   ██╔══██║
    # ██████╔╝██║  ██║███████╗██║  ██║██████╔╝   ██║   ██║  ██║
    # ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚═════╝    ╚═╝   ╚═╝  ╚═╝
    df_breadth = estimate_breadth_and_depth(df=df_tab, 
        breadth_cols = ['protein'], 
        N =tr_bulk.clone_df.shape[0] , 
        M= tr_bulk.clone_df['count'].sum(), 
        motif = True,
        exact = False)
    df_breadth = df_breadth.assign(file = files[0])
    # save rectangular sparse matrix, tabulation, and breadth esimates for future reference
    scipy.sparse.save_npz(os.path.join(destination, f"{file}.rw.npz"), tr_search.rw_beta)
    df_tab.to_csv(os.path.join(destination, f"{file}.tabulation.tsv"), 
        sep = '\t', index = True)
    df_breadth.to_csv(os.path.join(destination, f"{file}.breadth.tsv"),
        sep = '\t', index = True)