(Quasi)Public Clones¶
Public TCRs are shared clonotypes found in multiple individuals, arising from VDJ recombination biases and common selection pressures. Repertoire analyses often focuses on public clones; however finding public antigen-specific TCRs is not always possible because TCR repertoires are characterized by extreme diversity. As a consequence, only a small fraction of the repertoire can be assayed in a single sample, making it difficult to reproducibly sample TCR clonotypes from an individual, let alone reliably detect shared clonotypes in a population.
Enter, stage left, the quasi-public TCRs – two or more TCRs, with a high degree of biochemical similarity – that are found in two or more individuals. Identifying quasi public TCRs becomes useful when evaluating an antigen enriched repertoire putatively recognizing the same epitope.
Finding similar receptors from multiple individuals provides stronger evidence of shared epitope recognition and reveals mechanistic basis for CDR-peptide-MHC binding.
Moreover, meta-clonotypes are by definition more abundant than exact clonotype and thus can be more reliably be detected in a single bulk unenriched sample, facilitating more robust function comparisons across populations.
I am happy to use the defaults¶
For instance, you may want find all the (quasi)public collections of TCRs within a fixed radius <= 18 TCRdist units of each TCR in the antigen enriched input data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | """ tcrdist3 is particularly useful for finding what we term quasi-public meta-clonotypes, collections of biochemically similar TCRs recognizing the same peptide-MHC. The easist way to define meta-clonotypes is to compute pairwise distances between TCRs found in an antigen-enriched subrepertoire, abbreviated below as <aesr> """ import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic # <aesr_fn> antigen-enriched subrepertoire fn = os.path.join('tcrdist', 'data','covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv') # <aesr_df> antigen-enriched subrepertoire df = pd.read_csv(fn) # <tr> TCR repertoire tr = TCRrep( cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa']].copy(), organism = 'human', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = True) # <tp> TCRpublic class for reporting publicities, fixed radius 18, 'nsubject > 3' tp = TCRpublic( tcrrep = tr, output_html_name = "quasi_public_clones.html") # by calling, .report() an html report is made public = tp.report() # Also, the following datafarme are available # <clone_df> pd.DataFrame clone_df from tr.clone_df # with neighbors and summary measures appended public['clone_df'] # <nn_summary> pd.DataFrame with just summary measures public['nn_summary'] # <quasi_public_df> Non-redundant groups of quasipublic clones public['quasi_public_df'] |
In addition to the summary DataFrames returned, a HTML quasi-publicity report is generated, allowing for the inspection of logo-motifs formed from highly similar antigen-enriched TCR sequences found in multiple subjects.
I’d like to tweak a default parameter¶
If you want to add or subtract information from the report you can do so relatively easily. For instance, suppose you want to summarize cohort information and add that to the report.
Just like elsewhere in tcrdist3, the python object TCRpublic stores all options as attributes:
In [2]: tp.__dict__
Out[2]:
{'tcrrep': <tcrdist.repertoire.TCRrep at 0x13d5b9310>,
'organism': 'human',
'chain': 'beta',
'output_html_name': 'quasi_public_clones.html',
'pw_mat_str': 'pw_beta',
'cdr3_name': 'cdr3_b_aa',
'v_gene_name': 'v_b_gene',
'j_gene_name': 'j_b_gene',
'nr_filter': True,
'labels': ['clone_id',
'cdr3_b_aa',
'v_b_gene',
'j_b_gene',
'radius',
'neighbors',
'K_neighbors',
'nsubject',
'qpublic',
'cdr3_b_aa.summary',
'v_b_gene.summary',
'j_b_gene.summary',
'cdr3_b_aa.summary',
'subject.summary'],
'fixed_radius': False,
'radius': None,
'query_str': 'qpublic == True & K_neighbors > 5',
'kargs_member_summ': {'key_col': 'neighbors',
'count_col': 'count',
'addl_cols': ['subject'],
'addl_n': 4},
'kargs_motif': {'pwmat_str': 'pw_beta',
'cdr3_name': 'cdr3_b_aa',
'v_name': 'v_b_gene',
'gene_names': ['v_b_gene', 'j_b_gene']},
'tcrsampler': <tcrsampler.sampler.TCRsampler at 0x13d5b9850>}
The default only summarizes subjects ‘addl_cols’: [‘subject’], so adding an additional categorical variable to include in the summary is as easy as:
tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
tp.labels.append("cohort.summary")
You can also specify your standard for publicity. Instead of ‘qpublic == True & K_neighbors > 5’ you can ask to find super public meta-clonotypes, returning only those groups that satisfy : ‘nsubject > 8’
tp.query_str = 'nsubject > 8'
Here’s the a full example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic # <aesr_fn> antigen-enriched subrepertoire aesr_fn = os.path.join( 'tcrdist', 'data', 'covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv') # <aesr_df> antigen-enriched subrepertoire aesr_df = pd.read_csv(aesr_fn) # <tr> TCR repertoire tr = TCRrep( cell_df = aesr_df[[ 'cohort', 'subject', 'v_b_gene', 'j_b_gene', 'cdr3_b_aa']].copy(), organism = 'human', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = True) # <tp> TCRpublic class for reporting publicities tp = TCRpublic( tcrrep = tr, output_html_name = "quasi_public_clones2.html") # set to True, if we want a universal radius tp.fixed_radius = True # must then specify maximum distance for finding similar TCRs tp.radius = 18 # set criteria for being quasi-public tp.query_str = 'nsubject > 6' # Add additional columns to be summarized in the report tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort'] # Add cohort.summary to the labels column so it shows up in the report tp.labels.append("cohort.summary") # by calling, .report() an html report is made public = tp.report() |
As you can see in this new a html quasi-publicity report , the report has a new column for summarizing the percentage of TCRs coming from each cohort in the study and the number of meta-clonotypes are fewer, since only those with TCRs drawn from more than 8 subject are reported.
I want my search radius to be sequence specific¶
The radius applied to each centroid can be specified in a column of the clone_df.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | """ Instead of enforcing a fixed radius, use a radius specific to each centroid, specified in an additional column. """ import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic fn = os.path.join( 'tcrdist', 'data', 'covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv') df = pd.read_csv(fn) tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], organism = "human", chains = ["beta"]) tp = TCRpublic( tcrrep = tr, output_html_name = "quasi_public_clones3.html") # set to True, if we want a universal radius tp.fixed_radius = False # must then specify maximum distance for finding similar TCRs tp.radius = None # set criteria for being quasi-public tp.query_str = 'nsubject > 5' # Add additional columns to be summarized in the report tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort'] # Add cohort.summary to the labels column so it shows up in the report tp.labels.append("cohort.summary") tp.labels.append("cdr3s") # Change number of subjects to display tp.kargs_member_summ['addl_n'] = 10 # by calling, .report() an html report is made public = tp.report() |
Notice that radius varies by row in this quasi-publicity report ,
Working from neighbor_diff output¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | """ Use values from neighborhood_diff """ import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic fn = os.path.join('tcrdist','data','covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv') df = pd.read_csv(fn) tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], organism = "human", chains = ["beta"]) from tcrdist.rep_diff import neighborhood_diff ndif = neighborhood_diff( clone_df= tr.clone_df, pwmat = tr.pw_beta, count_col = 'count', x_cols = ['cohort'], knn_radius = 25, test_method = "chi2") # Add neighbors and other columns of interest # from neighbor_diff result to the clone_df tr.clone_df = pd.concat([tr.clone_df, ndif[['neighbors', 'K_neighbors','val_0','ct_0','pvalue']] ], axis = 1) # Because neighors and K_neighbor are already added to the clone_df # TCRpublic.report() uses those instead of finding new ones. tp = TCRpublic( tcrrep = tr, output_html_name = "quasi_public_clones_with_ndif.html") # Add any columns neighbor_diff columns #that you want to display in the final report tp.labels.append('val_0') tp.labels.append('ct_0') tp.labels.append('pvalue') # chagne sort to be pvalue not publicity tp.sort_columns = ['pvalue'] # because you are sorting by pvalue, change to True tp.sort_ascending = True tp.report() |
I hate OOP just show me the functions¶
TCRpublic is for convenience. You can customize a lot including the background tcrsampler; but the power users may want to work with the underlying functions directly. Here are some examples of how:
I just want to quickly find neighbors and (quasi)public clones¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | """ Instead of enforcing a fixed radius, use a radius specific to each centroid, specified in an additional column. """ import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic fn = os.path.join('tcrdist','data','covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv') df = pd.read_csv(fn) tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], organism = "human", chains = ["beta"]) # NEIGHBORS from tcrdist.public import _neighbors_fixed_radius from tcrdist.public import _neighbors_variable_radius # returns lists of lists of all neighbors at fixed of variable radii _neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18) _neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius) # returns the number (K) neighbors at fixed or vriable radii from tcrdist.public import _K_neighbors_fixed_radius from tcrdist.public import _K_neighbors_variable_radius _K_neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18) _K_neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius) # First find neighbors by your favorite method tr.clone_df['neighbors'] = _neighbors_variable_radius( pwmat = tr.pw_beta, radius_list = tr.clone_df.radius) # Once neighbors are added to a clone_df you can easily determine publicity. tr.clone_df['nsubject'] = tr.clone_df['neighbors'].\ apply(lambda x: tr.clone_df['subject'].iloc[x].nunique()) tr.clone_df['qpublic'] = tr.clone_df['nsubject'].\ apply(lambda x: x > 1) |
I have neighbors and radii already, I want logos¶
Suppose you want to specify exactly what to include in a motif logo report. This example is slightly different then those above because we are going to use two inputs files. The first input file includes all of the TCRs in antigen enriched repertoire. The second file is a subset of the first, specifying exactly the TCRs centroids to report. Remember that any element of a clone_df can be included/excluded from the HTML report. Those fields to include can be specified as labels.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | """ Report of meta-clonotypes using two dataframes. <df> has all TCRS <df2> has a subset of TCRS in <df>, specifiyint which are to be used as centroids. """ import os import pandas as pd import numpy as np from tcrdist.repertoire import TCRrep from tcrdist.public import TCRpublic from tcrdist.tree import _default_sampler_olga from progress.bar import IncrementalBar from palmotif import compute_pal_motif, svg_logo from tcrdist.public import make_motif_logo output_html_name = "custom_report.html" # <fn> filename for all TCRs in an antigen-enriched repertoire fn = os.path.join('tcrdist','data','covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv') df = pd.read_csv(fn, sep = ",") df = df[['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'pgen', 'max_radi']].\ rename(columns= {'max_radi':'radius'}).copy() # <fn>2 filename for priority TCRs fn2 = os.path.join('tcrdist','data','covid19', 'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv.ranked_centers.tsv') df2 = pd.read_csv(fn2, sep = "\t").\ rename(columns= {'max_radi':'radius'}).copy() # Compute distances between all TCRs tr = TCRrep(cell_df = df, organism = 'human', chains = ['beta'], compute_distances = True) # Initialize a tcrsampler, this will be used to make background motifs tcrsampler = _default_sampler_olga(organism = "human", chain = "beta")() # Iterate through each row of the df2, making a logo for each. svgs = list() svgs_raw = list() bar = IncrementalBar("Making Logos", max = df2.shape[0]) for i,r in df2.iterrows(): bar.next() svg,svg_raw=make_motif_logo(tcrsampler = tcrsampler, clone_df = tr.clone_df, pwmat = tr.pw_beta, centroid = r['cdr3_b_aa'], v_gene = r['v_b_gene'], radius = r['radius'], pwmat_str = 'pw_beta', cdr3_name = 'cdr3_b_aa', v_name = 'v_b_gene', gene_names = ['v_b_gene','j_b_gene']) svgs.append(svg) svgs_raw .append(svg_raw) bar.next(); bar.finish() df2['svg'] = svgs df2['svg_raw'] = svgs_raw def shrink(s): """reduce size of svg graphic""" s = s.replace('height="100%"', 'height="20%"') s = s.replace('width="100%"', 'width="20%"') return s # Choose columns to include in the report labels = [ 'cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'radius', 'regex', 'target_hits', 'nsubject', 'chi2joint'] with open(output_html_name, 'w') as output_handle: for i,r in df2.iterrows(): #import pdb; pdb.set_trace() svg, svg_raw = r['svg'],r['svg_raw'] output_handle.write("<br></br>") output_handle.write(shrink(svg)) output_handle.write(shrink(svg_raw)) output_handle.write("<br></br>") output_handle.write(pd.DataFrame(r[labels]).transpose().to_html()) output_handle.write("<br></br>") |
Will this work with sparse matrix options?¶
tcrdist3 has a memory efficient options for larger datasets that produce scipy.sparse rather than dense representations of distance relationships.
Currently you can’t call TCRpublic() on this sparse representation. However, here is an example of how you can achieve similar results via a script, reporting (quasi)Public meta-clonotypes from a sparse format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | """ Making a meta-clonotype report from a scipy.sparse TCRdist matrix. """ import numpy as np import pandas as pd from tcrdist.repertoire import TCRrep from tcrdist.public import _neighbors_sparse_fixed_radius, _neighbors_sparse_variable_radius from tcrdist.summarize import test_for_subsets from tcrdist.tree import _default_sampler_olga from tcrdist.public import make_motif_logo_from_index df = pd.read_csv("dash.csv").query('epitope == "PA"') tr = TCRrep(cell_df = df, #(2) organism = 'mouse', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = False) # When setting the radius to 50, the sparse matrix # will convert any value > 50 to 0. True zeros are # repressented as -1. radius = 50 tr.cpus = 1 # Notice that we called .compute_sparse_rect_distances instead of .compute_distances tr.compute_sparse_rect_distances(df = tr.clone_df, radius = radius) # There are two functions for finding neighbors from a sparse TCRdist matrix. # For 1 fixed radius: _neighbors_sparse_fixed_radius() # For a radius per row: _neighbors_sparse_variable_radius() tr.clone_df['radius'] = 12 tr.clone_df['neighbors'] = \ _neighbors_sparse_variable_radius( csrmat = tr.rw_beta, #radius = 12) radius_list = tr.clone_df['radius'].to_list()) # <K_neighbors>the number of neighbors per TCR tr.clone_df['K_neighbors'] = tr.clone_df['neighbors'].apply(lambda x: len(x)) # <nsubject> the number of subject (nsubject) neighboring the TCR ( tr.clone_df['nsubject'] = tr.clone_df['neighbors'].apply(lambda x: len(tr.clone_df['subject'][x].unique())) # nsubject > 1 implies quasi-publicity) tr.clone_df['qpublic'] = tr.clone_df['nsubject'].apply(lambda x: x >1 ) # e.g., For the report, focus on TCRs with more than 5 neighboring subjects quasi_public_df = tr.clone_df.query('nsubject > 5').copy().\ sort_values('nsubject', ascending = False) # test_for_subsets()> allows us to remove TCRs with identical neighborhoods quasi_public_df['unique'] = test_for_subsets(quasi_public_df['neighbors']) quasi_public_df = quasi_public_df[quasi_public_df['unique'] == 1].copy() # declare a sampler for generating a backgrond comparison ts = _default_sampler_olga(organism = 'mouse', chain = 'beta')() # make a background-subtracted logo <svg> and raw log <svg_raw> for each TCR svgs = list() svgs_raw = list() for i,r in quasi_public_df.iterrows(): svg, svg_raw = make_motif_logo_from_index(tcrsampler = ts, ind = r['neighbors'], centroid = r['cdr3_b_aa'], clone_df = tr.clone_df, cdr3_name = 'cdr3_b_aa', v_name = 'v_b_gene', gene_names = ['v_b_gene','j_b_gene']) svgs.append(svg) svgs_raw.append(svg_raw) # Output a html report output_html_name = 'quasi_public_df_report.html' quasi_public_df['svg'] = svgs quasi_public_df['svg_raw'] = svgs_raw # Specific columns to include in the report labels = [ 'cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'radius', 'K_neighbors', 'nsubject'] def shrink(s): """reduce size of svg graphic""" s = s.replace('height="100%"', 'height="20%"') s = s.replace('width="100%"', 'width="20%"') return s with open(output_html_name, 'w') as output_handle: for i,r in quasi_public_df.iterrows(): #import pdb; pdb.set_trace() svg, svg_raw = r['svg'],r['svg_raw'] output_handle.write("<br></br>") output_handle.write(shrink(svg)) output_handle.write(shrink(svg_raw)) output_handle.write("<br></br>") output_handle.write(pd.DataFrame(r[labels]).transpose().to_html()) output_handle.write("<br></br>") |
For more on sparse matrices in tcrdist3 see the tab on ‘Working With Bulk Data’.