TCR Distances¶
To quickly get data associated with these and other snippets, see the section on Downloads. Many examples use the data file (dash.csv).
I am happy to use the defaults¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | """ If you just want a 'tcrdistances' using pre-set default setting. You can access distance matrices: tr.pw_alpha - alpha chain pairwise distance matrix tr.pw_beta - alpha chain pairwise distance matrix tr.pw_cdr3_a_aa - cdr3 alpha chain distance matrix tr.pw_cdr3_b_aa - cdr3 beta chain distance matrix """ import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], db_file = 'alphabeta_gammadelta_db.tsv') tr.pw_alpha tr.pw_beta tr.pw_cdr3_a_aa tr.pw_cdr3_b_aa |
I’d like to tweak a default parameter¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | """ If you want 'tcrdistances' with changes over some parameters. For instance you want to change the gap penalty on CDR3s to 5. """ import pwseqdist as pw import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], compute_distances = False, db_file = 'alphabeta_gammadelta_db.tsv') tr.kargs_a['cdr3_a_aa']['gap_penalty'] = 5 tr.kargs_b['cdr3_b_aa']['gap_penalty'] = 5 tr.compute_distances() tr.pw_alpha tr.pw_beta |
I want complete control¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | """ If want a 'tcrdistances' AND you want control over EVERY parameter. """ import pwseqdist as pw import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], compute_distances = False, db_file = 'alphabeta_gammadelta_db.tsv') metrics_a = { "cdr3_a_aa" : pw.metrics.nb_vector_tcrdist, "pmhc_a_aa" : pw.metrics.nb_vector_tcrdist, "cdr2_a_aa" : pw.metrics.nb_vector_tcrdist, "cdr1_a_aa" : pw.metrics.nb_vector_tcrdist} metrics_b = { "cdr3_b_aa" : pw.metrics.nb_vector_tcrdist, "pmhc_b_aa" : pw.metrics.nb_vector_tcrdist, "cdr2_b_aa" : pw.metrics.nb_vector_tcrdist, "cdr1_b_aa" : pw.metrics.nb_vector_tcrdist } weights_a= { "cdr3_a_aa" : 3, "pmhc_a_aa" : 1, "cdr2_a_aa" : 1, "cdr1_a_aa" : 1} weights_b = { "cdr3_b_aa" : 3, "pmhc_b_aa" : 1, "cdr2_b_aa" : 1, "cdr1_b_aa" : 1} kargs_a = { 'cdr3_a_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos': False}, 'pmhc_a_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight':1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}, 'cdr2_a_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}, 'cdr1_a_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight':1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True} } kargs_b= { 'cdr3_b_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos': False}, 'pmhc_b_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}, 'cdr2_b_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight':1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}, 'cdr1_b_aa' : { 'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight':1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True} } tr.metrics_a = metrics_a tr.metrics_b = metrics_b tr.weights_a = weights_a tr.weights_b = weights_b tr.kargs_a = kargs_a tr.kargs_b = kargs_b |
Tip
- metrics - tcrdist3 allows specification of a unique metric to each CDR. You can even provide your own
- By default - tcrdist3 uses pwseqdist >= 0.2.0, by the same authors, for various distance metrics
- pw.metrics.nb_vector_tcrdist - a Numba accelerated metric that matches the approach in Dash et al. (2018)
- pw.metrics.nw_hamming_metric - a pairwise Needleman-Wunsch alignment followed by hamming distance (with multiprocessing enabled for faster run-time)
- pw.metrics.nw_metric - a reciprocal Needleman-Wunsch alignment score based dissimilarity using BLOSUM62
- The field - commonly uses editdistance to compare CDRs. For instance, the Cython enabled packages like python-Levenshtein metrics can be used seamlessly with tcrdist3
- Dictionary keys must match appropriate columns of the clone_df DataFrame.
Tip
- weights - typically the CDR3 is given a higher weight because it is in direct contact with antigen peptide.
- Dictionary keys must match appropriate columns of the clone_df DataFrame.
Tip
- kargs - tuneable parameters
- Tunable parameters are passed to metric as keyword-arguments. Some important parameters for pw.metrics.nb_vector_tcrdist are
- use_numba - must be set to true for metrics starting with ‘nb’.
- ctrim - number of amino acids to trim of c-terminal end of a CDR. (i.e. ctrim = 3 CASSQDFEQ-YF consider only positions after CAS-SQDFEQ-YF)
- ntrim - number of amino acids to trim of n-terminal end of a CDR. (i.e. ntrim = 2 CASSQDFEQ-YF consider only positions before “YF”, SQDFEQ-YF)
- gap_penalty - the penalty accrued for gaps in the pairwise alignment.
- ‘fixed_gappos’ - When sequences are of different length, there will be a gap position. If ‘fixed_gappos’ is False, then the metric inserts a single gap at an optimal position based on a BLOSUM62 scoring matrix. This is recommended for the CDR3, but it is not necessary when the CDR1, 2, and 2.5 are already imgt_aligned and of a fixed length.
- The default parameters match those used in Dash et al. (2018) *Quantifiable predictive features define epitope-specific T cell receptor repertoires
- Dictionary keys must match appropriate columns of the clone_df DataFrame.
Tip
Protip: CDR2.5 alpha and CDR2.5 beta, the pMHC-facing loop between CDR2 and CDR3, are referred to in tcrdist2 as pmhc_a and phmc_b, respectively.
I just want to count mismatches¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | """ If you want "tcrdistances" using a different metric. Here we illustrate the use a metric that uses the Needleman-Wunsch algorithm to align sequences and then calculate the number of mismatching positions (pw.metrics.nw_hamming_metric) This method doesn't rely on Numba so it can run faster using multiple cpus. """ import pwseqdist as pw import pandas as pd from tcrdist.repertoire import TCRrep import multiprocessing df = pd.read_csv("dash.csv") df = df.head(100) # for faster testing tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], use_defaults=False, compute_distances = False, cpus = 1, db_file = 'alphabeta_gammadelta_db.tsv') metrics_a = { "cdr3_a_aa" : pw.metrics.nw_hamming_metric , "pmhc_a_aa" : pw.metrics.nw_hamming_metric , "cdr2_a_aa" : pw.metrics.nw_hamming_metric , "cdr1_a_aa" : pw.metrics.nw_hamming_metric } metrics_b = { "cdr3_b_aa" : pw.metrics.nw_hamming_metric , "pmhc_b_aa" : pw.metrics.nw_hamming_metric , "cdr2_b_aa" : pw.metrics.nw_hamming_metric , "cdr1_b_aa" : pw.metrics.nw_hamming_metric } weights_a = { "cdr3_a_aa" : 1, "pmhc_a_aa" : 1, "cdr2_a_aa" : 1, "cdr1_a_aa" : 1} weights_b = { "cdr3_b_aa" : 1, "pmhc_b_aa" : 1, "cdr2_b_aa" : 1, "cdr1_b_aa" : 1} kargs_a = { 'cdr3_a_aa' : {'use_numba': False}, 'pmhc_a_aa' : { 'use_numba': False}, 'cdr2_a_aa' : { 'use_numba': False}, 'cdr1_a_aa' : { 'use_numba': False} } kargs_b = { 'cdr3_b_aa' : {'use_numba': False}, 'pmhc_b_aa' : { 'use_numba': False}, 'cdr2_b_aa' : { 'use_numba': False}, 'cdr1_b_aa' : { 'use_numba': False} } tr.metrics_a = metrics_a tr.metrics_b = metrics_b tr.weights_a = weights_a tr.weights_b = weights_b tr.kargs_a = kargs_a tr.kargs_b = kargs_b tr.compute_distances() tr.pw_cdr3_b_aa tr.pw_beta |
I want to use my own distance metric¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | """ If you want a tcrdistance, but you want to use your own metric. (A valid metric takes two strings and returns a numerical distance). def my_own_metric(s1,s2): return Levenshtein.distance(s1,s2) """ import pwseqdist as pw import pandas as pd from tcrdist.repertoire import TCRrep import multiprocessing df = pd.read_csv("dash.csv") df = df.head(100) # for faster testing tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], use_defaults=False, compute_distances = False, cpus = 1, db_file = 'alphabeta_gammadelta_db.tsv') metrics_a = { "cdr3_a_aa" : my_own_metric , "pmhc_a_aa" : my_own_metric , "cdr2_a_aa" : my_own_metric , "cdr1_a_aa" : my_own_metric } metrics_b = { "cdr3_b_aa" : my_own_metric , "pmhc_b_aa" : my_own_metric , "cdr2_b_aa" : my_own_metric, "cdr1_b_aa" : my_own_metric } weights_a = { "cdr3_a_aa" : 1, "pmhc_a_aa" : 1, "cdr2_a_aa" : 1, "cdr1_a_aa" : 1} weights_b = { "cdr3_b_aa" : 1, "pmhc_b_aa" : 1, "cdr2_b_aa" : 1, "cdr1_b_aa" : 1} kargs_a = { 'cdr3_a_aa' : {'use_numba': False}, 'pmhc_a_aa' : { 'use_numba': False}, 'cdr2_a_aa' : { 'use_numba': False}, 'cdr1_a_aa' : { 'use_numba': False} } kargs_b = { 'cdr3_b_aa' : {'use_numba': False}, 'pmhc_b_aa' : { 'use_numba': False}, 'cdr2_b_aa' : { 'use_numba': False}, 'cdr1_b_aa' : { 'use_numba': False} } tr.metrics_a = metrics_a tr.metrics_b = metrics_b tr.weights_a = weights_a tr.weights_b = weights_b tr.kargs_a = kargs_a tr.kargs_b = kargs_b tr.compute_distances() tr.pw_cdr3_b_aa tr.pw_beta |
I want tcrdistances, but I hate OOP¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | """ If you don't want to use OOP, but you I still want a multi-CDR tcrdistances on a single chain, using you own metric def my_own_metric(s1,s2): return Levenshtein.distance(s1,s2) """ import multiprocessing import pandas as pd from tcrdist.rep_funcs import _pws, _pw df = pd.read_csv("dash2.csv") metrics_b = { "cdr3_b_aa" : my_own_metric , "pmhc_b_aa" : my_own_metric , "cdr2_b_aa" : my_own_metric , "cdr1_b_aa" : my_own_metric } weights_b = { "cdr3_b_aa" : 1, "pmhc_b_aa" : 1, "cdr2_b_aa" : 1, "cdr1_b_aa" : 1} kargs_b = { 'cdr3_b_aa' : {'use_numba': False}, 'pmhc_b_aa' : { 'use_numba': False}, 'cdr2_b_aa' : { 'use_numba': False}, 'cdr1_b_aa' : { 'use_numba': False} } dmats = _pws(df = df , metrics = metrics_b, weights = weights_b, kargs = kargs_b , cpu = 1, uniquify= True, store = True) print(dmats.keys()) |
I hate OOP, and I only want distances for the CDR3¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | """ If you hate object oriented programming, just show me the functions. No problem. Maybe you only care about the CDR3 on the beta chain. def my_own_metric(s1,s2): return Levenshtein.distance(s1,s2) """ import multiprocessing import pandas as pd from tcrdist.rep_funcs import _pws, _pw df = pd.read_csv("dash2.csv") # dmat = _pw( metric = my_own_metric, seqs1 = df['cdr3_b_aa'].values, ncpus=2, uniqify=True, use_numba=False) |
I want tcrdistances but I want to keep my variable names¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | """ You want a 'tcrdistance' but you don't want to bother with the tcrdist3 framework. Note that the columns names are completely arbitrary under this framework, so one can directly compute a tcrdist on a AIRR, MIXCR, VDJTools, or other formated file without any reformatting. """ import multiprocessing import pandas as pd import pwseqdist as pw from tcrdist.rep_funcs import _pws, _pw df_airr = pd.read_csv("dash_beta_airr.csv") # Choose the metrics you want to apply to each CDR metrics = { 'cdr3_aa' : pw.metrics.nb_vector_tcrdist, 'cdr2_aa' : pw.metrics.nb_vector_tcrdist, 'cdr1_aa' : pw.metrics.nb_vector_tcrdist} # Choose the weights that are right for you. weights = { 'cdr3_aa' : 3, 'cdr2_aa' : 1, 'cdr1_aa' : 1} # Provide arguments for the distance metrics kargs = { 'cdr3_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos':False}, 'cdr2_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}, 'cdr1_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}} # Here are your distance matrices from tcrdist.rep_funcs import _pws dmats = _pws(df = df_airr, metrics = metrics, weights= weights, kargs=kargs, cpu = 1, store = True) dmats['tcrdist'] |
I want to use TCRrep but I want to keep my variable names¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | """ If you already have a clones file and want to compute 'tcrdistances' on a DataFrame with custom columns names. Set: 1. Assign TCRrep.clone_df 2. set infer_cdrs = False, 3. compute_distances = False 4. deduplicate = False 5. customize the keys for metrics, weights, and kargs with the lambda customize = lambda d : {new_cols[k]:v for k,v in d.items()} 6. call .calculate_distances() """ import pwseqdist as pw import pandas as pd from tcrdist.repertoire import TCRrep new_cols = {'cdr3_a_aa':'c3a', 'pmhc_a_aa':'pa', 'cdr2_a_aa':'c2a','cdr1_a_aa':'c1a', 'cdr3_b_aa':'c3b', 'pmhc_b_aa':'pb', 'cdr2_b_aa':'c2b','cdr1_b_aa':'c1b'} df = pd.read_csv("dash2.csv").rename(columns = new_cols) tr = TCRrep( cell_df = df, clone_df = df, #(1) organism = 'mouse', chains = ['alpha','beta'], infer_all_genes = True, infer_cdrs = False, #(2)s compute_distances = False, #(3) deduplicate=False, #(4) db_file = 'alphabeta_gammadelta_db.tsv') customize = lambda d : {new_cols[k]:v for k,v in d.items()} #(5) tr.metrics_a = customize(tr.metrics_a) tr.metrics_b = customize(tr.metrics_b) tr.weights_a = customize(tr.weights_a) tr.weights_b = customize(tr.weights_b) tr.kargs_a = customize(tr.kargs_a) tr.kargs_b = customize(tr.kargs_b) tr.compute_distances() #(6) # Notice that pairwise results now have custom names tr.pw_c3b tr.pw_c3a tr.pw_alpha tr.pw_beta |
I want distances from 1 TCR to many TCRs¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | (1) cell_df is asigned the first 10 cells in dash.csv (2) TCRrep with default settings. (3) compute rectangular distance between clone_df and df2. """ import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv").head(10) #(1) tr = TCRrep(cell_df = df, #(2) organism = 'mouse', chains = ['alpha','beta'], compute_distances = False) df2 = pd.read_csv("dash.csv") tr2 = TCRrep(cell_df = df2, #(2) organism = 'mouse', chains = ['alpha','beta'], compute_distances = False) tr.compute_rect_distances( df = tr.clone_df, df2 = tr2.clone_df) #(3) assert tr.rw_alpha.shape == (10,1920) assert tr.rw_beta.shape == (10,1920) |
I want to search a bulk dataset¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | """ Example of comparing two clone_df from separate repertoires of different sizes 1. Load a bulk beta-chain dataset with > 70 Clones. 2. Prepare a beta chain clones_df DataFrame with cdr3_b_aa, pmhc_b_aa, cdr2_b_aa, cdr1_b_aa columns. 3. Extract copy of processed dataframe. 4. Load data of interest into a TCRrep instance. 5. Compare the first 100 clones in primary dataset (100 Clones) against the the entire bulk dataset (77,538 Clones) Note: for large datasets you may want to turn store = False """ import pandas as pd from tcrdist.repertoire import TCRrep from tcrdist.setup_tests import download_and_extract_zip_file download_and_extract_zip_file('bulk.zip', source = "dropbox", dest = ".") df_bulk = pd.read_csv('bulk.csv', sep = ",") #(1) tr_bulk = TCRrep(cell_df = df_bulk, #(2) organism = 'mouse', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = False) df_search = tr_bulk.clone_df.copy() #(3) df = pd.read_csv("dash.csv").head(100) tr = TCRrep(cell_df = df, #(4) organism = 'mouse', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv') assert tr.pw_beta.shape == (100,100) assert df_search.shape == (77538, 8) tr.compute_rect_distances(df = tr.clone_df, df2 = df_search, store =True) #(5) assert tr.rw_beta.shape == (100, 77538) assert tr.rw_cdr3_b_aa.shape == (100, 77538) # see how many are tcrdistance less than 50 import numpy as np np.sum(tr.rw_beta < 50, axis = 1) |
I want something else¶
Let us know. We are happy to help. Get in touch with kmayerbl AT fredhutch DOT org