TCR Distances

To quickly get data associated with these and other snippets, see the section on Downloads. Many examples use the data file (dash.csv).

I am happy to use the defaults

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"""
If you just want a 'tcrdistances' using pre-set default setting.

    You can access distance matrices:
        tr.pw_alpha     - alpha chain pairwise distance matrix
        tr.pw_beta      - alpha chain pairwise distance matrix
        tr.pw_cdr3_a_aa - cdr3 alpha chain distance matrix
        tr.pw_cdr3_b_aa - cdr3 beta chain distance matrix
"""
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

tr.pw_alpha
tr.pw_beta
tr.pw_cdr3_a_aa
tr.pw_cdr3_b_aa

I’d like to tweak a default parameter

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
"""  
If you want 'tcrdistances' with changes over some parameters.
For instance you want to change the gap penalty on CDR3s to 5. 
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False,
            db_file = 'alphabeta_gammadelta_db.tsv')

tr.kargs_a['cdr3_a_aa']['gap_penalty'] = 5 
tr.kargs_b['cdr3_b_aa']['gap_penalty'] = 5 

tr.compute_distances()

tr.pw_alpha
tr.pw_beta

I want complete control

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : pw.metrics.nb_vector_tcrdist,
    "pmhc_a_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr2_a_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr1_a_aa" : pw.metrics.nb_vector_tcrdist}

metrics_b = {
    "cdr3_b_aa" : pw.metrics.nb_vector_tcrdist,
    "pmhc_b_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr2_b_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr1_b_aa" : pw.metrics.nb_vector_tcrdist }

weights_a= { 
    "cdr3_a_aa" : 3,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 3,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': True, 
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 
        'dist_weight': 1, 
        'gap_penalty':4, 
        'ntrim':3, 
        'ctrim':2, 
        'fixed_gappos': False},
    'pmhc_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr2_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight': 1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr1_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True}
    }
kargs_b= {  
    'cdr3_b_aa' : 
        {'use_numba': True, 
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 
        'dist_weight': 1, 
        'gap_penalty':4, 
        'ntrim':3, 
        'ctrim':2, 
        'fixed_gappos': False},
    'pmhc_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight': 1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr2_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr1_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True}
    }   

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

Tip

  • metrics - tcrdist3 allows specification of a unique metric to each CDR. You can even provide your own
    • By default - tcrdist3 uses pwseqdist >= 0.2.0, by the same authors, for various distance metrics
      • pw.metrics.nb_vector_tcrdist - a Numba accelerated metric that matches the approach in Dash et al. (2018)
      • pw.metrics.nw_hamming_metric - a pairwise Needleman-Wunsch alignment followed by hamming distance (with multiprocessing enabled for faster run-time)
      • pw.metrics.nw_metric - a reciprocal Needleman-Wunsch alignment score based dissimilarity using BLOSUM62
    • The field - commonly uses editdistance to compare CDRs. For instance, the Cython enabled packages like python-Levenshtein metrics can be used seamlessly with tcrdist3
    • Dictionary keys must match appropriate columns of the clone_df DataFrame.

Tip

  • weights - typically the CDR3 is given a higher weight because it is in direct contact with antigen peptide.
    • Dictionary keys must match appropriate columns of the clone_df DataFrame.

Tip

  • kargs - tuneable parameters
    • Tunable parameters are passed to metric as keyword-arguments. Some important parameters for pw.metrics.nb_vector_tcrdist are
      • use_numba - must be set to true for metrics starting with ‘nb’.
      • ctrim - number of amino acids to trim of c-terminal end of a CDR. (i.e. ctrim = 3 CASSQDFEQ-YF consider only positions after CAS-SQDFEQ-YF)
      • ntrim - number of amino acids to trim of n-terminal end of a CDR. (i.e. ntrim = 2 CASSQDFEQ-YF consider only positions before “YF”, SQDFEQ-YF)
      • gap_penalty - the penalty accrued for gaps in the pairwise alignment.
      • ‘fixed_gappos’ - When sequences are of different length, there will be a gap position. If ‘fixed_gappos’ is False, then the metric inserts a single gap at an optimal position based on a BLOSUM62 scoring matrix. This is recommended for the CDR3, but it is not necessary when the CDR1, 2, and 2.5 are already imgt_aligned and of a fixed length.
    • The default parameters match those used in Dash et al. (2018) *Quantifiable predictive features define epitope-specific T cell receptor repertoires
    • Dictionary keys must match appropriate columns of the clone_df DataFrame.

Tip

Protip: CDR2.5 alpha and CDR2.5 beta, the pMHC-facing loop between CDR2 and CDR3, are referred to in tcrdist2 as pmhc_a and phmc_b, respectively.

I just want to count mismatches

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
"""
If you want "tcrdistances" using a different metric.

Here we illustrate the use a metric that uses the 
Needleman-Wunsch algorithm to align sequences and then 
calculate the number of mismatching positions (pw.metrics.nw_hamming_metric)

This method doesn't rely on Numba so it can run faster using multiple cpus.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing

df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            use_defaults=False,
            compute_distances = False,
            cpus = 1,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : pw.metrics.nw_hamming_metric ,
    "pmhc_a_aa" : pw.metrics.nw_hamming_metric ,
    "cdr2_a_aa" : pw.metrics.nw_hamming_metric ,
    "cdr1_a_aa" : pw.metrics.nw_hamming_metric }

metrics_b = {
    "cdr3_b_aa" : pw.metrics.nw_hamming_metric ,
    "pmhc_b_aa" : pw.metrics.nw_hamming_metric ,
    "cdr2_b_aa" : pw.metrics.nw_hamming_metric ,
    "cdr1_b_aa" : pw.metrics.nw_hamming_metric  }

weights_a = { 
    "cdr3_a_aa" : 1,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': False},
    'pmhc_a_aa' : {
        'use_numba': False},
    'cdr2_a_aa' : {
        'use_numba': False},
    'cdr1_a_aa' : {
        'use_numba': False}
    }
kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

tr.compute_distances()

tr.pw_cdr3_b_aa
tr.pw_beta

I want to use my own distance metric

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
If you want a tcrdistance, but you want to use your own metric. 
(A valid metric takes two strings and returns a numerical distance).  

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing

df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            use_defaults=False,
            compute_distances = False,
            cpus = 1,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : my_own_metric ,
    "pmhc_a_aa" : my_own_metric ,
    "cdr2_a_aa" : my_own_metric ,
    "cdr1_a_aa" : my_own_metric }

metrics_b = {
    "cdr3_b_aa" : my_own_metric ,
    "pmhc_b_aa" : my_own_metric ,
    "cdr2_b_aa" : my_own_metric,
    "cdr1_b_aa" : my_own_metric }

weights_a = { 
    "cdr3_a_aa" : 1,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': False},
    'pmhc_a_aa' : {
        'use_numba': False},
    'cdr2_a_aa' : {
        'use_numba': False},
    'cdr1_a_aa' : {
        'use_numba': False}
    }
kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

tr.compute_distances()

tr.pw_cdr3_b_aa
tr.pw_beta

I want tcrdistances, but I hate OOP

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
"""
If you don't want to use OOP, but you I still want a multi-CDR 
tcrdistances on a single chain, using you own metric 

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)    
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw

df = pd.read_csv("dash2.csv")

metrics_b = {
    "cdr3_b_aa" : my_own_metric ,
    "pmhc_b_aa" : my_own_metric ,
    "cdr2_b_aa" : my_own_metric ,
    "cdr1_b_aa" : my_own_metric }

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

dmats =  _pws(df = df , 
            metrics = metrics_b, 
            weights = weights_b, 
            kargs   = kargs_b , 
            cpu     = 1, 
            uniquify= True, 
            store   = True)

print(dmats.keys())

I hate OOP, and I only want distances for the CDR3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
"""
If you hate object oriented programming, just show me the functions. 
No problem. 

Maybe you only care about the CDR3 on the beta chain.

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)
"""  
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw

df = pd.read_csv("dash2.csv")

# 
dmat = _pw( metric = my_own_metric,
            seqs1 = df['cdr3_b_aa'].values,
            ncpus=2,
            uniqify=True,
            use_numba=False)

I want tcrdistances but I want to keep my variable names

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""
You want a 'tcrdistance' but you don't want to bother with the tcrdist3 framework. 

Note that the columns names are completely arbitrary under this 
framework, so one can directly compute a tcrdist on a 
AIRR, MIXCR, VDJTools, or other formated file without any
reformatting.
""" 
import multiprocessing
import pandas as pd
import pwseqdist as pw
from tcrdist.rep_funcs import _pws, _pw  

df_airr = pd.read_csv("dash_beta_airr.csv")

# Choose the metrics you want to apply to each CDR
metrics = { 'cdr3_aa' : pw.metrics.nb_vector_tcrdist,
            'cdr2_aa' : pw.metrics.nb_vector_tcrdist,
            'cdr1_aa' : pw.metrics.nb_vector_tcrdist}

# Choose the weights that are right for you.
weights = { 'cdr3_aa' : 3,
            'cdr2_aa' : 1,
            'cdr1_aa' : 1}

# Provide arguments for the distance metrics 
kargs = {   'cdr3_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos':False},
            'cdr2_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True},
            'cdr1_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}}
            
# Here are your distance matrices
from tcrdist.rep_funcs import _pws

dmats = _pws(df = df_airr,
         metrics = metrics, 
         weights= weights, 
         kargs=kargs, 
         cpu = 1, 
         store = True)

dmats['tcrdist']

I want to use TCRrep but I want to keep my variable names

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"""
If you already have a clones file and want 
to compute 'tcrdistances' on a DataFrame with 
custom columns names.

Set:
1. Assign TCRrep.clone_df
2. set infer_cdrs = False,
3. compute_distances = False
4. deduplicate = False
5. customize the keys for metrics, weights, and kargs with the lambda
    customize = lambda d : {new_cols[k]:v for k,v in d.items()} 
6. call .calculate_distances()
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

new_cols = {'cdr3_a_aa':'c3a', 'pmhc_a_aa':'pa', 'cdr2_a_aa':'c2a','cdr1_a_aa':'c1a',
            'cdr3_b_aa':'c3b', 'pmhc_b_aa':'pb', 'cdr2_b_aa':'c2b','cdr1_b_aa':'c1b'}

df = pd.read_csv("dash2.csv").rename(columns = new_cols) 

tr = TCRrep(
        cell_df = df,
        clone_df = df,              #(1)
        organism = 'mouse', 
        chains = ['alpha','beta'],
        infer_all_genes = True, 
        infer_cdrs = False,         #(2)s
        compute_distances = False,  #(3)
        deduplicate=False,          #(4)
        db_file = 'alphabeta_gammadelta_db.tsv')

customize = lambda d : {new_cols[k]:v for k,v in d.items()} #(5)
tr.metrics_a = customize(tr.metrics_a)
tr.metrics_b = customize(tr.metrics_b)
tr.weights_a = customize(tr.weights_a)
tr.weights_b = customize(tr.weights_b)
tr.kargs_a = customize(tr.kargs_a)
tr.kargs_b = customize(tr.kargs_b)

tr.compute_distances() #(6)

# Notice that pairwise results now have custom names 
tr.pw_c3b
tr.pw_c3a
tr.pw_alpha
tr.pw_beta

I want distances from 1 TCR to many TCRs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

(1) cell_df is asigned the first 10 cells in dash.csv
(2) TCRrep with default settings.
(3) compute rectangular distance between clone_df and df2.
"""
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv").head(10)   #(1)
tr = TCRrep(cell_df = df,               #(2)
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False)
df2 = pd.read_csv("dash.csv")
tr2 = TCRrep(cell_df = df2,               #(2)
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False)

tr.compute_rect_distances( df = tr.clone_df, df2 = tr2.clone_df) #(3)

assert tr.rw_alpha.shape == (10,1920) 
assert tr.rw_beta.shape  == (10,1920)

I want to search a bulk dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""
Example of comparing two clone_df from separate repertoires of different sizes

1. Load a bulk beta-chain dataset with > 70 Clones.
2. Prepare a beta chain clones_df DataFrame with cdr3_b_aa, pmhc_b_aa, cdr2_b_aa, cdr1_b_aa columns.
3. Extract copy of processed dataframe.
4. Load data of interest into a TCRrep instance.
5. Compare the first 100 clones in primary dataset (100 Clones) against the the entire bulk dataset (77,538 Clones)
    Note: for large datasets you may want to turn store = False
"""
import pandas as pd 
from tcrdist.repertoire import TCRrep
from tcrdist.setup_tests import download_and_extract_zip_file
download_and_extract_zip_file('bulk.zip', source = "dropbox", dest = ".")
df_bulk = pd.read_csv('bulk.csv', sep = ",")        #(1)
tr_bulk = TCRrep(cell_df = df_bulk,                 #(2)
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = False)
df_search = tr_bulk.clone_df.copy()                 #(3)

df = pd.read_csv("dash.csv").head(100)              

tr = TCRrep(cell_df = df,                           #(4)
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

assert tr.pw_beta.shape == (100,100)
assert df_search.shape  == (77538, 8)
tr.compute_rect_distances(df = tr.clone_df, df2 = df_search, store =True) #(5)

assert tr.rw_beta.shape == (100, 77538)
assert tr.rw_cdr3_b_aa.shape == (100, 77538)

# see how many are tcrdistance less than 50
import numpy as np
np.sum(tr.rw_beta < 50, axis = 1)

I want something else

Let us know. We are happy to help. Get in touch with kmayerbl AT fredhutch DOT org