tcrdist3 is an open-source python package that enables a broad array of T cell receptor sequence analyses. Some of the functionality and code is adapted from the original tcr-dist package which was released with the publication of Dash et al. Nature (2017). This package contains a new API for computing tcrdistance measures as well as new features for biomarker development, described in more detail in Mayer-Blackwell et al. bioRxiv (2020). The package has been expanded to include support for gamma-delta TCRs; it has also been recoded to increase CPU efficiency using numba, a high-performance just-in-time compiler.
pip install git+https://email@example.com
OSX Install Tips¶
If you are having trouble installing on OSX, here are some additional steps that users have done to install tcrdist3 on OS X from scratch:
- Install Homebrew: https://brew.sh (copy the command from the webpage into Terminal app)
- In Terminal intall the following (or parasail may not build):
brew install autoconf automake libtool
- Install miniconda for python3.8 for OS X from the .pkg .
- In the terminal and type:
conda create --name tcrdist3 python=3.8 conda activate tcrdist3 pip install tcrdist3 pip install notebook ipython ipython
Now in the iPython console you can copy and paste examples.
docker pull quay.io/kmayerb/tcrdist3:0.2.0
If you need more details, checkout out the page on the tcrdist3 Docker container.
A human alpha/beta T cell repertoire is huge, comprised of an estimated 10^11 to 10^13 total T cells, with the process of V(D)J recombination capable of generating as many as 10^15 to 10^20 unique receptor sequences. With the wide adoption of high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) and single-cell sequencing technologies, immunologists are witnessing an explosion in the availability of TCR repertoire data.
TCR functional diversity is concentrated in complementarity defining regions (CDRs) that contact antigen-presenting molecules and enable receptor docking and antigenic specificity. Although software tools exist for TCR repertoire analysis we have developed a modern Python package that improves upon a biochemically aware TCR distance metric that can effectively reduce the dimensionality of repertoire data, enable statistical analyses that efficiently handle the scale of and molecular diversity in AIRR-seq datasets. tcrdist3 is designed as an interactive set of functional operations that can be interfaced with existing tools and used to construct flexible hypothesis-driven analysis pipelines within batch computing frameworks.
Loading a TCR Dataset¶
Loading a TCR dataset from most standard formats, including AIRR, MIXCR and Adaptive, and preparing it for analysis with tcrdist3 is simple. Tcrdist3 operates with a pandas.DataFrame, which can be created from CSV/TSV and many other tabular formats. Once the DataFrame is loaded its a matter of renaming columns to match tcrdist or telling tcrdist which columns to use.
[example of loading from a .csv file ].
1 2 3 4 5 6 7 8 9
import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = False)
For analyses that require the sequences of the CDRs encoded by the V genes, these sequences can be inferred from gene/allele names based on IMGT nomenclature (the nomenclature used by Adaptive Biotechnology is also supported as shown in an example on Cleaning Adaptive ImmunoSEQ Files).
Gene Level Analyses¶
Loading a TCR dataset will immediately enable gene-level analyses of repertoires, for example Sankey plots of V/D/J gene frequency or statistical tests for differential gene enrichment in two or more conditions:
1 2 3 4 5 6 7 8 9 10 11 12 13
from tcrdist.plotting import plot_pairings, _write_svg svg_PA = plot_pairings(tr.clone_df.loc[tr.clone_df.epitope == "PA"], cols = ['v_b_gene', 'j_b_gene','j_a_gene', 'v_a_gene'], count_col='count') svg_NP = plot_pairings(tr.clone_df.loc[tr.clone_df.epitope == "NP"], cols = ['v_b_gene', 'j_b_gene', 'j_a_gene', 'v_a_gene'], count_col='count') _write_svg(svg_PA, name = "PA_gene_usage_plot.svg", dest = ".") _write_svg(svg_NP, name = "NP_gene_usage_plot.svg", dest = ".")
PA and NP gene usage diagrams. Click to enlarge.
See the example gallery (Visualizing) for more examples of how to produce these diagrams.
1 2 3 4 5 6
import fishersapi fishersapi.fishers_frame(tr.clone_df.loc[tr.clone_df.epitope == "NP"], col_pairs=[('v_b_gene', 'j_b_gene'), ('v_a_gene', 'j_a_gene'), ('v_a_gene', 'v_b_gene'), ('j_a_gene', 'j_b_gene')])
Central to TCR repertoire analysis is the choice of a distance measure that relates unique TCRs to one another, based on the assumption that TCRs sharing a high degree of sequence similarity tend to recognize the same or similar MHC-bound peptides (pMHCs). In the absence of TCR distance-based sequence clustering, one is forced to treat each TCR clonotype as a unique biological entity. The major methodological drawback of considering each clonotype in isolation is s that it is rare to observe identical clonotypes in multiple samples from the same individual and even rarer across multiple individuals, making it difficult to discover novel biomarkers or understand generalized underpinnings of TCR:pMHC specificity. By relating pairs of TCRs with a sequence-based distance, the analyst can vastly reduce the effective diversity of a TCR repertoire and enable the application of distance-based statistical learning methods.
This concept of a TCR distance motivate our group’s initial analyses in Dash et al. (2017) and is well appreciated by the growing community of experts in immune repertoire analysis (Glanville et al. 2017, Madi et al. 2017, Pogorelyy and Shugay 2019, Huang et al. 2020). Tcrdist3 builds on our previous efforts and offers a unique, flexible, and efficient approach for computing distances among TCRs.
By initializing the TCRRep object with data, using default arguments, you have already computed distances among all pairs of unique TCRs. By default, TCRRep computes the distance published in Dash et al., which is based on amino-acid similarity in each of the CDR loops, with greater weight on the CDR3.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['beta'], db_file = 'alphabeta_gammadelta_db.tsv') """ tr.pw_beta array([[ 0., 162., 90., ..., 120., 75., 75.], [162., 0., 123., ..., 96., 126., 138.], [ 90., 123., 0., ..., 81., 120., 120.], ..., [120., 96., 81., ..., 0., 69., 69.], [ 75., 126., 120., ..., 69., 0., 12.], [ 75., 138., 120., ..., 69., 12., 0.]]) """
A unique feature of tcrdist3 is that all of the parameters of the distance metric can be adjusted (e.g. alpha-chain only, weights on CDR loops, etc.) or completely new user-defined metrics can be provided to calculate pairwise distances. The package comes with a distance based on Needleman-Wunsch global sequence alignment and a BLOSUM62 similarity matrix, as well as the Levenshtein/edit distance, which is employed by other TCR analysis packages such as TCRNET/VDJtools, ALICE, and GLIPH2. [example of a beta-only edit distance]
See the example gallery (TCR Distances) for many examples of the flexibility provided for computing TCR distances.
With pairwise distances defined, the next step is to group similar TCRs into neighborhoods (i.e., clusters of biochemically similar TCRs). This can be achieved with any number of distance-based clustering methods (basic clustering in python can be performed with sklearn.clustering [https://scikit-learn.org/stable/modules/clustering.html]). We have focused our initial efforts on nearest-neighbor (NN) and hierarchical clustering. The NN method forms a neighborhood around a TCR based on a pre-specified inclusion radius or a fixed number of neighborhood members; we suggest forming a neighborhood around every unique TCR using a relatively conservative radius that includes TCRs with <1-2 aa substitutions in the CDR3, such that all TCRs within the neighborhood likely share antigenic specificity [will be nice at some point to show a graph of how TCR distance relates to CDR3 hamming distance on average for distances from one sequence in the “head” of a single epitope, as a rough guide].
Fixed Radius Neighborhoods¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
""" Basic Specificity Neighborhoods based on a Fixed Radius """ import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['beta','alpha'], db_file = 'alphabeta_gammadelta_db.tsv') from tcrdist.rep_diff import neighborhood_diff # diff testing is pasted on binary comparison, so all epitope not 'PA' are set to 'X' tr.clone_df['PA'] = ['PA' if x == 'PA' else 'X' for x in tr.clone_df.epitope] # Larger Radius tr.nn_df = neighborhood_diff(clone_df= tr.clone_df, pwmat = tr.pw_beta + tr.pw_alpha, count_col = 'count', x_cols = ['PA'], knn_radius = 150) tr.nn_df[['K_neighbors', 'val_0', 'ct_0', 'val_2', 'ct_2', 'RR','OR', 'pvalue', 'FWERp','FDRq']].\ sort_values(['FDRq']).sort_values(['OR','ct_0'], ascending = False) # Smaller Radius tr.nn_df = neighborhood_diff(clone_df= tr.clone_df, pwmat = tr.pw_beta + tr.pw_alpha, count_col = 'count', x_cols = ['PA'], knn_radius = 75) tr.nn_df[['K_neighbors', 'val_0', 'ct_0', 'val_2', 'ct_2', 'RR','OR', 'pvalue', 'FWERp','FDRq']].\ sort_values(['FDRq']).sort_values(['OR','ct_0'], ascending = False)
The full sorted output table has a lot of information, but a takeaway from the first row can be gathered from a subset of the columns shown below. Notice that clone 1011 neighbors 93 unique clonetypes (K_neighbors), repressenting 165 sequences (ct_0) that all share PA-epitope specificity and 0 (‘ct_2’) that are not PA-specific.
|1011||93||(‘PA’, ‘MEM+’)||165||(‘X’, ‘MEM+’)||0||7.12||inf||2.16566-06||0.00308||7.262e-06|
|1050||82||(‘PA’, ‘MEM+’)||156||(‘X’, ‘MEM+’)||0||7.00||inf||1.92270e-06||0.00288||7.262e-06|
|1075||82||(‘PA’, ‘MEM+’)||156||(‘X’, ‘MEM+’)||0||7.00||inf||1.92270e-06||0.00288||7.262e-06|
|500||76||(‘PA’, ‘MEM+’)||144||(‘X’, ‘MEM+’)||0||6.85||inf||1.30396e-06||0.00224||7.262e-06|
|60||70||(‘PA’, ‘MEM+’)||127||(‘X’, ‘MEM+’)||0||6.64||inf||1.69754e-06||0.00269||7.262e-06|
|71||76||(‘PA’, ‘MEM+’)||124||(‘X’, ‘MEM+’)||0||6.61||inf||1.16703e-06||0.00204||7.262e-06|
By vizualizing the first 500 rows of the DataFrame output by neighborhood_diff at different thresshold radi, we see the number of similar clones clustering to each unique clone and the proportion of neighboring clones that are also PA-epitope specific (shown in blue), at a knn_radius = 150 (top) and at a knn_radius = 75 (bottom).
This dataset alone is not sufficient to estimate a threshold tcrdistance threshold likely to enrich for clustesr of epitope-specific TCRs. This is because the non-PA TCRs in this demo dataset, are also under strong selection to recognize another epitope. A more illustrative example would use a larger set of background sequences that might be expected in non-presorted repertoire. We will show later on how an improved analysis can be done with larger background datasets.
Alternatively, hierarchical clustering provides neighborhoods with a range of TCR diversity from pairs of similar TCRs near the leaves of the tree to larger neighborhoods near the root. While its unlikely that the larger nodes of a clustered repertoire will represent a single epitope specificity, evaluating a range of neighborhood sizes provides a more unbiased approach. The hierdiff module includes a plotting function for producing a dendrogram that uses the color of each neighborhood/node of the tree to represent a categorical meta-data variable such as the experimental condition under which the TCRs were observed; the SVG output enables embedding in a Jupyter notebook or html page and can provide a useful way to interactively explore the clusters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
""" Basic Specificity Neighborhoods based on a Hierarchical Clustering """ import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['beta','alpha'], db_file = 'alphabeta_gammadelta_db.tsv') from tcrdist.rep_diff import hcluster_diff, member_summ from hierdiff import plot_hclust_props # diff testing is pasted on binary comparison, so all epitope not 'PA' are set to 'X' tr.clone_df['PA'] = ['PA' if x == 'PA' else 'X' for x in tr.clone_df.epitope] res, Z= hcluster_diff(tr.clone_df, tr.pw_beta, x_cols = ['PA'], count_col = 'count') res_summary = member_summ(res_df = res, clone_df = tr.clone_df, addl_cols=['epitope']) res_detailed = pd.concat([res, res_summary], axis = 1) html = plot_hclust_props(Z, title='PA Epitope Example', res=res_detailed, tooltip_cols=['cdr3_b_aa','v_b_gene', 'j_b_gene','epitope'], alpha=0.00001, colors = ['blue','gray'], alpha_col='pvalue') with open('hierdiff_example.html', 'w') as fh: fh.write(html)
The purpose of defining many, potentially overlapping and redundant neighborhoods, is to provide a comprehensive set of features that could potentially be associated with an experimental condition or phenotype of interest. The ultimate goal may be to filter out the most interesting TCRs and their neighborhoods for future investigation.
Like all statistical analyses, TCR analysis benefits from a properly designed experimental design with a testable prospective hypothesis. For instance, an experiment might seek to identify vaccine-induced TCRs in one individual with samples collected and sequenced before and after vaccination; the hypothesis would be that the vaccine induced one or more expansions among T cells recognizing a vaccine epitope. The functions tcrdist.stats.hier_diff and neighborhood_diff enable testing of this hypothesis by testing each neighborhood for enrichment or depletion of TCRs associated with a specific condition or variable; after multiplicity adjustment clusters with significant enrichment are good candidates for further investigation.
One may also be interested in a population-level question about whether there is a neighborhood of TCRs that are consistently or “publicly” induced by a vaccine; such a TCR might be a candidate biomarker for vaccine protection. The rep_diff functions can also accommodate basic experimental designs involving multiple subjects and categorical covariates. For more complicated experiments the count tables generated by nn_tally/hier_tally can also be exported for use in more sophisticated modeling frameworks (e.g. linear regression, mixed effects, etc.)
Save your TCRrep object using dill.
import dill import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], db_file = 'alphabeta_gammadelta_db.tsv') dill.dump(tr, open("yourfile.dill", mode='wb')) tr_reloaded = dill.load(open("yourfile.dill", mode='rb'))
For larger datasets, you may wish to first remove unwanted large attributes that will contribute to the overall file size before dill pickling. Moreover, you can achieve additional space saving by converting Numpy ndarray attributes to type int16, prior to saving. In the example below, these steps reduce the final filesize from 114 MB to 36 MB.
import dill import pandas as pd from tcrdist.repertoire import TCRrep df = pd.read_csv("dash.csv") tr = TCRrep(cell_df = df, organism = 'mouse', chains = ['alpha','beta'], db_file = 'alphabeta_gammadelta_db.tsv') """ Optional: remove CDR specific attributes prior to saving """ tr.pw_pmhc_a_aa = None tr.pw_cdr3_a_aa = None tr.pw_cdr2_a_aa = None tr.pw_cdr1_a_aa = None tr.pw_pmhc_b_aa = None tr.pw_cdr3_b_aa = None tr.pw_cdr2_b_aa = None tr.pw_cdr1_b_aa = None """ Optional: change numpy datatype to further reduces file sizes """ tr.pw_alpha = tr.pw_alpha.astype('int16') tr.pw_alpha = tr.pw_beta.astype('int16') dill.dump(tr, open("yourfile.dill", mode='wb')) tr_reloaded = dill.load(open("yourfile.dill", mode='rb'))
Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, Gaevert JA, Hertz T, Thomas PG, Bradley PH, Fiore-Gartland A. 2020 TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA-restricted SARS-CoV-2 associated TCR features. bioRxiv (2020) doi:10.1101/2020.12.24.424260
Dash, P. et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547, 89–93 (2017)
Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017)
Madi, A. et al. T cell receptor repertoires of mice and humans are clustered in similarity networks around conserved public CDR3 sequences. Elife 6, (2017)
Pogorelyy, M. V. & Shugay, M. A Framework for Annotation of Antigen Specificities in High-Throughput T-Cell Repertoire Sequencing Studies. Front. Immunol. 10, 2159 (2019)
Pogorelyy, M. V. et al. Detecting T cell receptors involved in immune responses from single repertoire snapshots. PLOS Biology vol. 17 e3000314 (2019)
Huang, H., Wang, C., Rubelt, F., Scriba, T. J. & Davis, M. M. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat. Biotechnol. (2020) doi:10.1038/s41587-020-0505-4