Inputs

Data

The tcrdist3 standard input is a Pandas DataFrame.

The header and first line of a typical input for a beta-chain analysis would look like this:

subject epitope count v_b_gene j_b_gene cdr3_b_aa cdr3_b_nucseq
s1 NP 1 TRBV1*01 TRBJ1-1*01 CACDSLGDKSSWDTRQMFF TGTGCCTGTGACTCGCTGGGGGATAAGAGCTCCTGGGACACCCGACAGATGTTTTTC

Column names reflect the chain under investigation. - a : alpha - b : beta - g : gamma - d : delta

Tcrdist3 only requires 3 input columns for single chain analysis (i.e., for beta chain cdr3_b_aa and v_b_gene, j_b_gene) and 6 columns for paired chain analysis (i.e., cdr3_b_aa, v_b_gene, j_b_gene, cdr3_a_aa, v_a_gene, and j_a_gene).

The columns ‘cdr3_a_nucseq’, ‘cdr3_b_nucseq are optional, but useful to include if you wish to prevent aggregation of multiple geneically distinct clones identical at the amino acid level (see critical information below for more information).

For v_x_gene, include the full IMGT gene name and allele (e.g., TRBV1*01). If you don’t know the allele, use *01. But an allele must be present to infer v_b_genes based on matching one of the id rows in this table.

Tip

Two of each can be supplied for paired analysis. tcrdistances can be calculated without nucleotide sequences, but some other features require them.

The following is required.
  • ‘count’
The following are optional:
  • ‘epitope`
  • ‘subject’
The following are usually inferred from germline reference v-gene but can be supplied by the user in some advanced use-cases only!
  • ‘cdr1_a_aa’, ‘cdr1_b_aa’, ‘cdr1_g_aa’, or ‘cdr1_d_aa’
  • ‘cdr2_a_aa’, ‘cdr2_b_aa’, ‘cdr2_g_aa’, or ‘cdr2_d_aa’
  • ‘pmhc_a_aa’, ‘pmhc_a_aa’, ‘pmhc_a_aa’, or ‘pmhc_a_aa’ (pmhc = cdr 2.5)

Tip

CDR2.5, the pMHC-facing loop between CDR2 and CDR3, are referred to in tcrdist3 as pmhc_a and phmc_b, respectively.

Arguments

chain(s)

Most classes and functions in tcrdist3 require specification of the appropriate t cell receptor chains:

  • [‘alpha’], [‘beta’], [‘gamma’], or [‘delta’] for single-chain analysis,
  • [‘alpha’, ‘beta’] or [‘gamma’, ‘delta’] for paired-chain analyis

organism

Most classes and functions in tcrdist3 require specification of an appropriate host organism. Currently only ‘human’ or ‘mouse’ are supported. This is required because reference TCR genes are organism specific.

db_file

The db_file is used by tcrdist3 to supply updated information about reference TCR germline sequences.

Critical Information

Tip

Please read this to understand what happens when you initialization a TCRrep instance. Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)

  • Tcrdist3 only requires 3 input columns for single chain analysis (i.e., for beta chain cdr3_b_aa and v_b_gene, j_b_gene) and 6 columns for paired chain analysis (i.e., cdr3_b_aa, v_b_gene, j_b_gene, cdr3_a_aa, v_a_gene, and j_a_gene). More columns can be included depending on the application.
  • An optional count column will track the abundance of each clone. If no count column is provided all clones are assigned a count of 1. Additional columns can be included if the user intends to distinguish input rows with donor information or epitope specificity annotations. The names of these columns are not pre-defined (e.g., subject, cell_type, visit, epitope)
  • By default, tcrdist3 uses all supplied columns provided in the DataFrame passed to the cell_df argument to look for potentially duplicated rows based on the default setting (deduplicate = True). The initialization of a TCRrep instance automatically aggregates counts over duplicated rows.
  • This has practical consequences. For instance, if no subject column is included identical clones from two or more individuals will be combined into a single row.
  • If any columns have missing values, the corresponding row containing the missing value is excluded. Thus, do not include columns that have missing values. If you wish to retain every clonotype, adding an index column or the nucleotide sequence will prevent rows with identical amino acid sequences from being merged.

Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:

  • organism specifies the appropriate organism. Either the character string ‘human’ or ‘mouse’ must be specified.
  • chains specifies whether the TCRrep instance will evaluate a single chain or paired chain data. Provide [‘alpha’] or [‘beta’] to the chains argument for single-chain analysis. For paired chain analysis, supply [‘alpha’, ‘beta’]. Tcrist3 supports `[‘gamma’],[ ‘delta’], or [‘gamma’, ‘delta’] as available options as well.

The organism and chains arguments ensure the correct lookup when appending CDR1, CDR2, and CDR2.5 sequences to the input cell_df DataFrame. To append these germline-encoded CDR sequences, tcrdist3 must recognize the user-supplied V gene names. The package uses IMGT nomenclature and a library of allele-specific reference genes. - cell_df contains the input TCR data. Only the relevant columns should be passed in the DataFrame to the cell_df argument. This is critical because a NaN (missing value) in any column will result in the corresponding row being removed from the analysis. - If the user wishes to retain clones identical at the amino acid level but with distinct CDR3 nucleotide junctions, the nucleotide sequence or another unique-valued column should be provided in the DataFrame passed to the cell_df argument. - Finally, remember that any row of `cell_df` with an unrecognized V gene name will be removed from the final `clone_df`. It is possible to see those lines of cell_df not integrated into clone_df by calling TCRrep.show_incomplete() after initialization. (Note: Advanced users who wish to add new genes not currently in the tcrdist3 library can do so by modifying the content of the ‘alphabeta_gammadelta_db.tsv ‘ file in the package source code (python3.8/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv))

Tip

The row order of Numpy arrays or Scipy csrmats containing computed pairwise distance will always match the order in the TCRrep.clone_df

Tip

Getting new database files: Reference json https://github.com/repseqio/library-imgt/releases Data coming from IMGT server may be used for academic research only, provided that it is referred to IMGT®, and cited as “IMGT®, the international ImMunoGeneTics information system® http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France).”