Adaptive ImmunoSEQ Data

Adaptive uses a distinct naming convention to IMGT Nomenclature. This poses a formatting challenge when using ImmunoSEQ files as inputs to tcrdist3. According to Adaptive’s technical team: “Adaptive’s nomenclature is more expanded to both facilitate alphanumeric sorting, and also specify the precision of the identification.” Adaptive’s technical team further explained to us the difference between naming systems, which we paraphrase here:

Both naming systems follow a [locus and family]-[gene]*[allele] convention, where IMGT naming prioritizes brevity, opting for “a single letter or number where possible” (except for alleles). IMGT also leaves out gene-level information when there is only one gene in the family. For instance, IMGT drops the gene-level info in naming TRBV15*02. By contrast, Adaptive uses the following three possible names:

  • A gene with allele-level identification: TCRBV15-01*02
  • Gene-level identification: TCRBV15-01
  • Family-level only: TCRBV15

Adaptive’s output files can contain gene-level names within the ‘bioidentity’ field like TCRBV15-X, when there is ambiguity about the gene-level assignment.

tcrdist3 uses IMGT gene names throughout, so the first step to working with ImmunoSEQ files is name conversion. To avoid losing lots of CDR3 data, when the V gene may not be fully resolved we often use Adaptive ‘bioidentity’ gene-level calls and replace allele with *01. Depending on your project’s goals, you may want to do this cleaning by hand, so let’s first take a look at how to convert Adaptive’s v_gene into its IMGT*01 equivalent:

!wget https://raw.githubusercontent.com/kmayerb/tcrdist3/master/Adaptive2020.tsv
import pandas as pd
from tcrdist.swap_gene_name import adaptive_to_imgt
adpt_input = pd.read_csv('Adaptive2020.tsv', sep = '\t')
adpt_input['v_b_gene'] = adpt_input['v_gene'].apply(lambda x : adaptive_to_imgt['human'].get(x))

Cleaning Adaptive ImmunoSEQ Files

We also have a one line conversion function that works with recent ImmunoSEQ files containing the ‘bioidentity’ field, as shown here:

import_adaptive_file

1
2
3
4
5
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = 'Adaptive2020.tsv')
tcrdist.adpt_funcs.import_adaptive_file(adaptive_filename, organism='human', chain='beta', return_valid_cdr3_only=True, count='productive_frequency', version_year=2020, sep='\t', subject=None, epitope=None, log=True, swap_imgt_dictionary=None, additional_cols=None, use_cols=['bio_identity', 'productive_frequency', 'templates', 'rearrangement'])

Prepare tcrdist3 input from 2020 Adaptive File containing ‘bio_identity’, ‘productive_frequency’, ‘templates’, and ‘rearrangement’.

Parameters:
  • adaptive_filename (str) – path to the Adaptive filename
  • version (int) – version_year
  • epitope (str or None) – name of epitope if known
  • subject (str or None) – If none the filename will be used as the subject
  • use_as_count (str) – name of column to be used as count (could be ‘productive_frequency’ or ‘templates’)
  • sep (str) – seperatore in Adaptive file
  • organism (str) – ‘human’ or ‘mouse’
  • chain (str) – ‘beta’ or ‘alpha’
  • log (bool) – If True, write a log.
  • swap_imgt_dictionary (dict or None) – If None, the default dictionary adaptive_to_imgt is used
  • additional_cols (None or List) – list of any additional columns you want to keep
  • use_cols (list) – [‘bio_identity’, ‘productive_frequency’, ‘templates’, ‘rearrangement’,’subject’] list of columns to retain from original input file. Add ‘subject’ if you wish to retain the subject.
Returns:

bulk_df

Return type:

pd.DataFrame

After conversion, the data as a Pandas DataFrame can be directly imported to tcrdist3.

Loading Adaptive ImmunoSEQ Files

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = 'Adaptive2020.tsv')
# For larger datasets, make sure compute_distances is set to False, 
# see: https://tcrdist3.readthedocs.io/en/latest/bulkdata.html
tr = TCRrep(cell_df = df, 
            organism = 'human', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = False)

Look Up Adaptive Conversion

1
2
3

"""Lookup *01 IMGT allele corresponding with an Adaptive gene name"""
assert adaptive_to_imgt['human']['TCRBV30'] == 'TRBV30*01'