Adaptive Biotechnology Data

Cleaning Adaptive Biotechnology Files

import_adaptive_file

1
2
3
4
5
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = 'Adaptive2020.tsv')

Input

rearrangement extended_rearrangement bio_identity amino_acid templates frame_type rearrangement_type productive_frequency cdr1_start_index cdr1_rearrangement_length cdr2_start_index cdr2_rearrangement_length cdr3_start_index cdr3_length v_index n1_index d_index n2_index j_index v_deletions n2_insertions d3_deletions d5_deletions n1_insertions j_deletions chosen_j_allele chosen_j_family chosen_j_gene chosen_v_allele chosen_v_family chosen_v_gene d_allele d_allele_ties d_family d_family_ties d_gene d_gene_ties d_resolved j_allele j_allele_ties j_family j_family_ties j_gene j_gene_ties j_resolved v_allele v_allele_ties v_family v_family_ties v_gene v_gene_ties v_resolved
GATTCTGGAGTCCGCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGGGGAGGATCTAGACCTACGAGCAGTACTTCGGGCCG unknown X+TCRBV28-01+TCRBJ02-07 na 1135 Out VDJ naunknown unknown unknown unknown unknown 38 43 -1 51 57 64 9 no data 1 9 7 2 no data no data no data no data no data no data 02 no data TCRBD02 no data TCRBD02-01 no data TCRBD02-01*02 01 no data TCRBJ02 no data TCRBJ02-07 no data TCRBJ02-07*01 01 no data TCRBV28 no data TCRBV28-01 no data TCRBV28-01*01  
TTGGAGCTGGACGACTCGGCCCTGTATCTCTGTGCCAGCAGCTTGGGTATGGGGACAGCCGCTAACTATGGCTACACCTTCGGTTCG ATGGGCCCTGGGCTCCTCTGCTGGGCGCTGCTTTGTCTCCTGGGAGCAGGCTCAGTGGAGACTGGAGTCACCCAAAGTCCCACACACCTGATCAAAACGAGAGGACAGCAAGTGACTCTGAGATGCTCTTCTCAGTCTGGGCACAACACTGTGTCCTGGTACCAACAGGCCCTGGGTCAGGGGCCCCAGTTTATCTTTCAGTATTATAGGGAGGAAGAGAATGGCAGAGGAAACTTCCCTCCTAGATTCTCAGGTCTCCAGTTCCCTAATTATAGCTCTGAGCTGAATGTGAACGCCTTGGAGCTGGACGACTCGGCCCTGTATCTCTGTGCCAGCAGCTTGGGTATGGGGACAGCCGCTAACTATGGCTACACCTTCGGTTCGGGGACCAGGTTAACCGTTGTAG CASSLGMGTAANYGYTF+TCRBV05-04+TCRBJ01-02 CASSLGMGTAANYGYTF 1300 In VDJ 0.0012208108813691113 135 15201 18 327 51 30 46 51 58 61 no data 5 5 no data 3 no data 01 TCRBJ01 02 01 TCRBV05 04 01 no data TCRBD01 no data TCRBD01-01 no data TCRBD01-01*01 01 no data TCRBJ01 no data TCRBJ01-02 no data TCRBJ01-02*01 01 no data TCRBV05 no data TCRBV05-04 no data TCRBV05-04*01  

Output

subject productive_frequency templates epitope cdr3_b_aa v_b_gene j_b_gene valid_cdr3 cdr3_b_nucseq
Adaptive2020.tsv 0.0012208108813691113 1300 X CASSLGMGTAANYGYTF TRBV5-4*01 TRBJ1-2*01 True TTGGAGCTGGACGACTCGGCCCTGTATCTCTGTGCCAGCAGCTTGGGTATGGGGACAGCCGCTAiACTATGGCTACACCTTCGGTTCG
Adaptive2020.tsv 0.0015044146399640895 1602 X CASSQPGRTLYEQYF TRBV14*01 TRBJ2-7*01 True CAGCCTGCAGAACTGGAGGATTCTGGAGTTTATTTCTGTGCCAGCAGCCAACCGGGACGGACCTTGTiACGAGCAGTACTTCGGGCCG

Loading Adaptive Biotechnology Files

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = 'Adaptive2020.tsv')
# For larger datasets, make sure compute_distances is set to False, 
# see: https://tcrdist3.readthedocs.io/en/latest/bulkdata.html
tr = TCRrep(cell_df = df, 
            organism = 'human', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', compute_distances = False)

Look Up Adaptive Conversion

1
2
3

"""Lookup *01 IMGT allele corresponding with an Adaptive gene name"""
assert adaptive_to_imgt['human']['TCRBV30'] == 'TRBV30*01'