Workflow

Prep Adaptive Biotechnology File

This data is based on data derived from Adaptive Biotechnology; however, input files contain corrected column names (see Inputs) and V and J genes have been renamed to match IMGT standard nomenclature. See more info on Cleaning Adaptive Biotechnology Files.

Loading a TCR Dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"""
Load all the TCRs associated with a particular epitope in 
the Adaptive Biotechnology COVID19 Data Release 2
"""
import os
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import get_basic_centroids

path = os.path.join('tcrdist', 'data', 'covid19')
file = 'mira_epitope_16_1683_QYIKWPWYI_YEQYIKWPW_YEQYIKWPWY.tcrdist3.csv'
filename = os.path.join(path,file)

df = pd.read_csv(filename, sep = ",")

df = df[['cell_type','subject','v_b_gene','j_b_gene','cdr3_b_aa',
        'epitope', 'age', 'sex','cohort']]
        
df['count'] = 1

tr = TCRrep(cell_df = df, 
            organism = 'human', 
            chains = ['beta'])

Distances

By default, tcrdist3 calculates the distances between all TCR receptors in a repertoire (see a dedicated page for more details about _tcrdistances). The attributes are stored as 2D Numpy arrays which are accessible as attributes.

The weighted multi-CDR based distance is stored as the attribute TCRrep.pw_beta.

tr.pw_beta

Individual components such as the distances between CDR3 are also available. For instance, TCRrep.pw_cdr3_b_aa.

tr.pw_cdr3_b_aa

Simple Clustering

The pairwise distance matrices can be hierachically clustered. Each row of theh centroids_df DataFrame is the centroid of a cluster of TCR receptors.

Some columns describe the cluster as a whole:

  • K_neighbors - the number of unique clones in each cluster
  • public - whether the cluster contains clones from multiple individuals
  • n_subjects - the number of subject with a clone in the cluster
  • neighbors for each clone in the cluster, the .iloc index of each clone in the tr.clone_df DataFrame.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
tr = get_basic_centroids(tr, max_dist = 200)

tr.centroids_df

tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]

from tcrdist.rep_diff import neighborhood_diff, hcluster_diff, member_summ
import hierdiff
#tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
#nd = neighborhood_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')

tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
res, Z= hcluster_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')

res_summary = member_summ(res_df = res, clone_df = tr.clone_df, addl_cols=['cohort','subject'])

res_detailed = pd.concat([res, res_summary], axis = 1)

html = hierdiff.plot_hclust_props(Z,
            title='PA Epitope Example',
            res=res_detailed,
cell_type subject v_b_gene j_b_gene cdr3_b_aa epitope age sex cohort cdr1_b_aa cdr2_b_aa pmhc_b_aa count clone_id neighbors K_neighbors cluster_id public n_subjects size_order
PBMC 178 TRBV20-1*01 TRBJ2-7*01 CSARALEEGSYEQYF QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY 36.0 M COVID-19-Convalescent DFQ……ATT SNEG…SKA A.SLTL 1 4 [3, 11, 24, 40, 144, 145, 147, 152, 157, 165, 175, 177, 178, 181, 183, 184, 185, 186, 187, 188, 189, 192, 193, 194, 195, 196, 197, 198, 199, 477, 481, 514, 515, 516, 517, 519, 520, 521, 522, 523, 524, 525, 526, 527, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 556, 951, 952, 1149, 1564] 70 14 public 9 0
naive_CD8 10881 TRBV19*01 TRBJ2-1*01 CASSLGTGNEQFF QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY 39.0 F Healthy (No known exposure) LNH…….DA SQI….VND E.KKES 1 122 [121, 405, 410, 424, 933, 1124, 1125, 1128, 1130, 1466, 1469, 1470, 1473, 1475, 1477, 1481, 1500, 1513, 1515, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1555, 1556, 1557, 1558, 1559] 48 69 public 5 1
naive_CD8 10881 TRBV5-6*01 TRBJ1-1*01 CASSISGGTEAFF QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY 39.0 F Healthy (No known exposure) SGH…….DT YYE….EEE F.PNYS 1 274 [273, 275, 279, 284, 285, 289, 290, 291, 292, 296, 297, 298, 301, 306, 307, 308, 309, 310, 311, 690, 695, 720, 723, 727, 728, 729, 733, 736, 759, 777, 778, 779, 781, 782, 783, 787, 1006, 1007, 1010, 1081, 1304, 1311, 1312, 1313, 1316] 45 21 public 5 2
PBMC 1349 TRBV19*01 TRBJ2-5*01 CASSIWGSPQETQYF QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY 61.0 M COVID-19-Convalescent LNH…….DA SQI….VND E.KKES 1 19 [18, 119, 418, 419, 420, 421, 422, 423, 1126, 1127, 1129, 1418, 1419, 1421, 1467, 1482, 1496, 1497, 1498, 1499, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525] 42 74 public 5 3
PBMC 1005703 TRBV19*01 TRBJ1-5*01 CASSIDLGPGNQPQHF QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY 31.0 F COVID-19-Convalescent LNH…….DA SQI….VND E.KKES 1 71 [70, 407, 409, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1468] 42 73 public 3 4

Advanced Clustering