Workflow¶
Prep Adaptive Biotechnology File¶
This data is based on data derived from Adaptive Biotechnology; however, input files contain corrected column names (see Inputs) and V and J genes have been renamed to match IMGT standard nomenclature. See more info on Cleaning Adaptive Biotechnology Files.
Loading a TCR Dataset¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | """
Load all the TCRs associated with a particular epitope in
the Adaptive Biotechnology COVID19 Data Release 2
"""
import os
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import get_basic_centroids
path = os.path.join('tcrdist', 'data', 'covid19')
file = 'mira_epitope_16_1683_QYIKWPWYI_YEQYIKWPW_YEQYIKWPWY.tcrdist3.csv'
filename = os.path.join(path,file)
df = pd.read_csv(filename, sep = ",")
df = df[['cell_type','subject','v_b_gene','j_b_gene','cdr3_b_aa',
'epitope', 'age', 'sex','cohort']]
df['count'] = 1
tr = TCRrep(cell_df = df,
organism = 'human',
chains = ['beta'])
|
Distances¶
By default, tcrdist3 calculates the distances between all TCR receptors in a repertoire (see a dedicated page for more details about _tcrdistances). The attributes are stored as 2D Numpy arrays which are accessible as attributes.
The weighted multi-CDR based distance is stored as the attribute TCRrep.pw_beta.
tr.pw_beta
Individual components such as the distances between CDR3 are also available. For instance, TCRrep.pw_cdr3_b_aa.
tr.pw_cdr3_b_aa
Simple Clustering¶
The pairwise distance matrices can be hierachically clustered. Each row of theh centroids_df DataFrame is the centroid of a cluster of TCR receptors.
Some columns describe the cluster as a whole:
- K_neighbors - the number of unique clones in each cluster
- public - whether the cluster contains clones from multiple individuals
- n_subjects - the number of subject with a clone in the cluster
- neighbors for each clone in the cluster, the .iloc index of each clone in the tr.clone_df DataFrame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | tr = get_basic_centroids(tr, max_dist = 200)
tr.centroids_df
tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
from tcrdist.rep_diff import neighborhood_diff, hcluster_diff, member_summ
import hierdiff
#tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
#nd = neighborhood_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')
tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
res, Z= hcluster_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')
res_summary = member_summ(res_df = res, clone_df = tr.clone_df, addl_cols=['cohort','subject'])
res_detailed = pd.concat([res, res_summary], axis = 1)
html = hierdiff.plot_hclust_props(Z,
title='PA Epitope Example',
res=res_detailed,
|
cell_type | subject | v_b_gene | j_b_gene | cdr3_b_aa | epitope | age | sex | cohort | cdr1_b_aa | cdr2_b_aa | pmhc_b_aa | count | clone_id | neighbors | K_neighbors | cluster_id | public | n_subjects | size_order |
PBMC | 178 | TRBV20-1*01 | TRBJ2-7*01 | CSARALEEGSYEQYF | QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY | 36.0 | M | COVID-19-Convalescent | DFQ……ATT | SNEG…SKA | A.SLTL | 1 | 4 | [3, 11, 24, 40, 144, 145, 147, 152, 157, 165, 175, 177, 178, 181, 183, 184, 185, 186, 187, 188, 189, 192, 193, 194, 195, 196, 197, 198, 199, 477, 481, 514, 515, 516, 517, 519, 520, 521, 522, 523, 524, 525, 526, 527, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 556, 951, 952, 1149, 1564] | 70 | 14 | public | 9 | 0 |
naive_CD8 | 10881 | TRBV19*01 | TRBJ2-1*01 | CASSLGTGNEQFF | QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY | 39.0 | F | Healthy (No known exposure) | LNH…….DA | SQI….VND | E.KKES | 1 | 122 | [121, 405, 410, 424, 933, 1124, 1125, 1128, 1130, 1466, 1469, 1470, 1473, 1475, 1477, 1481, 1500, 1513, 1515, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1555, 1556, 1557, 1558, 1559] | 48 | 69 | public | 5 | 1 |
naive_CD8 | 10881 | TRBV5-6*01 | TRBJ1-1*01 | CASSISGGTEAFF | QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY | 39.0 | F | Healthy (No known exposure) | SGH…….DT | YYE….EEE | F.PNYS | 1 | 274 | [273, 275, 279, 284, 285, 289, 290, 291, 292, 296, 297, 298, 301, 306, 307, 308, 309, 310, 311, 690, 695, 720, 723, 727, 728, 729, 733, 736, 759, 777, 778, 779, 781, 782, 783, 787, 1006, 1007, 1010, 1081, 1304, 1311, 1312, 1313, 1316] | 45 | 21 | public | 5 | 2 |
PBMC | 1349 | TRBV19*01 | TRBJ2-5*01 | CASSIWGSPQETQYF | QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY | 61.0 | M | COVID-19-Convalescent | LNH…….DA | SQI….VND | E.KKES | 1 | 19 | [18, 119, 418, 419, 420, 421, 422, 423, 1126, 1127, 1129, 1418, 1419, 1421, 1467, 1482, 1496, 1497, 1498, 1499, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525] | 42 | 74 | public | 5 | 3 |
PBMC | 1005703 | TRBV19*01 | TRBJ1-5*01 | CASSIDLGPGNQPQHF | QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY | 31.0 | F | COVID-19-Convalescent | LNH…….DA | SQI….VND | E.KKES | 1 | 71 | [70, 407, 409, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1468] | 42 | 73 | public | 3 | 4 |