Workflow¶

Prep Adaptive Biotechnology File¶

This data is based on data derived from Adaptive Biotechnology; however, input files contain corrected column names (see Inputs) and V and J genes have been renamed to match IMGT standard nomenclature. See more info on Cleaning Adaptive Biotechnology Files.

Loading a TCR Dataset¶

"""
Load all the TCRs associated with a particular epitope in 
the Adaptive Biotechnology COVID19 Data Release 2
"""
import os
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import get_basic_centroids

path = os.path.join('tcrdist', 'data', 'covid19')
file = 'mira_epitope_16_1683_QYIKWPWYI_YEQYIKWPW_YEQYIKWPWY.tcrdist3.csv'
filename = os.path.join(path,file)

df = pd.read_csv(filename, sep = ",")

df = df[['cell_type','subject','v_b_gene','j_b_gene','cdr3_b_aa',
        'epitope', 'age', 'sex','cohort']]
        
df['count'] = 1

tr = TCRrep(cell_df = df, 
            organism = 'human', 
            chains = ['beta'])

Distances¶

By default, tcrdist3 calculates the distances between all TCR receptors in a repertoire (see a dedicated page for more details about _tcrdistances). The attributes are stored as 2D Numpy arrays which are accessible as attributes.

The weighted multi-CDR based distance is stored as the attribute TCRrep.pw_beta.

tr.pw_beta

Individual components such as the distances between CDR3 are also available. For instance, TCRrep.pw_cdr3_b_aa.

tr.pw_cdr3_b_aa

Simple Clustering¶

The pairwise distance matrices can be hierachically clustered. Each row of theh centroids_df DataFrame is the centroid of a cluster of TCR receptors.

Some columns describe the cluster as a whole:

K_neighbors - the number of unique clones in each cluster
public - whether the cluster contains clones from multiple individuals
n_subjects - the number of subject with a clone in the cluster
neighbors for each clone in the cluster, the .iloc index of each clone in the tr.clone_df DataFrame.

tr = get_basic_centroids(tr, max_dist = 200)

tr.centroids_df

tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]

from tcrdist.rep_diff import neighborhood_diff, hcluster_diff, member_summ
import hierdiff
#tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
#nd = neighborhood_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')

tr.clone_df['covid'] = ['healthy' if x.find("Healthy") != -1 else "covid" for x in tr.clone_df.cohort]
res, Z= hcluster_diff(tr.clone_df, tr.pw_beta, x_cols = ['covid'], count_col = 'count')

res_summary = member_summ(res_df = res, clone_df = tr.clone_df, addl_cols=['cohort','subject'])

res_detailed = pd.concat([res, res_summary], axis = 1)

html = hierdiff.plot_hclust_props(Z,
            title='PA Epitope Example',
            res=res_detailed,

cell_type

subject

v_b_gene

j_b_gene

cdr3_b_aa

epitope

age

sex

cohort

cdr1_b_aa

cdr2_b_aa

pmhc_b_aa

count

clone_id

neighbors

K_neighbors

cluster_id

public

n_subjects

size_order

PBMC

178

TRBV20-1*01

TRBJ2-7*01

CSARALEEGSYEQYF

QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY

36.0

M

COVID-19-Convalescent

DFQ……ATT

SNEG…SKA

A.SLTL

1

4

[3, 11, 24, 40, 144, 145, 147, 152, 157, 165, 175, 177, 178, 181, 183, 184, 185, 186, 187, 188, 189, 192, 193, 194, 195, 196, 197, 198, 199, 477, 481, 514, 515, 516, 517, 519, 520, 521, 522, 523, 524, 525, 526, 527, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 556, 951, 952, 1149, 1564]

70

14

public

9

0

naive_CD8

10881

TRBV19*01

TRBJ2-1*01

CASSLGTGNEQFF

QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY

39.0

F

Healthy (No known exposure)

LNH…….DA

SQI….VND

E.KKES

1

122

[121, 405, 410, 424, 933, 1124, 1125, 1128, 1130, 1466, 1469, 1470, 1473, 1475, 1477, 1481, 1500, 1513, 1515, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1555, 1556, 1557, 1558, 1559]

48

69

public

5

1

naive_CD8

10881

TRBV5-6*01

TRBJ1-1*01

CASSISGGTEAFF

QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY

39.0

F

Healthy (No known exposure)

SGH…….DT

YYE….EEE

F.PNYS

1

274

[273, 275, 279, 284, 285, 289, 290, 291, 292, 296, 297, 298, 301, 306, 307, 308, 309, 310, 311, 690, 695, 720, 723, 727, 728, 729, 733, 736, 759, 777, 778, 779, 781, 782, 783, 787, 1006, 1007, 1010, 1081, 1304, 1311, 1312, 1313, 1316]

45

21

public

5

2

PBMC

1349

TRBV19*01

TRBJ2-5*01

CASSIWGSPQETQYF

QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY

61.0

M

COVID-19-Convalescent

LNH…….DA

SQI….VND

E.KKES

1

19

[18, 119, 418, 419, 420, 421, 422, 423, 1126, 1127, 1129, 1418, 1419, 1421, 1467, 1482, 1496, 1497, 1498, 1499, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525]

42

74

public

5

3

PBMC

1005703

TRBV19*01

TRBJ1-5*01

CASSIDLGPGNQPQHF

QYIKWPWYI,YEQYIKWPW,YEQYIKWPWY

31.0

F

COVID-19-Convalescent

LNH…….DA

SQI….VND

E.KKES

1

71

[70, 407, 409, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1468]

42

73

public

3

4

Workflow¶

Prep Adaptive Biotechnology File¶

Loading a TCR Dataset¶

Distances¶

Simple Clustering¶

Advanced Clustering¶