(Quasi)Public Clones¶

Public TCRs are shared clonotypes found in multiple individuals, arising from VDJ recombination biases and common selection pressures. Repertoire analyses often focuses on public clones; however finding public antigen-specific TCRs is not always possible because TCR repertoires are characterized by extreme diversity. As a consequence, only a small fraction of the repertoire can be assayed in a single sample, making it difficult to reproducibly sample TCR clonotypes from an individual, let alone reliably detect shared clonotypes in a population.

Enter, stage left, the quasi-public TCRs – two or more TCRs, with a high degree of biochemical similarity – that are found in two or more individuals. Identifying quasi public TCRs becomes useful when evaluating an antigen enriched repertoire putatively recognizing the same epitope.

Finding similar receptors from multiple individuals provides stronger evidence of shared epitope recognition and reveals mechanistic basis for CDR-peptide-MHC binding.

Moreover, meta-clonotypes are by definition more abundant than exact clonotype and thus can be more reliably be detected in a single bulk unenriched sample, facilitating more robust function comparisons across populations.

I am happy to use the defaults¶

For instance, you may want find all the (quasi)public collections of TCRs within a fixed radius <= 18 TCRdist units of each TCR in the antigen enriched input data.

"""
tcrdist3 is particularly useful for finding 
what we term quasi-public meta-clonotypes, 
collections of biochemically similar TCRs 
recognizing the same peptide-MHC. 

The easist way to define meta-clonotypes
is to compute pairwise distances between 
TCRs found in an antigen-enriched 
subrepertoire, abbreviated below as 
<aesr>
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

    # <aesr_fn> antigen-enriched subrepertoire
fn = os.path.join('tcrdist', 'data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv')
    # <aesr_df> antigen-enriched subrepertoire
df = pd.read_csv(fn)
    # <tr> TCR repertoire
tr = TCRrep(
    cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa']].copy(), 
    organism = 'human', 
    chains = ['beta'], 
    db_file = 'alphabeta_gammadelta_db.tsv', 
    compute_distances = True)

    # <tp> TCRpublic class for reporting publicities, fixed radius 18, 'nsubject > 3'
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones.html")

    # by calling, .report() an html report is made
public = tp.report()
    
    # Also, the following datafarme are available
    # <clone_df> pd.DataFrame clone_df from tr.clone_df 
    # with neighbors and summary measures appended
public['clone_df']
    # <nn_summary> pd.DataFrame with just summary measures
public['nn_summary']
    # <quasi_public_df> Non-redundant groups of quasipublic clones
public['quasi_public_df']

In addition to the summary DataFrames returned, a HTML quasi-publicity report is generated, allowing for the inspection of logo-motifs formed from highly similar antigen-enriched TCR sequences found in multiple subjects.

I’d like to tweak a default parameter¶

If you want to add or subtract information from the report you can do so relatively easily. For instance, suppose you want to summarize cohort information and add that to the report.

Just like elsewhere in tcrdist3, the python object TCRpublic stores all options as attributes:

In [2]: tp.__dict__
Out[2]:
{'tcrrep': <tcrdist.repertoire.TCRrep at 0x13d5b9310>,
 'organism': 'human',
 'chain': 'beta',
 'output_html_name': 'quasi_public_clones.html',
 'pw_mat_str': 'pw_beta',
 'cdr3_name': 'cdr3_b_aa',
 'v_gene_name': 'v_b_gene',
 'j_gene_name': 'j_b_gene',
 'nr_filter': True,
 'labels': ['clone_id',
  'cdr3_b_aa',
  'v_b_gene',
  'j_b_gene',
  'radius',
  'neighbors',
  'K_neighbors',
  'nsubject',
  'qpublic',
  'cdr3_b_aa.summary',
  'v_b_gene.summary',
  'j_b_gene.summary',
  'cdr3_b_aa.summary',
  'subject.summary'],
 'fixed_radius': False,
 'radius': None,
 'query_str': 'qpublic == True & K_neighbors > 5',
 'kargs_member_summ': {'key_col': 'neighbors',
  'count_col': 'count',
  'addl_cols': ['subject'],
  'addl_n': 4},
 'kargs_motif': {'pwmat_str': 'pw_beta',
  'cdr3_name': 'cdr3_b_aa',
  'v_name': 'v_b_gene',
  'gene_names': ['v_b_gene', 'j_b_gene']},
 'tcrsampler': <tcrsampler.sampler.TCRsampler at 0x13d5b9850>}

The default only summarizes subjects ‘addl_cols’: [‘subject’], so adding an additional categorical variable to include in the summary is as easy as:

tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
tp.labels.append("cohort.summary")

You can also specify your standard for publicity. Instead of ‘qpublic == True & K_neighbors > 5’ you can ask to find super public meta-clonotypes, returning only those groups that satisfy : ‘nsubject > 8’

tp.query_str = 'nsubject > 8'

Here’s the a full example:

import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

# <aesr_fn> antigen-enriched subrepertoire
aesr_fn = os.path.join(
    'tcrdist',
    'data',
    'covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv')

# <aesr_df> antigen-enriched subrepertoire
aesr_df = pd.read_csv(aesr_fn)
# <tr> TCR repertoire
tr = TCRrep(
    cell_df = aesr_df[[
        'cohort',
        'subject',
        'v_b_gene', 
        'j_b_gene',
        'cdr3_b_aa']].copy(), 
    organism = 'human', 
    chains = ['beta'], 
    db_file = 'alphabeta_gammadelta_db.tsv', 
    compute_distances = True)
# <tp> TCRpublic class for reporting publicities 
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones2.html")
# set to True, if we want a universal radius
tp.fixed_radius = True
# must then specify maximum distance for finding similar TCRs
tp.radius = 18
# set criteria for being quasi-public
tp.query_str = 'nsubject > 6'
# Add additional columns to be summarized in the report
tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
# Add cohort.summary to the labels column so it shows up in the report
tp.labels.append("cohort.summary")
# by calling, .report() an html report is made
public = tp.report()

As you can see in this new a html quasi-publicity report , the report has a new column for summarizing the percentage of TCRs coming from each cohort in the study and the number of meta-clonotypes are fewer, since only those with TCRs drawn from more than 8 subject are reported.

I want my search radius to be sequence specific¶

The radius applied to each centroid can be specified in a column of the clone_df.

"""
Instead of enforcing a fixed radius, 
use a radius specific to each
centroid, specified in an additional 
column.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

fn = os.path.join(
    'tcrdist',
    'data',
    'covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')

df = pd.read_csv(fn)

tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones3.html")

# set to True, if we want a universal radius
tp.fixed_radius = False
# must then specify maximum distance for finding similar TCRs
tp.radius = None
# set criteria for being quasi-public
tp.query_str = 'nsubject > 5'
# Add additional columns to be summarized in the report
tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
# Add cohort.summary to the labels column so it shows up in the report
tp.labels.append("cohort.summary")
tp.labels.append("cdr3s")
# Change number of subjects to display
tp.kargs_member_summ['addl_n'] = 10
# by calling, .report() an html report is made
public = tp.report()

Notice that radius varies by row in this quasi-publicity report ,

Working from neighbor_diff output¶

"""
Use values from neighborhood_diff
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')
df = pd.read_csv(fn)
tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

from tcrdist.rep_diff import neighborhood_diff
ndif = neighborhood_diff(   clone_df= tr.clone_df, 
                                pwmat = tr.pw_beta, 
                                count_col = 'count', 
                                x_cols = ['cohort'], 
                                knn_radius = 25, 
                                test_method = "chi2")
# Add neighbors and other columns of interest 
# from neighbor_diff result to the clone_df
tr.clone_df = pd.concat([tr.clone_df, ndif[['neighbors', 'K_neighbors','val_0','ct_0','pvalue']] ], axis = 1)
# Because neighors and K_neighbor are already added to the clone_df 
# TCRpublic.report() uses those instead of finding new ones.
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones_with_ndif.html")
# Add any columns neighbor_diff columns 
#that you want to display in the final report
tp.labels.append('val_0')
tp.labels.append('ct_0')
tp.labels.append('pvalue')
# chagne sort to be pvalue not publicity
tp.sort_columns = ['pvalue']
# because you are sorting by pvalue, change to True
tp.sort_ascending = True
tp.report()

I hate OOP just show me the functions¶

TCRpublic is for convenience. You can customize a lot including the background tcrsampler; but the power users may want to work with the underlying functions directly. Here are some examples of how:

I just want to quickly find neighbors and (quasi)public clones¶

"""
Instead of enforcing a fixed radius, 
use a radius specific to each
centroid, specified in an additional 
column.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')
df = pd.read_csv(fn)
tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

# NEIGHBORS
from tcrdist.public import _neighbors_fixed_radius
from tcrdist.public import _neighbors_variable_radius
# returns lists of lists of all neighbors at fixed of variable radii
_neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18)
_neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius)

# returns the number (K) neighbors at fixed or vriable radii
from tcrdist.public import _K_neighbors_fixed_radius
from tcrdist.public import _K_neighbors_variable_radius
_K_neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18)
_K_neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius)

# First find neighbors by your favorite method 
tr.clone_df['neighbors'] = _neighbors_variable_radius(
    pwmat = tr.pw_beta, 
    radius_list = tr.clone_df.radius)
# Once neighbors are added to a clone_df you can easily determine publicity. 
tr.clone_df['nsubject']   = tr.clone_df['neighbors'].\
    apply(lambda x: tr.clone_df['subject'].iloc[x].nunique())
tr.clone_df['qpublic']   = tr.clone_df['nsubject'].\
    apply(lambda x: x > 1)

I have neighbors and radii already, I want logos¶

Suppose you want to specify exactly what to include in a motif logo report. This example is slightly different then those above because we are going to use two inputs files. The first input file includes all of the TCRs in antigen enriched repertoire. The second file is a subset of the first, specifying exactly the TCRs centroids to report. Remember that any element of a clone_df can be included/excluded from the HTML report. Those fields to include can be specified as labels.

"""
Report of meta-clonotypes using two dataframes.
<df>  has all TCRS
<df2> has a subset of TCRS in <df>, specifiyint which 
are to be used as centroids.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
from tcrdist.tree import _default_sampler_olga
from progress.bar import IncrementalBar
from palmotif import compute_pal_motif, svg_logo
from tcrdist.public import make_motif_logo

output_html_name = "custom_report.html"
# <fn> filename for all TCRs in an antigen-enriched repertoire
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv')
df = pd.read_csv(fn, sep = ",")
df = df[['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'pgen', 'max_radi']].\
    rename(columns= {'max_radi':'radius'}).copy()

# <fn>2 filename for priority TCRs
fn2 = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv.ranked_centers.tsv')
df2 = pd.read_csv(fn2, sep = "\t").\
    rename(columns= {'max_radi':'radius'}).copy()

# Compute distances between all TCRs
tr = TCRrep(cell_df = df, 
    organism = 'human',
    chains = ['beta'], 
    compute_distances = True)

# Initialize a tcrsampler, this will be used to make background motifs
tcrsampler = _default_sampler_olga(organism = "human", chain = "beta")()

# Iterate through each row of the df2, making a logo for each.
svgs = list()
svgs_raw = list()
bar = IncrementalBar("Making Logos", max = df2.shape[0])
for i,r in df2.iterrows():
    bar.next()
    svg,svg_raw=make_motif_logo(tcrsampler = tcrsampler,
                        clone_df = tr.clone_df,
                        pwmat = tr.pw_beta,
                        centroid = r['cdr3_b_aa'],
                        v_gene = r['v_b_gene'],
                        radius = r['radius'],
                        pwmat_str = 'pw_beta',
                        cdr3_name = 'cdr3_b_aa',
                        v_name = 'v_b_gene',
                        gene_names = ['v_b_gene','j_b_gene'])
    svgs.append(svg)
    svgs_raw .append(svg_raw)
bar.next(); bar.finish()
df2['svg'] = svgs
df2['svg_raw'] = svgs_raw

def shrink(s):
    """reduce size of svg graphic"""
    s = s.replace('height="100%"', 'height="20%"')
    s = s.replace('width="100%"', 'width="20%"')
    return s

# Choose columns to include in the report
labels = [  'cdr3_b_aa', 
            'v_b_gene',
            'j_b_gene',
            'radius',
            'regex', 
            'target_hits',
            'nsubject',
            'chi2joint']

with open(output_html_name, 'w') as output_handle:
    for i,r in df2.iterrows():
        #import pdb; pdb.set_trace()
        svg, svg_raw = r['svg'],r['svg_raw']
        output_handle.write("<br></br>")
        output_handle.write(shrink(svg))
        output_handle.write(shrink(svg_raw))
        output_handle.write("<br></br>")
        output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
        output_handle.write("<br></br>")

Will this work with sparse matrix options?¶

tcrdist3 has a memory efficient options for larger datasets that produce scipy.sparse rather than dense representations of distance relationships.

Currently you can’t call TCRpublic() on this sparse representation. However, here is an example of how you can achieve similar results via a script, reporting (quasi)Public meta-clonotypes from a sparse format.

"""
Making a meta-clonotype report from a 
scipy.sparse TCRdist matrix.
"""
import numpy as np
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.public import _neighbors_sparse_fixed_radius, _neighbors_sparse_variable_radius
from tcrdist.summarize import test_for_subsets
from tcrdist.tree import _default_sampler_olga
from tcrdist.public import make_motif_logo_from_index

df = pd.read_csv("dash.csv").query('epitope == "PA"')
tr = TCRrep(cell_df = df,               #(2)
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = False)
    # When setting the radius to 50, the sparse matrix 
    # will convert any value > 50 to 0. True zeros are 
    # repressented as -1.
radius = 50
tr.cpus = 1
    # Notice that we called .compute_sparse_rect_distances instead of .compute_distances
tr.compute_sparse_rect_distances(df = tr.clone_df, radius = radius)

    # There are two functions for finding neighbors from a sparse TCRdist matrix. 
    # For 1 fixed radius: _neighbors_sparse_fixed_radius()
    # For a radius per row: _neighbors_sparse_variable_radius()
tr.clone_df['radius'] = 12 
tr.clone_df['neighbors'] = \
    _neighbors_sparse_variable_radius(
        csrmat = tr.rw_beta, 
        #radius = 12)
        radius_list = tr.clone_df['radius'].to_list())

    # <K_neighbors>the number of neighbors per TCR
tr.clone_df['K_neighbors'] = tr.clone_df['neighbors'].apply(lambda x: len(x))
    # <nsubject> the number of subject (nsubject) neighboring the TCR (
tr.clone_df['nsubject'] = tr.clone_df['neighbors'].apply(lambda x: len(tr.clone_df['subject'][x].unique()))
    # nsubject > 1 implies quasi-publicity)
tr.clone_df['qpublic'] = tr.clone_df['nsubject'].apply(lambda x: x >1 )

    # e.g., For the report, focus on TCRs with more than 5 neighboring subjects 
quasi_public_df = tr.clone_df.query('nsubject > 5').copy().\
    sort_values('nsubject', ascending = False)
    # test_for_subsets()> allows us to remove TCRs with identical neighborhoods
quasi_public_df['unique']  = test_for_subsets(quasi_public_df['neighbors'])
quasi_public_df = quasi_public_df[quasi_public_df['unique'] == 1].copy()
    # declare a sampler for generating a backgrond comparison
ts = _default_sampler_olga(organism = 'mouse', chain = 'beta')()

    # make a background-subtracted logo <svg> and raw log <svg_raw> for each TCR
svgs = list()
svgs_raw = list()
for i,r in quasi_public_df.iterrows():
    svg, svg_raw  = make_motif_logo_from_index(tcrsampler = ts,
                                               ind = r['neighbors'],
                                               centroid = r['cdr3_b_aa'],
                                               clone_df = tr.clone_df,
                                               cdr3_name = 'cdr3_b_aa',
                                               v_name = 'v_b_gene',
                                               gene_names = ['v_b_gene','j_b_gene'])
    svgs.append(svg)
    svgs_raw.append(svg_raw)

    # Output a html report
output_html_name = 'quasi_public_df_report.html'
quasi_public_df['svg'] = svgs
quasi_public_df['svg_raw'] = svgs_raw
    # Specific columns to include in the report
labels = [  'cdr3_b_aa', 
            'v_b_gene',
            'j_b_gene',
            'radius', 
            'K_neighbors',
            'nsubject']

def shrink(s):
    """reduce size of svg graphic"""
    s = s.replace('height="100%"', 'height="20%"')
    s = s.replace('width="100%"', 'width="20%"')
    return s

with open(output_html_name, 'w') as output_handle:
    for i,r in quasi_public_df.iterrows():
        #import pdb; pdb.set_trace()
        svg, svg_raw = r['svg'],r['svg_raw']
        output_handle.write("<br></br>")
        output_handle.write(shrink(svg))
        output_handle.write(shrink(svg_raw))
        output_handle.write("<br></br>")
        output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
        output_handle.write("<br></br>")

For more on sparse matrices in tcrdist3 see the tab on ‘Working With Bulk Data’.