Working with Bulk Data

For analysis of large repertoires, such as those commonly generated from bulk PMBC TCR beta chain sequencing, tcrdist3 provides the option to fragment the pairwise distance computation into chunks, storing the intermediate results to disk. In practice, smaller distances between TCRs are the quantity of interest, such that large TCR distances may be discarded. The remaining set of pairwise edges may be saved in a sparse matrix format.

This approach allows for the generation of TCR networks analogous to those commonly generated by instantiating edges between all TCRs with an edit distance of 1. The key advantage is that a TCR distance network forms edges using a biologically informed metric that can accommodate greater sequence divergence.

Tcrdist3’s chunked computation of pairwise distances allows the user to find biochemcially similar TCRs within large bulk (> 100K unique clones) or combine TCR dataset from multiple individuals, with computationally intensive task of comparing all TCRs to one another distributed across multiple computing nodes. For instance, computing a tcrdistance network among 100,000 clones involves computing on the order of 10 billion distances.

Based on available compute resources, the user can configure the chunked computation to the size of their bulk dataset., critical arguments row_size (sets the number of rows of the pairwise matrix to compute in a given chunk) and pm_processes (the number of available concurrent process).

[Illustrative Table of Memory Usage with Different Configurations]

The argument max_distance determines the sparsity of the resulting distance matrix. Any distance greater than max_distance is set to zero, introducing sparsity. The argument reassemble, if set to True, will reassemble the distance matrix fragments written to disk and return the sparse representation of the distance matrix for further analysis. The argument ‘cleanup’, if True, will delete the folder containing fragments written to disk. If running, compute_n_tally_out_of_memory cleanup should be set to False so that TCR neighbors can also be tallied in chunks. This enables the user to tally the number of neighbors of each TCR associated with demographic or experimental categorical variables of interest (e.g., age, HLA genotype, pre/post-infection). The following examples demonstrate this procedure on medium- and bulk-sized datasets.

from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt
from tcrdist.rep_funcs import  compute_pw_sparse_out_of_memory
from tcrdist.rep_funcs import  compute_n_tally_out_of_memory

df = import_adaptive_file(adaptive_filename = 'ADIRP0000895_TCRB.tsv',
                                                  organism = "human",
                      chain = "beta",
                      count = 'productive_frequency')

tr = TCRrep(cell_df = df,
        organism = 'human',
        chains = ['beta'],
        db_file = 'alphabeta_gammadelta_db.tsv',
        compute_distances = False,
        store_all_cdr = False)

S, fragments = compute_pw_sparse_out_of_memory( tr = tr,
        row_size      = 500,
        pm_processes  = 14,
        pm_pbar       = True,
        max_distance  = 37,
        matrix_name   = 'rw_beta',
        reassemble    = True,
        cleanup       = False)


nn_tally_df_cohort = compute_n_tally_out_of_memory(fragments,
        matrix_name = "rw_beta",
        pm_processes  = 14,
        to_file = False,
        to_memory = True,
        knn_radius = 36,
        x_cols = ['categorical_variable'])

from hierdiff.association_testing import cluster_association_test
nn_associations = cluster_association_test(res = nn_tally_df_cohort, y_col='cmember', method='fishers')

Tcrdist3 is built to handle paired chain data (i.e. single-cell data where both the alpha and beta chain sequences were recovered). Tcrdist3’s chunked computation approach is appropriate for finding sets for quasi-public ‘clonetypes’, cells with non-identical but biochemically similar receptors, across multiple samples. Tcrdist3’s chunked computation can compare a a much larger number of clones than was previously possible.

This example makes use of the dash.csv dataset, to show how the memory-conserving chunked computation approach compares to the standard approach where all distances are retained in memory. For paired or single-chain datasets larger than 10K clones (i.e. >10^8 pairwise comparisons) we recommend the function compute_pw_sparse_out_of_memory2. Furthermore, by setting the function argument cleanup to False, one can subsequently call compute_n_tally_out_of_memory2, to use the chunked distance results – already written to disk – to tally neighbors based on categorical variables of interest.

import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.rep_funcs import  compute_pw_sparse_out_of_memory2
from tcrdist.rep_funcs import  compute_n_tally_out_of_memory2
from hierdiff.association_testing import cluster_association_test

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df.sample(100, random_state = 1),
            organism = 'mouse',
            chains = ['alpha','beta'],
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = True,
            store_all_cdr = False)

check_beta = tr.pw_beta.copy(); check_beta[check_beta == 0] = 1
check_alpha = tr.pw_alpha.copy(); check_alpha[check_alpha == 0] = 1
check_alpha_beta = check_beta + check_alpha


S, fragments = compute_pw_sparse_out_of_memory2(        tr = tr,
        row_size      = 50,
        pm_processes  = 1,
        pm_pbar       = True,
        max_distance  = 1000,
        reassemble    = True,
        cleanup       = False,
        assign        = True)

assert np.all(tr.pw_beta == check_beta)
assert np.all(tr.pw_alpha == check_alpha)

ndif1 = compute_n_tally_out_of_memory2(fragments,
     to_file = False,
     to_memory = True,
     pm_processes = 2,
     x_cols = ['epitope'],
     count_col='count',
     knn_neighbors= None,
     knn_radius =100)

from hierdiff.association_testing import cluster_association_test
ndif1 = cluster_association_test(res = ndif1, y_col='cmember', method='chi2')


from tcrdist.rep_diff import neighborhood_diff
ndif2 = neighborhood_diff(clone_df= tr.clone_df,
    pwmat = np.array(tr.pw_beta.todense() + tr.pw_alpha.todense()),
    count_col = 'count',
    x_cols = ['epitope'],
    knn_radius = 100,
    test_method = "chi2")

assert ndif1.shape == ndif2.shape
np.all(ndif2['FDRq'].to_list() == ndif2['FDRq'].to_list())