Sparse RepresentationΒΆ

For large datasets, you may want to set compute_distances to False and then use a sparse implementation. First, set tr.cpus as appropriate to your system. When computing distances with the sparse implementation, the argument radius is the maximum distance to be stored. All distances greater than radius will be converted to 0, reducing the memory required in a sparse format. The argument chunk_size tells tcrdist3 how many rows to compute at a time. For instance, if you have 100,000 x 100,000 clones, then a chunk size of 100 will compute distances 100x100,000 on each node and store each of the 1000 intermediate results in a sparse format before recombining them it a single sparse scipy.sparse.csr_matrix. Larger chunk sizes will result in less overhead, but chunk size should be tuned based on available memory. The results are object attributes rw_beta and rw_alpha. True 0 distances are represented as -1. The techniques for customizing the distance metric such as changing trims, gap when using the sparse implementation.

import pandas as pd
from tcrdist.repertoire import TCRrep

import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
            organism = 'mouse',
            chains = ['alpha','beta'],
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = False)

tr.cpus = 2
tr.compute_sparse_rect_distances(radius = 50, chunk_size = 100)
tr.rw_beta
"""<1920x1920 sparse matrix of type '<class 'numpy.int16'>'
	with 108846 stored elements in Compressed Sparse Row format>
"""
print(tr.rw_beta)
"""
  (0, 0)  -1
  (1, 1)  -1
  (1, 470)  24
  (1, 472)  24
  (2, 2)  -1
  : :
  (1919, 1911)  24
  (1919, 1912)  38
  (1919, 1918)  12
  (1919, 1919)  -1
"""