Sparse RepresentationΒΆ
For large datasets, you may want to set compute_distances to False and then use a sparse implementation.
First, set tr.cpus
as appropriate to your system. When computing distances with the sparse implementation, the argument radius
is the maximum distance to be stored. All distances greater than radius
will be converted to 0, reducing the memory required in a sparse format. The argument chunk_size
tells tcrdist3 how many rows to compute at a time. For instance, if you have 100,000 x 100,000 clones, then a chunk size of 100 will compute distances 100x100,000 on each node and store each of the 1000 intermediate results in a sparse format before recombining them it a single sparse scipy.sparse.csr_matrix
. Larger chunk sizes will result in less overhead, but chunk size should be tuned based on available memory. The results are object attributes rw_beta
and rw_alpha
. True 0 distances are represented as -1. The techniques for customizing the distance metric such as changing trims, gap when using the sparse implementation.
import pandas as pd
from tcrdist.repertoire import TCRrep
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
tr.cpus = 2
tr.compute_sparse_rect_distances(radius = 50, chunk_size = 100)
tr.rw_beta
"""<1920x1920 sparse matrix of type '<class 'numpy.int16'>'
with 108846 stored elements in Compressed Sparse Row format>
"""
print(tr.rw_beta)
"""
(0, 0) -1
(1, 1) -1
(1, 470) 24
(1, 472) 24
(2, 2) -1
: :
(1919, 1911) 24
(1919, 1912) 38
(1919, 1918) 12
(1919, 1919) -1
"""