Integrate scRNA-seq datasets#
scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.
Here, weโll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.
Setup#
!lamin load test-scrna
Show code cell output
๐ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
๐ก loaded instance: testuser1/test-scrna
import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
๐ก loaded instance: testuser1/test-scrna (lamindb 0.54.1)
ln.track()
๐ก notebook imports: anndata==0.9.2 lamindb==0.54.1 lnschema_bionty==0.31.2
โ record with similar name exist! did you mean to load it?
id | __ratio__ | |
---|---|---|
name | ||
scRNA-seq | Nv48yAceNSh8z8 | 90.0 |
๐ก Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-22 18:44:51, created_by_id='DzTjkKse')
๐ก Run(id='cDadTpYy2vMW7XBYQlb0', run_at=2023-09-22 18:44:51, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')
Access #
Query files by provenance metadata#
users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("scrna")
id | __ratio__ | |
---|---|---|
name | ||
Integrate scRNA-seq datasets | agayZTonayqAz8 | 90.0 |
scRNA-seq | Nv48yAceNSh8z8 | 90.0 |
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
GzA3KMdHzowOYsClkbvy | yNdwkjSP | None | .h5ad | AnnData | Conde22 | None | 28049505 | WEFcMZxJNmMiUOFrcSTaig | md5 | Nv48yAceNSh8z8 | Nv39dIk0xeRfAOZAwfvB | None | 2023-09-22 18:44:21 | DzTjkKse |
D4Soc2iFauHfymG956ss | yNdwkjSP | None | .h5ad | AnnData | 10x reference pbmc68k | None | 660792 | a2V0IgOjMRHsCeZH169UOQ | md5 | Nv48yAceNSh8z8 | Nv39dIk0xeRfAOZAwfvB | None | 2023-09-22 18:44:45 | DzTjkKse |
Query files based on biological metadata#
assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
experimental_factors=assays.single_cell_rna_sequencing,
species=species.human,
cell_types=cell_types.gamma_delta_t_cell,
)
query.df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
D4Soc2iFauHfymG956ss | yNdwkjSP | None | .h5ad | AnnData | 10x reference pbmc68k | None | 660792 | a2V0IgOjMRHsCeZH169UOQ | md5 | Nv48yAceNSh8z8 | Nv39dIk0xeRfAOZAwfvB | None | 2023-09-22 18:44:45 | DzTjkKse |
GzA3KMdHzowOYsClkbvy | yNdwkjSP | None | .h5ad | AnnData | Conde22 | None | 28049505 | WEFcMZxJNmMiUOFrcSTaig | md5 | Nv48yAceNSh8z8 | Nv39dIk0xeRfAOZAwfvB | None | 2023-09-22 18:44:21 | DzTjkKse |
Transform #
Compare gene sets#
Get file objects:
query = ln.File.filter()
file1, file2 = query.list()
file1.describe()
File(id='GzA3KMdHzowOYsClkbvy', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-22 18:44:21)
Provenance:
๐๏ธ storage: Storage(id='yNdwkjSP', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-22 18:43:43, created_by_id='DzTjkKse')
๐ transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-22 18:44:45, created_by_id='DzTjkKse')
๐ฃ run: Run(id='Nv39dIk0xeRfAOZAwfvB', run_at=2023-09-22 18:43:45, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
๐ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-22 18:43:43)
Features:
var: FeatureSet(id='2gQIre5ht93RP9Br7AxJ', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-22 18:44:16, modality_id='YVd1fHWO', created_by_id='DzTjkKse')
'LINC01088', 'AP2S1', 'ADSL', 'USP16', 'None', 'None', 'SCAT2', 'ZNF45-AS1', 'LINC02132', 'XIRP2-AS1', ...
obs: FeatureSet(id='ACQDyVarceSpQOe20uFE', n=4, registry='core.Feature', hash='Pku8H0niKZ8uYnQMyx1J', updated_at=2023-09-22 18:44:21, modality_id='zaCpJM7g', created_by_id='DzTjkKse')
๐ tissue (17, bionty.Tissue): 'caecum', 'bone marrow', 'lung', 'thymus', 'liver', 'mesenteric lymph node', 'lamina propria', 'jejunal epithelium', 'duodenum', 'thoracic lymph node', ...
๐ donor (12, core.ULabel): '582C', 'A35', 'D503', 'A29', 'A52', '640C', 'A31', 'D496', '621B', 'A36', ...
๐ cell_type (32, bionty.CellType): 'gamma-delta T cell', 'mast cell', 'non-classical monocyte', 'plasmablast', 'megakaryocyte', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'mucosal invariant T cell', 'plasmacytoid dendritic cell', 'progenitor cell', 'CD16-positive, CD56-dim natural killer cell, human', ...
๐ assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
Labels:
๐ท๏ธ species (1, bionty.Species): 'human'
๐ท๏ธ tissues (17, bionty.Tissue): 'caecum', 'bone marrow', 'lung', 'thymus', 'liver', 'mesenteric lymph node', 'lamina propria', 'jejunal epithelium', 'duodenum', 'thoracic lymph node', ...
๐ท๏ธ cell_types (32, bionty.CellType): 'gamma-delta T cell', 'mast cell', 'non-classical monocyte', 'plasmablast', 'megakaryocyte', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'mucosal invariant T cell', 'plasmacytoid dendritic cell', 'progenitor cell', 'CD16-positive, CD56-dim natural killer cell, human', ...
๐ท๏ธ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
๐ท๏ธ ulabels (12, core.ULabel): '582C', 'A35', 'D503', 'A29', 'A52', '640C', 'A31', 'D496', '621B', 'A36', ...
file1.view_flow()
file2.describe()
File(id='D4Soc2iFauHfymG956ss', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-09-22 18:44:45)
Provenance:
๐๏ธ storage: Storage(id='yNdwkjSP', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-22 18:43:43, created_by_id='DzTjkKse')
๐ transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-22 18:44:45, created_by_id='DzTjkKse')
๐ฃ run: Run(id='Nv39dIk0xeRfAOZAwfvB', run_at=2023-09-22 18:43:45, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
๐ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-22 18:43:43)
Features:
var: FeatureSet(id='GglELLiZwTYIyev6GwOp', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-22 18:44:45, modality_id='YVd1fHWO', created_by_id='DzTjkKse')
'CYTL1', 'PSMC3', 'AP2S1', 'RHOC', 'PDAP1', 'TAGLN2', 'LBH', 'ADSL', 'CCL4', 'PLAC8', ...
obs: FeatureSet(id='tfrfeotun53IO4o0g2Pj', n=1, registry='core.Feature', hash='k3ON0Ea-SwSaTVbRu7kE', updated_at=2023-09-22 18:44:45, modality_id='zaCpJM7g', created_by_id='DzTjkKse')
๐ cell_type (9, bionty.CellType): 'gamma-delta T cell', 'cytotoxic T cell', 'CD4-positive, alpha-beta T cell', 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte'
external: FeatureSet(id='l8GZYinuhuSSFpV55ch4', n=2, registry='core.Feature', hash='2DlkyLpMca3LGwfc7E2N', updated_at=2023-09-22 18:44:46, modality_id='zaCpJM7g', created_by_id='DzTjkKse')
๐ species (1, bionty.Species): 'human'
๐ assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
๐ท๏ธ species (1, bionty.Species): 'human'
๐ท๏ธ cell_types (9, bionty.CellType): 'gamma-delta T cell', 'cytotoxic T cell', 'CD4-positive, alpha-beta T cell', 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte'
๐ท๏ธ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file2.view_flow()
Load files into memory:
file1_adata = file1.load()
file2_adata = file2.load()
Here we compute shared genes without loading files:
file1_genes = file1.features["var"]
file2_genes = file2.features["var"]
shared_genes = file1_genes & file2_genes
len(shared_genes)
749
shared_genes.list("symbol")[:10]
['AP2S1',
'ADSL',
'NIFK',
'LYL1',
'UPP1',
'AHSA1',
'JOSD2',
'ERP29',
'GYPC',
'NAP1L1']
Compare cell types#
file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()
shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['gamma-delta T cell', 'CD16-positive, CD56-dim natural killer cell, human']
We can now subset the two datasets by shared cell types:
file1_adata_subset = file1_adata[
file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file2_adata_subset = file2_adata[
file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]
Concatenate subsetted datasets:
adata_concat = ad.concat(
[file1_adata_subset, file2_adata_subset],
label="file",
keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs ร n_vars = 187 ร 749
obs: 'cell_type', 'file'
obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type file
CD16-positive, CD56-dim natural killer cell, human Conde22 114
gamma-delta T cell Conde22 66
10x reference pbmc68k 4
CD16-positive, CD56-dim natural killer cell, human 10x reference pbmc68k 3
dtype: int64
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
Show code cell output
๐ก deleting instance testuser1/test-scrna
โ
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โ
instance cache deleted
โ
deleted '.lndb' sqlite file
โ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna