Concatenate datasets to a single array store¶
In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.
Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.
This is also what CELLxGENE does to create Census: a number of .h5ad
files are concatenated to give rise to a single tiledbsoma
array store (CELLxGENE: scRNA-seq).
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce
ln.track("oJN8WmVrxI8m0000")
Query the collection of h5ad
files that we’d like to concatenate into a single array.
collection = ln.Collection.get(key="scrna/collection1", version="2")
collection.describe()
Prepare the AnnData objects¶
To concatenate the AnnData
objects into a single tiledbsoma.Experiment
, they need to have the same .var
and .obs
columns.
# load a number of AnnData objects that's small enough to fit into memory
adatas = [artifact.load() for artifact in collection.ordered_artifacts]
# compute the intersection of columns for these objects
var_columns = reduce(
pd.Index.intersection, [adata.var.columns for adata in adatas]
) # this only affects metadata columns of features (say, gene annotations)
var_raw_columns = reduce(
pd.Index.intersection, [adata.raw.var.columns for adata in adatas]
)
obs_columns = reduce(
pd.Index.intersection, [adata.obs.columns for adata in adatas]
) # this actually subsets features (dataset dimensions)
Prepare the AnnData
objects for concatenation. Prepare id fields, sanitize index
names, intersect columns, drop .obsp
, .uns
and columns that aren’t part of the intersection.
for i, adata in enumerate(adatas):
del adata.obsp # not supported by tiledbsoma
del adata.uns # not supported by tiledbsoma
adata.obs = adata.obs.filter(obs_columns) # filter columns to intersection
adata.obs["obs_id"] = (
adata.obs.index
) # prepare a column for tiledbsoma to use as an index
adata.obs["dataset"] = i
adata.obs.index.name = None
adata.var = adata.var.filter(var_columns) # filter columns to intersection
adata.var["var_id"] = adata.var.index
adata.var.index.name = None
drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
adata.raw.var["var_id"] = adata.raw.var.index
adata.raw.var.index.name = None
Create the array store¶
Save the AnnData
objects in one array store referenced by an Artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
adatas,
description="tiledbsoma experiment",
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
append_obsm_varm=True,
)
Note
Provenance is tracked by writing the current run.uid
to tiledbsoma.Experiment.obs
as lamin_run_uid
.
If you know tiledbsoma
API, then note that save_tiledbsoma_experiment()
abstracts over both tiledbsoma.io.register_anndatas
and tiledbsoma.io.from_anndata
.
Query the array store¶
Here we query the obs
from the array store.
with soma_artifact.open() as soma_store:
obs = soma_store["obs"]
var = soma_store["ms"]["RNA"]["var"]
obs_columns_store = obs.schema.names
var_columns_store = var.schema.names
obs_store_df = obs.read().concat().to_pandas()
display(obs_store_df)
Append to the array store¶
Prepare a new AnnData
object to be appended to the store.
ln.core.datasets.anndata_with_obs().write_h5ad("adata_to_append.h5ad")
!lamin save adata_to_append.h5ad --description "adata to append"
adata = ln.Artifact.filter(description="adata to append").one().load()
adata.obs_names_make_unique()
adata.var_names_make_unique()
adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index
adata.obs["dataset"] = obs_store_df["dataset"].max()
obs_columns_same = [
obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store
]
adata.obs = adata.obs[obs_columns_same]
var_columns_same = [
var_col for var_col in adata.var.columns if var_col in var_columns_store
]
adata.var = adata.var[var_columns_same]
adata.write_h5ad("adata_to_append.h5ad")
Append the AnnData
object from disk by revising soma_artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
["adata_to_append.h5ad"],
revises=soma_artifact,
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
)
Update the array store¶
Add a new embedding to the existing array store.
# read the data matrix
with soma_artifact.open() as soma_store:
ms_rna = soma_store["ms"]["RNA"]
n_obs = len(soma_store["obs"])
n_var = len(ms_rna["var"])
X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
# calculate PCA embedding from the queried `X`
pca_array = sc.pp.pca(X, n_comps=2)
Open the array store in write mode and add PCA.
with soma_artifact.open(mode="w") as soma_store:
tiledbsoma.io.add_matrix_to_collection(
exp=soma_store,
measurement_name="RNA",
collection_name="obsm",
matrix_name="pca",
matrix_data=pca_array,
)
See array store mutations¶
During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance.
soma_artifact.versions.df()
View lineage of the array store¶
Check the generating flow of the array store.
soma_artifact.view_lineage()
Note
For the underlying API, see the tiledbsoma documentation.