Curate AnnData
based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene
against the CELLxGENE schema v5.1.0.
Load your instance where you want to register the curated AnnData object:
# pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --modules bionty
import lamindb as ln
import bionty as bt
def get_semi_curated_dataset():
adata = ln.core.datasets.anndata_human_immune_cells()
adata.obs["sex_ontology_term_id"] = "PATO:0000384"
adata.obs["organism"] = "human"
adata.obs["sex"] = "unknown"
# create some typos in the metadata
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories({"lung": "lungg"})
# new donor ids
adata.obs["donor"] = adata.obs["donor"].astype(str) + "-1"
# drop animal cell
adata = adata[adata.obs["cell_type"] != "animal cell", :]
# remove columns that are reserved in the cellxgene schema
adata.var.drop(columns=["feature_reference", "feature_biotype"], inplace=True)
adata.raw.var.drop(
columns=["feature_name", "feature_reference", "feature_biotype"], inplace=True
)
return adata
→ connected lamindb: testuser1/test-cellxgene-curate
Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.
adata = get_semi_curated_dataset()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1
Validate and curate metadata¶
We create a Curate
object that references the AnnData
object.
During instantiation, any :class:~lamindb.Feature
records are saved.
curator = ln.curators.CellxGeneAnnDataCatManager(adata, schema_version="5.1.0")
validated = curator.validate()
✗ missing required obs columns 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'suspension_type', 'tissue_type'
→ consider initializing a Curate object with `defaults=cxg.CellxGeneAnnDataCatManager.cxg_categoricals_defaults` to automatically add these columns with default values
Let’s fix the “donor_id” column name:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)
For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:
ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults
Note
CELLxGENE requires columns tissue
, organism
, and assay
to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
curator = ln.curators.CellxGeneAnnDataCatManager(
adata,
defaults=ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults,
schema_version="5.1.0",
)
validated = curator.validate()
validated
Remove unvalidated values¶
We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).
curator.non_validated
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var_names.isin(curator.non_validated["var_index"])
].copy()
adata.raw = raw_data
curator = ln.curators.CellxGeneAnnDataCatManager(adata, schema_version="5.1.0")
✗ Could not find source: ExperimentalFactor
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: CellType
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: DevelopmentalStage
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: Disease
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: Phenotype
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: Tissue
→ consider running `bionty.core.sync_public_sources()`
✗ Could not find source: Gene
→ consider running `bionty.core.sync_public_sources()`
Register new metadata labels¶
Following the suggestions above to register genes and labels that aren’t present in the current instance:
(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)
An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:
tissues = curator.lookup(public=True).tissue
tissues.lung
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
{"lungg": tissues.lung.name}
)
Let’s validate the object again:
validated = curator.validate()
validated
adata.obs.head()
Save artifact¶
artifact = curator.save_artifact(
key=f"my_datasets/dataset-curated-against-cxg-{curator.schema_version}.h5ad"
)
artifact.describe()
Return an input h5ad file for cellxgene-schema¶
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1
Note
The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.