Curate `AnnData` based on the CELLxGENE schema¶

This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene against the CELLxGENE schema v5.1.0.

Load your instance where you want to register the curated AnnData object:

# pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --modules bionty

import lamindb as ln
import bionty as bt


def get_semi_curated_dataset():
    adata = ln.core.datasets.anndata_human_immune_cells()
    adata.obs["sex_ontology_term_id"] = "PATO:0000384"
    adata.obs["organism"] = "human"
    adata.obs["sex"] = "unknown"
    # create some typos in the metadata
    adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories({"lung": "lungg"})
    # new donor ids
    adata.obs["donor"] = adata.obs["donor"].astype(str) + "-1"
    # drop animal cell
    adata = adata[adata.obs["cell_type"] != "animal cell", :]
    # remove columns that are reserved in the cellxgene schema
    adata.var.drop(columns=["feature_reference", "feature_biotype"], inplace=True)
    adata.raw.var.drop(
        columns=["feature_name", "feature_reference", "feature_biotype"], inplace=True
    )
    return adata

→ connected lamindb: testuser1/test-cellxgene-curate

Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

adata = get_semi_curated_dataset()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1

Validate and curate metadata¶

We create a Curate object that references the AnnData object. During instantiation, any :class:~lamindb.Feature records are saved.

curator = ln.curators.CellxGeneAnnDataCatManager(adata, schema_version="5.1.0")

validated = curator.validate()

✗ missing required obs columns 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'suspension_type', 'tissue_type'
    → consider initializing a Curate object with `defaults=cxg.CellxGeneAnnDataCatManager.cxg_categoricals_defaults` to automatically add these columns with default values

Let’s fix the “donor_id” column name:

adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)

For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:

ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults

Note

CELLxGENE requires columns tissue, organism, and assay to have existing values from the ontologies. Therefore, these columns need to be added and populated manually.

curator = ln.curators.CellxGeneAnnDataCatManager(
    adata,
    defaults=ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults,
    schema_version="5.1.0",
)

validated = curator.validate()
validated

Remove unvalidated values¶

We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).

curator.non_validated

Show code cell output

Hide code cell output

{'tissue': ['lungg'],
 'var_index': ['ENSG00000230699',
  'ENSG00000241180',
  'ENSG00000226849',
  'ENSG00000272482',
  'ENSG00000264443',
  'ENSG00000242396',
  'ENSG00000237352',
  'ENSG00000269933',
  'ENSG00000286863',
  'ENSG00000285808',
  'ENSG00000261737',
  'ENSG00000230427',
  'ENSG00000226822',
  'ENSG00000273373',
  'ENSG00000259834',
  'ENSG00000224167',
  'ENSG00000256374',
  'ENSG00000234283',
  'ENSG00000263464',
  'ENSG00000203812',
  'ENSG00000272196',
  'ENSG00000237975',
  'ENSG00000235736',
  'ENSG00000272880',
  'ENSG00000227925',
  'ENSG00000238042',
  'ENSG00000237845',
  'ENSG00000270188',
  'ENSG00000287116',
  'ENSG00000236856',
  'ENSG00000226277',
  'ENSG00000237133',
  'ENSG00000224739',
  'ENSG00000230525',
  'ENSG00000227902',
  'ENSG00000237327',
  'ENSG00000285155',
  'ENSG00000232411',
  'ENSG00000239467',
  'ENSG00000225205',
  'ENSG00000272551',
  'ENSG00000280374',
  'ENSG00000226747',
  'ENSG00000272519',
  'ENSG00000236886',
  'ENSG00000229352',
  'ENSG00000286601',
  'ENSG00000227021',
  'ENSG00000259855',
  'ENSG00000233143',
  'ENSG00000228135',
  'ENSG00000273301',
  'ENSG00000237940',
  'ENSG00000271870',
  'ENSG00000237838',
  'ENSG00000286996',
  'ENSG00000223797',
  'ENSG00000233509',
  'ENSG00000269028',
  'ENSG00000239462',
  'ENSG00000286699',
  'ENSG00000273370',
  'ENSG00000261490',
  'ENSG00000251679',
  'ENSG00000249988',
  'ENSG00000272567',
  'ENSG00000270394',
  'ENSG00000249381',
  'ENSG00000272370',
  'ENSG00000272354',
  'ENSG00000251044',
  'ENSG00000248371',
  'ENSG00000251613',
  'ENSG00000272040',
  'ENSG00000182230',
  'ENSG00000249684',
  'ENSG00000233937',
  'ENSG00000248103',
  'ENSG00000204092',
  'ENSG00000261068',
  'ENSG00000236740',
  'ENSG00000236996',
  'ENSG00000232295',
  'ENSG00000271734',
  'ENSG00000236673',
  'ENSG00000227220',
  'ENSG00000236166',
  'ENSG00000112096',
  'ENSG00000285162',
  'ENSG00000228434',
  'ENSG00000229881',
  'ENSG00000286228',
  'ENSG00000237513',
  'ENSG00000285106',
  'ENSG00000226380',
  'ENSG00000270672',
  'ENSG00000225932',
  'ENSG00000244693',
  'ENSG00000283504',
  'ENSG00000283648',
  'ENSG00000268955',
  'ENSG00000272267',
  'ENSG00000255495',
  'ENSG00000253381',
  'ENSG00000254143',
  'ENSG00000253878',
  'ENSG00000259820',
  'ENSG00000226403',
  'ENSG00000229611',
  'ENSG00000233776',
  'ENSG00000269900',
  'ENSG00000283886',
  'ENSG00000261534',
  'ENSG00000237548',
  'ENSG00000239665',
  'ENSG00000256892',
  'ENSG00000249860',
  'ENSG00000271409',
  'ENSG00000224745',
  'ENSG00000261438',
  'ENSG00000231575',
  'ENSG00000260461',
  'ENSG00000234134',
  'ENSG00000255823',
  'ENSG00000248671',
  'ENSG00000254740',
  'ENSG00000254561',
  'ENSG00000282080',
  'ENSG00000256427',
  'ENSG00000286911',
  'ENSG00000287577',
  'ENSG00000246331',
  'ENSG00000287388',
  'ENSG00000276814',
  'ENSG00000271259',
  'ENSG00000287622',
  'ENSG00000255945',
  'ENSG00000261650',
  'ENSG00000256542',
  'ENSG00000230641',
  'ENSG00000275294',
  'ENSG00000236094',
  'ENSG00000237585',
  'ENSG00000223458',
  'ENSG00000261666',
  'ENSG00000280710',
  'ENSG00000203441',
  'ENSG00000230156',
  'ENSG00000275216',
  'ENSG00000215271',
  'ENSG00000286931',
  'ENSG00000258414',
  'ENSG00000258808',
  'ENSG00000277050',
  'ENSG00000273888',
  'ENSG00000258777',
  'ENSG00000258301',
  'ENSG00000258861',
  'ENSG00000259444',
  'ENSG00000260780',
  'ENSG00000244952',
  'ENSG00000259730',
  'ENSG00000258631',
  'ENSG00000258831',
  'ENSG00000273923',
  'ENSG00000259664',
  'ENSG00000259582',
  'ENSG00000261720',
  'ENSG00000277010',
  'ENSG00000260182',
  'ENSG00000262668',
  'ENSG00000232196',
  'ENSG00000260060',
  'ENSG00000260141',
  'ENSG00000261439',
  'ENSG00000260923',
  'ENSG00000215067',
  'ENSG00000263316',
  'ENSG00000262089',
  'ENSG00000273388',
  'ENSG00000264067',
  'ENSG00000272736',
  'ENSG00000214970',
  'ENSG00000263388',
  'ENSG00000262292',
  'ENSG00000256618',
  'ENSG00000221995',
  'ENSG00000226377',
  'ENSG00000273576',
  'ENSG00000267637',
  'ENSG00000283517',
  'ENSG00000282965',
  'ENSG00000286603',
  'ENSG00000265717',
  'ENSG00000278107',
  'ENSG00000273733',
  'ENSG00000273837',
  'ENSG00000286949',
  'ENSG00000256222',
  'ENSG00000280095',
  'ENSG00000278927',
  'ENSG00000278955',
  'ENSG00000224247',
  'ENSG00000272948',
  'ENSG00000233213',
  'ENSG00000277352',
  'ENSG00000239446',
  'ENSG00000231566',
  'ENSG00000256045',
  'ENSG00000228906',
  'ENSG00000228139',
  'ENSG00000261773',
  'ENSG00000237563',
  'ENSG00000228890',
  'ENSG00000226362',
  'ENSG00000278198',
  'ENSG00000273496',
  'ENSG00000277666',
  'ENSG00000278782',
  'ENSG00000277761']}

adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var_names.isin(curator.non_validated["var_index"])
    ].copy()
    adata.raw = raw_data

curator = ln.curators.CellxGeneAnnDataCatManager(adata, schema_version="5.1.0")

✗ Could not find source: ExperimentalFactor
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: CellType
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: DevelopmentalStage
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: Disease
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: Phenotype
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: Tissue
    → consider running `bionty.core.sync_public_sources()`

✗ Could not find source: Gene
    → consider running `bionty.core.sync_public_sources()`

Register new metadata labels¶

Following the suggestions above to register genes and labels that aren’t present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)

An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:

tissues = curator.lookup(public=True).tissue
tissues.lung

adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)

Let’s validate the object again:

validated = curator.validate()
validated

adata.obs.head()

Show code cell output

Hide code cell output

	donor_id	tissue	cell_type	assay	sex_ontology_term_id	organism	sex	development_stage	disease	self_reported_ethnicity	suspension_type	tissue_type
CZINY-0109_CTGGTCTAGTCTGTAC	D496-1	blood	classical monocyte	10x 3' v3	PATO:0000384	human	unknown	unknown	normal	unknown	cell	tissue
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT	621B-1	thoracic lymph node	T follicular helper cell	10x 5' v2	PATO:0000384	human	unknown	unknown	normal	unknown	cell	tissue
Pan_T7935491_CTGGTCTGTACATGTC	A29-1	spleen	memory B cell	10x 5' v1	PATO:0000384	human	unknown	unknown	normal	unknown	cell	tissue
Pan_T7980367_GGGCATCCAGGTGGAT	A36-1	lung	alveolar macrophage	10x 5' v1	PATO:0000384	human	unknown	unknown	normal	unknown	cell	tissue
Pan_T7935494_ATCATGGTCTACCTGC	A29-1	mesenteric lymph node	naive thymus-derived CD4-positive, alpha-beta ...	10x 5' v1	PATO:0000384	human	unknown	unknown	normal	unknown	cell	tissue

Save artifact¶

artifact = curator.save_artifact(
    key=f"my_datasets/dataset-curated-against-cxg-{curator.schema_version}.h5ad"
)

artifact.describe()

Show code cell output

Hide code cell output

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: 6NwnoewTLbs9NUbB0000          hash: t1Fw4r_WZzCxCdPtca_XXh
│   ├── size: 52.1 MB                      n_observations: 1626
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-14 06:41:35    created_by: testuser1 (Test User1)
│   ├── key: my_datasets/dataset-curated-against-cxg-5.1.0.h5ad
│   └── storage location / path: 
│       /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/6NwnoewTLbs9NUbB0000.
│       h5ad
├── Dataset features
│   ├── var • 36283                     [bionty.Gene]                                                              
│   │   MIR1302-2HG                     float                                                                      
│   │   FAM138A                         float                                                                      
│   │   OR4F5                           float                                                                      
│   │   OR4F29                          float                                                                      
│   │   OR4F16                          float                                                                      
│   │   LINC01409                       float                                                                      
│   │   FAM87B                          float                                                                      
│   │   LINC01128                       float                                                                      
│   │   LINC00115                       float                                                                      
│   │   FAM41C                          float                                                                      
│   └── obs • 10                        [Feature]                                                                  
│       assay                           cat[bionty.ExperimentalFactor]     10x 3' v3, 10x 5' v1, 10x 5' v2         
│       cell_type                       cat[bionty.CellType]               CD16-negative, CD56-bright natural kill…
│       development_stage               cat[bionty.DevelopmentalStage]     unknown                                 
│       disease                         cat[bionty.Disease]                normal                                  
│       organism                        cat[bionty.Organism]               human                                   
│       self_reported_ethnicity         cat[bionty.Ethnicity]              unknown                                 
│       sex_ontology_term_id            cat[bionty.Phenotype]              male                                    
│       suspension_type                 cat[ULabel]                        cell                                    
│       tissue                          cat[bionty.Tissue]                 blood, bone marrow, caecum, duodenum, i…
│       tissue_type                     cat[ULabel]                        tissue                                  
└── Labels
    └── .organisms                      bionty.Organism                    human                                   
        .tissues                        bionty.Tissue                      blood, thoracic lymph node, spleen, mes…
        .cell_types                     bionty.CellType                    classical monocyte, T follicular helper…
        .diseases                       bionty.Disease                     normal                                  
        .phenotypes                     bionty.Phenotype                   male                                    
        .experimental_factors           bionty.ExperimentalFactor          10x 3' v3, 10x 5' v2, 10x 5' v1         
        .developmental_stages           bionty.DevelopmentalStage          unknown                                 
        .ethnicities                    bionty.Ethnicity                   unknown                                 
        .ulabels                        ULabel                             tissue, cell

Return an input h5ad file for cellxgene-schema¶

title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg

adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")

!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1

Note

The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.

Curate AnnData based on the CELLxGENE schema¶

Validate and curate metadata¶

Remove unvalidated values¶

Register new metadata labels¶

Save artifact¶

Return an input h5ad file for cellxgene-schema¶

Curate `AnnData` based on the CELLxGENE schema¶