Protein¶

lamindb provides access to the following public Protein ontologies through bionty:

Uniprot

Here we show how to access and search Protein ontologies to standardize new data.

import bionty as bt
import pandas as pd

PublicOntology objects¶

Let us create a public ontology accessor with .public method, which chooses a default public ontology source from Source. It’s a PublicOntology object, which you can think about as a public registry:

proteins = bt.Protein.public(organism="human")
proteins

→ connected lamindb: testuser1/test-public-ontologies

PublicOntology
Entity: Protein
Organism: human
Source: uniprot, 2024-03
#terms: 204088

As for registries, you can export the ontology as a DataFrame:

df = proteins.df()
df.head()

	uniprotkb_id	name	description	length	synonyms	gene_symbol	ensembl_gene_ids
0	A0A023HJ61	Ras-related protein Rab-4A		121		RAB4A	None
1	A0A023HN28	SRSF3	USP6 fusion protein	16		None	None
2	A0A023I7F4	Cytochrome b		380		CYTB	None
3	A0A023I7H2	NADH-ubiquinone oxidoreductase chain 5		603	EC 7.1.1.2	ND5	None
4	A0A023I7H5	ATP synthase subunit a		226		ATP6	None

Unlike registries, you can also export it as a Pronto object via public.ontology.

Look up terms¶

As for registries, terms can be looked up with auto-complete:

lookup = proteins.lookup()

The . accessor provides normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.ac3

Protein(uniprotkb_id='Q8IX81', name='AC3', description='', length=502, synonyms='', gene_symbol=None, ensembl_gene_ids=None)

To look up the exact original strings, convert the lookup object to dict and use the [] accessor:

lookup_dict = lookup.dict()
lookup_dict["AC3"]

Protein(uniprotkb_id='Q8IX81', name='AC3', description='', length=502, synonyms='', gene_symbol=None, ensembl_gene_ids=None)

By default, the name field is used to generate lookup keys. You can specify another field to look up:

lookup = proteins.lookup(proteins.gene_symbol)

lookup.rab4a

! 3 records found for 'rab4a'. Returning based on keep='first'.

Protein(uniprotkb_id='A0A023HJ61', name='Ras-related protein Rab-4A', description='', length=121, synonyms='', gene_symbol='RAB4A', ensembl_gene_ids=None)

Search terms¶

Search behaves in the same way as it does for registries:

proteins.search("RAS").head(3)

	uniprotkb_id	name	description	length	synonyms	gene_symbol	ensembl_gene_ids
189295	Q96PV0	Ras	Rap GTPase-activating protein SynGAP	1343	Neuronal RasGAP\|Synaptic Ras GTPase-activating...	SYNGAP1	ENST00000395071.6 [Q96PV0-4];ENST00000414753.6...
16769	A0A140T8W4	Ras	Rap GTPase-activating protein SynGAP	487		SYNGAP1	ENST00000355818.3;
158749	P20936	Ras GTPase-activating protein 1		1047	GAP\|GTPase-activating protein\|RasGAP\|Ras p21 p...	RASA1	ENST00000274376.11 [P20936-1];ENST00000456692....

By default, search also covers synonyms and all other fileds containing strings:

proteins.search("member of RAS oncogene family like 2B").head(3)

	uniprotkb_id	name	description	length	gene_symbol	ensembl_gene_ids
71092	A0A8I5KRY9	RAB	member of RAS oncogene family like 2B	58	RABL2B	ENST00000685352.1;
71734	A0A8I5KX29	RAB	member of RAS oncogene family like 2B	51	RABL2B	ENST00000690024.1;
81780	A8MXF6	RAB	member of RAS oncogene family like 2B	165	RABL2B	ENST00000395591.5;

Search specific field (by default, search is done on all fields containing strings):

proteins.search(
    "RABL2B",
    field=proteins.gene_symbol,
).head()

	uniprotkb_id	name	description	length	gene_symbol	ensembl_gene_ids
71092	A0A8I5KRY9	RAB	member of RAS oncogene family like 2B	58	RABL2B	ENST00000685352.1;
71734	A0A8I5KX29	RAB	member of RAS oncogene family like 2B	51	RABL2B	ENST00000690024.1;
81780	A8MXF6	RAB	member of RAS oncogene family like 2B	165	RABL2B	ENST00000395591.5;
101750	C9JFZ0	RAB	member of RAS oncogene family like 2B	20	RABL2B	ENST00000413505.1;
119224	F2Z2T3	RAB	member of RAS oncogene family like 2B	99	RABL2B	ENST00000395590.5;

Standardize Protein identifiers¶

Let us generate a DataFrame that stores a number of Protein identifiers, some of which corrupted:

df_orig = pd.DataFrame(
    index=[
        "A0A024QZ08",
        "X6RLV5",
        "X6RM24",
        "A0A024QZQ1",
        "This protein does not exist",
    ]
)
df_orig


A0A024QZ08
X6RLV5
X6RM24
A0A024QZQ1
This protein does not exist

We can check whether any of our values are validated against the ontology reference:

validated = proteins.validate(df_orig.index, proteins.name)
df_orig.index[~validated]

! 5 unique terms (100.00%) are not validated: 'A0A024QZ08', 'X6RLV5', 'X6RM24', 'A0A024QZQ1', 'This protein does not exist'

Index(['A0A024QZ08', 'X6RLV5', 'X6RM24', 'A0A024QZQ1',
       'This protein does not exist'],
      dtype='object')

Ontology source versions¶

For any given entity, we can choose from a number of versions:

bt.Source.filter(entity="bionty.Protein").df()

Show code cell output

Hide code cell output

	uid	entity	organism	name	in_db	currently_used	description	url	md5	source_website	space_id	dataframe_artifact_id	version	run_id	created_at	created_by_id	_aux	branch_id
id
10	3EYyGRYN	bionty.Protein	human	uniprot	False	True	Uniprot	s3://bionty-assets/df_human__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1
11	01RWXN2V	bionty.Protein	mouse	uniprot	False	True	Uniprot	s3://bionty-assets/df_mouse__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1

# only lists the sources that are currently used
bt.Source.filter(entity="bionty.Protein", currently_used=True).df()

	uid	entity	organism	name	in_db	currently_used	description	url	md5	source_website	space_id	dataframe_artifact_id	version	run_id	created_at	created_by_id	_aux	branch_id
id
10	3EYyGRYN	bionty.Protein	human	uniprot	False	True	Uniprot	s3://bionty-assets/df_human__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1
11	01RWXN2V	bionty.Protein	mouse	uniprot	False	True	Uniprot	s3://bionty-assets/df_mouse__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1

When instantiating a Bionty object, we can choose a source or version:

source = bt.Source.filter(
    name="uniprot", organism="human"
).first()
proteins= bt.Protein.public(source=source)
proteins

PublicOntology
Entity: Protein
Organism: human
Source: uniprot, 2024-03
#terms: 204088

The currently used ontologies can be displayed using:

bt.Source.filter(currently_used=True).df()

Show code cell output

Hide code cell output

	uid	entity	organism	name	in_db	currently_used	description	url	md5	source_website	space_id	dataframe_artifact_id	version	run_id	created_at	created_by_id	_aux	branch_id
id
1	33TUF039	bionty.Organism	vertebrates	ensembl	False	True	Ensembl	https://ftp.ensembl.org/pub/release-112/specie...	None	https://www.ensembl.org	1	None	release-112	None	2025-07-14 06:41:44.843000+00:00	1	None	1
2	6bbVUTCS	bionty.Organism	bacteria	ensembl	False	True	Ensembl	https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacte...	None	https://www.ensembl.org	1	None	release-57	None	2025-07-14 06:41:44.843000+00:00	1	None	1
3	6s9nV6xh	bionty.Organism	fungi	ensembl	False	True	Ensembl	https://ftp.ensemblgenomes.ebi.ac.uk/pub/fungi...	None	https://www.ensembl.org	1	None	release-57	None	2025-07-14 06:41:44.843000+00:00	1	None	1
4	2PmTrc8x	bionty.Organism	metazoa	ensembl	False	True	Ensembl	https://ftp.ensemblgenomes.ebi.ac.uk/pub/metaz...	None	https://www.ensembl.org	1	None	release-57	None	2025-07-14 06:41:44.843000+00:00	1	None	1
5	7GPHh16S	bionty.Organism	plants	ensembl	False	True	Ensembl	https://ftp.ensemblgenomes.ebi.ac.uk/pub/plant...	None	https://www.ensembl.org	1	None	release-57	None	2025-07-14 06:41:44.843000+00:00	1	None	1
6	4tsksCMX	bionty.Organism	all	ncbitaxon	False	True	NCBItaxon Ontology	http://purl.obolibrary.org/obo/ncbitaxon/2023-...	None	https://github.com/obophenotype/ncbitaxon	1	None	2023-06-20	None	2025-07-14 06:41:44.843000+00:00	1	None	1
7	4UGNz3fr	bionty.Gene	human	ensembl	False	True	Ensembl	s3://bionty-assets/df_human__ensembl__release-...	None	https://www.ensembl.org	1	None	release-112	None	2025-07-14 06:41:44.843000+00:00	1	None	1
8	4r4fvV0S	bionty.Gene	mouse	ensembl	False	True	Ensembl	s3://bionty-assets/df_mouse__ensembl__release-...	None	https://www.ensembl.org	1	None	release-112	None	2025-07-14 06:41:44.843000+00:00	1	None	1
9	4RPA3Re0	bionty.Gene	saccharomyces cerevisiae	ensembl	False	True	Ensembl	s3://bionty-assets/df_saccharomyces cerevisiae...	None	https://www.ensembl.org	1	None	release-112	None	2025-07-14 06:41:44.843000+00:00	1	None	1
10	3EYyGRYN	bionty.Protein	human	uniprot	False	True	Uniprot	s3://bionty-assets/df_human__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1
11	01RWXN2V	bionty.Protein	mouse	uniprot	False	True	Uniprot	s3://bionty-assets/df_mouse__uniprot__2024-03_...	None	https://www.uniprot.org	1	None	2024-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1
12	3kDh8qAX	bionty.CellMarker	human	cellmarker	False	True	CellMarker	s3://bionty-assets/human_cellmarker_2.0_CellMa...	None	http://bio-bigdata.hrbmu.edu.cn/CellMarker	1	None	2.0	None	2025-07-14 06:41:44.843000+00:00	1	None	1
13	7bV5uJo3	bionty.CellMarker	mouse	cellmarker	False	True	CellMarker	s3://bionty-assets/mouse_cellmarker_2.0_CellMa...	None	http://bio-bigdata.hrbmu.edu.cn/CellMarker	1	None	2.0	None	2025-07-14 06:41:44.843000+00:00	1	None	1
14	6LyRtvz8	bionty.CellLine	all	clo	False	True	Cell Line Ontology	s3://bionty-assets/df_all__clo__2022-03-21__Ce...	None	https://bioportal.bioontology.org/ontologies/CLO	1	None	2022-03-21	None	2025-07-14 06:41:44.843000+00:00	1	None	1
16	3Uw2Va7a	bionty.CellType	all	cl	False	True	Cell Ontology	http://purl.obolibrary.org/obo/cl/releases/202...	None	https://obophenotype.github.io/cell-ontology	1	None	2024-08-16	None	2025-07-14 06:41:44.843000+00:00	1	None	1
17	MUtAGdL4	bionty.Tissue	all	uberon	False	True	Uberon multi-species anatomy ontology	http://purl.obolibrary.org/obo/uberon/releases...	None	http://obophenotype.github.io/uberon	1	None	2024-08-07	None	2025-07-14 06:41:44.843000+00:00	1	None	1
18	IGIkseWQ	bionty.Disease	all	mondo	False	True	Mondo Disease Ontology	http://purl.obolibrary.org/obo/mondo/releases/...	None	https://mondo.monarchinitiative.org	1	None	2025-06-03	None	2025-07-14 06:41:44.843000+00:00	1	None	1
19	4kswnHVF	bionty.Disease	human	doid	False	True	Human Disease Ontology	http://purl.obolibrary.org/obo/doid/releases/2...	None	https://disease-ontology.org	1	None	2024-05-29	None	2025-07-14 06:41:44.843000+00:00	1	None	1
21	2a1HvjdB	bionty.ExperimentalFactor	all	efo	False	True	The Experimental Factor Ontology	http://www.ebi.ac.uk/efo/releases/v3.70.0/efo.owl	None	https://bioportal.bioontology.org/ontologies/EFO	1	None	3.70.0	None	2025-07-14 06:41:44.843000+00:00	1	None	1
22	6S4qkDx1	bionty.Phenotype	all	pato	False	True	Phenotype And Trait Ontology	http://purl.obolibrary.org/obo/pato/releases/2...	None	https://github.com/pato-ontology/pato	1	None	2024-03-28	None	2025-07-14 06:41:44.843000+00:00	1	None	1
23	48fBFLmn	bionty.Phenotype	human	hp	False	True	Human Phenotype Ontology	https://github.com/obophenotype/human-phenotyp...	None	https://hpo.jax.org	1	None	2024-04-26	None	2025-07-14 06:41:44.843000+00:00	1	None	1
25	7Ent3V2y	bionty.Pathway	all	go	False	True	Gene Ontology	http://purl.obolibrary.org/obo/go/releases/202...	None	http://geneontology.org	1	None	2024-06-17	None	2025-07-14 06:41:44.843000+00:00	1	None	1
27	3rm9aOzL	BFXPipeline	all	lamin	False	True	Bioinformatics Pipeline	s3://bionty-assets/df_all__lamin__1.0.0__BFXpi...	None	https://lamin.ai	1	None	1.0.0	None	2025-07-14 06:41:44.843000+00:00	1	None	1
28	ugaIoIlj	Drug	all	dron	False	True	Drug Ontology	http://purl.obolibrary.org/obo/dron/releases/2...	None	https://bioportal.bioontology.org/ontologies/DRON	1	None	2024-08-05	None	2025-07-14 06:41:44.843000+00:00	1	None	1
30	1GbFkOdz	bionty.DevelopmentalStage	human	hsapdv	False	True	Human Developmental Stages	https://github.com/obophenotype/developmental-...	None	https://github.com/obophenotype/developmental-...	1	None	2024-05-28	None	2025-07-14 06:41:44.843000+00:00	1	None	1
31	10va5JSt	bionty.DevelopmentalStage	mouse	mmusdv	False	True	Mouse Developmental Stages	https://github.com/obophenotype/developmental-...	None	https://github.com/obophenotype/developmental-...	1	None	2024-05-28	None	2025-07-14 06:41:44.843000+00:00	1	None	1
32	MJRqduf9	bionty.Ethnicity	human	hancestro	False	True	Human Ancestry Ontology	http://purl.obolibrary.org/obo/hancestro/relea...	None	https://github.com/EBISPOT/hancestro	1	None	3.0	None	2025-07-14 06:41:44.843000+00:00	1	None	1
33	5JnVODh4	BioSample	all	ncbi	False	True	NCBI BioSample attributes	s3://bionty-assets/df_all__ncbi__2023-09__BioS...	None	https://www.ncbi.nlm.nih.gov/biosample/docs/at...	1	None	2023-09	None	2025-07-14 06:41:44.843000+00:00	1	None	1