You might recall, around the year 2000, a Grand Breakthrough in biology: the complete sequencing of human DNA. People hoped that proteins would be next. But now, about 20 years later, we're not at that level of understanding when it comes to PROTEINS rather than DNA... Why? Because proteins are devilishly complex.
If you're managing a database (relational or semantic) featuring proteins, you might think that the entities (records) of your database are simply "proteins", and that you're just going to need a number of fields (attributes) to describe those records... Right? Wrong!
A little basic biology brush-up for starters... Recall that DNA is a sequence of 4 "letters" (nucleotides.) Proteins are sequences of 20 "letters" (amino acids.) So, why the immense complexity of proteins?
For starters, DNA serves largely one role (the "coding"); by contrast, proteins serve a multitude of roles (such as structural, enzymatic, etc)
DNA has a relatively regular structure (if we ignore "epigenetic modifications"); proteins have a gigantic variety of 3D shapes, and being modified is "the name of the game" for proteins
You probably know that DNA codes proteins... but it's nowhere as simple: the same DNA stretch can give rise to a variety of different proteins (a process referred to as "alternate splicing")
DNA is always in one place (in the cell nucleus in complex organisms); by contrast, proteins occur in a variety of locations, including many different "compartments" within the cell, and outside cells as well.
All cells (not counting red blood cells, which lack a nucleus) have essentially the same DNA. By contrast, proteins' presence varies by tissue type, and cell type.
DNA, except for cell replication, is a fixed amount. By contrast, protein concentrations constantly vary as reactions occur, as the organism ages, etc, etc.
I won't keep going. But suffice it to say that any data modeling for proteins is going to be very complex. If you're used to pharmacological modeling of entities such as drug names, drug dosages, contraindications, doctor visits... well, those relatively intuitive entities now give way to a rather complex zoo of entities such as "GenomeEncodedEntity", "EntitySet", "ReactionlikeEvent".
The schema further highlights the complex entities one is dealing with (if your browser window is large enough, it's shown on the left side bar in a gray box.)
In Reactome, a molecule in one compartment is distinct from that molecule in another compartment. Thus, e.g., extracellular and cytosolic glucose are different Reactome entities.
The Reactome uses data entities they call ReferenceEntity to capture the invariant features of a molecule.
And it uses entities it calls PhysicalEntity to express the COMBINATION of a ReferenceEntity attribute (e.g., the "generic" Glycogen phosphorylase UniProt:P06737) PLUS attributes giving SPECIFIC conditional information (e.g., localization to the cytosol and phosphorylation on serine residue.)
If you're managing a database (relational or semantic) featuring proteins, you might think that the entities (records) of your database are simply "proteins", and that you're just going to need a number of fields (attributes) to describe those records... Right? Wrong!
A little basic biology brush-up for starters... Recall that DNA is a sequence of 4 "letters" (nucleotides.) Proteins are sequences of 20 "letters" (amino acids.) So, why the immense complexity of proteins?
There are 20200 possible amino-acid sequences for a 200-residue protein, of which the natural evolutionary process has sampled only an infinitesimal subset. (Nature article)
For starters, DNA serves largely one role (the "coding"); by contrast, proteins serve a multitude of roles (such as structural, enzymatic, etc)
DNA has a relatively regular structure (if we ignore "epigenetic modifications"); proteins have a gigantic variety of 3D shapes, and being modified is "the name of the game" for proteins
You probably know that DNA codes proteins... but it's nowhere as simple: the same DNA stretch can give rise to a variety of different proteins (a process referred to as "alternate splicing")
DNA is always in one place (in the cell nucleus in complex organisms); by contrast, proteins occur in a variety of locations, including many different "compartments" within the cell, and outside cells as well.
All cells (not counting red blood cells, which lack a nucleus) have essentially the same DNA. By contrast, proteins' presence varies by tissue type, and cell type.
DNA, except for cell replication, is a fixed amount. By contrast, protein concentrations constantly vary as reactions occur, as the organism ages, etc, etc.
I won't keep going. But suffice it to say that any data modeling for proteins is going to be very complex. If you're used to pharmacological modeling of entities such as drug names, drug dosages, contraindications, doctor visits... well, those relatively intuitive entities now give way to a rather complex zoo of entities such as "GenomeEncodedEntity", "EntitySet", "ReactionlikeEvent".
The identity of cells and tissues therefore seems to be determined primarily by the abundance at which they express their constituent proteins, and perhaps by the manner in which the proteins are organized in the proteome, rather than the presence or absence of certain proteins. (Source)
The Reactome project
The well-curated biological dataset from the Reactome project focuses a lot about proteins:In addition to phosphorylation and ubiquitination, proteins can be subjected to (among others) methylation, acetylation, glycosylation, oxidation and nitrosylation. Some proteins undergo all these modifications, often in time-dependent combinations.In Reactome, the un-modified and modified forms of a protein are distinct physical entities. A glance at the data model will bring home some of this new complexity.
The schema further highlights the complex entities one is dealing with (if your browser window is large enough, it's shown on the left side bar in a gray box.)
A macromolecule’s function may depend on whether the molecule is free or complexed with specific other molecules. Reactome treats complexes as physical entities distinct from their components.
In Reactome, a molecule in one compartment is distinct from that molecule in another compartment. Thus, e.g., extracellular and cytosolic glucose are different Reactome entities.
The Reactome uses data entities they call ReferenceEntity to capture the invariant features of a molecule.
And it uses entities it calls PhysicalEntity to express the COMBINATION of a ReferenceEntity attribute (e.g., the "generic" Glycogen phosphorylase UniProt:P06737) PLUS attributes giving SPECIFIC conditional information (e.g., localization to the cytosol and phosphorylation on serine residue.)
Comments
Post a Comment