UPDATED Feb. 2024 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"...
By contrast, relational databases come across with an attitude like a micro-manager: "my way or the highway"...
Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs? A way to marry the flexibility of Graph Databases and the discipline of Relational Databases?
Let's Get Concrete
Consider a simple scenario with scientific data such as the Sample, Experiment, Study, Run Result, where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results.
That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j. For example, a rough draft might go like this:
The “labels” (black tags) represent the Class of the data – akin to the table names of relational databases, or Classes in RDF (semantic theory.)
Want to learn more about graph databases? See part 1 of this series: Graph Databases (Neo4j) - a revolution in modeling the real world!
Freedom but Anarchy
Everything so far is easy, intuitive and very flexible… BUT a major shortcoming is that it’s very “anarchic”: no enforcement of anything. The data might include, or not include, any attributes (fields), or any relationships – possibly meaningless, excessive or missing! For example, we might have an aberrant link indicating that a Study “eats” a Sample!
Also, the database isn't storing any self-documenting data : all one can do is to examine some data nodes (for example, pick a "Sample" node at random) and presume/hope that it's representative of all data!
Please note that self-documenting data isn't just about creating an alternative to a user manual! It can also contain information for the User Interface, to avoid having to hard-wire into the UI a whole lot of knowledge, such as the fact that some field X can only take some specific values.
A possible remedy to the above shortcoming is to use the graph database itself to define the schema – and use a software library to help maintain it and optionally enforce it.
An advantage is that the schema can be anything that one wants: strict or lax, ad hoc or just about whatever one needs.
A disadvantage is that one has the responsibility to define, construct, debug and enforce such as schema. And there’s no standardization.
One approach could be to create and maintain an open-source library that:
- Provides enough generality for most purposes
- Is well-tested for functionality
- Puts a lot of thought into attaining efficiency of database operations
- Has been put to use under a variety of circumstances, to make sure it can handle an adequate range of circumstance
- Feels intuitive and easy to use
- Is open-source, and thus can pave the way to standardization, and possibly be modified (or forked) in case of special needs
Well, that’s exactly what the NeoSchema library (which as of Feb. 2024 is about to emerge from late Beta), aims to do! Its slogan is:
the "marriage" of the flexibility ("anything goes!") of graph databases,
and the discipline of relational databases
Currently, it’s being distributed as part of the open-source project BrainAnnex.org, but it can be used independently as well.
The entire technology stack is described in this video.
Here, I will primarily focus on a conceptual discussion of the Schema Layer. For nitty-gritty details, see its newly-expanded User Guide.
The NeoSchema library has been "battle tested" in :
- the implementation of the core functionality of the Brain Annex web app (a multimedia content-management system)
- the work at a recent employer of mine, for import and management of large sets of highly-interlinked data from JSON files
An Example
How would such a schema layer be used in a sample scenario like with that hypothetical Sample, Experiment, Study, Run Results data?
The first step is to use nodes labeled “CLASS” (dark green, below), to represent the various data Classes. The data nodes are linked to their respective Classes by means of a relationship named “SCHEMA”:
Now that the Class nodes are in place, all the schema information about the class can go there.
In particular, their relationships to other Class nodes, echoing the expected – or mandated – relationships between the data nodes (such as “used_in”, “part_of” and “produces”, in our example.) They can be seen in the next diagram.
Also, if the “Properties” (aka “attributes” or “fields”) of the data nodes (i.e. data records) are made into entities, that opens up a way to store plenty of information about them – such as their names, perhaps data types, or whether they are required, or even data to assist the User Interface to better display them. And that’s exactly what we do; the Properties are made into graph-database nodes labeled “PROPERTY”.
In the following diagrams, the “Properties” schema nodes are in orange – and also note the relationships among the Class nodes (dark green):
If additional schema metadata is desired on the relationships among the Classes – above and beyond their existence, name and direction – the most comprehensive way to store that is by turning them into schema entities, added as follows (light violet nodes):
Note that Relationships nodes (light violet) may in turn have Properties, if so desired. For example, the relationship “used_in”, from samples to experiments, has a hypothetical property named “quantity” in this example. [Feb. 2024 UPDATE: some slight changes had been made to this part of the design; for example, the violet nodes are now labeled "LINK" instead. Also, a special schema relationship named "INSTANCE_OF" was introduced. But this is too much detail for this blog entry, and gets discussed in the new User Guide]
Using It
In practical terms, how to create and maintain the Schema Nodes in the graph database?
And how to add data nodes to the database while (optionally) verifying and enforcing data-integrity requirements?
The good news is that the open-source Python NeoSchema library provides the bulk of the implementation. It comes with years of development, a lot of unit testing, and being “battle-hardened” by real-life challenges from the two projects mentioned earlier. [Feb. 2024 UPDATE: this library is getting far more stable, and stylistically consistent - and it's about to emerge from a late Beta stage.]
The User Interface of the optional web app that comes out of the box with the Brain Annex project also provides a (late Beta-stage) editor to conveniently create and edit the Schema. Plus a web API to do all the Schema management through JSON calls, if one so wishes.
Import of data from JSON files and Pandas data frames (and therefore CSV files) is also implemented in the NeoSchema library, which provides functions that handle the imports with Schema support. All enforcement previously declared in the Schema is automatically performed. Examples of options:
- the import is rejected if it doesn’t conform to the schema.
- undesired “extra” data (not declared in the Schema) is excluded from the imports.
All said and done, 3 ways to use the Schema Layer:
- Directly, with python calls to methods in the NeoSchema library
- Using an API that accepts JSON requests
- Using a web app built on top on that API
All of them are part of the standard distribution of the BrainAnnex.org open-source project.
Dependencies
Guides
Tutorials
The following tutorials are in the form of annotated and illustrated Jupyterlab notebooks, which come as part of the Brain Annex repository:
NeoSchema Tutorial 1 : basic Schema operations (Classes, Properties, Data Nodes)
NeoSchema Tutorial 2 : set up a simple Schema (Classes, Properties) and perform a data import (Data Nodes and relationships among them)
The above links are just VIEWS of Jupyter notebooks. If you want to actually run them from your machine, follow the instructions for installing the Brain Annex technology stack. (Note: at a later date, the NeoSchema library will get independently released, and you'll be able to simply "pip install" it.)
Here is what the first tutorial, above, will guide you to create, with just a handful of function calls to the NeoSchema library, using a JupyterLab notebook (or, alternatively, just a python script):
How to Build on It
This article is focused on the Schema Layer (the "NeoSchema" library in the diagram below.) The next article in the series discusses the higher layers.
Comments
Post a Comment