UPDATED Aug. 2023 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"...
By contrast, relational databases come across with an attitude along the lines of a micro-manager: "my way or the highway"...
Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs?
This is part 5 of a 7-part series on Graph Databases and Neo4j.
part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!
part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way
part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language
part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess
part 5 : Using Schema in Graph Databases such as Neo4j
part 6 : Putting it All Together - a Technology Stack on Top of a Graph Database
part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database
Let's Get Concrete
Consider a simple scenario with data such as the Sample, Experiment, Study, Run Result, where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results.
That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j. For example, a rough draft might go like this:
The “labels” (black tags) represent the Class of the data – akin to table names of relational databases, or Classes in RDF.
Want to learn more about graph databases? See part 1 of this series: Graph Databases (Neo4j) - a revolution in modeling the real world!
Freedom but Anarchy
Everything so far is easy, intuitive and very flexible… BUT a major shortcoming is that it’s very “anarchic”: no enforcement of anything. The data might include, or not include, any attributes (fields), or any relationships – possibly meaningless, excessive or missing! For example, we might have an aberrant link indicating that a Study “produces” a Sample!
A possible remedy is to use the graph database itself to define the schema – and use a software library to help maintain it and optionally enforce it.
An advantage is that the schema can be anything that one wants: strict or lax, ad hoc or just about whatever one needs.
A disadvantage is that one has the responsibility to define, construct, debug and enforce such as schema. And there’s no standardization.
One remedy is to create and maintain an open-source library that:
- Provides enough generality for most purposes
- Is well-tested for functionality
- Has been put to use under a variety of circumstances, to make sure it can handle an adequate range of circumstance
- Is open-source, and thus can pave the way to standardization, and possibly be modified (or forked) in case of special needs
Well, that’s exactly what the NeoSchema library, currently in late Beta, aims to do! Its slogan is:
bring together the best of the flexibility ("anything goes!") of graph databases,
and the "law and order" aspect of relational databases
Currently, it’s being distributed as part of the open-source project BrainAnnex.org, but it can be used independently as well.
The entire technology stack is described in this video.
Here I will primarily focus on a conceptual discussion of the Schema Layer.
The NeoSchema library has been tested :
- in the implementation of the core functionality of the Brain Annex web app (a multimedia content-management system)
- in the work at a recent employer of mine, for import and management of large sets of highly-interlinked data
An Example
How would such a schema layer be used in a sample scenario like with that hypothetical Sample, Experiment, Study, Run Results data?
The 1st step is to use nodes labeled “Class” (dark green), to represent the various data classes. The data nodes are linked to their respective classes by means of a relationship named “SCHEMA”:
Now that the Class nodes are in place, all the schema information about the class can go there.
In particular, their relationships to other Class nodes, echoing the expected relationships between the data nodes (such as “used_in”, “part_of” and “produces”, in our example.) They can be seen in the next diagram.
Also, if the “properties” (aka “attributes” or “fields”) are made into entities, that opens up a way to store plenty of information about them – such as their names, perhaps data types, or whether they are required, or even data to assist the User Interface to better display them. And that’s exactly what we’re going to do; the Properties are made into Neo4j nodes labeled “PROPERTY”.
In the following diagrams, the “properties” schema nodes are in orange – and also note the relationships among the Class nodes (dark green):
If additional schema metadata is desired on the Class relationships – above and beyond their existence, name and direction – the most comprehensive way to store that is by turning them into schema entities, added as follows (light violet nodes):
Note that Relationships nodes (light violet) may in turn have Properties, if so desired. For example, the relationship “used_in”, from samples to experiments, has a hypothetical property named “quantity” in this example.
Using It
In practical terms, how to create and maintain the Schema Nodes in Neo4j?
And how to add data nodes to Neo4j while (optionally) verifying and enforcing requirements?
The good news is that the open-source Python NeoSchema library provides the bulk of the implementation. It comes with a lot of unit testing, and is “battle-hardened” by real-life challenges from the two projects mentioned earlier.
The User Interface of Brain Annex also provides a (partially implemented) editor to conveniently create and edit the schema. Plus an API to do all that through JSON calls, if one so wishes.
Import of data from JSON files is also implemented. The user can opt to do imports with or without schema support. If done with schema support, all enforcement is automatically performed; for example, the import is rejected if it doesn’t conform to the schema, or undesired “extra” data is excluded from the imports.
Continually checking the schema thru database operations can be slow… so, the NeoSchema library optionally caches schema data in memory during large imports – which in test has shown to result in about an order of magnitude speedup.
All said and done, 3 ways to use the Schema Layer:
- Directly, with python calls to methods in the NeoSchema library
- Using an API that accepts JSON requests
- Using a web app built on top on that API
Programmer's Guide to the NeoSchema library.
Tutorial
It's in the form of an annotated and illustrated Jupyterlab notebook, which comes as part of the Brain Annex repository:
The above link is just a VIEW of a notebook. If you want to actually run it from your machine, follow the instructions for installing the Brain Annex technology stack. (Note: at a later date, the NeoSchema library will get independently released.)
Here's what this intro tutorial will guide you to create, with just a handful of function calls from the NeoSchema library, using a JupyterLab notebook (or, alternatively, just a python script):
How to Build on It
This article is focused on the Schema Layer (the "NeoSchema" library in the diagram below.) The next article (part 6) in the series discusses the higher layers.
This is part 5 of a 7-part series on Graph Databases and Neo4j.
part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!
part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way
part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language
part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess
part 5 : Using Schema in Graph Databases such as Neo4j
part 6 : Putting it All Together - a Technology Stack on Top of a Graph Database
part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database
Comments
Post a Comment