Skip to main content

Using Schema in Graph Databases such as Neo4j

UPDATED Feb. 2024 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"...

By contrast, relational databases come across with an attitude like a micro-manager:  "my way or the highway"...

Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs?  A way to marry the flexibility of Graph Databases and the discipline of Relational Databases?


This article is part 5 of a growing, ongoing series on Graph Databases and Neo4j


Let's Get Concrete

Consider a simple scenario with scientific data such as the Sample, Experiment, Study, Run Result, where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results. 

That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j.   For example, a rough draft might go like this:

 

The “labels” (black tags) represent the Class of the data – akin to the table names of relational databases, or Classes in RDF (semantic theory.)

Want to learn more about graph databases?   See part 1 of this series: Graph Databases (Neo4j) - a revolution in modeling the real world! 

Freedom but Anarchy

Everything so far is easy, intuitive and very flexible… BUT a major shortcoming is that it’s very “anarchic”: no enforcement of anything.   The data might include, or not include,  any attributes (fields),  or any relationships –  possibly meaningless, excessive or missing!  For example, we might have an aberrant link indicating that a Study “eats” a Sample!

Also, the database isn't storing any self-documenting data : all one can do is to examine some data nodes (for example, pick a "Sample" node at random) and presume/hope that it's representative of all data!

Please note that self-documenting data isn't just about creating an alternative to a user manual!  It can also contain information for the User Interface, to avoid having to hard-wire into the UI a whole lot of knowledge, such as the fact that some field X can only take some specific values.

A possible remedy to the above shortcoming is to use the graph database itself to define the schema – and use a software library to help maintain it and optionally enforce it.

An advantage is that the schema can be anything that one wants: strict or lax, ad hoc or just about whatever one needs.

A disadvantage is that one has the responsibility to define, construct, debug and enforce such as schema.   And there’s no standardization.

One approach could be to create and maintain an open-source library that:

  • Provides enough generality for most purposes
  • Is well-tested for functionality
  • Puts a lot of thought into attaining efficiency of database operations
  • Has been put to use under a variety of circumstances, to make sure it can handle an adequate range of circumstance
  • Feels intuitive and easy to use
  • Is open-source, and thus can pave the way to standardization, and possibly be modified (or forked) in case of special needs


Well, that’s exactly what the NeoSchema library (which as of Feb. 2024 is about to emerge from late Beta), aims to do!  Its slogan is:  

the "marriage" of the flexibility ("anything goes!") of graph databases,
and the discipline of relational databases

Currently, it’s being distributed as part of the open-source project BrainAnnex.org, but it can be used independently as well.

The entire technology stack is described in this video.

Here, I will primarily focus on a conceptual discussion of the Schema Layer.  For nitty-gritty details, see its newly-expanded User Guide.

The NeoSchema library has been "battle tested" in :

  1. the implementation of the core functionality of the Brain Annex web app (a multimedia content-management system)

  2. the work at a recent employer of mine, for import and management of large sets of highly-interlinked data from JSON files

An Example

How would such a schema layer be used in a sample scenario like with that hypothetical Sample, Experiment, Study, Run Results data?

The first step is to use nodes labeled “CLASS” (dark green, below), to represent the various data Classes.  The data nodes are linked to their respective Classes by means of a relationship named “SCHEMA”:



Now that the Class nodes are in place, all the schema information about the class can go there.  

In particular, their relationships to other Class nodes, echoing the expected – or mandated – relationships between the data nodes (such as “used_in”, “part_of” and “produces”, in our example.)  They can be seen in the next diagram.

Also, if the “Properties” (aka “attributes” or “fields”) of the data nodes (i.e. data records) are made into entities, that opens up a way to store plenty of information about them – such as their names, perhaps data types, or whether they are required, or even data to assist the User Interface to better display them.  And that’s exactly what we do; the Properties are made into graph-database nodes labeled “PROPERTY”.

In the following diagrams, the “Properties” schema nodes are in orange – and also note the relationships among the Class nodes (dark green):


If additional schema metadata is desired on the relationships among the Classes – above and beyond their existence, name and direction – the most comprehensive way to store that is by turning them into schema entities, added as follows (light violet nodes):

Note that Relationships nodes (light violet) may in turn have Properties, if so desired.   For example, the relationship “used_in”, from samples to experiments, has a hypothetical property named “quantity” in this example.  [Feb. 2024 UPDATE: some slight changes had been made to this part of the design; for example, the violet nodes are now labeled "LINK" instead.  Also, a special schema relationship named "INSTANCE_OF" was introduced.  But this is too much detail for this blog entry, and gets discussed in the new User Guide]

Using It

In practical terms, how to create and maintain the Schema Nodes in the graph database?

And how to add data nodes to the database while (optionally) verifying and enforcing data-integrity requirements?

The good news is that the open-source Python NeoSchema library provides the bulk of the implementation.  It comes with years of development, a lot of unit testing, and being “battle-hardened” by real-life challenges from the two projects mentioned earlier.  [Feb. 2024 UPDATE:  this library is getting far more stable, and stylistically consistent - and it's about to emerge from a late Beta stage.]

The User Interface of the optional web app that comes out of the box with the Brain Annex project also provides a (late Beta-stage) editor to conveniently create and edit the Schema.  Plus a web API to do all the Schema management through JSON calls, if one so wishes.

Import of data from JSON files and Pandas data frames (and therefore CSV files) is also implemented in the NeoSchema library, which provides functions that handle the imports with Schema support.  All enforcement previously declared in the Schema is automatically performed.  Examples of options: 

  • the import is rejected if it doesn’t conform to the schema.
  • undesired “extra” data (not declared in the Schema)  is excluded from the imports.


All said and done, 3 ways to use the Schema Layer:

  1. Directly, with python calls to methods in the NeoSchema library
  2. Using an API that accepts JSON requests
  3. Using a web app built on top on that API

All of them are part of the standard distribution of the BrainAnnex.org open-source project.

Dependencies

The NeoSchema library makes use of the NeoAccess library, and all that is based on an underlying Neo4j graph database.  [Feb. 2024 UPDATE:  NeoAccess was recently ported to another graph database, namely AWS Neptune.   It shouldn't be too hard to port NeoSchema to Neptune as well, if anyone is so inclined.]

Guides

User Guide

Reference Guide

Tutorials

The following tutorials are in the form of annotated and illustrated Jupyterlab notebooks, which come as part of the Brain Annex repository:

NeoSchema Tutorial 1  : basic Schema operations (Classes, Properties, Data Nodes)

NeoSchema Tutorial 2  : set up a simple Schema (Classes, Properties) and perform a data import (Data Nodes and relationships among them)

The above links are just VIEWS of Jupyter notebooks.  If you want to actually run them from your machine, follow the instructions for installing the Brain Annex technology stack.  (Note: at a later date, the NeoSchema library will get independently released, and you'll be able to simply "pip install" it.)

Here is what the first tutorial, above, will guide you to create, with just a handful of function calls to the NeoSchema library, using a JupyterLab notebook (or, alternatively, just a python script):

How to Build on It

This article is focused on the Schema Layer (the "NeoSchema" library in the diagram below.)  The next article in the series discusses the higher layers.




This article is part 5 of a growing, ongoing series on Graph Databases and Neo4j

 

Comments

Popular posts from this blog

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp

Anti-Aging Research: Science, not Hype

Last updated May 2023 Q: "How is aging a disease?" A: It's a dynamic system that veers away from its homeostasis (normal equilibrium point): hence a form of slow-progressing illness. Labeling it as 'natural' is a surrender to our traditional state of ignorance and powerlessness, which fortunately is beginning to be changed! Aging is "normal" only from the point of view of the "selfish gene", for whom the body is a disposable carrier. Individuals organisms - for whom self-preservation has a different meaning than for genes - have received scant help from evolution... with rare exceptions such as the T. dohrnii jellyfish (which I discuss here )... but now the time has finally arrived for our rational design to remedy some of the cellular flaws that evolution never bothered to correct!   The above is my standard answer to an oft-asked question. The science of aging is by all evidence very misunderstood by the general public.  Hype,

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED Oct. 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING! Let me backtrack.   In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK! Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles. UPDATE: This article is now part 1 of a growing, ongoing series on Graph Databases and Neo4j But ther

What are Graph Databases - and Why Should I Care?? : "Graph Databases for Poets"

  This is a very gentle introduction to the subject.  The subtitle is inspired by university courses such as "Physics for Poets"!  (if you're technically inclined, there's an alternate article for you.) It has been said that "The language of physics (or of God) is math".  On a similar note, it could be said that: The language of the biological world - or of any subject or endeavor involving complexity - is networks ('meshes') What is a network?  Think of  it as the familiar 'friends of friends' diagram from social media. Everywhere one turns in biology, there's a network – at the cellular level, tissue level, organ level, ecosystem level.  The weather and other earth systems are networks.  Human societal organization is a network.  Electrical circuits, the Internet, our own brains...  Networks are everywhere! What can we do with networks, to better understand the world around us, or to create something that we need? Broadly s

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries?   You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This article is part 4 of a growing,  ongoing  series  on Graph Databases and Neo4j   "NeoAccess" is the bottom layer of the technology stack provided by the BrainAnnex open-source project .  All layers are very modular, and the NeoAccess library may also be used by itself , entirely separately from the rest of the technology stack.  (A diagram of the full stack is shown later in this article.) NeoAccess interacts with the Neo4j Python driver , which is provided by the Neo4j company, to access the database from Python; the API to access that driver is very powerful, but complex - and does

Neo4j Sandbox Tutorial : try Neo4j and learn Cypher - free and easy!

So, you have an itch to test-drive Neo4j and its Cypher query language.  Maybe you want to learn it, or evaluate it, or introduce colleagues/clients to it.  And you wish for: fast, simple and free! Well, good news: the Neo4j company kindly provides a free, short-term hosted solution called "the Neo4j sandbox" .  Extremely easy to set up and use! This article is part 2 of a growing, ongoing series on Graph Databases and Neo4j Register (free) for the Neo4j "Sandbox" Go to sandbox.neo4j.com , and register with a working email and a password.  That's it! Note that this same email/password will also let you into the Neo4j Community Forums and Support ; the same login for all: very convenient! Launch your instance - blank or pre-populated After registering, go to  sandbox.neo4j.com  , and follow the steps in the diagram below (the choices might differ, but the "Blank Sandbox" should always be there): Too good to be true?  Is there

Visualization of Graph Databases Using Cytoscape.js

(UPDATED APR. 2024)   I have ample evidence from multiple sources that there are strong unmet needs in the area of visualization of graph databases. And whenever there's a vacuum, vendors circle like vultures - with incomplete, non-customizable, and at times ridiculously expensive, closed-box proprietary solutions.   Fortunately, coming to the rescue is the awesome open-source cytoscape.js library ,  an offshoot of the "Cytoscape" project of the  Institute for Systems Biology , a project with a long history that goes back to 2002. One can do amazing custom solutions, relatively easily, when one combines this Cytoscape library with:   1) a front-end framework such as Vue.js   2) backend libraries (for example in python) to prepare and serve the data   For example, a while back I created a visualizer for networks of chemical reactions, for another open-source project I lead ( life123.science )   This visualizer will look and feel generally familiar to anyone who has eve

Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

You have a general idea of what Graph Databases - and Neo4j in particular - are...  But how to get started?  Read on! This article is part 3 of a growing,  ongoing  series  on Graph Databases and Neo4j   If you're new to graph databases, please check out part 1 for an intro and motivation about them.  There, we discussed an example about an extremely simple database involving actors, movies and directors...  and saw how easy the Cypher query language makes it to answer questions such as "which directors have worked with Tom Hanks in 2016" - questions that, when done with relational databases and SQL, turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables! In this tutorial, we will actually carry out that query - and get acquainted with Cypher and the Neo4j browser interface in the process.  This is the dataset we'll be constructing: Get the database in place If you don't already have a database installed locally

Full-Text Search with the Neo4j Graph Database

(UPDATED May 2024)   Now that we have discussed a full technology stack based on Neo4j (or other graph databases), and that we a design and implementation available from the open-source project BrainAnnex.org  , what next?  What shall we build on top? Well, how about  Full-Text Search ?  This article is part of a growing, ongoing series on Graph Databases and Neo4j Full-Text Searching/Indexing The Brain Annex open-source project includes an implementation of a design that uses the convenient services of its Schema Layer , to provide indexing of word-based documents using Neo4j. The python class FullTextIndexing ( source code ) provides the necessary methods, and it can parse both plain-text and HTML documents (for example, used in "formatted notes"); parsing of PDF files and other formats will be added at a later date. No grammatical analysis ( stemming or lemmatizing ) is done on the text.  However, a long list of common word ("stop words") that g