Skip to main content

Using Schema in Graph Databases such as Neo4j

UPDATED Aug. 2023 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"...

By contrast, relational databases come across with an attitude along the lines of a micro-manager:  "my way or the highway"...

Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs?

This is part 5 of a 7-part series on Graph Databases and Neo4j. 

part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!

part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way

part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess

part 5 : Using Schema in Graph Databases such as Neo4j

part 6 : Putting it All Together - a Technology Stack on Top of a Graph Database

part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database


Let's Get Concrete

Consider a simple scenario with data such as the Sample, Experiment, Study, Run Result, where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results. 

That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j.   For example, a rough draft might go like this:

 

The “labels” (black tags) represent the Class of the data – akin to table names of relational databases, or Classes in RDF.   

Want to learn more about graph databases?   See part 1 of this series: Graph Databases (Neo4j) - a revolution in modeling the real world! 

Freedom but Anarchy

Everything so far is easy, intuitive and very flexible… BUT a major shortcoming is that it’s very “anarchic”: no enforcement of anything.   The data might include, or not include,  any attributes (fields),  or any relationships –  possibly meaningless, excessive or missing!  For example, we might have an aberrant link indicating that a Study “produces” a Sample!

A possible remedy is to use the graph database itself to define the schema – and use a software library to help maintain it and optionally enforce it.

An advantage is that the schema can be anything that one wants: strict or lax, ad hoc or just about whatever one needs.

A disadvantage is that one has the responsibility to define, construct, debug and enforce such as schema.   And there’s no standardization.

One remedy is to create and maintain an open-source library that:

  • Provides enough generality for most purposes
  • Is well-tested for functionality
  • Has been put to use under a variety of circumstances, to make sure it can handle an adequate range of circumstance
  • Is open-source, and thus can pave the way to standardization, and possibly be modified (or forked) in case of special needs


Well, that’s exactly what the NeoSchema library, currently in late Beta, aims to do!  Its slogan is:  

bring together the best of the flexibility ("anything goes!") of graph databases,
and the "law and order" aspect of relational databases

Currently, it’s being distributed as part of the open-source project BrainAnnex.org, but it can be used independently as well.

The entire technology stack is described in this video.

Here I will primarily focus on a conceptual discussion of the Schema Layer.

The NeoSchema library has been tested :

  1. in the implementation of the core functionality of the Brain Annex web app (a multimedia content-management system)

  2. in the work at a recent employer of mine, for import and management of large sets of highly-interlinked data

An Example

How would such a schema layer be used in a sample scenario like with that hypothetical Sample, Experiment, Study, Run Results data?

The 1st step is to use nodes labeled “Class” (dark green), to represent the various data classes.  The data nodes are linked to their respective classes by means of a relationship named “SCHEMA”:



Now that the Class nodes are in place, all the schema information about the class can go there.  

In particular, their relationships to other Class nodes, echoing the expected relationships between the data nodes (such as “used_in”, “part_of” and “produces”, in our example.)  They can be seen in the next diagram.

Also, if the “properties” (aka “attributes” or “fields”) are made into entities, that opens up a way to store plenty of information about them – such as their names, perhaps data types, or whether they are required, or even data to assist the User Interface to better display them.  And that’s exactly what we’re going to do; the Properties are made into Neo4j nodes labeled “PROPERTY”.

In the following diagrams, the “properties” schema nodes are in orange – and also note the relationships among the Class nodes (dark green):


If additional schema metadata is desired on the Class relationships – above and beyond their existence, name and direction – the most comprehensive way to store that is by turning them into schema entities, added as follows (light violet nodes):

Note that Relationships nodes (light violet) may in turn have Properties, if so desired.   For example, the relationship “used_in”, from samples to experiments, has a hypothetical property named “quantity” in this example.

Using It

In practical terms, how to create and maintain the Schema Nodes in Neo4j?

And how to add data nodes to Neo4j while (optionally) verifying and enforcing requirements?

The good news is that the open-source Python NeoSchema library provides the bulk of the implementation.  It comes with a lot of unit testing, and is “battle-hardened” by real-life challenges from the two projects mentioned earlier.

The User Interface of Brain Annex also provides a (partially implemented) editor to conveniently create and edit the schema.  Plus an API to do all that through JSON calls, if one so wishes.

Import of data from JSON files is also implemented.  The user can opt to do imports with or without schema support.  If done with schema support, all enforcement is automatically performed; for example, the import is rejected if it doesn’t conform to the schema, or undesired “extra” data is excluded from the imports.

Continually checking the schema thru database operations can be slow…  so, the NeoSchema library  optionally caches schema data in memory during large imports – which in test has shown to result in about an order of magnitude speedup.

All said and done, 3 ways to use the Schema Layer:

  1. Directly, with python calls to methods in the NeoSchema library
  2. Using an API that accepts JSON requests
  3. Using a web app built on top on that API

Programmer's Guide to the NeoSchema library.

Tutorial

It's in the form of an annotated and illustrated Jupyterlab notebook, which comes as part of the Brain Annex repository:

NeoSchema Tutorial 1

The above link is just a VIEW of a notebook.  If you want to actually run it from your machine, follow the instructions for installing the Brain Annex technology stack.  (Note: at a later date, the NeoSchema library will get independently released.)

Here's what this intro tutorial will guide you to create, with just a handful of function calls from the NeoSchema library, using a JupyterLab notebook (or, alternatively, just a python script):

How to Build on It

This article is focused on the Schema Layer (the "NeoSchema" library in the diagram below.)  The next article (part 6) in the series discusses the higher layers.

This is part 5 of a 7-part series on Graph Databases and Neo4j. 

part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!

part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way

part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language 

part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess

part 5 : Using Schema in Graph Databases such as Neo4j

part 6 : Putting it All Together - a Technology Stack on Top of a Graph Database

part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database

Comments

Popular posts from this blog

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp

Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

You have a general idea of what Graph Databases - and Neo4j in particular - are...  But how to get started?  Read on! This is part 3 of a 7-part series on Graph Databases and Neo4j.   part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language  part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database part 7  : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database If you're new to graph databases, please check out part 1 for an intro and motivation about them.  There, we discussed an example about an extremely simple database involving actors, movies and directors...  and saw how easy the Cypher query lan

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED July 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING! Let me backtrack.   In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK! Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles. UPDATE: This article is now part of a series... This is part 1 of a 7-part series on Graph Databases and Neo4j.  

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries?   You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This is part 4 of a 7-part series on Graph Databases and Neo4j.   part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher – free and easy part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language   part 4 : Using Neo4j with Python : the Open-Source Library "NeoAccess" part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database part 7  : (SPECIAL TOPIC) Ful

To Build or Not to Build One’s Own Desktop Computer?

“ VALENTINA ” [UPDATED JUNE 2021] - Whether you're a hobbyist, or someone who just needs a good desktop computer, or an IT professional who wants a wider breath of knowledge, or a gamer who needs a performant machine, you might have contemplated at some point whether to build your own desktop computer. If you're a hobbyist, I think it's a great project.  If you're an IT professional - especially a "coder" - I urge you to do it: in my opinion, a full-fledged Computer Scientist absolutely needs breath, ranging from the likes of Shannon's Information Theory and the Halting Problem - all the way down to how transistors work. And what about someone who just needs a good desktop computer?  A big maybe on that - but perhaps this blog entry will either help you, or scare you off for your own good! To build, or not to build, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of OEM's cutting corners and limit

Full-Text Search with the Neo4j Graph Database

(UPDATED Sep. 2023)   In part 5 ( Using Schema in Graph Databases ) we discussed the concept of a Schema Layer, and a design and implementation available from the open-source project BrainAnnex.org Now that we have such a layer, what shall be build on top of it?   Well, how about  Full-Text Search ?  This is  part 7  of a ongoing series on Graph Databases and Neo4j.   part 1  : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2  : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way part 3  : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language part 4  : Using Neo4j with Python : the Open-Source Library NeoAccess part 5  : Using Schema in Graph Databases such as Neo4j part 6 :   Putting it All Together - a Technology Stack on Top of a (Neo4j) Graph Database part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database part 8 (upcoming!) : (SPECIAL TOPIC) Document Management Full-Text Searching/

PET/CT Combined Scanners - a 2018 Breakthrough of the Year... and a Personal Story

Image source Recently, a co-worker in her 20's was diagnosed with a brain tumor!  At times like these, the importance of medical imaging jumps to the fore! Most people have heard of CT ("CAT") scanners – at least enough to know that they don't actually involve cats – but less well-known are PET scanners (which likewise don't involve pets!), and the synergistic combination of the two. A Marriage Made in Heaven What do those scanners do?  And why are they being combined in single devices? Voted 2018 Breakthrough of the Year by a science magazine , the improved PET/CT combined scanner has been a game changer. The EXPLORER PET/CT scanner – the world’s first medical imaging system that can capture a 3D image of the entire human body simultaneously – has produced its first human images. Developed by UC Davis scientists and a multi-institutional consortium, EXPLORER can scan up to 40 times faster, or use up to 40 times less radiation dose, than

RDF Triple Stores vs. Property Graphs : How to Attach Properties to Relationships

Time for the opening shot of a series about Semantic Technology , and in particular contrasting-and-comparing the opposing (but perhaps ultimately complementary) camps of:   RDF Triple Stores , aka Triples-Based Graphs.   For example, Blazegraph or Apache Jena   (Labeled) Property Graphs .  For example, Neo4j or Blazegraph (For this article, I'll assume that you have at least a passing acquaintance with both.  Here is background info on Triplestores and Property Graphs ) It’s my opinion that modeling in terms of Subject/Predicate/Object triples (aka RDF ) might be appealing to mathematicians or philosophers for its minimalist foundation (though a lot of baroque add-on’s quickly come out of the closet!) Modeling in terms of (Labeled) Property Graphs might be appealing to computer scientists, because such graphs appear more usable and less clunky once you start actually doing something with them. Perhaps because I straddle both the Math and CS camps, I’m currently on t

Brain Microarchitecture : Feedback from Higher-order areas to Lower-order areas

Some questions that arise in Machine Learning involve the prospect of using feedback from Higher-order areas (downstream) to Lower-order areas (upstream), and using Global Knowledge for Local Processing.  A desire to gain insight into those issues from Neuroscience ("how does the brain do it?") led me to some fascinating investigations into the Microcircuits of the Cerebral Cortex.  This blog entry is a broad review of the field, in the context of the original motivating questions from Machine Learning.   Starting out with a quote from the “bible of Neuroscience”: From Principles of Neural Science, 5th edn  (Online book location 1435.3 / 5867).  Emphasis and note added by me: Sensory pathways are not exclusively serial; in each functional pathway higher-order areas project back to the lower-order areas from which they receive input. In this way neurons in higher-order areas, sensitive to the global pattern of sensory input, can modulate the activity of neurons in lowe