Skip to main content

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED Oct. 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING!

Let me backtrack.  

In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK!

Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles.

UPDATE: This article is now part 1 of a growing, ongoing series on Graph Databases and Neo4j

But there were thorns in the otherwise happy relationship

The root cause: THE REAL WORLD DOES NOT REALLY RESEMBLE THE TABLES of a relational database.

As a database expert, one is very used to that.  It becomes second nature to use lots of gimmicks - basically hacks - to make real data conform to relational databases.  We don't call them "hacks" : we glorify them with names such as a "foreign keys", "join tables", "normal forms" - but, let's face it – they are ugly hacks!

Example

Relational databases are, sadly, so divorced from the workings of the real world, that's it's exceedingly easy to come up with a very basic example that feels like "square pegs in round holes":
  • We have movies, actors and directors
  • Movies contain actors; conversely, actors usually appear in multiple movies
  • Movies have one or more directors; conversely, directors may work in multiple movies

That's it!  A straightforward snapshot of a simple real-world scenario.

And yet, even for something so basic, we immediately need to roll out the exotic heavy tanks of relational databases:

  • Primary keys for movies, actors and directors
  • A "join table", containing "foreign keys", to handle the many-to-many relationship between movies and actors
  • Another "join table" to handle the many-to-many relationship between movies and directors

At this point, if you want to ask "which directors have worked with Tom Hanks in 2016", you are faced with a gnarly "join" of a whopping 5 tables : the 3 main tables and the 2 special "join tables".

All said and done, an SQL monstrosity such as :

SELECT directors.name
    FROM directors
        JOIN movies_directors
        JOIN movies
        JOIN
movies_actors
        JOIN actors
    ON directors.ID = movies_directors.director_ID
    AND movies_directors.movie_ID = movies.ID
    AND movies.ID = movies_actors.movie_ID
    AND movies_actors.actor_ID = actors.ID
    WHERE actors.name = "Tom Hanks"
    AND
movies.release_date = 2016

If you're an experienced database administrator or data scientist, you're likely to just shrug and say, "well, that's life."  Everyone else is more likely to say, "EEK and YUCK!"

What went wrong?

Why did a simple little question, such as "which directors have worked with Tom Hanks in 2016", turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables?

Well, it's because relational databases are a poor tool for this job!  In fact, they are a poor tool for any job involving "relationships" (connections) between data entities ("movies", "actors" and "directors", in our example), especially "many-to-many relationships" (an actor in many movies, and conversely a movie with many actors.)

The word "relational" in the name "relational database" might give to the uninitiated the impression that they are the perfect tool to model relationships: far from that!  Relational Databases CAN model relationships, but are rather awkward at it!   Ever known someone who "COULD" be in a relationship, but is rather awkward at relationships?  Same story here!

Graph Databases to the Rescue

Several years ago, I worked at a Bioinformatics company where we brought together various large biological datasets.  I came across the astounding Reactome.org project, which I talk about in another blog entry.  I learned that they were using Neo4j, a "graph database" from a company quickly rising to prominence.

At the time, Graph Databases were still relatively new - though they had been around for several years.  I tested out Neo4j and scrutinized its parent company... and quickly saw the immense potential.  There was a good reason why the Reactome project was using Neo4j to deal with something as complex as the set of all reactions within human cells!

A rift arose inside the company where I then worked : a camp, led by me, favored using Neo4j; another camp favored using "Triplestores" (an older technology, at times referred to as RDF - I contrast it with graph databases in another post); and, to make it a full 3-ring circus, yet another camp wanted to just stick to relational databases!

Upper management worried that Neo4j was "too slow."  I proved them wrong by running the whole Reactome database on a run-of-the-mill laptop, with just a little extra memory installed!

Several years have now gone by... long past that job, but Neo4j and other Graph Databases are here to stay - and in fact the Neo4j's ecosystem has vastly grown and become extremely dynamic.

I have by now been in several jobs where Neo4j was a key player, and even had to turn down some other Neo4j jobs, such as an offer from the Bank of America.  The large companies, be it BoA, eBay, Walmart, Airbus, Toyota, Marriott Hotel, Verizon, GlaxoSmithKline pharmaceuticals [UPDATE: I worked there in 2021], AbbVie [UPDATE: I currently work there], etc, have been flocking to Neo4j and other graph database (short listings or longer one), and smaller companies are beginning to take notice.

Needless to say, I'm proud to have "backed the right horse", starting at a time when Neo4j was not as popular!

What is a Graph Database?

It's a powerful, versatile way to model the real world; more than that, it's a massive seismic shift underfoot, which I regard as a game changer!  

By the way, the word "graph" in the name has nothing to do with math plots - in this context, a graph is a "network (mesh) of nodes", familiar from the friend-of-friend diagrams of social media.

Since graph databases are internally based on meshes of connections between "entities" (such as individual movies or actors or directors), then establishing and following relationship is an extremely natural operation - rather than the contrived, painful counterpart in relational databases!

For example, returning to our earlier question of finding all directors who worked with Tom Hanks, it's now vastly easier to do that, using a Cypher query - a counterpart of SQL:

MATCH (a:actors) -- (m:movies) -- (d:directors)
WHERE a.name = "Tom Hanks"
AND m.release_date = 2016
RETURN d.name

That's it! 😁  We're asking to traverse the network, starting at the "Tom Hanks" node, labeled "actors", until we reach nodes labeled as "directors", and requiring to pass through a "movies" node with a "release_date" field value of 2016.  (Labels are reminiscent of relational-database table names, but more versatile.)

Traversing a graph database

Traversing a network is vastly more intuitive than taking a "subset of a Cartesian product" - the counterpart operation in relational databases!

In case there was a danger of traveling through our database from "Tom Hanks" to a director by following other paths (such as friendships or common interests, if we stored such data), we can be more specific with the relationship names, appearing in the square brackets:

MATCH (a:actors) -[:stars_in]-> (m:movies) -[:directed_by]-> (d:directors)
WHERE a.name = "Tom Hanks"
AND m.release_date = 2016
RETURN d.name
Note: in part3 of this series of articles, I'll present a tutorial on how to actually run the above query.
 
Modeling and querying of complex data is much easier - especially data involving many-to-many relationships among several entities ("tables".)  
 
Not only that, but making changes to the data schema is vastly easier; by contrast, with relational databases, a change to the database scheme can easily be a "major surgery" to dread!  If your schema is not well-known ahead of time, and/or subject to frequent changes, graph databases are a true lifesaver!  That definitely applies to the biomedical-research field, the industry I usually work in.
 
The more intuitive quality of Graph database also makes it easier to involve people who aren't data engineers – such as data scientists, researchers or doctors – in brainstorming over modeling the data.

What is Neo4j?

Neo4j started out in Sweden
Neo4j is a company that has been quickly rising to prominence.  And for good reasons!  They have been instrumental in advancing Graph Database technology with their main product, also named Neo4j.

Neo4j has raised $116 Million in series D and E funding (source); so, it's deliciously well-funded... and they've put their large pot of gold to good use to create and polish a remarkable product, and to nurture a large ecosystem.   UPDATE: in June 2021, Neo4j announced another round of funding, a whopping $325M in Series F !

They started out in the period 2000-2007...  so, it's all relatively new but definitely not super-new.  As I like to say, "new enough to be innovative; but not so new as to risk being on the bleeding edge!"

From Sweden with love (EXTRA love if you get the Enterprise version!)
The company started out in Sweden...   Yep, just like IKEA!

Be aware of the existence of 2 versions of Neo4j : like many products nowadays, there's a community version - which is free and open-source - and a paid enterprise version

Most small and medium companies will probably do just fine with the community version.  If that's the case, make sure to resist the company's many lures to rope you into the paid enterprise version: in particular, steer clear of the so-called "desktop version" and of the graphic tool "Bloom" - both of them require the enterprise version... and one can live eminently well without either of those tools.  Just install the community version of Neo4j on your Linux or Windows 10/11 desktop, or on your cloud VM.  Alternatives to the graphic tool Blooms are the included Neo4j browser interface, as well as the inexpensive 3rd-party tool Commander, and the free open-source project Brain Annex.

If you're a medium or large company, the enterprise version offers scalability and other advanced features.  Generously, Neo4j makes the enterprise version available for free to small and medium companies through their Startup Program.  (But first ask yourself if you really need the enterprise version!)

Alternatives to Neo4j

Neo4j is a graph database, but not the only game in town.  I have looked into some alternatives, but haven't found a compelling reason to jump ship...

It's vital to keep in mind is that the query language matters hugely, too.  I personally find Cypher, introduced by the Neo4j company but later made open-source ("OpenCypher"), to be immensely powerful - leaving good ol' SQL in the dust in many ways!   It's elegant, it's easy, it's powerful - and can be used from Python and several other programming languages.

A graph database that has been promoted by AWS is Neptune,  a cloud-based solution that lets users use both RDF (the older technology) and property graphs (similar to Neo4j.)  But it traps you in the AWS cloud; by contrast, you can take Neo4j and other graph databases to any cloud, as well as to your own desktop!   In July 2021, Neptune - playing catch up to public demand - at long last started offering support for the OpenCypher query language.

If you are a huge financial corporation, you might need a graph database that supposedly scales up better at huge sizes, such as TigerGraph.  But everyone else is probably just fine with Neo4j, which by the way can also scale up to multiple nodes (in its Enterprise version.)  Worse yet, TigerGraph shoots itself in the foot by not supporting the Cypher query language; they could – bear in mind that Cypher is open source, and various vendors have their own version.  Even AWS stepped down from its do-it-my-way throne, and started offering it in 2021!

Dear TigerGraph, please seriously consider providing support for OpenCypher.

[...] our interest in TigerGraph is deflated to essentially Zero... and, unless we come across a situation where we absolutely must switch, we'll just stick with Neo4j.
(from a message exchange I had with that company)

Also on my radar is Arango, a free and open-source database that can employ multiple modalities, including key/value, document, and graph data, with a common query language.

Remember : what really matters isn't just the product but the ecosystem behind it...  Python is a great example of that truth!  I'm pleased to have witnessed first-hand Neo4j's ecosystem blossom in recent years.  Except for the company's pushiness to steer users towards its Enterprise version, I have been quite pleased with Neo4j and the immense positive impact they're having - personally, it feels like "a new lease on life" after the long plateau in my "marriage" to relational databases!

Oct. 2023 UPDATE:  Years have gone by - and still haven't found a compelling reason for using graph databases other than Neo4j...

Intro Mini-Tutorials

To keep in mind that the following mini-tutorials are more about the general concept of graph databases, as well as about the (open-source) Cypher query language...  so, they also broadly apply to other products besides Neo4j. 

I vote this the best under 2-min intro to Neo4j that I've been able to find:

 Here's perhaps the best 2nd-level intro, again in less than 2 min:

And, finally, a funny, helpful 30-min intro, "Graph Databases Will Change Your Freakin' Life!":

Now, try it live on Neo4j's "sandbox" site!

I created a tutorial about that, in part 2 of this series of articles.

In part 3 I provide an introductory tutorial on the query language Cypher.  In particular, we'll be revisiting  – and actually running  – the "Tom Hanks" example from earlier.

Want to access Neo4j thru Python?   

Part 4 of this series of articles discusses an open-source library to easily use Neo4j from a Python script, app or notebook, with minimal graph-database expertise needed.

Putting it All Together : a Technology Stack on top of a Graph Database

One typically needs a full data-management solution, not just a database.

Part 5 discusses a library, and a conceptual layer, to impose an optional "Schema" (required or recommended/typical structure) on the data - either "strict" or "lax" (permissive), as needed for one's use case.

Part 6 discusses the full technology stack provided by the open-source project BrainAnnex.org.  In particular, a layer to define and manage an optional schema, as lax or strict as you need it to be.  Plus an API layer and a UI layer.

Later parts discuss special topics.

This article is part of a growing, ongoing series on Graph Databases and Neo4j

 

Comments

Popular posts from this blog

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries?   You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This article is part 4 of a growing,  ongoing  series  on Graph Databases and Neo4j   "NeoAccess" is the bottom layer of the technology stack provided by the BrainAnnex open-source project .  All layers are very modular, and the NeoAccess library may also be used by itself , entirely separately from the rest of the technology stack.  (A diagram of the full stack is shown later in this article.) NeoAccess interacts with the Neo4j Python driver , which is provided by the Neo4j company, to access the database from Python; the API to access that driver is very powerful, but complex - and does

What are Graph Databases - and Why Should I Care?? : "Graph Databases for Poets"

  This is a very gentle introduction to the subject.  The subtitle is inspired by university courses such as "Physics for Poets"!  (if you're technically inclined, there's an alternate article for you.) It has been said that "The language of physics (or of God) is math".  On a similar note, it could be said that: The language of the biological world - or of any subject or endeavor involving complexity - is networks ('meshes') What is a network?  Think of  it as the familiar 'friends of friends' diagram from social media. Everywhere one turns in biology, there's a network – at the cellular level, tissue level, organ level, ecosystem level.  The weather and other earth systems are networks.  Human societal organization is a network.  Electrical circuits, the Internet, our own brains...  Networks are everywhere! What can we do with networks, to better understand the world around us, or to create something that we need? Broadly s

Full-Text Search with the Neo4j Graph Database

(UPDATED Oct. 2023)   Now that we have discussed a full technology stack based on Neo4j (or other graph databases), and that we a design and implementation available from the open-source project BrainAnnex.org  , what next?  What shall we build on top? Well, how about  Full-Text Search ?  This article is part of a growing, ongoing series on Graph Databases and Neo4j Full-Text Searching/Indexing Starting with the  Version 5, Beta 26.1  release, the Brain Annex open-source project includes a straightforward but working implementation of a design that uses the convenient services of its Schema Layer , to provide indexing of word-based documents using Neo4j. The python class FullTextIndexing ( source code ) provides the necessary methods, and it can parse both plain-text and HTML documents (for example, used in "formatted notes"); parsing of PDF files and other formats will be added at a later date. No grammatical analysis ( stemming or lemmatizing ) is done on

Using Schema in Graph Databases such as Neo4j

UPDATED Feb. 2024 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"... By contrast, relational databases come across with an attitude like a micro-manager:  "my way or the highway"... Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs?  A way to marry the flexibility of Graph Databases and the discipline of Relational Databases? This article is part 5 of a growing,  ongoing  series  on Graph Databases and Neo4j Let's Get Concrete Consider a simple scenario with scientific data such as the Sample, Experiment, Study, Run Result , where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results.  That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j .   For example, a rough draft might go like this:   The “labels” (black tags) represent

Anti-Aging Research: Science, not Hype

Last updated May 2023 Q: "How is aging a disease?" A: It's a dynamic system that veers away from its homeostasis (normal equilibrium point): hence a form of slow-progressing illness. Labeling it as 'natural' is a surrender to our traditional state of ignorance and powerlessness, which fortunately is beginning to be changed! Aging is "normal" only from the point of view of the "selfish gene", for whom the body is a disposable carrier. Individuals organisms - for whom self-preservation has a different meaning than for genes - have received scant help from evolution... with rare exceptions such as the T. dohrnii jellyfish (which I discuss here )... but now the time has finally arrived for our rational design to remedy some of the cellular flaws that evolution never bothered to correct!   The above is my standard answer to an oft-asked question. The science of aging is by all evidence very misunderstood by the general public.  Hype,

Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

You have a general idea of what Graph Databases - and Neo4j in particular - are...  But how to get started?  Read on! This article is part 3 of a growing,  ongoing  series  on Graph Databases and Neo4j   If you're new to graph databases, please check out part 1 for an intro and motivation about them.  There, we discussed an example about an extremely simple database involving actors, movies and directors...  and saw how easy the Cypher query language makes it to answer questions such as "which directors have worked with Tom Hanks in 2016" - questions that, when done with relational databases and SQL, turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables! In this tutorial, we will actually carry out that query - and get acquainted with Cypher and the Neo4j browser interface in the process.  This is the dataset we'll be constructing: Get the database in place If you don't already have a database installed locally

Neo4j Sandbox Tutorial : try Neo4j and learn Cypher - free and easy!

So, you have an itch to test-drive Neo4j and its Cypher query language.  Maybe you want to learn it, or evaluate it, or introduce colleagues/clients to it.  And you wish for: fast, simple and free! Well, good news: the Neo4j company kindly provides a free, short-term hosted solution called "the Neo4j sandbox" .  Extremely easy to set up and use! This article is part 2 of a growing, ongoing series on Graph Databases and Neo4j Register (free) for the Neo4j "Sandbox" Go to sandbox.neo4j.com , and register with a working email and a password.  That's it! Note that this same email/password will also let you into the Neo4j Community Forums and Support ; the same login for all: very convenient! Launch your instance - blank or pre-populated After registering, go to  sandbox.neo4j.com  , and follow the steps in the diagram below (the choices might differ, but the "Blank Sandbox" should always be there): Too good to be true?  Is there

Visualization of Graph Databases Using Cytoscape.js

(UPDATED APR. 2024)   I have ample evidence from multiple sources that there are strong unmet needs in the area of visualization of graph databases. And whenever there's a vacuum, vendors circle like vultures - with incomplete, non-customizable, and at times ridiculously expensive, closed-box proprietary solutions.   Fortunately, coming to the rescue is the awesome open-source cytoscape.js library ,  an offshoot of the "Cytoscape" project of the  Institute for Systems Biology , a project with a long history that goes back to 2002. One can do amazing custom solutions, relatively easily, when one combines this Cytoscape library with:   1) a front-end framework such as Vue.js   2) backend libraries (for example in python) to prepare and serve the data   For example, a while back I created a visualizer for networks of chemical reactions, for another open-source project I lead ( life123.science )   This visualizer will look and feel generally familiar to anyone who has eve