Skip to main content

Graph Databases (Neo4j) - a revolution in modeling the real world!

(UPDATED 11/2022) - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING!

Let me backtrack.  

In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK!

Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles.

But there were thorns in the otherwise happy relationship

The root cause: THE REAL WORLD DOES NOT REALLY RESEMBLE THE TABLES of a relational database.

As a database expert, one is very used to that.  It becomes second nature to use lots of gimmicks - basically hacks - to make real data conform to relational databases.  We don't call them "hacks" : we glorify them with names such as a "foreign keys", "join tables", "normal forms" - but, let's face it - they are ugly hacks!

Example

Relational databases are, sadly, so divorced from the workings of the real world, that's it's exceedingly easy to come up with a very basic example that feels like "square pegs in round holes":
  • We have movies, actors and directors
  • Movies contain actors; conversely, actors usually appear in multiple movies
  • Movies have one or more directors; conversely, directors may work in multiple movies

That's it!  A straightforward snapshot of a simple real-world scenario.

And yet, even for something so basic, we immediately need to roll out the exotic heavy tanks of relational databases:

  • Primary keys for movies, actors and directors
  • A "join table", containing "foreign keys", to handle the many-to-many relationship between movies and actors
  • Another "join table" to handle the many-to-many relationship between movies and directors

At this point, if you want to ask "which directors have worked with Tom Hanks in 2016", you are faced with a gnarly "join" of a whopping 5 tables : the 3 main tables and the 2 special "join tables".

All said and done, an SQL monstrosity such as :

SELECT directors.name
    FROM directors
        JOIN movies_directors
        JOIN movies
        JOIN
movies_actors
        JOIN actors
    ON directors.ID = movies_directors.director_ID
    AND movies_directors.movie_ID = movies.ID
    AND movies.ID = movies_actors.movie_ID
    AND movies_actors.actor_ID = actors.ID
    WHERE actors.name = "Tom Hanks"
    AND
movies.release_date = 2016

If you're an experienced database administrator or data scientist, you're likely to just shrug and say, "well, that's life."  Everyone else is more likely to say, "EEK and YUCK!"

What went wrong?

Why did a simple little question, such as "which directors have worked with Tom Hanks in 2016", turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables?

Well, it's because relational databases are a poor tool for this job!  In fact, they are a poor tool for any job involving "relationships" (connections) between data entities ("movies", "actors" and "directors", in our example), especially "many-to-many relationships" (an actor in many movies, and conversely a movie with many actors.)

The word "relational" in the name "relational database" might give to the uninitiated the impression that they are the perfect tool to model relationships: far from that!  Relational Databases CAN model relationships, but are rather awkward at it!   Ever known someone who "COULD" be in a relationship, but is rather awkward at relationships?  Same story here!

Graph Databases to the Rescue

Several years ago, I worked at a Bioinformatics company where we brought together various large biological datasets.  I came across the astounding Reactome.org project, which I talk about in another blog entry.  I learned that they were using Neo4j, a "graph database" from a company quickly rising to prominence.

At the time, Graph Databases were still relatively new - though they had been around for several years.  I tested out Neo4j and scrutinized its parent company... and quickly saw the immense potential.  There was a good reason why the Reactome project was using Neo4j to deal with something as complex as the set of all reactions within human cells!

A rift arose inside the company where I then worked : a camp, led by me, favored using Neo4j; another camp favored using "Triplestores" (an older technology, at times referred to as RDF - I contrast it with graph databases in another post); and, to make it a full 3-ring circus, yet another camp wanted to just stick to relational databases!

Upper management worried that Neo4j was "too slow."  I proved them wrong by running the whole Reactome database on a run-of-the-mill laptop, with just a little extra memory installed!

Several years have now gone by... long past that job, but Neo4j and other Graph Databases are here to stay - and in fact the Neo4j's ecosystem has vastly grown and become extremely dynamic.

I have by now been in several jobs where Neo4j was a key player, and even had to turn down some other Neo4j jobs, such as an offer from the Bank of America.  The large companies, be it BoA, eBay, Walmart, Airbus, Toyota, Marriott Hotel, Verizon, GlaxoSmithKline pharmaceuticals [UPDATE: I worked there in 2021], etc, have been flocking to Neo4j and other graph database (short listings or longer one), and smaller companies are beginning to take notice.

Needless to say, I'm proud to have "backed the right horse", starting at a time when Neo4j was not as popular!

What is a Graph Database?

It's a powerful, versatile way to model the real world: more than that, it's a massive seismic shift underfoot, which I regard as a game changer!  

By the way, the word "graph" in the name has nothing to do with math plots - in this context, a graph is a "network (mesh) of nodes", familiar from the friend-of-friend diagrams of social media.

Since graph databases are internally based on meshes of connections between "entities" (such as individual movies or actors or directors), then establishing and following relationship is an extremely natural operation - rather than the contrived, painful counterpart in relational databases!

For example, returning to our earlier question of finding all directors who worked with Tom Hanks, it's now vastly easier to do that, using a Cypher query - a counterpart of SQL:

MATCH (a:actors) -- (m:movies) -- (d:directors)
WHERE a.name = "Tom Hanks"
AND m.release_date = 2016
RETURN d.name

That's it! 😁  We're asking to traverse the network, starting at the "Tom Hanks" node, labeled "actors", until we reach nodes labeled as "directors", and requiring to pass through a "movies" node with a "release_date" field value of 2016.  (Labels are reminiscent of relational-database table names, but more versatile.)

Traversing a graph database

Traversing a network is vastly more intuitive than taking a "subset of a Cartesian product" - the counterpart operation in relational databases!

In case there were a danger of traveling through our database from "Tom Hanks" to a director by following other paths (such as friendships or common interests, if we stored such data), we can be more specific:

MATCH (a:actors) -[:stars_in]-> (m:movies) -[:directed_by]-> (d:directors)
WHERE a.name = "Tom Hanks"
AND m.release_date = 2016
RETURN d.name
 
Modeling and querying of complex data is much easier - especially data involving many-to-many relationships among several entities ("tables".)  
 
Not only that, but making changes to the data schema is vastly easier; by contrast, with relational databases, a change to the database scheme can easily be a "major surgery" to dread!  If your schema is not well-known ahead of time, and/or subject to frequent changes, graph databases are a true lifesaver!  That definitely applies to the biomedical-research field, the industry I usually work in.
 
The more intuitive quality of Graph database also makes it easier to involve non-IT people, such as scientists or doctors, in brainstorms over modeling the data.

What is Neo4j?

Neo4j started out in Sweden
Neo4j is a company that has been quickly rising to prominence.  And for good reasons!  They have been instrumental in advancing Graph Database technology with their main product, also named Neo4j.

Neo4j has raised $116 Million in series D and E funding (source); so, it's deliciously well-funded... and they've put their large pot of gold to good use to create and polish a remarkable product, and to nurture a large ecosystem.

They started out in the period 2000-2007...  so, it's all relatively new but definitely not super-new.  As I like to say, "new enough to be innovative; but not so new as to risk being on the bleeding edge!"

From Sweden with love (EXTRA love if you get the Enterprise version!)
The company started out in Sweden...   Yep, just like IKEA!

Be aware of the existence of 2 versions of Neo4j : like many products nowadays, there's a community version - which is free and open-source - and a paid enterprise version

Most small and medium companies will probably do just fine with the community version.  If that's the case, make sure to resist the company's many lures to rope you into the paid enterprise version: in particular, steer clear of the so-called "desktop version" and of the graphic tool "Bloom" - both of them require the enterprise version... and one can live eminently well without either of those tools.  Just install the community version of Neo4j on your Linux or Windows 10 desktop, or on your cloud VM.  Alternatives to the graphic tool Blooms are the included Neo4j browser interface, as well as the inexpensive 3rd-party tool Commander, and the free open-source project Brain Annex.

If you're a medium or large company, the enterprise version offers scalability and other advanced features.  Generously, Neo4j makes the enterprise version available for free to small and medium companies through their Startup Program.  (But first ask yourself if you really need the enterprise version!)

Alternatives to Neo4j

Neo4j is a graph database, but not the only game in town.  I have looked into some alternatives, but haven't found a compelling reason to jump ship...

It's vital to keep in mind is that the query language matters hugely, too.  I personally find Cypher, introduced by the Neo4j company but later made open-source ("OpenCypher"), to be immensely powerful - leaving good ol' SQL in the dust in many ways!   It's elegant, it's easy, it's powerful - and can be used from Python and several other programming languages.

A graph database that has been promoted by AWS is Neptune,  a cloud-based solution that lets users use both RDF (the older technology) and property graphs (similar to Neo4j.)  But it traps you in the AWS cloud; by contrast, you can take Neo4j and other graph databases to any cloud, as well as to your own desktop!   In July 2021, Neptune - playing catch up to public demand - at long last started offering support for the OpenCypher query language.

If you are a huge financial corporation, you might need a graph database that supposedly scales up better at huge sizes, such as TigerGraph.  But everyone else is probably just fine with Neo4j, which by the way can also scale up to multiple nodes (in its Enterprise version.)  Worse yet, TigerGraph shoots itself in the foot by not supporting the Cypher query language; they could - remember, Cypher is open source and various vendors have their own version.  Even AWS stepped down from its do-it-my-way throne, and started offering it in 2021!

Dear TigerGraph, please seriously consider providing support for OpenCypher.

[...] our interest in TigerGraph is deflated to essentially Zero... and, unless we come across a situation where we absolutely must switch, we'll just stick with Neo4j.
(from a message exchange I had with that company)

Also on my radar is Arango, a free and open-source database that can employ multiple modalities, including key/value, document, and graph data, with a common query language.

Remember : what really matters isn't just the product but the ecosystem behind it...  Python is a great example of that truth!  I'm pleased to have witnessed first-hand Neo4j's ecosystem blossom in recent years.  Except for the company's pushiness to steer users towards its Enterprise version, I have been quite pleased with Neo4j and the immense positive impact they're having - personally, it feels like "a new lease on life" after the long plateau in my "marriage" to relational databases!

Intro Mini-Tutorials

To keep in mind that the following mini-tutorials are more about the general concept of graph databases, as well as about the (open-source) Cypher query language...  so, they also broadly apply to other products besides Neo4j. 

I vote this the best under 2-min intro to Neo4j that I've been able to find:

 Here's perhaps the best 2nd-level intro, again in less than 2 min:

And, finally, a funny, helpful 30-min intro, "Graph Databases Will Change Your Freakin' Life!":

Now, try it out live on Neo4j's free "sandbox" site!

Want to access Neo4j thru Python?   

Neo4j provides official support for a powerful but complex library called Neo4j Bolt Driver for Python (in some places referred to as the "Neo4j Python Driver".)

To make use of its power, but without getting bogged down with its complex low-level details,  I wrote a library to make Python interfacing to Neo4j easier, and released it to open source at the beginning of 2021 : https://github.com/BrainAnnex/neo4j-liaison .

As of Dec. 2021, it has been superseded by "NeoAccess", an expanded library that I released (source code on GitHub) as part of the new version of Brain Annex, also based on work that I and others did at GSK pharmaceuticals, and graciously made open source by the company.

The NeoAccess library also comes with an optional companion library, NeoSchema (source code, in late Beta as of Nov. 2022): a schema layer harmoniously brings together the best of the flexibility ("anything goes!") of graph databases and the "law and order" aspect of relational databases!  (For details, see part 2 of this article.)

Other open-source libraries exist to access Neo4j from Python: users with simple, limited needs, might benefit from Py2neo, and Django users might want to look at Neomodel  (details about both.)

Putting it All Together : a Technology Stack on top of a Graph Database

One typically needs a full data-management solution, not just a database.  The Schema Layer, briefly mentioned in the previous section, as well as an API and a UI, are all discussed in "part 2": Using Schema in Graph Databases such as Neo4j

 

Comments

Popular posts from this blog

Life123 : Quantitative Modeling of Biological Systems

(UPDATED 8/2022) - Are we ready to embark on a next-generation detailed quantitative modeling of complex biological systems , including whole-cell simulations?  An anticipated up-jump in computing power may be imminent from Photonics computers (which I discuss here ), and GPU's are rapidly gaining power as well...  Are we in ready state to put existing - and upcoming - power to good use? This is a manifest, and a call to action What's Life123? It's about detailed quantitative modeling of biological systems in 1-D, 2-D and full 3-D, as well as a multi-faceted software platform for doing so. What's (pseudo-)1D?  For now, let's say it's like the inside of a long, thin tube - with no interactions with the tube.  Likewise, (pseudo-)2D can be thought of as a Petri dish, with no interactions with the lid or the bottom. Website : https://life123.science A purposeful decision to also utilize 1D and 2D But why?  Yes, it's in part about "walk before you run&quo

Discussing Neuroscience with ChatGPT

UPDATED Feb. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it is a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. UPDATE:  this article focuses on some of the impressive abilities of ChatGPT.  For a good glimpse of its weaknesses, in the context of poor intuition about Physics, as well as Math errors, check out this great short video:  ChatGPT does Physics For a high-level explanation of how ChatGPT actually works -

D3 Visualization with Vue.js : a powerful alliance (when done right!)

[UPDATED MAY 2022]  D3.js is a very powerful visualization tool, especially for specialized/custom needs...  On the flip side, it's rather hard to use - with a steep learning curve. Even worse if one also wants interactivity ! But why is D3 so hard/clunky to use?  And what can be done about it? Spoiler alert: Vue.js (or other modern front-end framework) to the rescue - if done right... All code in the examples is available in this GitHub repository . The Root of the Problem In a nutshell, what makes D3 awkward to use is that, for historical reasons, it tries to do too much : most painfully, it uses an old way to do direct DOM manipulation (i.e. restructuring the page layout) - an operation that nowadays is superbly handled in a far more friendly way by modern front-end frameworks, such as Vue.js Document Object Model ( DOM ) is a programming interface for web documents.  In simple terms, it's the structure of the elements on a web page (text, images, etc.) Let the front-e

To Build or Not to Build One’s Own Desktop Computer?

“ VALENTINA ” [UPDATED JUNE 2021] - Whether you're a hobbyist, or someone who just needs a good desktop computer, or an IT professional who wants a wider breath of knowledge, or a gamer who needs a performant machine, you might have contemplated at some point whether to build your own desktop computer. If you're a hobbyist, I think it's a great project.  If you're an IT professional - especially a "coder" - I urge you to do it: in my opinion, a full-fledged Computer Scientist absolutely needs breath, ranging from the likes of Shannon's Information Theory and the Halting Problem - all the way down to how transistors work. And what about someone who just needs a good desktop computer?  A big maybe on that - but perhaps this blog entry will either help you, or scare you off for your own good! To build, or not to build, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of OEM's cutting corners and limit

Brain Microarchitecture : Feedback from Higher-order areas to Lower-order areas

Some questions that arise in Machine Learning involve the prospect of using feedback from Higher-order areas (downstream) to Lower-order areas (upstream), and using Global Knowledge for Local Processing.  A desire to gain insight into those issues from Neuroscience ("how does the brain do it?") led me to some fascinating investigations into the Microcircuits of the Cerebral Cortex.  This blog entry is a broad review of the field, in the context of the original motivating questions from Machine Learning.   Starting out with a quote from the “bible of Neuroscience”: From Principles of Neural Science, 5th edn  (Online book location 1435.3 / 5867).  Emphasis and note added by me: Sensory pathways are not exclusively serial; in each functional pathway higher-order areas project back to the lower-order areas from which they receive input. In this way neurons in higher-order areas, sensitive to the global pattern of sensory input, can modulate the activity of neurons in lowe

A "Seismic Shift" in Longevity Science : Mainstream Acceptance + Large Funding

"You are incredibly prescient!"   I woke up to those words from a former colleague on Jan. 19, 2022: the bombshell announcement that the Chief Science Officer of pharma giant GSK, where I worked until recently, will become the CEO at the new, $3 BILLION longevity science company Altos (presumably also funded by Amazon's Jeff Bezos.) Big Pharma is at long last embracing Longevity Science. The corollary: longevity science is entering Mainstream (with capital "M") But let me backtrack... The Decade of Longevity Science When Harvard professor David Sinclair declared the 2020's to be the " decade of the paradigm shift about age reversal ", one could perhaps be dismissive of it as just an outburst of enthusiasm... But in the past couple of years, we're seeing strong evidence that his forecast is right on the mark! While I worked at GlaxoSmithKline - a giant, top-10, pharma company - I vigorously advocated forming a Longevity Science dept., and sp

PET/CT Combined Scanners - a 2018 Breakthrough of the Year... and a Personal Story

Image source Recently, a co-worker in her 20's was diagnosed with a brain tumor!  At times like these, the importance of medical imaging jumps to the fore! Most people have heard of CT ("CAT") scanners – at least enough to know that they don't actually involve cats – but less well-known are PET scanners (which likewise don't involve pets!), and the synergistic combination of the two. A Marriage Made in Heaven What do those scanners do?  And why are they being combined in single devices? Voted 2018 Breakthrough of the Year by a science magazine , the improved PET/CT combined scanner has been a game changer. The EXPLORER PET/CT scanner – the world’s first medical imaging system that can capture a 3D image of the entire human body simultaneously – has produced its first human images. Developed by UC Davis scientists and a multi-institutional consortium, EXPLORER can scan up to 40 times faster, or use up to 40 times less radiation dose, than

RDF Triple Stores vs. Property Graphs : How to Attach Properties to Relationships

Time for the opening shot of a series about Semantic Technology , and in particular contrasting-and-comparing the opposing (but perhaps ultimately complementary) camps of:   RDF Triple Stores , aka Triples-Based Graphs.   For example, Blazegraph or Apache Jena   (Labeled) Property Graphs .  For example, Neo4j or Blazegraph (For this article, I'll assume that you have at least a passing acquaintance with both.  Here is background info on Triplestores and Property Graphs ) It’s my opinion that modeling in terms of Subject/Predicate/Object triples (aka RDF ) might be appealing to mathematicians or philosophers for its minimalist foundation (though a lot of baroque add-on’s quickly come out of the closet!) Modeling in terms of (Labeled) Property Graphs might be appealing to computer scientists, because such graphs appear more usable and less clunky once you start actually doing something with them. Perhaps because I straddle both the Math and CS camps, I’m currently on t

Anti-Aging Research: Science, not Hype

Last updated December 2022 Q: "How is aging a disease?" A: "It's a dynamic system that veers away from its homeostasis (normal equilibrium point): hence a form of slow-progressing illness. Labeling it as 'natural' is a surrender to our traditional state of ignorance and powerlessness, which fortunately is beginning to be changed!" The above is my standard answer to an oft-asked question. The science of aging is by all evidence very misunderstood by the general public.  Hype, misinformation and unquestioned assumptions often prevail, unfortunately. Aging as a systemic breakdown of the body, rather than a series of isolated events and conditions. This 2013 diagram from NIH is a good way to jump-start contemplating the big picture: The diagram originates from the Cell journal: The Hallmarks of Aging   Telomere shortening is perhaps the one most talked about - but just one of several processes.  As stated in the above paper: Each