Skip to main content

Graph Databases (Neo4j) - a revolution in modeling the real world!

(UPDATED 9/2022) - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING!

Let me backtrack.  

In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK!

Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles.

But there were thorns in the otherwise happy relationship

The root cause: THE REAL WORLD DOES NOT REALLY RESEMBLE THE TABLES of a relational database.

As a database expert, one is very used to that.  It becomes second nature to use lots of gimmicks - basically hacks - to make real data conform to relational databases.  We don't call them "hacks" : we glorify them with names such as a "foreign keys", "join tables", "normal forms" - but, let's face it - they are ugly hacks!


Relational databases are, sadly, so divorced from the workings of the real world, that's it's exceedingly easy to come up with a very basic example that feels like "square pegs in round holes":
  • We have movies, actors and directors
  • Movies contain actors; conversely, actors usually appear in multiple movies
  • Movies have one or more directors; conversely, directors may work in multiple movies

That's it!  A straightforward snapshot of a simple real-world scenario.

And yet, even for something so basic, we immediately need to roll out the exotic heavy tanks of relational databases:

  • Primary keys for movies, actors and directors
  • A "join table", containing "foreign keys", to handle the many-to-many relationship between movies and actors
  • Another "join table" to handle the many-to-many relationship between movies and directors

At this point, if you want to ask "which directors have worked with Tom Hanks in 2016", you are faced with a gnarly "join" of a whopping 5 tables : the 3 main tables and the 2 special "join tables".

All said and done, an SQL monstrosity such as :

    FROM directors
        JOIN movies_directors
        JOIN movies
        JOIN actors
    ON directors.ID = movies_directors.director_ID
    AND movies_directors.movie_ID = movies.ID
    AND movies.ID = movies_actors.movie_ID
    AND movies_actors.actor_ID = actors.ID
    WHERE = "Tom Hanks"
movies.release_date = 2016

If you're an experienced database administrator or data scientist, you're likely to just shrug and say, "well, that's life."  Everyone else is more likely to say, "EEK and YUCK!"

What went wrong?

Why did a simple little question, such as "which directors have worked with Tom Hanks in 2016", turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables?

Well, it's because relational databases are a poor tool for this job!  In fact, they are a poor tool for any job involving "relationships" (connections) between data entities ("movies", "actors" and "directors", in our example), especially "many-to-many relationships" (an actor in many movies, and conversely a movie with many actors.)

The word "relational" in the name "relational database" might give to the uninitiated the impression that they are the perfect tool to model relationships: far from that!  Relational Databases CAN model relationships, but are rather awkward at it!   Ever known someone who "COULD" be in a relationship, but is rather awkward at relationships?  Same story here!

Graph Databases to the Rescue

Several years ago, I worked at a Bioinformatics company where we brought together various large biological datasets.  I came across the astounding project, which I talk about in another blog entry.  I learned that they were using Neo4j, a "graph database" from a company quickly rising to prominence.

At the time, Graph Databases were still relatively new - though they had been around for several years.  I tested out Neo4j and scrutinized its parent company... and quickly saw the immense potential.  There was a good reason why the Reactome project was using Neo4j to deal with something as complex as the set of all reactions within human cells!

A rift arose inside the company where I then worked : a camp, led by me, favored using Neo4j; another camp favored using "Triplestores" (an older technology, at times referred to as RDF - I contrast it with graph databases in another post); and, to make it a full 3-ring circus, yet another camp wanted to just stick to relational databases!

Upper management worried that Neo4j was "too slow."  I proved them wrong by running the whole Reactome database on a run-of-the-mill laptop, with just a little extra memory installed!

Several years have now gone by... long past that job, but Neo4j and other Graph Databases are here to stay - and in fact the Neo4j's ecosystem has vastly grown and become extremely dynamic.

I have by now been in several jobs where Neo4j was a key player, and even had to turn down some other Neo4j jobs, such as an offer from the Bank of America.  The large companies, be it BoA, eBay, Walmart, Airbus, Toyota, Marriott Hotel, Verizon, GlaxoSmithKline pharmaceuticals [UPDATE: I worked there in 2021], etc, have been flocking to Neo4j and other graph database (short listings or longer one), and smaller companies are beginning to take notice.

Needless to say, I'm proud to have "backed the right horse", starting at a time when Neo4j was not as popular!

What is a Graph Database?

It's a powerful, versatile way to model the real world: more than that, it's a massive seismic shift underfoot, which I regard as a game changer!  

By the way, the word "graph" in the name has nothing to do with math plots - in this context, a graph is a "network (mesh) of nodes", familiar from the friend-of-friend diagrams of social media.

Since graph databases are internally based on meshes of connections between "entities" (such as individual movies or actors or directors), then establishing and following relationship is an extremely natural operation - rather than the contrived, painful counterpart in relational databases!

For example, returning to our earlier question of finding all directors who worked with Tom Hanks, it's now vastly easier to do that, using a Cypher query - a counterpart of SQL:

MATCH (a:actors) -- (m:movies) -- (d:directors)
WHERE = "Tom Hanks"
AND m.release_date = 2016

That's it! 😁  We're asking to traverse the network, starting at the "Tom Hanks" node, labeled "actors", until we reach nodes labeled as "directors", and requiring to pass through a "movies" node with a "release_date" field value of 2016.  (Labels are reminiscent of relational-database table names, but more versatile.)

Traversing a graph database

Traversing a network is vastly more intuitive than taking a "subset of a Cartesian product" - the counterpart operation in relational databases!

In case there were a danger of traveling through our database from "Tom Hanks" to a director by following other paths (such as friendships or common interests, if we stored such data), we can be more specific:

MATCH (a:actors) -[:stars_in]-> (m:movies) -[:directed_by]-> (d:directors)
WHERE = "Tom Hanks"
AND m.release_date = 2016
Modeling and querying of complex data is much easier - especially data involving many-to-many relationships among several entities ("tables".)  
Not only that, but making changes to the data schema is vastly easier; by contrast, with relational databases, a change to the database scheme can easily be a "major surgery" to dread!  If your schema is not well-known ahead of time, and/or subject to frequent changes, graph databases are a true lifesaver!  That definitely applies to the biomedical-research field, the industry I usually work in.
The more intuitive quality of Graph database also makes it easier to involve non-IT people, such as scientists or doctors, in brainstorms over modeling the data.

What is Neo4j?

Neo4j started out in Sweden
Neo4j is a company that has been quickly rising to prominence.  And for good reasons!  They have been instrumental in advancing Graph Database technology with their main product, also named Neo4j.

Neo4j has raised $116 Million in series D and E funding (source); so, it's deliciously well-funded... and they've put their large pot of gold to good use to create and polish a remarkable product, and to nurture a large ecosystem.

They started out in the period 2000-2007...  so, it's all relatively new but definitely not super-new.  As I like to say, "new enough to be innovative; but not so new as to risk being on the bleeding edge!"

From Sweden with love (EXTRA love if you get the Enterprise version!)
The company started out in Sweden...   Yep, just like IKEA!

Be aware of the existence of 2 versions of Neo4j : like many products nowadays, there's a community version - which is free and open-source - and a paid enterprise version

Most small and medium companies will probably do just fine with the community version.  If that's the case, make sure to resist the company's many lures to rope you into the paid enterprise version: in particular, steer clear of the so-called "desktop version" and of the graphic tool "Bloom" - both of them require the enterprise version... and one can live eminently well without either of those tools.  Just install the community version of Neo4j on your Linux or Windows 10 desktop, or on your cloud VM.  Alternatives to the graphic tool Blooms are the included Neo4j browser interface, as well as the inexpensive 3rd-party tool Commander, and the free open-source project Brain Annex.

If you're a medium or large company, the enterprise version offers scalability and other advanced features.  Generously, Neo4j makes the enterprise version available for free to small and medium companies through their Startup Program.  (But first ask yourself if you really need the enterprise version!)

Alternatives to Neo4j

Neo4j is a graph database, but not the only game in town.  I have looked into some alternatives, but haven't found a compelling reason to jump ship...

It's vital to keep in mind is that the query language matters hugely, too.  I personally find Cypher, introduced by the Neo4j company but later made open-source ("OpenCypher"), to be immensely powerful - leaving good ol' SQL in the dust in many ways!   It's elegant, it's easy, it's powerful - and can be used from Python and several other programming languages.

A graph database that has been promoted by AWS is Neptune,  a cloud-based solution that lets users use both RDF (the older technology) and property graphs (similar to Neo4j.)  But it traps you in the AWS cloud; by contrast, you can take Neo4j and other graph databases to any cloud, as well as to your own desktop!   In July 2021, Neptune - playing catch up to public demand - at long last started offering support for the OpenCypher query language.

If you are a huge financial corporation, you might need a graph database that supposedly scales up better at huge sizes, such as TigerGraph.  But everyone else is probably just fine with Neo4j, which by the way can also scale up to multiple nodes (in its Enterprise version.)  Worse yet, TigerGraph shoots itself in the foot by not supporting the Cypher query language; they could - remember, Cypher is open source and various vendors have their own version.  Even AWS stepped down from its do-it-my-way throne, and started offering it in 2021!

Dear TigerGraph, please seriously consider providing support for OpenCypher.

[...] our interest in TigerGraph is deflated to essentially Zero... and, unless we come across a situation where we absolutely must switch, we'll just stick with Neo4j.
(from a message exchange I had with that company)

Also on my radar is Arango, a free and open-source database that can employ multiple modalities, including key/value, document, and graph data, with a common query language.

Remember : what really matters isn't just the product but the ecosystem behind it...  Python is a great example of that truth!  I'm pleased to have witnessed first-hand Neo4j's ecosystem blossom in recent years.  Except for the company's pushiness to steer users towards its Enterprise version, I have been quite pleased with Neo4j and the immense positive impact they're having - personally, it feels like "a new lease on life" after the long plateau in my "marriage" to relational databases!

Intro Mini-Tutorials

To keep in mind that the following mini-tutorials are more about the general concept of graph databases, as well as about the (open-source) Cypher query language...  so, they also broadly apply to other products besides Neo4j. 

I vote this the best under 2-min intro to Neo4j that I've been able to find:

 Here's perhaps the best 2nd-level intro, again in less than 2 min:

And, finally, a funny, helpful 30-min intro, "Graph Databases Will Change Your Freakin' Life!":

Now, try it out live on Neo4j's free "sandbox" site!

Want to access Neo4j thru Python?   

Neo4j provides official support for a powerful but complex library called Neo4j Bolt Driver for Python (in some places referred to as the "Neo4j Python Driver".)

To make use of its power, but without getting bogged down with its complex low-level details,  I wrote a library to make Python interfacing to Neo4j easier, and released it to open source at the beginning of 2021 : .

As of Dec. 2021, it has been superseded by "NeoAccess", an expanded library that I released (source code on GitHub) as part of the new version of Brain Annex, also based on work that I and others did at GSK pharmaceuticals, and graciously made open source by the company.

The NeoAccess library also comes with an optional companion library, NeoSchema (source code, in late Beta as of Sept. 2022): a schema layer harmoniously brings together the best of the flexibility ("anything goes!") of graph databases and the "law and order" aspect of relational databases!

Other open-source libraries exist: users with simple, limited needs might benefit from Py2neo, and Django users might want to look at Neomodel  (details about both.)

Putting it All Together : a Technology Stack on top of a Graph Database

For many practical use cases, one needs a full data-management solution, not just a database.  So, the next natural step is to add an API and possibly a UI.

Things start getting especially exciting and powerful when the UI is aware of the different data types (for example, as specified in the Schema layer described in the previous section), and has the capability to personalize the display and editing mode of data records based on their types ("classes") - perhaps with plugins to provide modularity and easy expansion.

Well, that's exactly what the new version (5) of the open-source project Brain Annex does!  As of Sept. 2022, it's in a late Beta stage.   One use case of such a system is to be a multimedia content management system, which is what the old ("vintage") versions of Brain Annex were, before switching to being Neo4j-based, which leads to a far more general system.

Armed with an API, and possibly a UI, one can for example create a standalone web app, or a control panel for an existing website or web app.  For example, this is how Brain Annex does it:



Popular posts from this blog

D3 Visualization with Vue.js : a powerful alliance (when done right!)

[UPDATED MAY 2022]  D3.js is a very powerful visualization tool, especially for specialized/custom needs...  On the flip side, it's rather hard to use - with a steep learning curve. Even worse if one also wants interactivity ! But why is D3 so hard/clunky to use?  And what can be done about it? Spoiler alert: Vue.js (or other modern front-end framework) to the rescue - if done right... All code in the examples is available in this GitHub repository . The Root of the Problem In a nutshell, what makes D3 awkward to use is that, for historical reasons, it tries to do too much : most painfully, it uses an old way to do direct DOM manipulation (i.e. restructuring the page layout) - an operation that nowadays is superbly handled in a far more friendly way by modern front-end frameworks, such as Vue.js Document Object Model ( DOM ) is a programming interface for web documents.  In simple terms, it's the structure of the elements on a web page (text, images, etc.) Let the front-e

A "Seismic Shift" in Longevity Science : Mainstream Acceptance + Large Funding

"You are incredibly prescient!"   I woke up to those words from a former colleague on Jan. 19, 2022: the bombshell announcement that the Chief Science Officer of pharma giant GSK, where I worked until recently, will become the CEO at the new, $3 BILLION longevity science company Altos (presumably also funded by Amazon's Jeff Bezos.) Big Pharma is at long last embracing Longevity Science. The corollary: longevity science is entering Mainstream (with capital "M") But let me backtrack... The Decade of Longevity Science When Harvard professor David Sinclair declared the 2020's to be the " decade of the paradigm shift about age reversal ", one could perhaps be dismissive of it as just an outburst of enthusiasm... But in the past couple of years, we're seeing strong evidence that his forecast is right on the mark! While I worked at GlaxoSmithKline - a giant, top-10, pharma company - I vigorously advocated forming a Longevity Science dept., and sp

Life123 : Quantitative Modeling of Biological Systems

(UPDATED 8/2022) - Are we ready to embark on a next-generation detailed quantitative modeling of complex biological systems , including whole-cell simulations?  An anticipated up-jump in computing power may be imminent from Photonics computers (which I discuss here ), and GPU's are rapidly gaining power as well...  Are we in ready state to put existing - and upcoming - power to good use? This is a manifest, and a call to action What's Life123? It's about detailed quantitative modeling of biological systems in 1-D, 2-D and full 3-D, as well as a multi-faceted software platform for doing so. What's (pseudo-)1D?  For now, let's say it's like the inside of a long, thin tube - with no interactions with the tube.  Likewise, (pseudo-)2D can be thought of as a Petri dish, with no interactions with the lid or the bottom. Website : A purposeful decision to also utilize 1D and 2D But why?  Yes, it's in part about "walk before you run&quo

Online Courses: (Often) Free and Just Awesome!

“Education is the kindling of a flame, not the filling of a vessel.” -Socrates.  [UPDATED Mar. 2021] Acquiring knowledge has been a hobby of mine since 4th grade, so it's no surprise that I'm the proverbial "kid in the candy store" when it comes to online courses!   As of writing, I have followed over 20 so far, and trying to decide what the next one will be... Utopia or Dystopia? You ever find yourself imagining the future, and wondering whether it'll turn out to be “utopian” or “dystopian”? Well, the state of higher education in the United States is decisively dystopian , with its absurdly ballooned costs and runaway student loans (a “bubble” that may burst sooner or later, mark my words!),  BUT there’s a counterpoint that is decisively utopian , namely the explosive rise of free online courses 😊 Here’s a brief 2012 Ted talk about the rise of free online courses , dated but still of interest. The gist of that TED talk is that online learning has com

Multimedia Knowledge Representation and Management : "Brain Annex"

 (Updated Feb. 2022) Wouldn't it be fantastic to have a "butler" to help us as we constantly face drowning in information? That need was crushingly pressing for me , as a polymath with a thirst for knowledge in several fields, not to mention numerous very technical jobs over the years, several complex research projects, old notes from college and grad school, an endless stream of online courses I take , a tech startup I founded and used to run, the many conferences I attend, life in general, and even hobbies that tend to generate abundant information (such as flying airplanes and studying multiple foreign languages!)   I was immensely eager for some sort of powerful assistance, something so helpful that I could poetically describe as an " annex " to my brain.. In this blog entry, I'll describe how deep frustration with existing software tools led to the start of the open-source project, a web-based knowledge representation and manageme

Anti-Aging Research: Science, not Hype

Last updated November 2021 Q: "How is aging a disease?" A: "It's a dynamic system that veers away from its homeostasis (normal equilibrium point): hence a form of slow-progressing illness. Labeling it as 'natural' is a surrender to our traditional state of ignorance and powerlessness, which fortunately is beginning to be changed!" The above is my standard answer to an oft-asked question. The science of aging is by all evidence very misunderstood by the general public.  Hype, misinformation and unquestioned assumptions often prevail, unfortunately. Aging as a systemic breakdown of the body, rather than a series of isolated events and conditions. This 2013 diagram from NIH is a good way to jump-start contemplating the big picture: The diagram originates from the Cell journal: The Hallmarks of Aging   Telomere shortening is perhaps the one most talked about - but just one of several processes.  As stated in the above paper: Each

Interactomics + Super (or Quantum) Computers + Machine Learning : the Future of Medicine?

[Updated Mar. 2021] Interactomics today bears a certain resemblance to genomics in the  1990s...  Big gaps in knowledge, but an explosively-growing field of great promise. If you're unfamiliar with the terms, genomics is about deciphering the gene sequence of an organism, while interactomics is about describing all the relevant bio-molecules and their web of interactions. A Detective Story Think of a good police-detective story; typically there is a multitude of characters, and an impossible-to-remember number of relationships: A hates B, who loves C, who had a crush on D, who always steers clear of E, who was best friends with A until D arrived... Yes, just like those detective stories, things get very complex with our biological story!  Examples of webs of interactions, familiar to many who took intro biology, are the Krebs cycle for metabolism or the Calvin cycle to fix carbon into sugars in plant photosynthesis. Now, imagine vastly expanding those cycles of rea

Brain Microarchitecture : Feedback from Higher-order areas to Lower-order areas

Some questions that arise in Machine Learning involve the prospect of using feedback from Higher-order areas (downstream) to Lower-order areas (upstream), and using Global Knowledge for Local Processing.  A desire to gain insight into those issues from Neuroscience ("how does the brain do it?") led me to some fascinating investigations into the Microcircuits of the Cerebral Cortex.  This blog entry is a broad review of the field, in the context of the original motivating questions from Machine Learning.   Starting out with a quote from the “bible of Neuroscience”: From Principles of Neural Science, 5th edn  (Online book location 1435.3 / 5867).  Emphasis and note added by me: Sensory pathways are not exclusively serial; in each functional pathway higher-order areas project back to the lower-order areas from which they receive input. In this way neurons in higher-order areas, sensitive to the global pattern of sensory input, can modulate the activity of neurons in lowe

Photonic Computer - a "supercharged GPU" with very low energy consumption

Yes, we all wish for Quantum Computers... but in the meantime we need something here and now!  Could Photonic Computers fit that role? Just about everyone has heard of fiber optics – using light for data transmission – but did you know that light can also be used for computing? There's a new commercial product expected for early next year (2022) . I contacted the CEO, Nicholas Harris, of a 4-y.o. startup, Lightmatter , interviewed in April 2021 here . Photonic computers, at least in their first commercial appearance, are essentially accelerator cards for Linear Algebra - and so of special interest for Machine Learning and some types of simulations.    Their claims are remarkable: 10X faster than some of the best GPUs using 90% less energy can be used with existing software stacks, such as TensorFlow commercially available early next year (2022) a lot of future growth, as additional wavelengths of light get used in parallel My own interest is pr