Let me backtrack.
In college, I got a hint of the "pre-relational database" days... Mercifully, that was largely before my time, but - primarily through a class - I got a taste of what the world was like before relational databases. It's an understatement to say: YUCK!
Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology. And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles.
UPDATE: This article is now part of a series...
But there were thorns in the otherwise happy relationship
As a database expert, one is very used to that. It becomes second nature to use lots of gimmicks - basically hacks - to make real data conform to relational databases. We don't call them "hacks" : we glorify them with names such as a "foreign keys", "join tables", "normal forms" - but, let's face it – they are ugly hacks!
- We have movies, actors and directors
- Movies contain actors; conversely, actors usually appear in multiple movies
- Movies have one or more directors; conversely, directors may work in multiple movies
That's it! A straightforward snapshot of a simple real-world scenario.
And yet, even for something so basic, we immediately need to roll out the exotic heavy tanks of relational databases:
- Primary keys for movies, actors and directors
- A "join table", containing "foreign keys", to handle the many-to-many relationship between movies and actors
- Another "join table" to handle the many-to-many relationship between movies and directors
At this point, if you want to ask "which directors have worked with Tom Hanks in 2016", you are faced with a gnarly "join" of a whopping 5 tables : the 3 main tables and the 2 special "join tables".
All said and done, an SQL monstrosity such as :
ON directors.ID = movies_directors.director_ID
AND movies_directors.movie_ID = movies.ID
AND movies.ID = movies_actors.movie_ID
AND movies_actors.actor_ID = actors.ID
WHERE actors.name = "Tom Hanks"
AND movies.release_date = 2016
If you're an experienced database administrator or data scientist, you're likely to just shrug and say, "well, that's life." Everyone else is more likely to say, "EEK and YUCK!"
What went wrong?
Why did a simple little question, such as "which directors have worked with Tom Hanks in 2016", turn into a monster of a query and an overly-complicated data model involving a whopping 5 tables?
Well, it's because relational databases are a poor tool for this job! In fact, they are a poor tool for any job involving "relationships" (connections) between data entities ("movies", "actors" and "directors", in our example), especially "many-to-many relationships" (an actor in many movies, and conversely a movie with many actors.)
The word "relational" in the name "relational database" might give to the uninitiated the impression that they are the perfect tool to model relationships: far from that! Relational Databases CAN model relationships, but are rather awkward at it! Ever known someone who "COULD" be in a relationship, but is rather awkward at relationships? Same story here!
Graph Databases to the Rescueanother blog entry. I learned that they were using Neo4j, a "graph database" from a company quickly rising to prominence.
At the time, Graph Databases were still relatively new - though they had been around for several years. I tested out Neo4j and scrutinized its parent company... and quickly saw the immense potential. There was a good reason why the Reactome project was using Neo4j to deal with something as complex as the set of all reactions within human cells!
A rift arose inside the company where I then worked : a camp, led by me, favored using Neo4j; another camp favored using "Triplestores" (an older technology, at times referred to as RDF - I contrast it with graph databases in another post); and, to make it a full 3-ring circus, yet another camp wanted to just stick to relational databases!
Upper management worried that Neo4j was "too slow." I proved them wrong by running the whole Reactome database on a run-of-the-mill laptop, with just a little extra memory installed!
Several years have now gone by... long past that job, but Neo4j and other Graph Databases are here to stay - and in fact the Neo4j's ecosystem has vastly grown and become extremely dynamic.
I have by now been in several jobs where Neo4j was a key player, and even had to turn down some other Neo4j jobs, such as an offer from the Bank of America. The large companies, be it BoA, eBay, Walmart, Airbus, Toyota, Marriott Hotel, Verizon, GlaxoSmithKline pharmaceuticals [UPDATE: I worked there in 2021], AbbVie [UPDATE: I currently work there], etc, have been flocking to Neo4j and other graph database (short listings or longer one), and smaller companies are beginning to take notice.
Needless to say, I'm proud to have "backed the right horse", starting at a time when Neo4j was not as popular!
What is a Graph Database?
By the way, the word "graph" in the name has nothing to do with math plots - in this context, a graph is a "network (mesh) of nodes", familiar from the friend-of-friend diagrams of social media.
Since graph databases are internally based on meshes of connections between "entities" (such as individual movies or actors or directors), then establishing and following relationship is an extremely natural operation - rather than the contrived, painful counterpart in relational databases!
For example, returning to our earlier question of finding all directors who worked with Tom Hanks, it's now vastly easier to do that, using a Cypher query - a counterpart of SQL:
MATCH (a:actors) -- (m:movies) -- (d:directors)
WHERE a.name = "Tom Hanks" AND m.release_date = 2016
That's it! 😁 We're asking to traverse the network, starting at the "Tom Hanks" node, labeled "actors", until we reach nodes labeled as "directors", and requiring to pass through a "movies" node with a "release_date" field value of 2016. (Labels are reminiscent of relational-database table names, but more versatile.)
Traversing a network is vastly more intuitive than taking a "subset of a Cartesian product" - the counterpart operation in relational databases!
In case there was a danger of traveling through our database from "Tom Hanks" to a director by following other paths (such as friendships or common interests, if we stored such data), we can be more specific with the relationship names, appearing in the square brackets:
MATCH (a:actors) -[:stars_in]-> (m:movies) -[:directed_by]-> (d:directors)
WHERE a.name = "Tom Hanks" AND m.release_date = 2016
What is Neo4j?
Neo4j has raised $116 Million in series D and E funding (source); so, it's deliciously well-funded... and they've put their large pot of gold to good use to create and polish a remarkable product, and to nurture a large ecosystem. UPDATE: in June 2021, Neo4j announced another round of funding, a whopping $325M in Series F !
They started out in the period 2000-2007... so, it's all relatively new but definitely not super-new. As I like to say, "new enough to be innovative; but not so new as to risk being on the bleeding edge!"
Be aware of the existence of 2 versions of Neo4j : like many products nowadays, there's a community version - which is free and open-source - and a paid enterprise version.
Most small and medium companies will probably do just fine with the community version. If that's the case, make sure to resist the company's many lures to rope you into the paid enterprise version: in particular, steer clear of the so-called "desktop version" and of the graphic tool "Bloom" - both of them require the enterprise version... and one can live eminently well without either of those tools. Just install the community version of Neo4j on your Linux or Windows 10/11 desktop, or on your cloud VM. Alternatives to the graphic tool Blooms are the included Neo4j browser interface, as well as the inexpensive 3rd-party tool Commander, and the free open-source project Brain Annex.
If you're a medium or large company, the enterprise version offers scalability and other advanced features. Generously, Neo4j makes the enterprise version available for free to small and medium companies through their Startup Program. (But first ask yourself if you really need the enterprise version!)
Alternatives to Neo4j
Neo4j is a graph database, but not the only game in town. I have looked into some alternatives, but haven't found a compelling reason to jump ship...
It's vital to keep in mind is that the query language matters hugely, too. I personally find Cypher,
introduced by the Neo4j company but later made open-source ("OpenCypher"), to be
immensely powerful - leaving good ol' SQL in the dust in many ways! It's elegant, it's easy, it's powerful - and can be used from Python and several other programming languages.
A graph database that has been promoted by AWS is Neptune, a cloud-based solution that lets users use both RDF (the older technology) and property graphs (similar to Neo4j.) But it traps you in the AWS cloud; by contrast, you can take Neo4j and other graph databases to any cloud, as well as to your own desktop! In July 2021, Neptune - playing catch up to public demand - at long last started offering support for the OpenCypher query language.
If you are a huge financial corporation, you might need a graph database
that supposedly scales up better at huge sizes, such as TigerGraph.
But everyone else is probably just fine with Neo4j, which by the way
can also scale up to multiple nodes (in its Enterprise version.) Worse yet, TigerGraph shoots itself in the foot by not supporting the Cypher query language; they could – bear in mind that Cypher is open source, and various vendors have their own version. Even AWS stepped down from its do-it-my-way throne, and started offering it in 2021!
Dear TigerGraph, please seriously consider providing support for OpenCypher.
[...] our interest in TigerGraph is deflated to essentially Zero... and, unless we come across a situation where we absolutely must switch, we'll just stick with Neo4j.
(from a message exchange I had with that company)
Also on my radar is Arango, a free and open-source database that can employ multiple modalities, including key/value, document, and graph data, with a common query language.
Remember : what really matters isn't just the product but the ecosystem behind it... Python is a great example of that truth! I'm pleased to have witnessed first-hand Neo4j's ecosystem blossom in recent years. Except for the company's pushiness to steer users towards its Enterprise version, I have been quite pleased with Neo4j and the immense positive impact they're having - personally, it feels like "a new lease on life" after the long plateau in my "marriage" to relational databases!
To keep in mind that the following mini-tutorials are more about the general concept of graph databases, as well as about the (open-source) Cypher query language... so, they also broadly apply to other products besides Neo4j.
I vote this the best under 2-min intro to Neo4j that I've been able to find:
Here's perhaps the best 2nd-level intro, again in less than 2 min:
And, finally, a funny, helpful 30-min intro, "Graph Databases Will Change Your Freakin' Life!":
Now, try it live on Neo4j's "sandbox" site!
Want to access Neo4j thru Python?
Putting it All Together : a Technology Stack on top of a Graph Database
One typically needs a full data-management solution, not just a database.
Part5 discusses the full technology stack provided by the open-source project BrainAnnex.org. In particular, a layer to define and manage an optional schema, as lax or strict as you need it to be. Plus an API layer and a UI layer.