Full-Text Search with the Neo4j Graph Database

(UPDATED May 2024) Now that we have discussed a full technology stack based on Neo4j (or other graph databases), and that we a design and implementation available from the open-source project BrainAnnex.org , what next? What shall we build on top?

Well, how about Full-Text Search?

This article is part of a growing, ongoing series on Graph Databases and Neo4j

Full-Text Searching/Indexing

The Brain Annex open-source project includes an implementation of a design that uses the convenient services of its Schema Layer, to provide indexing of word-based documents using Neo4j.

The python class FullTextIndexing (source code) provides the necessary methods, and it can parse both plain-text and HTML documents (for example, used in "formatted notes"); parsing of PDF files and other formats will be added at a later date.

No grammatical analysis (stemming or lemmatizing) is done on the text. However, a long list of common word ("stop words") that get stripped away, is used to substantially pare down the text into words that are useful for searching purposes.

For example:

Mr. Joe&sons
A Long–Term business! Find it at > (http://example.com/home)
Visit Joe's "NOW!"

pares down (also stripped of all HTML) into: ['mr', 'joe', 'sons', 'long', 'term', 'business', 'find', 'example', 'home', 'visit'] Several common terms got dropped from that list.

Of course, you could tweak that list of common terms, to better suit your own use case...

The diagram below shows an example of the Neo4j internal storage of the text indexing for two hypothetical documents, whose metadata is stored in the light-blue nodes on the right. Notice the division of the Neo4j nodes into separate "Schema" (large green box at the top) and "Data" (yellow) sections: this is conveniently managed by the Schema Layer library (as explained in part 5)

One document ("test1.txt"), which uses "Indexer" node 26, contains the exciting text:
"hello to the world !!! ? Welcome to learning how she cooks with potatoes..."
The indexed words appear as magenta circles attached to the brown circle labeled "26"; notice that many common words aren't indexed at all.

The "Indexer" Nodes

The "Indexer" Neo4j nodes (light-brown circles in the above diagram) may look un-necessary... and indeed are a design choice, not an absolute necessity.

You may wonder, why not directly link the "content metadata" nodes (light blue) to the "word" nodes (magenta)? Sure, we could do that - but then there would be lots of "occurs" relationships intruding into the "CONTENT" module (light blue box on the right.)

With this design, by contrast, all the indexing is managed within its own "INDEXING" module (large magenta box on the left) - and the only contact between the "CONTENT" and "INDEXING" sections are single "has_index" relationships : very modular and clean! The "content metadata" nodes (light-blue circles) only need to know their corresponding "Indexer" nodes - and nothing else; all the indexing remains secluded away from content.

Understanding It Better

A first tutorial is currently available as a Jupyter notebook.

It will guide you into creating a structure like one shown in the diagram in the previous section - and then perform full-text searches for words.

Be aware that it clears the databases; so, make sure to run it on a test database, or comment out the line db.empty_dbase()

The easiest way to run it, is probably to install the whole Brain Annex technology stack on your computer or VM (instructions), and then use an IDE such as PyCharm, to start up JupyterLab (you can use the convenient batch file "quick.bat" at the top level)... but bear in mind that the only dependencies are the NeoAccess and NeoSchema libraries; the rest of the stack isn't needed.

Improving on It?

While the extraction of text from HTML is already part of the functionality ("formatted text notes" that use HTML have long been a staple of Brain Annex), the parsing of PDF files or Word documents, etc, remain on the to-do list.

Without grammatical analysis, related words such as "cooks" and "cooking", or "learn" and "learning", remain separate (yellow dashed lined added to the diagram earlier in this article.) Is this ugly internally for the database engineers to look at? Sure, it's ugly - and they might demand a pay raise for the emotional pain! BUT does it really matter to the users? I'd say, not really, as long as users are advised that it's best to search using word STEMS: for example, to search for "learn" rather than "learning" or "learns" - to catch all 3.

Can your users handle such directions? If yes, voila', you have a simple design that, while it won't win awards for cutting-edge innovation, is nonetheless infinitely better than not having full-text search!

Being an open-source project, this code (currently in late Beta stage) is something that you can of course just take and tweak it to your needs; maybe add some stemming or lemmatizing. Or use the general design ideas presented here, as a foundation for your own implementation - with or without the "Schema Layer" that comes with it.

May 2024 UPDATE: new releases of the Brain Annex technology stack now include indexing of the contents of PDF and EPUB documents (using the PyMuPDF library for parsing them.)

Performing Searches

With the word indexing in place, it's a relative breeze to carry out searches. For example, the UI of the Brain Annex open-source project (i.e., the web app that is the top layer in its technology stack) allows word searches thru all of the "formatted notes" and "plain-text documents" managed by it. [May 2024 UPDATE: the screenshot below is from an older versions; much more complex searches are now supported]

and here are the results of that search, which happens to locate some "formatted notes" (HTML documents) and a "plain-text document" (which contains the searched-for word in its body):

IMPORTANT NOTE: multimedia knowledge management is just one use case of the Brain Annex technology stack, and comes packaged into the standard releases, currently in late Beta. You may also opt to use the lower layers of its technology stack for YOUR own use cases - which may well be totally different. The technology stack was discussed in part 6.

May 2024 UPDATE: searches are now much more sophisticated - and they can be limited to particular Categories and their descendant sub-categories (i.e. an "ontology" may be used to guide and restrict searches): details in this newly-release video.

Categories are a semantic layer, and have long been a centerpiece of the BrainAnnex project: "Category" is a high-level entity, with associated functions for UI display/edit, that is extensively used in the top layers of the Brain Annex software stack, to represent ordered "Collections" of "Content Items" - akin to the layout of paragraphs, sections and diagrams in a book chapter - to be discussed in future articles.

Future Directions

One final thought: once one has a good set of "word" nodes in a graph database, possibilities beckon about adding connections!

It might be as straightforward as creating "variant_of" relationships (perhaps an alternative to traditional stemming or lemmatizing)...

or it might be "related_to" relationships (perhaps aided by the import of a thesaurus database)...

This article is part of a growing, ongoing series on Graph Databases and Neo4j

Comments

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning . For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience. The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you! 'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp...

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED Oct. 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING! Let me backtrack. In college, I got a hint of the "pre-relational database" days... Mercifully, that was largely before my time, but - primarily through a class - I got a taste of what the world was like before relational databases. It's an understatement to say: YUCK! Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology. And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles. UPDATE: This article is now part 1 of a growing, ongoing series on Graph Databases and Neo4j But ther...

Using Schema in Graph Databases such as Neo4j

UPDATED Feb. 2024 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"... By contrast, relational databases come across with an attitude like a micro-manager: "my way or the highway"... Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs? A way to marry the flexibility of Graph Databases and the discipline of Relational Databases? This article is part 5 of a growing, ongoing series on Graph Databases and Neo4j Let's Get Concrete Consider a simple scenario with scientific data such as the Sample, Experiment, Study, Run Result , where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results. That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j . For example, a rough draft might go like this: The “labels” (b...

Interactomics + Super (or Quantum) Computers + Machine Learning : the Future of Medicine?

[Updated Mar. 2022] Interactomics today bears a certain resemblance to genomics in the 1990s... Big gaps in knowledge, but an explosively-growing field of great promise. If you're unfamiliar with the terms, genomics is about deciphering the gene sequence of an organism, while interactomics is about describing all the relevant bio-molecules and their web of interactions. A Detective Story Think of a good police-detective story; typically there is a multitude of characters, and an impossible-to-remember number of relationships: A hates B, who loves C, who had a crush on D, who always steers clear of E, who was best friends with A until D arrived... Yes, just like those detective stories, things get very complex with our biological story! Examples of webs of interactions, familiar to many who took intro biology, are the Krebs cycle for metabolism or the Calvin cycle to fix carbon into sugars in plant photosynthesis. Now, imagine vastly expanding those cyc...

What are Graph Databases - and Why Should I Care?? : "Graph Databases for Poets"

This is a very gentle introduction to the subject. The subtitle is inspired by university courses such as "Physics for Poets"! (if you're technically inclined, there's an alternate article for you.) It has been said that "The language of physics (or of God) is math". On a similar note, it could be said that: The language of the biological world - or of any subject or endeavor involving complexity - is networks ('meshes') What is a network? Think of it as the familiar 'friends of friends' diagram from social media. Everywhere one turns in biology, there's a network – at the cellular level, tissue level, organ level, ecosystem level. The weather and other earth systems are networks. Human societal organization is a network. Electrical circuits, the Internet, our own brains... Networks are everywhere! What can we do with networks, to better understand the world around us, or to create something that we need? Broadly ...

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries? You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This article is part 4 of a growing, ongoing series on Graph Databases and Neo4j "NeoAccess" is the bottom layer of the technology stack provided by the BrainAnnex open-source project . All layers are very modular, and the NeoAccess library may also be used by itself , entirely separately from the rest of the technology stack. (A diagram of the full stack is shown later in this article.) NeoAccess interacts with the Neo4j Python driver , which is provided by the Neo4j company, to access the database from Python; the API to access that driver is very p...

Photonic Computer - a "supercharged GPU" with very low energy consumption

Yes, we all wish for Quantum Computers... but in the meantime we need something here and now! Could Photonic Computers fit that role? Just about everyone has heard of fiber optics – using light for data transmission – but did you know that light can also be used for computing? There's a new commercial product expected for early next year (2022) . I contacted the CEO, Nicholas Harris, of a 4-y.o. startup, Lightmatter , interviewed in April 2021 here . Photonic computers, at least in their first commercial appearance, are essentially accelerator cards for Linear Algebra - and so of special interest for Machine Learning and some types of simulations. Their claims are remarkable: 10X faster than some of the best GPUs using 90% less energy can be used with existing software stacks, such as TensorFlow commercially available early next year (2022) a lot of future growth, as additional wavelengths of light get used in parallel My own inte...

Visualization of Graph Databases Using Cytoscape.js

(UPDATED SEP. 2025) I have ample evidence from multiple sources that there are strong unmet needs in the area of visualization of graph databases. And whenever there's a vacuum, there's danger of having to deal with vendors circling like vultures - with incomplete, non-customizable, and at times ridiculously expensive, closed-box proprietary solutions. Fortunately, coming to the rescue is the awesome open-source cytoscape.js library , an offshoot of the "Cytoscape" project of the Institute for Systems Biology , a project with a long history that goes back to 2002. One can do amazing custom solutions, relatively easily, when one combines this Cytoscape library with: 1) a front-end framework such as Vue.js 2) backend libraries (for example in python) to prepare and serve the data For example, a while back I created a visualizer for networks of chemical reactions, for another open-source project I lead ( life123.science ) This ...

Life123 : Quantitative Dynamical Modeling of Biological Systems

(UPDATED 8/2022) - Are we ready to embark on a next-generation detailed quantitative modeling of complex biological systems , including whole-cell simulations? An anticipated up-jump in computing power may be imminent from Photonics computers (which I discuss here ), and GPU's are rapidly gaining power as well... Are we in ready state to put existing - and upcoming - power to good use? This is a manifest, and a call to action What's Life123? It's about detailed quantitative modeling of biological systems in 1-D, 2-D and full 3-D, as well as a multi-faceted software platform for doing so. What's (pseudo-)1D? For now, let's say it's like the inside of a long, thin tube - with no interactions with the tube. Likewise, (pseudo-)2D can be thought of as a Petri dish, with no interactions with the lid or the bottom. Website : https://life123.science A purposeful decision to also utilize 1D and 2D But why? Yes, it's in part about "walk before you run...

D3 Visualization with Vue.js : a powerful alliance (when done right!)

[UPDATED MAY 2022] D3.js is a very powerful visualization tool, especially for specialized/custom needs... On the flip side, it's rather hard to use - with a steep learning curve. Even worse if one also wants interactivity ! But why is D3 so hard/clunky to use? And what can be done about it? Spoiler alert: Vue.js (or other modern front-end framework) to the rescue - if done right... All code in the examples is available in this GitHub repository . The Root of the Problem In a nutshell, what makes D3 awkward to use is that, for historical reasons, it tries to do too much : most painfully, it uses an old way to do direct DOM manipulation (i.e. restructuring the page layout) - an operation that nowadays is superbly handled in a far more friendly way by modern front-end frameworks, such as Vue.js Document Object Model ( DOM ) is a programming interface for web documents. In simple terms, it's the structure of the elements on a web page (text, images, etc.) Let ...

Julian's Polymath Explorations

Search This Blog