Skip to main content

Full-Text Search with the Neo4j Graph Database

(UPDATED Sep. 2023)  In part 5 (Using Schema in Graph Databases) we discussed the concept of a Schema Layer, and a design and implementation available from the open-source project BrainAnnex.org

Now that we have such a layer, what shall be build on top of it?   Well, how about Full-Text Search

This is part 7 of a ongoing series on Graph Databases and Neo4j. 

part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!

part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way

part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess

part 5 : Using Schema in Graph Databases such as Neo4j

part 6 : Putting it All Together - a Technology Stack on Top of a (Neo4j) Graph Database

part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database

part 8 (upcoming!) : (SPECIAL TOPIC) Document Management

Full-Text Searching/Indexing

Starting with the Version 5, Beta 26.1 release, the Brain Annex open-source project includes a straightforward but working implementation of a design that uses the convenient services of its Schema Layer, to provide indexing of word-based documents using Neo4j.

The python class FullTextIndexing (source code) provides the necessary methods, and it can parse both plain-text and HTML documents (for example, used in "formatted notes"); parsing of PDF files and other formats will be added at a later date.

No grammatical analysis (stemming or lemmatizing) is done on the text.  However, a long list of common word ("stop words") that get stripped away, is used to substantially pare down the text into words that are useful for searching purposes.  

For example:

Mr. Joe&sons
A Long–Term business! Find it at > (http://example.com/home)
Visit Joe's "NOW!"

pares down (also stripped of all HTML) into: ['mr', 'joe', 'sons', 'long', 'term', 'business', 'find', 'example', 'home', 'visit']   Several common terms got dropped from that list.

Of course, you could tweak that list of common terms, to better suit your own use case...

The diagram below shows an example of the Neo4j internal storage of the text indexing for two hypothetical documents, whose metadata is stored in the light-blue nodes on the right.  Notice the division of the Neo4j nodes into separate "Schema" (large green box at the top) and "Data" (yellow) sections: this is conveniently managed by the Schema Layer library (as explained in part 5)

One document ("test1.txt"), which uses "Indexer" node 26, contains the exciting text: 

"hello to the world !!! ?  Welcome to learning how she cooks with potatoes..." 

The indexed words appear as magenta circles attached to the brown circle labeled "26";  notice that many common words aren't indexed at all.

The "Indexer" Nodes

The "Indexer" Neo4j nodes (light-brown circles in the above diagram) may look un-necessary...  and indeed are a design choice, not an absolute necessity.  

You may wonder, why not directly link the "content metadata" nodes (light blue) to the "word" nodes (magenta)?  Sure, we could do that - but then there would be lots of "occurs" relationships intruding into the "CONTENT" module (light blue box on the right.)  

With this design, by contrast, all the indexing is managed within its own "INDEXING" module (large magenta box on the left) - and the only contact between the "CONTENT" and "INDEXING" sections are single "has_index" relationships : very modular and clean!  The "content metadata" nodes (light-blue circles) only need to know their corresponding "Indexer" nodes - and nothing else; all the indexing remains secluded away from content.

Understanding It Better

A first tutorial is currently available as a Jupyter notebook.

It will guide you into creating a structure like one shown in the diagram in the previous section - and then perform full-text searches for words.

Be aware that it clears the databases; so, make sure to run it on a test database, or comment out the line  db.empty_dbase()

The easiest way to run it, is probably to install the whole Brain Annex technology stack on your computer or VM (instructions), and then use an IDE such as PyCharm, to start up JupyterLab (you can use the convenient batch file "quick.bat" at the top level)... but bear in mind that the only dependencies are the NeoAccess and NeoSchema libraries; the rest of the stack isn't needed.

Improving on It?

While the extraction of text from HTML is already part of the functionality ("formatted text notes" that use HTML have long been a staple of Brain Annex), the parsing of PDF files or Word documents, etc,  remain on the to-do list.

Without grammatical analysis, related words such as "cooks" and "cooking", or "learn" and "learning", remain separate (yellow dashed lined added to the diagram earlier in this article.)  Is this ugly internally for the database engineers to look at?  Sure, it's ugly - and they might demand a pay raise for the emotional pain!  BUT does it really matter to the users?   I'd say, not really, as long as users are advised that it's best to search using word STEMS: for example, to search for "learn" rather than "learning" or "learns" - to catch all 3.

Can your users handle such directions?  If yes, voila', you have a simple design that, while it won't win awards for cutting-edge innovation, is nonetheless infinitely better than not having full-text search!

Being an open-source project, this code (currently in late Beta stage) is something that you can of course just take and tweak it to your needs; maybe add some stemming or lemmatizing.  Or use the general design ideas presented here, as a foundation for your own implementation - with or without the "Schema Layer" that comes with it.

Performing Searches

With the word indexing in place, it's a relative breeze to carry out searches.  For example, the current UI of the Brain Annex open-source project (i.e., the web app that is the top layer in its technology stack) recently added 1-word searches thru all of the "formatted notes" and "plain-text documents" managed by it.

and here are the results of that search, which happens to locate some "formatted notes" (HTML documents) and a "plain-text document" (which contains the searched-for word in its body):

IMPORTANT NOTE:  multimedia knowledge management is just one use case of the Brain Annex technology stack, and comes packaged into the standard releases, currently in late Beta.  You may also opt to use the lower layers of its technology stack for YOUR own use cases - which may well be totally different.  The technology stack was discussed in part 6.

Future Directions

One final thought: once one has a good set of "word" nodes in a graph database, possibilities beckon about adding connections!  

It might be as straightforward as creating "variant_of" relationships (perhaps an alternative to traditional stemming or lemmatizing)...  

or it might be "related_to" relationships (perhaps aided by the import of a thesaurus database)... 

or it might be semantic layers, such as connections between individual words and high-level entities, for example "Categories".  ("Category" is a high-level entity, with associated functions for UI display/edit, that is extensively used in the top layers of the Brain Annex software stack, to represent ordered "Collections" of "Content Items" - akin to the layout of paragraphs, sections and diagrams in a book chapter - to be discussed in future articles.)

This is part 7 of an ongoing series on Graph Databases and Neo4j. 

part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world!

part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way

part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess

part 5 : Using Schema in Graph Databases such as Neo4j

part 6 : Putting it All Together - a Technology Stack on Top of a (Neo4j) Graph Database

part 7 : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database

part 8 (upcoming!) : (SPECIAL TOPIC) Document Management

Comments

Popular posts from this blog

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp

Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language

You have a general idea of what Graph Databases - and Neo4j in particular - are...  But how to get started?  Read on! This is part 3 of a 7-part series on Graph Databases and Neo4j.   part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language  part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database part 7  : (SPECIAL TOPIC) Full-Text Search with the Neo4j Graph Database If you're new to graph databases, please check out part 1 for an intro and motivation about them.  There, we discussed an example about an extremely simple database involving actors, movies and directors...  and saw how easy the Cypher query lan

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED July 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING! Let me backtrack.   In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK! Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles. UPDATE: This article is now part of a series... This is part 1 of a 7-part series on Graph Databases and Neo4j.  

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries?   You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This is part 4 of a 7-part series on Graph Databases and Neo4j.   part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher – free and easy part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language   part 4 : Using Neo4j with Python : the Open-Source Library "NeoAccess" part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database part 7  : (SPECIAL TOPIC) Ful

To Build or Not to Build One’s Own Desktop Computer?

“ VALENTINA ” [UPDATED JUNE 2021] - Whether you're a hobbyist, or someone who just needs a good desktop computer, or an IT professional who wants a wider breath of knowledge, or a gamer who needs a performant machine, you might have contemplated at some point whether to build your own desktop computer. If you're a hobbyist, I think it's a great project.  If you're an IT professional - especially a "coder" - I urge you to do it: in my opinion, a full-fledged Computer Scientist absolutely needs breath, ranging from the likes of Shannon's Information Theory and the Halting Problem - all the way down to how transistors work. And what about someone who just needs a good desktop computer?  A big maybe on that - but perhaps this blog entry will either help you, or scare you off for your own good! To build, or not to build, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of OEM's cutting corners and limit

Using Schema in Graph Databases such as Neo4j

UPDATED Aug. 2023 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"... By contrast, relational databases come across with an attitude along the lines of a micro-manager:  "my way or the highway"... Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs? This is part 5 of a 7-part series on Graph Databases and Neo4j.   part 1 : Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database part 7  : (SPECIAL TOPIC) F

PET/CT Combined Scanners - a 2018 Breakthrough of the Year... and a Personal Story

Image source Recently, a co-worker in her 20's was diagnosed with a brain tumor!  At times like these, the importance of medical imaging jumps to the fore! Most people have heard of CT ("CAT") scanners – at least enough to know that they don't actually involve cats – but less well-known are PET scanners (which likewise don't involve pets!), and the synergistic combination of the two. A Marriage Made in Heaven What do those scanners do?  And why are they being combined in single devices? Voted 2018 Breakthrough of the Year by a science magazine , the improved PET/CT combined scanner has been a game changer. The EXPLORER PET/CT scanner – the world’s first medical imaging system that can capture a 3D image of the entire human body simultaneously – has produced its first human images. Developed by UC Davis scientists and a multi-institutional consortium, EXPLORER can scan up to 40 times faster, or use up to 40 times less radiation dose, than

RDF Triple Stores vs. Property Graphs : How to Attach Properties to Relationships

Time for the opening shot of a series about Semantic Technology , and in particular contrasting-and-comparing the opposing (but perhaps ultimately complementary) camps of:   RDF Triple Stores , aka Triples-Based Graphs.   For example, Blazegraph or Apache Jena   (Labeled) Property Graphs .  For example, Neo4j or Blazegraph (For this article, I'll assume that you have at least a passing acquaintance with both.  Here is background info on Triplestores and Property Graphs ) It’s my opinion that modeling in terms of Subject/Predicate/Object triples (aka RDF ) might be appealing to mathematicians or philosophers for its minimalist foundation (though a lot of baroque add-on’s quickly come out of the closet!) Modeling in terms of (Labeled) Property Graphs might be appealing to computer scientists, because such graphs appear more usable and less clunky once you start actually doing something with them. Perhaps because I straddle both the Math and CS camps, I’m currently on t

Brain Microarchitecture : Feedback from Higher-order areas to Lower-order areas

Some questions that arise in Machine Learning involve the prospect of using feedback from Higher-order areas (downstream) to Lower-order areas (upstream), and using Global Knowledge for Local Processing.  A desire to gain insight into those issues from Neuroscience ("how does the brain do it?") led me to some fascinating investigations into the Microcircuits of the Cerebral Cortex.  This blog entry is a broad review of the field, in the context of the original motivating questions from Machine Learning.   Starting out with a quote from the “bible of Neuroscience”: From Principles of Neural Science, 5th edn  (Online book location 1435.3 / 5867).  Emphasis and note added by me: Sensory pathways are not exclusively serial; in each functional pathway higher-order areas project back to the lower-order areas from which they receive input. In this way neurons in higher-order areas, sensitive to the global pattern of sensory input, can modulate the activity of neurons in lowe