Skip to main content

Associative Memory and Attention Function in Machine Learning - Key, Value & Query

The fascinating but terse 2017 paper on Machine Translation Attention Is All You Need, by Vaswani et al, has generated a lot of interest – and plenty of head-scratching to digest it!  As far as I can tell, the Attention Function, with its Key, Value and Query inputs, is one of the obstacles in wrapping one's head around the Deep-Learning "Transformer" architecture presented in that paper.

In brief: whereas an RNN (Recursive Neural Network) processes inputs sequentially, and uses each one to update an internal state, the Transformer architecture takes in all inputs at once, and makes extensive use of “Key/Value/Query memory spaces” (called “Multi-Head Attention” in the paper.)  Advantages include faster speed, and overcoming the problem of “remembering early inputs” that affects RNN’s.

That Attention Function, and related concepts, is what I will address in this blog entry.  I will NOT discuss anything else about the Transformer architecture... but at the very end I'm including a few links to references and tutorials that I found helpful.

The “Attention Function” in Machine Learning is often presented as just an indexing scheme into a memory of values, as exemplified by the following annotated sketch, adapted from a talk on “Attention Is All You Need” (link).

(Source)


However,
I see something deeper in it: namely, a generalization from computer memory to multi-dimensional associative memory. 
So, I’ll be digging deeper, trying to lay out a more mathematical foundation – and attempting to get more insight into the process, with a side foray into human cognitive processes.

A Series of Generalizations: from computer memory to multi-dimensional associative memory

A traditional computer memory can be thought of as the representation of a function from the domain of  Natural numbers (the memory addresses) mapped to the set of  Natural numbers (the values stored in the memory.)
Generalizing one notch, let the range of the function now be the set of Real numbers:



Let’s now further generalize by changing the domain of the function, from the set of Natural numbers to the set of Reals:



Our limited knowledge of the function is in the form of a set of values at some points in the domain. 
If we want to estimate the function value at some other point, we can interpolate; for example, using two nearby points (local interpolation), or using all available points (global interpolation.)


One final generalization: expand the domain of the function from R to R2.



Given a new point Q in our domain, how do we estimate the function value h(Q) ?   Again, we could do local or global interpolations.


Episodic Memory

Think of all the (Ki, vi) pairs as "episodic memories".

Example: hot stove.  Each point represents a childhood memory.  The components of K are clues to heat (such as burner color and smell); v is pain level upon touching it.

Loosely speaking, regarding the individual points as episodic memories, a person with “a lot of experience” would have many such points.  A person with “wide-ranging experiences” would have points covering more of the domain (as opposed to being clustered in a small sub-region.)  A person with “good intuition” would retrieve many points and make a good global interpolation.


Interpolation - Key, Value & Query

Let’s refer to:
  •     the Ki vectors (for i = 1, …, n)   as “Keys
  •     the vi scalars  (for i = 1, …, n)  as “Values
  •     the Q vector as “Query vector” (of the same dimension as the Key vectors)
  •     the space spanned by all the Ki vectors as “Memory Space
If we are given all the Key vectors, all the scalar Values and the Query vector, how to define a function that usefully interpolates the given set of mappings  { Ki  |–>  Vi }  ?



One approach to a global interpolation function is to use a weighed average of the Values: if a particular Key is “similar” to the Query vector, use a large weight – otherwise, a small one.

Using Distance

For example, we could use some metric on the underlying Memory Space, such as the Euclidean distance: if dist(Q, Ki ) = 0 then use a weight of 1 for the corresponding Vi ; conversely, as the distance approaches infinity, the weight goes to 0.
One function that does just that, is:   weight = 1 / (1 + dist)
(note that:  dist = 0 –> weight = 1  ; dist = ∞  –> weight = 0 )

The weighed average then becomes:
  [Formula 1]

Note that if  Q = Kj for some j, then dist(Q , Kj) = 0, and therefore the contribution of the j-th term in the sum is the full vj.  To be noted that the other Values still contribute (“leak in”) even when Q is equal to one of the Keys, unless the other Key vectors are infinitely far from Q.

The influence of distant points could be reduced by replacing the distance function with, for example, a square-distance function (which would also eliminates the need to compute square roots.)
More generally, an arbitrary exponential could be given to the distance function, and this exponential might be treated as a hyper-parameter of the interpolation function.

Using Dot Products

An alternate approach is to employ the dot product of Q with the various Ki’s.
Since  Q * Ki =  || Q ||  || Ki ||  cos(θ)  , where θ is the angle between the vectors, then
(Q * Ki ) / || Ki || 2    could be used as a measure of similarity between Q and Ki.  Note that if Q = Ki , then the expression evaluates to 1.
Using this approach, the weighed-average interpolating function becomes:
   [Formula 2]

Note the if all the Ki’s are orthogonal and Q = Kj for some j, then h(Q) = vj , with no “leakage” from the other Values , since all the other dot products are zero.

Using SoftMax

The 2017 paper “Attention Is All You Need”, by Vaswani et al, uses the following formula:

the Values are vectors rather than scalars.
Q is a set of queries simultaneously, packed together into a matrix; the keys and the (vector) values are also packed together into matrices K and V, respectively.
dk is the dimension of the Query and Key vectors.



EXAMPLE 1:  All the Kj 's and Q are unit vectors.  Use the dot-product interpolator (formula 2), repeated below:
    [Formula 2, repeated]
Being unit vectors,  Q * Ki = cos(α) , where α is the angle between them.  Note that, even though we’re placing the vectors on the x-y plane, this is really a 1-dimentional case along the unit circle.
n = 2
K1 = (1, 0)   ,   v1 = 15
K2 = (0, 1)    ,   v2 = -10
Q unit vector.


Using formula 2 and taking into account that all vectors are of size 1, h(Q) can be parametrized in terms of θ, where θ is the angle from the x-axis to the Q vector  :
h(θ) = [Q * (1, 0)] * 15 + [Q * (0, 1)] * (-10)
= 15 cos(θ) - 10 cos(90º - θ) = 15 cos(θ) - 10 sin(θ)


Notice how, thanks to the orthogonality of the Key vectors, the original Values are attained exactly when Q is equals to a Key vector.   θ = 0 corresponds to the first Key vector (1, 0), and  θ = π/2  to the second one (0, 1).
Same plot in 3D – the sides of the cylinder show the interpolated values along the vertical axis:




EXAMPLE 2:  let’s add a third mapping to the function of the previous example.
n = 3
K1 = (1, 0)    ,   V1 = 15
K2 = (0, 1)    ,   V2 = -10
K3 = (-3, -3)  ,   V3 = -50
Note that we are no longer restricting ourselves to unit vectors, nor to orthogonal Key vectors.
Then:
h(x, y) = V1*{x,y}.K1/(K1.K1) + V2*{x,y}.K2/(K2.K2) + V3*{x,y}.K3/(K3.K3)
that simplifies to:
h(x, y) = 70x/3 - 5y/3
That’s a plane in 3D space, with a big positive slant along the x-axis and a small negative slant along the y-axis:


Mathematica code:

K1 = {1, 0}
K2 = {0, 1}
K3 = {-3, -3}
V1 = 15
V2 = -10
V3 = -50
hInterpolate[x_,y_]:=
N[V1*{x,y}.K1/(K1.K1)+ V2*{x,y}.K2/(K2.K2)+ V3*{x,y}.K3/(K3.K3)]

Lacking orthogonality between the Key vectors, we no longer obtain the original values when the Query Vector equals one of the Key vectors:

hInterpolate(1, 0) = 23.3333         (vs. the original 15)
hInterpolate (0, 1) =  -1.66667      (vs. the original -10)
hInterpolate(-3, -3) = -65             (vs. the original -50)


EXAMPLE 3:  Same data as for the previous example #2, but now using Formula 1, the one based on distances instead of dot products.

In Mathematica:
hNEWinterpolate[x_,y_] := N[V1/(1+Norm[{x,y}-K1] ) + V2/(1+Norm[{x,y}-K2] ) + V3/(1+Norm[{x,y}-K3] )]

It’s quite a different plot now, reminiscent of spacetime distortion in General Relativity.  The baseline interpolated values are (asymptotically) zero, and the individual Values at the Key-vector points “distort” that baseline:


Looking at the values of this new interpolating function at the Key Vectors:

hNEWinterpolate(0, 1) = 2.52453         (vs. the original 15)
hNEWinterpolate(1, 0) = -12.1201         (vs. the original -10)
hNEWinterpolate(-3, -3) = -49.1667     (vs. the original -50)

The interpolated value at the third point (K3) isn’t much affected from its original value, due to the large distance from the other points.


Generalizations: different Domains and Ranges of the Memory Function

In the previous examples, the memory function was  h: ℜ2 –> ℜ
The range of the function could be  ℜn instead of ℜ.  Then, the interpolation function we’ve been looking at would create linear combinations of vectors, instead of scalars.
If  h:  ℜn –> ℜn , i.e. the domain and the range are the same set, then the Value vectors might be interpreted as points in the Domain – and the function h could be regarded as associating points in ℜn

Booleans (B) or other sets could take the place of ℜ.  For example, the classic Pavlovian reflex could be regarded as an interpolation of  h:  B2 -> B   (or perhaps B2 -> ℜ)





Helpful Resources on the Deep-Learning "Transformer" Architecture

As promised in the opening sentences, below is a listing of resources (mainly blogs and videos) that I found helpful when I first encountered the Deep-Learning "Transformer" Architecture that was introduced in the 2017 paper “Attention Is All You Need”.

I won't discuss the actual architecture in this blog entry, except for presenting this high-level diagram, which I annotated in red to emphasize the part about Keys, Values and Queries occurring as inputs to an Attention function:

In this blog entry, I attempted to give a deeper understanding of the Attention function and the meaning of Keys, Values and Queries.  The resources below go a long way to make the above diagram more intelligible:

URLtitlecomments
http://jalammar.github.io/illustrated-transformer/The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a timeExtremely helpful!  Blog on which the helpful video https://www.youtube.com/watch?v=S0KakHcj_rs (next entry) is based
https://www.youtube.com/watch?v=S0KakHcj_rs[Transformer] Attention Is All You Need | AISC Foundational - YouTubeVery helpful and well-presented. Based on the blog http://jalammar.github.io/illustrated-transformer/ (previous entry)
https://distill.pub/2016/augmented-rnns/Attention and Augmented Recurrent Neural Networksvery readable and helpful
https://www.youtube.com/watch?v=iDulhoQ2proAttention Is All You Need - YouTubequite helpful - but it glosses over the later part. It feels like a "part 1" in need of a "part 2"
https://www.youtube.com/watch?v=W2rWgXJBZhUAttention in Neural Networks - YouTubehelpful
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a timeHelpful animations for RNN translation, without and with Attention. However, it glosses over a few things
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/Paper Dissected: "Attention is All You Need" Explained | Machine Learning Explainedclear language. Helpful
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129A Brief Overview of Attention Mechanism – SyncedReview – Mediuminsightful but with tortured English, errors and omissions
https://medium.com/@adityathiruvengadam/transformer-architecture-attention-is-all-you-need-aeccd9f50d09Transformer Architecture: Attention Is All You Needsome clarifications - though also plenty of rehashing - of the terse "Attention is All You Need" paper. Nearly unreadable. One intriguing (but unclear) animation
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04What is a Transformer? – Inside Machine learning – MediumSome clarifications - but mostly rehashing - of the terse "Attention is All You Need" paper
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.htmlTransformer: A Novel Neural Network Architecture for Language Understanding2017. Brief. One intriguing (but unclear) animation
https://www.youtube.com/watch?v=rBCqOTEfxvgAttention is all you need attentional neural network models – Łukasz Kaiser - YouTubeDirectly from a Google guy... but on the confusing side
https://www.youtube.com/watch?v=SysgYptB198C5W3L07 Attention Model Intuition
(Part 7 of a series)
helpful
https://www.youtube.com/watch?v=quoGRI-1l0AC5W3L08 Attention Model
(Part 8 of a series)
helpful - details from part 7

Comments

Popular posts from this blog

Discussing Neuroscience with ChatGPT

UPDATED Apr. 2023 - I'm excited by ChatGPT 's possibilities in terms of facilitating advanced learning .  For example, I got enlightening answers to questions that I had confronted when I first studied neuroscience.  The examples below are taken from a very recent session I had with ChatGPT (mid Jan. 2023.) Source: https://neurosciencestuff.tumblr.com In case you're not familiar with ChatGPT, it's a very sophisticated "chatbot" - though, if you call it that way, it'll correct you!  'I am not a "chatbot", I am a language model, a sophisticated type of AI algorithm trained on vast amounts of text data to generate human-like text'. For a high-level explanation of how ChatGPT actually works - which also gives immense insight into its weaknesses, there's an excellent late Jan. 2023 talk by Stephen Wolfram, the brilliant author of the Mathematica software and of Wolfram Alpha , a product that could be combined with ChatGPT to imp...

Graph Databases (Neo4j) - a revolution in modeling the real world!

UPDATED Oct. 2023 - I was "married" to Relational Databases for many years... and it was a good "relationship" full of love and productivity - but SOMETHING WAS MISSING! Let me backtrack.   In college, I got a hint of the "pre-relational database" days...  Mercifully, that was largely before my time, but  - primarily through a class - I got a taste of what the world was like before relational databases.  It's an understatement to say: YUCK! Gratitude for the power and convenience of Relational Databases and SQL - and relief at having narrowly averted life before it! - made me an instant mega-fan of that technology.  And for many years I held various jobs that, directly or indirectly, made use of MySQL and other relational databases - whether as a Database Administrator, Full-Stack Developer, Data Scientist, CTO or various other roles. UPDATE: This article is now part 1 of a growing, ongoing series on Graph Databases and Neo4j But ther...

Using Schema in Graph Databases such as Neo4j

UPDATED Feb. 2024 - Graph databases have an easygoing laissez-faire attitude: "express yourself (almost) however you want"... By contrast, relational databases come across with an attitude like a micro-manager:  "my way or the highway"... Is there a way to take the best of both worlds and distance oneself from their respective excesses, as best suited for one's needs?  A way to marry the flexibility of Graph Databases and the discipline of Relational Databases? This article is part 5 of a growing,  ongoing  series  on Graph Databases and Neo4j Let's Get Concrete Consider a simple scenario with scientific data such as the Sample, Experiment, Study, Run Result , where Samples are used in Experiments, and where Experiments are part of Studies and produce Run Results.  That’s all very easy and intuitive to represent and store in a Labeled Graph Database such as Neo4j .   For example, a rough draft might go like this:   The “labels” (b...

What are Graph Databases - and Why Should I Care?? : "Graph Databases for Poets"

  This is a very gentle introduction to the subject.  The subtitle is inspired by university courses such as "Physics for Poets"!  (if you're technically inclined, there's an alternate article for you.) It has been said that "The language of physics (or of God) is math".  On a similar note, it could be said that: The language of the biological world - or of any subject or endeavor involving complexity - is networks ('meshes') What is a network?  Think of  it as the familiar 'friends of friends' diagram from social media. Everywhere one turns in biology, there's a network – at the cellular level, tissue level, organ level, ecosystem level.  The weather and other earth systems are networks.  Human societal organization is a network.  Electrical circuits, the Internet, our own brains...  Networks are everywhere! What can we do with networks, to better understand the world around us, or to create something that we need? Broadly ...

Anti-Aging Research: Science, not Hype

Last updated May 2023 Q: "How is aging a disease?" A: It's a dynamic system that veers away from its homeostasis (normal equilibrium point): hence a form of slow-progressing illness. Labeling it as 'natural' is a surrender to our traditional state of ignorance and powerlessness, which fortunately is beginning to be changed! Aging is "normal" only from the point of view of the "selfish gene", for whom the body is a disposable carrier. Individuals organisms - for whom self-preservation has a different meaning than for genes - have received scant help from evolution... with rare exceptions such as the T. dohrnii jellyfish (which I discuss here )... but now the time has finally arrived for our rational design to remedy some of the cellular flaws that evolution never bothered to correct!   The above is my standard answer to an oft-asked question. The science of aging is by all evidence very misunderstood by the general public.  Hype,...

Using Neo4j with Python : the Open-Source Library "NeoAccess"

So, you want to build a python app or Jupyter notebook to utilize Neo4j, but aren't too keen on coding a lot of string manipulation to programmatic create ad-hoc Cypher queries?   You're in the right place: the NeoAccess library can do take care of all that, sparing you from lengthy, error-prone development that requires substantial graph-database and software-development expertise! This article is part 4 of a growing,  ongoing  series  on Graph Databases and Neo4j   "NeoAccess" is the bottom layer of the technology stack provided by the BrainAnnex open-source project .  All layers are very modular, and the NeoAccess library may also be used by itself , entirely separately from the rest of the technology stack.  (A diagram of the full stack is shown later in this article.) NeoAccess interacts with the Neo4j Python driver , which is provided by the Neo4j company, to access the database from Python; the API to access that driver is very p...

Visualization of Graph Databases Using Cytoscape.js

(UPDATED APR. 2024)   I have ample evidence from multiple sources that there are strong unmet needs in the area of visualization of graph databases. And whenever there's a vacuum, vendors circle like vultures - with incomplete, non-customizable, and at times ridiculously expensive, closed-box proprietary solutions.   Fortunately, coming to the rescue is the awesome open-source cytoscape.js library ,  an offshoot of the "Cytoscape" project of the  Institute for Systems Biology , a project with a long history that goes back to 2002. One can do amazing custom solutions, relatively easily, when one combines this Cytoscape library with:   1) a front-end framework such as Vue.js   2) backend libraries (for example in python) to prepare and serve the data   For example, a while back I created a visualizer for networks of chemical reactions, for another open-source project I lead ( life123.science )   This visualizer will look and feel generally ...

Neo4j Sandbox Tutorial : try Neo4j and learn Cypher - free and easy!

So, you have an itch to test-drive Neo4j and its Cypher query language.  Maybe you want to learn it, or evaluate it, or introduce colleagues/clients to it.  And you wish for: fast, simple and free! Well, good news: the Neo4j company kindly provides a free, short-term hosted solution called "the Neo4j sandbox" .  Extremely easy to set up and use! This article is part 2 of a growing, ongoing series on Graph Databases and Neo4j Register (free) for the Neo4j "Sandbox" Go to sandbox.neo4j.com , and register with a working email and a password.  That's it! Note that this same email/password will also let you into the Neo4j Community Forums and Support ; the same login for all: very convenient! Launch your instance - blank or pre-populated After registering, go to  sandbox.neo4j.com  , and follow the steps in the diagram below (the choices might differ, but the "Blank Sandbox" should always be there): Too good to be true?  Is there...

Graph Database Series - Beginners to Advanced, Full-Stack (Neo4j focus)

A growing, ongoing series on Graph Databases and Neo4j.  The series has grown so large that I'm "factoring out" their Table of Contents as a separate entry, here! A series on Graph Databases and Neo4j part 0    (alternate version of part 1 for a general, non-technical audience)   :             What are Graph Databases - and Why Should I Care?? : "Graph Databases for Poets" part 1 : Intro to Graph Databases (Neo4j) - a revolution in modeling the real world! part 2 : Neo4j Sandbox Tutorial : try Neo4j and learn Cypher the free and easy way part 3 : Neo4j & Cypher Tutorial : Getting Started with a Graph Database and its Query Language  part 4 : Using Neo4j with Python : the Open-Source Library NeoAccess part 5 : Using Schema in Graph Databases such as Neo4j part 6  : Putting it All Together - a Technology Stack on Top of a Graph Database SPECIAL TOPICS *...

Full-Text Search with the Neo4j Graph Database

(UPDATED May 2024)   Now that we have discussed a full technology stack based on Neo4j (or other graph databases), and that we a design and implementation available from the open-source project BrainAnnex.org  , what next?  What shall we build on top? Well, how about  Full-Text Search ?  This article is part of a growing, ongoing series on Graph Databases and Neo4j Full-Text Searching/Indexing The Brain Annex open-source project includes an implementation of a design that uses the convenient services of its Schema Layer , to provide indexing of word-based documents using Neo4j. The python class FullTextIndexing ( source code ) provides the necessary methods, and it can parse both plain-text and HTML documents (for example, used in "formatted notes"); parsing of PDF files and other formats will be added at a later date. No grammatical analysis ( stemming or lemmatizing ) is done on the text.  However, a long list of common word ("stop ...