Home Artificial Intelligence Motivating Self-Attention

Motivating Self-Attention

0
Motivating Self-Attention

We don’t need to completely replace the worth of v_Riley with v_dog, so let’s say that we take a linear combination of v_Riley and v_dog as the brand new value for v_Riley:

v_Riley = get_value('Riley')
v_dog = get_value('dog')

ratio = .75
v_Riley = (ratio * v_Riley) + ((1-ratio) * v_dog)

This seems to work alright, we’ve embedded a little bit of the meaning of the word “dog” into the word “Riley”.

Now we would love to try to apply this kind of attention to the entire sentence by updating the vector representations of each word by the vector representations of each other word.

What goes mistaken here?

The core problem is that we don’t know which words should tackle the meanings of other words. We’d also like some measure of how much the worth of every word should contribute to one another word.

Part 2

Alright. So we want to understand how much two words needs to be related.

Time for attempt number 2.

I’ve redesigned our vector database in order that each word actually has two associated vectors. The primary is identical value vector that we had before, still denoted by v. As well as, we now have unit vectors denoted by k that store some notion of word relations. Specifically, if two k vectors are close together, it signifies that the values related to these words are more likely to influence one another’s meanings.

With our latest k and v vectors, how can we modify our previous scheme to update v_Riley’s value with v_dog in a way that respects how much two words are related?

Let’s proceed with the identical linear combination business as before, but provided that the k vectors of each are close in embedding space. Even higher, we are able to use the dot product of the 2 k vectors (which range from 0–1 since they’re unit vectors) to inform us how much we should always update v_Riley with v_dog.

v_Riley, v_dog = get_value('Riley'), get_value('dog')
k_Riley, k_dog = get_key('Riley'), get_key('dog')

relevance = k_Riley · k_dog # dot product

v_Riley = (relevance) * v_Riley + (1 - relevance) * v_dog

That is somewhat bit strange since if relevance is 1, v_Riley gets completely replaced by v_dog, but let’s ignore that for a minute.

I would like to as a substitute take into consideration what happens after we apply this sort of idea to the entire sequence. The word “Riley” can have a relevance value with one another word via dot product of ks. So, possibly we are able to as a substitute update the worth of every word proportionally to the worth of the dot product. For simplicity, let’s also include it’s dot product with itself as a technique to preserve it’s own value.

sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()

# obtain a listing of values
values = get_values(words)

# oh yeah, that is what k stands for by the best way
keys = get_keys(words)

# get riley's relevance key
riley_index = words.index('Riley')
riley_key = keys[riley_index]

# generate relevance of "Riley" to one another word
relevances = [riley_key · key for key in keys] #still pretending python has ·

# normalize relevances to sum to 1
relevances /= sum(relevances)

# takes a linear combination of values, weighted by relevances
v_Riley = relevances · values

Okay that’s ok for now.

But once more, I claim that there’s something mistaken with this approach. It’s not that any of our ideas have been implemented incorrectly, but slightly there’s something fundamentally different between this approach and the way we actually take into consideration relationships between words.

If there’s any point in this text where I really really think that it’s best to stop and think, it’s here. Even those of you who think you fully understand attention. What’s mistaken with our approach?

A touch

Relationships between words are inherently asymmetric! The way in which that “Riley” attends to “dog” is different from the best way that “dog” attends to “Riley”. It’s a much greater deal that “Riley” refers to a dog, not a human, then the name of the dog.

In contrast, the dot product is a symmetric operation, which suggests that in our current setup, if a attends to b, then b attends equally strong to a! Actually, that is somewhat false because we’re normalizing the relevance scores, but the purpose is that the words must have the choice of attending in an asymmetric way, even when the opposite tokens are held constant.

Part 3

We’re almost there! Finally, the query becomes:

How can we most naturally extend our current setup to permit for asymmetric relationships?

Well what can we do with another vector type? We still have our worth vectors v, and our relation vector k. Now we’ve yet one more vector q for every token.

How can we modify our setup and use q to attain the asymmetric relationship that we wish?

Those of you who’re acquainted with how self-attention works will hopefully be smirking at this point.

As a substitute of computing relevance k_dog · k_Riley when “dog” attends to “Riley”, we are able to as a substitute query q_Riley against the key k_dog by taking their dot product. When computing the opposite way around, we can have q_dog · k_Riley as a substitute — asymmetric relevance!

Here’s the entire thing together, computing the update for each value directly!

sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()
seq_len = len(words)

# obtain arrays of queries, keys, and values, each of shape (seq_len, n)
Q = array(get_queries(words))
K = array(get_keys(words))
V = array(get_values(words))

relevances = Q @ K.T
normalized_relevances = relevances / relevances.sum(axis=1)

new_V = normalized_relevances @ V

LEAVE A REPLY

Please enter your comment!
Please enter your name here