Metadata
- Author: Art of the Problem
- Category: video
- URL: https://www.youtube.com/watch?v=OFS90-FX6pg&list=PLMTyXBPQmKICil_O1E-_AXSRVeBfXucg_&index=172
Highlights
units and then the state units were connected to the middle of the network and finally also connected to themselves this resulted in a state of mind which depended on the past and could affect the future what he called a recurrent neural network
after being trained on the pattern more and he noticed the learned sequences were not
just memorized they were generalized
he noticed was that the network learned word boundaries on its own
a plot where at the onset of a new word the chance of error or uncertainty is high and as more of the word is received the error rate declines since the sequence is increasingly predictable
when a network learns to perform a sequence it essentially learns to follow a trajectory through State space and these learned trajectories tend to be attractors
reflects what we saw in information Theory where an intelligent signal contain decreasing entropy over sequence length
he viewed an attractor as the generalized pattern learned by the network which was represented in the connection weights in the inner layers
the network would spatially cluster words based on meaning for example it separated nouns which are inanimate and animate and Within These groups he saw subcategorization animate objects were broken down into human and non-human clusters inanimate objects were broken down down into breakable and edible
clusters and so he emphasizes that the network was learning these hierarchical interpretations
everything could be learned from patterns in the language
since we can represent words as points in high dimensional space then sequences of words or sentences can be thought of as a pathway and similar sentences seem to follow similar Pathways
if we think of intelligence as the ability to learn this views learning as the compression of experience into a predictive model of the world
hitting the capacity of the network to maintain coherent context over long
sequences
he noticed how it learned in phases
they reported the discovery of a sentiment neuron which was a single neuron within the network that directly corresponded to the sentiment of the text how positive or negative it sounded
the sentiment neuron emerged out of the process of learning to predict the next word
open question why our model recovers the concept of sentiment in such a precise disentangled interpretable and manipulable way
key problem
with recurrent neural networks because they process data serially all the contexts had to be squeezed into a fixed internal memory and this was a bottleneck limiting the ability of the network to handle context over long sequences of text and so meaning gets squeezed out
a solution to this memory constraint attention the key Insight behind their approach was to create a network with a new kind of dynamic layer which could adapt some of its connection weights based on the context of the input known as a self attention layer
attention layers work by allowing every word in the input to look at and compare itself to every other word and absorb the meaning from the most relevant words to better capture the context of its intended use in that sentence
this is done by simply measuring the distance between all the word pairs in
concept space similar Concepts will be closer in this space leading to a higher connection waiting
that’s why we call them Transformers
this is a network
architecture that can look at everything everywhere all at once
it showed some capability in answering general questions and these questions did not need to be present in the training data this is known as zero shot learning
highlighted the potential of language models to generalize from their training data and apply it to arbitrary
tasks
amazingly it could translate languages as good as systems trained only on translation without any translation specific
training
one capability really jumped out once training was complete
you could still teach the network new things known as in context learning
gave the definition of a madeup word gigaro and then asked it to use that word in a sentence which it did perfectly
the key point is we could change the behavior of the network
without changing the network weights that is a frozen Network can learn new tricks
works because it’s leveraging the internal model of the individual Concepts which it can combine or compose arbitrarily so you can think of two layers of learning a core in weight learning which happens during training and then a layer of in context learning which happens during use or inference
its ability to talk to itself and think out loud a paper was highly shared showing that simply adding the phrase think step by step
at the end of your prompt dramatically improved the performance of Chachi BT
the kernel process of an emerging operating system you have an equivalent of uh random access memory or Ram uh which in this case for an llm would be the context window and you can imagine this llm trying to page relevant information in and out of its context window to perform your task
Leave a Reply