ChatGPT: A 30 Year History | How Neural Networks Learned to Talk

https://www.youtube.com/watch

Metadata

  • Author: Art of the Problem
  • Category: video
  • URL: https://www.youtube.com/watch?v=OFS90-FX6pg&list=PLMTyXBPQmKICil_O1E-_AXSRVeBfXucg_&index=172

Highlights

units and then the state units were connected to the middle of the network and finally also connected to themselves this resulted in a state of mind which depended on the past and could affect the future what he called a recurrent neural network

after being trained on the pattern more and he noticed the learned sequences were not

just memorized they were generalized

he noticed was that the network learned word boundaries on its own

a plot where at the onset of a new word the chance of error or uncertainty is high and as more of the word is received the error rate declines since the sequence is increasingly predictable

when a network learns to perform a sequence it essentially learns to follow a trajectory through State space and these learned trajectories tend to be attractors

reflects what we saw in information Theory where an intelligent signal contain decreasing entropy over sequence length

he viewed an attractor as the generalized pattern learned by the network which was represented in the connection weights in the inner layers

the network would spatially cluster words based on meaning for example it separated nouns which are inanimate and animate and Within These groups he saw subcategorization animate objects were broken down into human and non-human clusters inanimate objects were broken down down into breakable and edible

clusters and so he emphasizes that the network was learning these hierarchical interpretations

everything could be learned from patterns in the language

since we can represent words as points in high dimensional space then sequences of words or sentences can be thought of as a pathway and similar sentences seem to follow similar Pathways

if we think of intelligence as the ability to learn this views learning as the compression of experience into a predictive model of the world

hitting the capacity of the network to maintain coherent context over long

sequences

he noticed how it learned in phases

they reported the discovery of a sentiment neuron which was a single neuron within the network that directly corresponded to the sentiment of the text how positive or negative it sounded

the sentiment neuron emerged out of the process of learning to predict the next word

open question why our model recovers the concept of sentiment in such a precise disentangled interpretable and manipulable way

key problem

with recurrent neural networks because they process data serially all the contexts had to be squeezed into a fixed internal memory and this was a bottleneck limiting the ability of the network to handle context over long sequences of text and so meaning gets squeezed out

a solution to this memory constraint attention the key Insight behind their approach was to create a network with a new kind of dynamic layer which could adapt some of its connection weights based on the context of the input known as a self attention layer

attention layers work by allowing every word in the input to look at and compare itself to every other word and absorb the meaning from the most relevant words to better capture the context of its intended use in that sentence

this is done by simply measuring the distance between all the word pairs in

concept space similar Concepts will be closer in this space leading to a higher connection waiting

that’s why we call them Transformers

this is a network

architecture that can look at everything everywhere all at once

it showed some capability in answering general questions and these questions did not need to be present in the training data this is known as zero shot learning

highlighted the potential of language models to generalize from their training data and apply it to arbitrary

tasks

amazingly it could translate languages as good as systems trained only on translation without any translation specific

training

one capability really jumped out once training was complete

you could still teach the network new things known as in context learning

gave the definition of a madeup word gigaro and then asked it to use that word in a sentence which it did perfectly

the key point is we could change the behavior of the network

without changing the network weights that is a frozen Network can learn new tricks

works because it’s leveraging the internal model of the individual Concepts which it can combine or compose arbitrarily so you can think of two layers of learning a core in weight learning which happens during training and then a layer of in context learning which happens during use or inference

its ability to talk to itself and think out loud a paper was highly shared showing that simply adding the phrase think step by step

at the end of your prompt dramatically improved the performance of Chachi BT

the kernel process of an emerging operating system you have an equivalent of uh random access memory or Ram uh which in this case for an llm would be the context window and you can imagine this llm trying to page relevant information in and out of its context window to perform your task


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *