Is GPT-4 just a blockhead?
This suggests that GPT-4’s responses could be generated by approximately – and, in some cases, exactly – reproducing samples from its training data.
Zhang et al. (2021)
In the 1950s, there were two lines of
thoughts competing in the
field of linguistics:
Structuralists insisted all languages are unique in their inner workings all requiring a different model for understanding.
The opposing school behind Zellig Harris and Noam Chomsky argued that all languages share a certain universal grammar.
Based on the idea of a universal grammar is
the
distributional hypothesis (Harris 1954)
You shall know a word by the company it keeps
(Firth 1957)
The implications of the distributional hypothesis:
(Jurafsky and Martin 2024)
The problem with a mere vector representation:
It becomes huge.
But you can train a neural network!
It predicts a word’s context given the word itself. Or vice versa.
Don’t you need training data?
The Internet has plenty of data!
It was the best of times, it was the worst of times, it was the age of wisdom.
So, create your own! (Mikolov et al. 2013)
It was the best of ____
You focus on a target word and its surrounding context words:
It was the best of times, it was the worst…
Now, let that window slide:
It was the best of times, it was the worst…
It was the best of times it, was the worst…
It was the best of times it, was the worst…
Using the skip-gram model, we learn to predict
one
context word (output) using one target word
(input).
(Weng 2017)
the New York Times, a fantastic newspaper
the good old times are long gone
during these times you can call
To predict a target word from multiple source context words,
we
can use continuous bag-of-words models:
\(\overrightarrow{king} - \overrightarrow{man} + \overrightarrow{woman} = \overrightarrow{queen}\)
(Millière and Buckner 2024)
What are the limitations of ChatGPT?
Try prompting it to spell words backwards. Can it play Tic Tac Toe? What other things does it fail at?
The before-mentioned models were still just shallow neural networks…
Current Large Language Models are based on the
Transformer
architecture (Vaswani et al. 2017)
GPT-3, GPT-4, and ChatGPT are
autoregressive Transformer
models.
For more explanations regarding the Transfomer architecture,
please watch this
and this
excellent explanation video on YouTube.
Is it really just predicting the next word?
What other methods besides deeeeep neural
networks are applied in
ChatGPT? Find out about:
The article by Millière and Buckner (2024, 8f) contains more information
From a recent article by Gimpel et al. (2023, 18ff):
Which of these points apply in your context? Which don’t? And why?