Is GPT-4 just a blockhead?
This suggests that GPT-4’s responses could be generated by approximately – and, in some cases, exactly – reproducing samples from its training data.
Zhang et al. (2021)
In the 1950s, there were two lines of
thoughts competing in the
field of linguistics:
Structuralists insisted all languages are unique in their inner workings all requiring a different model for understanding.
The opposing school behind Zellig Harris and Noam Chomsky argued that all languages share a certain universal grammar.
Based on the idea of a universal grammar is
the
distributional hypothesis (Harris 1954)
You shall know a word by the company it keeps
(Firth 1957)
The implications of the distributional hypothesis:
(Jurafsky and Martin 2024)
The problem with a mere vector representation:
It becomes huge.
But you can train a neural network!
It predicts a word’s context given the word itself. Or vice versa.
Don’t you need training data?
The Internet has plenty of data!
It was the best of times, it was the worst of times, it was the age of wisdom.
So, create your own! (Mikolov et al. 2013)
It was the best of ____
You focus on a target word and its surrounding context words:
It was the best of times, it was the worst…
Now, let that window slide:
It was the best of times, it was the worst…
It was the best of times it, was the worst…
It was the best of times it, was the worst…
Using the skip-gram model, we learn to predict
one
context word (output) using one target word
(input).