Is GPT-4 just a Blockhead?

Prof. Dr. Mirco Schoenfeld

Is GPT-4 just a blockhead?

This suggests that GPT-4’s responses could be generated by approximately – and, in some cases, exactly – reproducing samples from its training data.

Zhang et al. (2021)

two schools of NLP

In the 1950s, there were two lines of
thoughts competing in the field of linguistics:

model-based approaches
stochastic paradigms

structuralists

Structuralists insisted all languages are unique in their inner workings all requiring a different model for understanding.

statistic approach

The opposing school behind Zellig Harris and Noam Chomsky argued that all languages share a certain universal grammar.

distributional hypothesis

Based on the idea of a universal grammar is the
distributional hypothesis (Harris 1954)

You shall know a word by the company it keeps
(Firth 1957)

implications of distributional hypothesis

The implications of the distributional hypothesis:

words that are used in the same
contexts tend to carry similar meaning
the more semantically similar two words are,
the more distributionally similar they will be
semantically similar words appear in similar linguistic contexts

vector spaces

(Jurafsky and Martin 2024)

vector spaces

from vectors to embeddings

The problem with a mere vector representation:

It becomes huge.

from vectors to embeddings

But you can train a neural network!

It predicts a word’s context given the word itself. Or vice versa.

training data

Don’t you need training data?

learning language

The Internet has plenty of data!

It was the best of times, it was the worst of times, it was the age of wisdom.

So, create your own! (Mikolov et al. 2013)

It was the best of ____

learning language

You focus on a target word and its surrounding context words:

It was the best of times, it was the worst…

learning language

Now, let that window slide:

It was the best of times, it was the worst…

It was the best of times it, was the worst…

learning language

Using the skip-gram model, we learn to predict one
context word (output) using one target word (input).

(Weng 2017)

learning language

the New York Times, a fantastic newspaper

the good old times are long gone

during these times you can call

learning language

To predict a target word from multiple source context words,
we can use continuous bag-of-words models:

vector spaces

\(\overrightarrow{king} - \overrightarrow{man} + \overrightarrow{woman} = \overrightarrow{queen}\)

vector spaces

(Millière and Buckner 2024)

vector spaces

It’s your turn

What are the limitations of ChatGPT?

Try prompting it to spell words backwards. Can it play Tic Tac Toe? What other things does it fail at?

from shallow to deep

The before-mentioned models were still just shallow neural networks…

Transformer Architecture

Current Large Language Models are based on the
Transformer architecture (Vaswani et al. 2017)

Transformer Architecture

In a transformer model, all words in the
input sequence are processed in parallel:

boost training efficiency
better handling of long sequences of text

Transformer Architecture

Transformers use a mechanism called self-attention

weigh importance of different parts of a sequence
allows for a notion of coreference resolution

_{_{https://nlp.stanford.edu/projects/coref.shtml}}

Transformer Architecture

Most common variant of Transformer-based
models are autoregressive models

learning objective is next-token prediction
learning is iterated over billions of training steps

Transformer Architecture

GPT-3, GPT-4, and ChatGPT are
autoregressive Transformer models.

Transformer Architecture

For more explanations regarding the Transfomer architecture,
please watch this and this excellent explanation video on YouTube.

But there is more…

Is it really just predicting the next word?

It’s your turn

What other methods besides deeeeep neural
networks are applied in ChatGPT? Find out about:

Reinforcement learning from human feedback
in-context learning, few-shot learning, zero-shot learning

_{The article by Millière and Buckner
(2024, 8f) contains more information}

It’s your turn

From a recent article by Gimpel et al. (2023, 18ff):

Which of these points apply in your context? Which don’t? And why?

References

Firth, John. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis, 10–32.

Gimpel, Henner, Kristina Hall, Stefan Decker, Torsten Eymann, Luis Lämmermann, Alexander Mädche, Maximilian Röglinger, et al. 2023. “Unlocking the Power of Generative AI Models and Systems Such as GPT-4 and ChatGPT for Higher Education : A Guide for Students and Lecturers.” Hohenheim Discussion Papers in Business, Economics and Social Sciences. Hohenheim. https://www.fim-rc.de/Paperbibliothek/Veroeffentlicht/1594/wi-1594.pdf.

Harris, Zellig S. 1954. “Distributional Structure.” WORD 10 (2–3): 146–62. https://doi.org/10.1080/00437956.1954.11659520.

Jurafsky, Dan, and James H. Martin. 2024. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.

Millière, Raphaël, and Cameron Buckner. 2024. “A Philosophical Introduction to Language Models – Part i: Continuity with Classic Debates.” https://arxiv.org/abs/2401.03910.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

Weng, Lilian. 2017. “Learning Word Embedding.” Lilianweng.github.io. https://lilianweng.github.io/posts/2017-10-15-word-embedding/.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. “Understanding Deep Learning (Still) Requires Rethinking Generalization.” Commun. ACM 64 (3): 107–15. https://doi.org/10.1145/3446776.