Is GPT-4 just a Blockhead?

Prof. Dr. Mirco Schoenfeld

Is GPT-4 just a blockhead?

Is GPT-4 just a blockhead?

This suggests that GPT-4’s responses could be generated by approximately – and, in some cases, exactly – reproducing samples from its training data.

Zhang et al. (2021)

two schools of NLP

In the 1950s, there were two lines of
thoughts competing in the field of linguistics:

  1. model-based approaches
  2. stochastic paradigms

structuralists

Structuralists insisted all languages are unique in their inner workings all requiring a different model for understanding.

statistic approach

The opposing school behind Zellig Harris and Noam Chomsky argued that all languages share a certain universal grammar.

distributional hypothesis

Based on the idea of a universal grammar is the
distributional hypothesis (Harris 1954)

You shall know a word by the company it keeps
(Firth 1957)

implications of distributional hypothesis

The implications of the distributional hypothesis:

  • words that are used in the same
    contexts tend to carry similar meaning
  • the more semantically similar two words are,
    the more distributionally similar they will be
  • semantically similar words appear in similar linguistic contexts

vector spaces

(Jurafsky and Martin 2024)

vector spaces

vector spaces

vector spaces

from vectors to embeddings

The problem with a mere vector representation:

It becomes huge.

from vectors to embeddings

But you can train a neural network!

It predicts a word’s context given the word itself. Or vice versa.

training data

Don’t you need training data?

learning language

The Internet has plenty of data!

It was the best of times, it was the worst of times, it was the age of wisdom.

So, create your own! (Mikolov et al. 2013)

It was the best of ____

learning language

You focus on a target word and its surrounding context words:

It was the best of times, it was the worst…

learning language

Now, let that window slide:

It was the best of times, it was the worst…

It was the best of times it, was the worst…

It was the best of times it, was the worst

learning language

Using the skip-gram model, we learn to predict one
context word (output) using one target word (input).