From Languages to Regular Expressions

Prof. Dr. Mirco Schoenfeld

Motivation

Motivation

Motivation

Lorem ipsum dolor sit amet, consetetur sadipscing elitr (Mueller 2020), sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua (Smith 1999). At vero eos et accusam et justo duo dolores et ea rebum (West 2022).

ERRORS IN YOUR CITATIONS

Motivation

Finding words in text is easy.

But what if we aren’t looking for a specific word?

Motivation

(Mueller 2020) (Smith 1999) (West 2022)

We are looking to fix all references - these are rather patterns

A word, followed by a space, followed by four digits in parentheses

What is a regular expression

Describing patterns in text can be done using regular expression

https://regexr.com/

Why regular expressions

We use regular expressions to

  • match patterns in text efficiently
  • validate if text adheres to a standard
  • ease repetitive tasks

Disclaimer

What we will not learn today:

Algorithms that evaluate regular expressions.

Disclaimer

Huh?

Algorithms exist that are blazingly fast.

You’ll need half of a lecture “theoretical computer science” to understand why

Towards regexp

Regular expressions are used to match patterns in text.

So, we first need to understand what texts consists of.

Alphabet

What are words?


Koster (2024)

Alphabet

To describe words, we need

  • the set of natural numbers \(\mathbb{N}\),
  • an alphabet \(\Sigma\), which is a non-empty, finite set,
  • characters or symbols, which are elements of an alphabet.

Alphabet

The set of natural numbers includes 0:

\(\mathbb{N} = \{0,1,2,3,...\}\)

Alphabet

Convention

Capital greek letters describe alphabets:

\(\Sigma, \Delta, \Gamma\)

Alphabet

Convention

Lower-case letters describe symbols:

\(\sigma, \alpha, a, a_{i}\) with \(i \in\mathbb{N}\)

Words

A word w in an alphabet \(\Sigma\) is
a finite sequence \(a_{1}\ldots a_{n}\) of symbols in \(\Sigma\).

Words

Words \(w=a_{1}\ldots a_{n}\) have a length \(|w|\) which is
the number \(n\) of symbols in \(w\).

Words

Words can be empty.

That is expressed using the empty string \(\epsilon\). And \(|\epsilon|=0\).

Languages

A set of words from an alphabet \(\Sigma\) is called a language \(L\) over \(\Sigma\).

Languages

Such mathematically defined languages \(L\) are usually neither meaningful, interesting nor suitable for interpersonal communication.

Regular Expressions

Regular Expressions

A regular expression allows us to write down a language compactly.

Compactly means, we omit writing down all words of the language.

Regular Expressions

\(L(ab[cd]) = \{ abc, abd\}\)

Regular Expressions

What can we describe?

  • atomic expressions
  • alternatives
  • concatenations
  • repetitions

Regular Expressions

Regular expressions for atomic expressions:

Given an alphabet \(\Sigma\) that doesn’t contain certain special characters.

Every word \(w\) from this alphabet \(\Sigma\) is a regular expression \(R\).

The language \(L(R)\) defined by \(R\) is simply \(\{w\}\).

This boils down to finding a specific word inside a text.

Regular Expressions

Ut perspiciatis unde omnis iste natus error…

\(R\) : error

Ut perspiciatis unde omnis iste natus error

Regular Expressions

Regular expressions for alternatives:

Given regular expressions \(R_1\) and \(R_2\),
then \(R_1 | R_2\) is also a regular expression.

\(L(R_1|R_2) = L(R_1)\cup L(R_2)\)

The ‘|’ can be used to search for alternatives.

The ‘|’ is one of the above-mentioned special characters.
Searching for this character, requires special instructions.

Regular Expressions

Ut perspiciatis unde omnis iste natus error…

\(R\) : error|unde

Ut perspiciatis unde omnis iste natus error

Regular Expressions

Regular expressions for concatenations:

Given regular expressions \(R_1\) and \(R_2\),
then \(R_1R_2\) is also a regular expression.

\(L(R_1R_2) = \{uv | u\in L(R_1), v\in L(R_2)\}\)

\(L(R_1R_2)\) contains all words that can be split into distinct parts.

Use parentheses to combine concatenations and alternatives.
Parentheses are also special characters.

Regular Expressions

Ut perspiciatis unde omnis iste natus error…

\(R\) : (und|ist)e

Ut perspiciatis unde omnis iste natus error…

Regular Expressions

Regular expressions for repetitions:

Given regular expression \(R\),
then \(R^\ast\) is also a regular expression.

\(L(R^\ast) = \{u_1 \ldots u_n | u_i \in L(R) \}\)

\(L(R^\ast)\) contains all words that are
arbitrary sequences of words in \(L(R)\).

\(L(R^\ast)\) is infinite.

Regular Expressions

Ut perspiciatis unde omnis iste natus error…

\(R\) : er*

Ut perspiciatis unde omnis iste natus error

Two issues

Regular Expressions (regexp) face two issues:

  1. Notation can be awkward using ASCII and it is not standardized
  2. Incompatibilities in notations between programs

Syntactic Sugar

Syntactic Sugar

Many programs allow following syntactic sugar:

Instead of a|b|c|d|e|x|y|z

we can use [abcdexyz]

or even [a-exyz]

Syntactic Sugar

More syntactic sugar:

  • the dot . matches arbitrary characters
  • A? instead of (|A) (no or one A)
  • A+ instead of AA* (at least one A)
  • A{n} matches “A exactly n times”
  • A{n,m} matches “A at least n and at most m times”
  • to match special characters prepend a backslash

Test

What does this regexp match?

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}

(“Email Address Regular Expression That 99.99% Works.” n.d.)

Resources

Recommendable resources for self-learning:

Back to the beginning

https://regexr.com/

Task

Until next week, what does this regex match?

[.?!][]\"’)]*\\($\\| $\\|\t\\| \\)[ \t\n]*

😶‍🌫️

Thanks

https://xkcd.com/208/

Acknowledgement

Acknowledgement:

This script is inspired by Tantau (2016).

References

“Email Address Regular Expression That 99.99% Works.” n.d. https://emailregex.com/.
Koster, Rob. 2024. “File:llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch Stationbord.JPG.” https://commons.wikimedia.org/w/index.php?title=File:Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch_stationbord.JPG&oldid=899898470.
Back to Lecture Website