From Languages to Regular Expressions

Motivation

ERRORS IN YOUR CITATIONS

Motivation

Finding words in text is easy.

But what if we aren’t looking for a specific word?

Motivation

We are looking to fix all references - these are rather patterns

A word, followed by a space, followed by four digits in parentheses

What is a regular expression

Describing patterns in text can be done using regular expression

https://regexr.com/

Why regular expressions

We use regular expressions to

match patterns in text efficiently
validate if text adheres to a standard
ease repetitive tasks

Disclaimer

What we will not learn today:

Algorithms that evaluate regular expressions.

Disclaimer

Huh?

Algorithms exist that are blazingly fast.

You’ll need half of a lecture “theoretical computer science” to understand why

Towards regexp

Regular expressions are used to match patterns in text.

So, we first need to understand what texts consists of.

Alphabet

What are words?

Koster (2024)

Alphabet

To describe words, we need

the set of natural numbers $\mathbb{N}$,
an alphabet $\Sigma$, which is a non-empty, finite set,
characters or symbols, which are elements of an alphabet.

Alphabet

The set of natural numbers includes 0:

$\mathbb{N} = \{0,1,2,3,...\}$

Alphabet

Convention

Capital greek letters describe alphabets:

$\Sigma, \Delta, \Gamma$

Alphabet

Convention

Lower-case letters describe symbols:

$\sigma, \alpha, a, a_{i}$ with $i \in\mathbb{N}$

Words

A word w in an alphabet $\Sigma$ is
a finite sequence $a_{1}\ldots a_{n}$ of symbols in $\Sigma$.

Words

Words $w=a_{1}\ldots a_{n}$ have a length $|w|$ which is
the number $n$ of symbols in $w$.

Words

Words can be empty.

That is expressed using the empty string $\epsilon$. And $|\epsilon|=0$.

Languages

A set of words from an alphabet $\Sigma$ is called a language $L$ over $\Sigma$.

Languages

Such mathematically defined languages $L$ are usually neither meaningful, interesting nor suitable for interpersonal communication.

Regular Expressions

A regular expression allows us to write down a language compactly.

Compactly means, we omit writing down all words of the language.

Regular Expressions

$L(ab[cd]) = \{ abc, abd\}$

Regular Expressions

What can we describe?

atomic expressions
alternatives
concatenations
repetitions

Regular Expressions

Regular expressions for atomic expressions:

Given an alphabet $\Sigma$ that doesn’t contain certain special characters.

Every word $w$ from this alphabet $\Sigma$ is a regular expression $R$.

The language $L(R)$ defined by $R$ is simply $\{w\}$.

This boils down to finding a specific word inside a text.

Regular Expressions

$R$ : error

Regular Expressions

Regular expressions for alternatives:

Given regular expressions $R_1$ and $R_2$,
then $R_1 | R_2$ is also a regular expression.

$L(R_1|R_2) = L(R_1)\cup L(R_2)$

The ‘|’ can be used to search for alternatives.

The ‘|’ is one of the above-mentioned special characters.
Searching for this character, requires special instructions.

Regular Expressions

$R$ : error|unde

Regular Expressions

Regular expressions for concatenations:

Given regular expressions $R_1$ and $R_2$,
then $R_1R_2$ is also a regular expression.

$L(R_1R_2) = \{uv | u\in L(R_1), v\in L(R_2)\}$

$L(R_1R_2)$ contains all words that can be split into distinct parts.

Use parentheses to combine concatenations and alternatives.
Parentheses are also special characters.

Regular Expressions

$R$ : (und|ist)e

Regular Expressions

Regular expressions for repetitions:

Given regular expression $R$,
then $R^\ast$ is also a regular expression.

$L(R^\ast) = \{u_1 \ldots u_n | u_i \in L(R) \}$

$L(R^\ast)$ contains all words that are
arbitrary sequences of words in $L(R)$.

$L(R^\ast)$ is infinite.

Regular Expressions

$R$ : er*

Two issues

Regular Expressions (regexp) face two issues:

Notation can be awkward using ASCII and it is not standardized
Incompatibilities in notations between programs

Syntactic Sugar

Many programs allow following syntactic sugar:

Instead of a|b|c|d|e|x|y|z

we can use [abcdexyz]

or even [a-exyz]

Syntactic Sugar

More syntactic sugar:

the dot . matches arbitrary characters
A? instead of (|A) (no or one A)
A+ instead of AA* (at least one A)
A{n} matches “A exactly n times”
A{n,m} matches “A at least n and at most m times”
to match special characters prepend a backslash

Test

What does this regexp match?

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}

(“Email Address Regular Expression That 99.99% Works.” n.d.)

Resources

Recommendable resources for self-learning:

Back to the beginning

https://regexr.com/

Learn it!

I strongly recommend learning regular expressions!

Beware of the rabbit hole, though…

https://xkcd.com/1319/

Task

Until next week, what does this regex match?

[.?!][]\"’)]*\$$\\| $\\|\t\\| \$[ \t\n]*

😶‍🌫️

Thanks

mirco.schoenfeld@uni-bayreuth.de

https://xkcd.com/208/

Acknowledgement

Acknowledgement:

This script is inspired by Tantau (2016).

References

“Email Address Regular Expression That 99.99% Works.” n.d. https://emailregex.com/.

Koster, Rob. 2024. “File:llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch Stationbord.JPG.” https://commons.wikimedia.org/w/index.php?title=File:Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch_stationbord.JPG&oldid=899898470.

From Languages to Regular Expressions

Prof. Dr. Mirco Schoenfeld

Motivation

Motivation

Motivation

Motivation

Motivation

What is a regular expression

Why regular expressions

Disclaimer

Disclaimer

Towards regexp

Alphabet

Alphabet

Alphabet

Alphabet

Alphabet

Words

Words

Words

Languages

Languages

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Two issues

Syntactic Sugar

Syntactic Sugar

Syntactic Sugar

Test

Resources

Back to the beginning

Learn it!

Task

Thanks

Acknowledgement

References