Regular Expressions

Prof. Dr. Mirco Schoenfeld

Motivation

Motivation

Motivation

Lorem ipsum dolor sit amet, consetetur sadipscing elitr (Mueller 2020), sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua (Smith 1999). At vero eos et accusam et justo duo dolores et ea rebum (West 2022).

ERRORS IN YOUR CITATIONS

Motivation

Finding words in text is easy.

But what if we aren’t looking for a specific word?

Motivation

(Mueller 2020) (Smith 1999) (West 2022)

We are looking to fix all references - these are rather patterns

A word, followed by a space, followed by four digits in parentheses

What is a regular expression

Describing patterns in text can be done using regular expression

https://regexr.com/

Why regular expressions

We use regular expressions to

  • match patterns in text efficiently
  • validate if text adheres to a standard
  • ease repetitive tasks

Why you should care

Why you should care

Why you should care

Why you should care

Why you should care

Why you should care

Towards regexp

Regular expressions are used to match patterns in text.

So, we first need to understand what texts consists of.

Alphabet

What are words?


Koster (2024)

Words and Alphabets

In a mathematical sense, words are lists of characters.

Words and Alphabets

To describe words, we need

  • the set of natural numbers to describe positions in a list,
  • an alphabet, which is a non-empty, finite set, and
  • characters or symbols, which are elements of an alphabet.

From Alphabets to Languages

A set of words from an alphabet is called a language \(L\).

Languages

Such mathematically defined languages \(L\) are usually neither meaningful, interesting nor suitable for interpersonal communication.

Regular Expressions

Regular Expressions

A regular expression allows us to write down a language compactly.

Compactly means, we omit writing down all words of the language.

Regular Expressions

\(L(ab[cd]) = \{ abc, abd\}\)

Regular Expressions

What can we describe?

  • atomic expressions
  • alternatives
  • concatenations
  • repetitions

Regular Expressions

Regular expressions for atomic expressions:

Ut perspiciatis unde omnis iste natus error…

\(R\) : error

Ut perspiciatis unde omnis iste natus error

This boils down to finding a specific word inside a text.

Regular Expressions

Regular expressions for alternatives:

Ut perspiciatis unde omnis iste natus error…

\(R\) : error|unde

Ut perspiciatis unde omnis iste natus error

The ‘|’ can be used to search for alternatives.

Regular Expressions

Regular expressions for concatenations:

Ut perspiciatis unde omnis iste natus error…

\(R\) : (und|ist)e

Ut perspiciatis unde omnis iste natus error…

Use parentheses to combine concatenations and alternatives.

Regular Expressions

Regular expressions for repetitions:

Ut perspiciatis unde omnis iste natus error…

\(R\) : er*

Ut perspiciatis unde omnis iste natus error…

\(L(R^\ast)\) is infinite.

Two issues

Regular Expressions (regexp) face two issues:

  1. Notation can be awkward using ASCII and it is not standardized
  2. Incompatibilities in notations between programs

Syntactic Sugar

Syntactic Sugar

Many programs allow following syntactic sugar:

Instead of a|b|c|d|e|x|y|z

we can use [abcdexyz]

or even [a-exyz]

Syntactic Sugar

More syntactic sugar:

  • the dot . matches arbitrary characters
  • A? instead of (|A) (no or one A)
  • A+ instead of AA* (at least one A)
  • A{n} matches “A exactly n times”
  • A{n,m} matches “A at least n and at most m times”
  • to match special characters prepend a backslash

Test

What does this regexp match?

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}

(“Email Address Regular Expression That 99.99% Works.” n.d.)

Resources

Recommendable resources for self-learning:

Back to the beginning

https://regexr.com/

Learn it!

I strongly recommend learning regular expressions!

Beware of the rabbit hole, though…

Thanks

https://xkcd.com/208/

Acknowledgement

Acknowledgement:

This script is inspired by Tantau (2016).

References

“Email Address Regular Expression That 99.99% Works.” n.d. https://emailregex.com/.
Koster, Rob. 2024. “File:llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch Stationbord.JPG.” https://commons.wikimedia.org/w/index.php?title=File:Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch_stationbord.JPG&oldid=899898470.
Back to Seminar