Lorem ipsum dolor sit amet, consetetur sadipscing elitr (Mueller 2020), sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua (Smith 1999). At vero eos et accusam et justo duo dolores et ea rebum (West 2022).
ERRORS IN YOUR CITATIONS
Finding words in text is easy.
But what if we aren’t looking for a specific word?
(Mueller 2020) (Smith 1999) (West 2022)
We are looking to fix all references - these are rather patterns
A word, followed by a space, followed by four digits in parentheses
Describing patterns in text can be done using regular expression
We use regular expressions to
What we will not learn today:
Algorithms that evaluate regular expressions.
Huh?
Algorithms exist that are blazingly fast.
You’ll need half of a lecture “theoretical computer science” to understand why
Regular expressions are used to match patterns in text.
So, we first need to understand what texts consists of.
What are words?
Koster
(2024)
To describe words, we need
The set of natural numbers includes 0:
\(\mathbb{N} = \{0,1,2,3,...\}\)
Convention
Capital greek letters describe alphabets:
\(\Sigma, \Delta, \Gamma\)
Convention
Lower-case letters describe symbols:
\(\sigma, \alpha, a, a_{i}\) with \(i \in\mathbb{N}\)
A word w in an alphabet \(\Sigma\) is
a finite sequence \(a_{1}\ldots a_{n}\) of symbols in \(\Sigma\).
Words \(w=a_{1}\ldots a_{n}\) have a
length \(|w|\) which is
the number \(n\) of symbols in \(w\).
Words can be empty.
That is expressed using the empty string \(\epsilon\). And \(|\epsilon|=0\).
A set of words from an alphabet \(\Sigma\) is called a language \(L\) over \(\Sigma\).
Such mathematically defined languages \(L\) are usually neither meaningful, interesting nor suitable for interpersonal communication.
A regular expression allows us to write down a language compactly.
Compactly means, we omit writing down all words of the language.
\(L(ab[cd]) = \{ abc, abd\}\)
What can we describe?
Regular expressions for atomic expressions:
Given an alphabet \(\Sigma\) that doesn’t contain certain special characters.
Every word \(w\) from this alphabet \(\Sigma\) is a regular expression \(R\).
The language \(L(R)\) defined by \(R\) is simply \(\{w\}\).
This boils down to finding a specific word inside a text.
Ut perspiciatis unde omnis iste natus error…
\(R\) : error
Ut perspiciatis unde omnis iste natus error…
Regular expressions for alternatives:
Given regular expressions \(R_1\)
and \(R_2\),
then \(R_1 | R_2\) is also a regular
expression.
\(L(R_1|R_2) = L(R_1)\cup L(R_2)\)
The ‘|’ can be used to search for alternatives.
The ‘|’ is one of the above-mentioned special characters.
Searching for this character, requires special instructions.
Ut perspiciatis unde omnis iste natus error…
\(R\) : error|unde
Ut perspiciatis unde omnis iste natus error…
Regular expressions for concatenations:
Given regular expressions \(R_1\)
and \(R_2\),
then \(R_1R_2\) is also a regular expression.
\(L(R_1R_2) = \{uv | u\in L(R_1), v\in L(R_2)\}\)
\(L(R_1R_2)\) contains all words that can be split into distinct parts.
Use parentheses to combine concatenations and
alternatives.
Parentheses are also special characters.
Ut perspiciatis unde omnis iste natus error…
\(R\) : (und|ist)e
Ut perspiciatis unde omnis iste natus error…
Regular expressions for repetitions:
Given regular expression \(R\),
then \(R^\ast\) is also a regular expression.
\(L(R^\ast) = \{u_1 \ldots u_n | u_i \in L(R) \}\)
\(L(R^\ast)\) contains all words
that are
arbitrary sequences of words in \(L(R)\).
\(L(R^\ast)\) is infinite.
Ut perspiciatis unde omnis iste natus error…
\(R\) : er*
Ut perspiciatis unde omnis iste natus error…
Regular Expressions (regexp) face two issues:
Many programs allow following syntactic sugar:
Instead of a|b|c|d|e|x|y|z
we can use [abcdexyz]
or even [a-exyz]
More syntactic sugar:
.
matches arbitrary
charactersA?
instead of (|A)
(no or
one A)A+
instead of AA*
(at
least one A)A{n}
matches “A exactly n times”A{n,m}
matches “A at least n and at
most m times”What does this regexp match?
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}
(“Email Address Regular Expression That 99.99% Works.” n.d.)
Recommendable resources for self-learning:
Until next week, what does this regex match?
[.?!][]\"’)]*\\($\\| $\\|\t\\| \\)[ \t\n]*
😶🌫️
Acknowledgement:
This script is inspired by Tantau (2016).