Lorem ipsum dolor sit amet, consetetur sadipscing elitr (Mueller 2020), sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua (Smith 1999). At vero eos et accusam et justo duo dolores et ea rebum (West 2022).
ERRORS IN YOUR CITATIONS
Finding words in text is easy.
But what if we aren’t looking for a specific word?
(Mueller 2020) (Smith 1999) (West 2022)
We are looking to fix all references - these are rather patterns
A word, followed by a space, followed by four digits in parentheses
Describing patterns in text can be done using regular expression
We use regular expressions to
Regular expressions are used to match patterns in text.
So, we first need to understand what texts consists of.
What are words?
Koster
(2024)
In a mathematical sense, words are lists of characters.
To describe words, we need
A set of words from an alphabet is called a language \(L\).
Such mathematically defined languages \(L\) are usually neither meaningful, interesting nor suitable for interpersonal communication.
A regular expression allows us to write down a language compactly.
Compactly means, we omit writing down all words of the language.
\(L(ab[cd]) = \{ abc, abd\}\)
What can we describe?
Regular expressions for atomic expressions:
Ut perspiciatis unde omnis iste natus error…
\(R\) : error
Ut perspiciatis unde omnis iste natus error…
This boils down to finding a specific word inside a text.
Regular expressions for alternatives:
Ut perspiciatis unde omnis iste natus error…
\(R\) : error|unde
Ut perspiciatis unde omnis iste natus error…
The ‘|’ can be used to search for alternatives.
Regular expressions for concatenations:
Ut perspiciatis unde omnis iste natus error…
\(R\) : (und|ist)e
Ut perspiciatis unde omnis iste natus error…
Use parentheses to combine concatenations and alternatives.
Regular expressions for repetitions:
Ut perspiciatis unde omnis iste natus error…
\(R\) : er*
Ut perspiciatis unde omnis iste natus error…
\(L(R^\ast)\) is infinite.
Regular Expressions (regexp) face two issues:
Many programs allow following syntactic sugar:
Instead of a|b|c|d|e|x|y|z
we can use [abcdexyz]
or even [a-exyz]
More syntactic sugar:
. matches arbitrary
charactersA? instead of (|A) (no or
one A)A+ instead of AA* (at
least one A)A{n} matches “A exactly n times”A{n,m} matches “A at least n and at
most m times”What does this regexp match?
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}
(“Email Address Regular Expression That 99.99% Works.” n.d.)
Recommendable resources for self-learning:
I strongly recommend learning regular expressions!
Beware of the rabbit hole, though…
mirco.schoenfeld@uni-bayreuth.de
Acknowledgement:
This script is inspired by Tantau (2016).