Prof. Dr. Mirco Schoenfeld

From unstructured data to useful information

Give unstructured data a structure!

Types of data

From Digits to Knowledge

Automation is key.

Intermediate Steps Vary

Processing unstructured data builds on various tools.

Example: Structuring a bunch of Excel sheets

Automatically collect information from a bunch of Excel sheets

What is needed to process Excel sheets?

At least, it has some structure already.

Process Excel sheets

Several ways possible to extract data from Excel sheets:

Processing Excel: Pitfalls

Beware of complicated (i.e. nested) table structures!

Example: Structuring eMails

From a pile of emails to structured information

What is needed to process eMails?

Here, the task becomes trickier.

What do you want to achieve?

Technology depends on what you aim to achieve:

  • identification of key words?
  • identification of characteristic words?
  • topics?
  • sentiment analysis?
  • stance detection?

Robust NLP techniques

Robust Natural Language Processing (NLP) techniques include

  • frequency analysis of text corpora
  • TF-IDF measures
  • topic analysis
  • labeling of emotions / sentiments

Semantical NLP techniques

Processing semantics of text:

  • embedding models
  • disambiguation words
  • entity recognition

Pattern Recognition

Pattern Recognition using Regular Expressions.

You should learn Regular Expressions!

Example: Structuring Paper Documents

From a pile of documents to structured information

What is needed to extract information from paper documents?

Optical Character Recognition

Convert the scan of a paper to machine-readable text using OCR.

OCR stands for Optical Character Recognition.

Current OCR builds on trained ML-models to identify glyphs.

OCR is usually error-prone and dependent on
a proper pre-processing of images.

  • Segmentation
  • Binarization
  • Skew correction
  • Noise removal

Skew correction

Noise removal

Error rates of OCR can be reduced by specific post-processing.

  1. Based on dictionaries and, e.g. edit distances:
Medical Mÿstory
Medical Mystery

Error rates of OCR can be reduced by specific post-processing.

  1. Based on context and statistical language modeling:
Medical Mÿstory
Medical History

Accuracy of OCR relies on

  • quality of input
  • proper preprocessing steps
  • amount of post-processing

OCR-accuracy around 99% achievable
with a good model and near-perfect input.

Not enough for negative search…

Improving OCR-accuracy beyond
99% requires heavy post-processing.

Double-keying with human involvement…

…or a combination with a generative AI model

Commonalities of examples

Common to all previous examples:

Eliminate manual intervention as much as possible.

Human inspection is still required!

Data Validation

Mixed approaches to ensure the human-in-the-loop:

  • pre-defined processing of data
  • automated checks
  • data that doesn’t meet predefined rules
    it is marked for manual inspection


Retrieving structured information from unstructured data:

Unstructured Data
Data Validation
Data Integration


Information Extraction can contain multiple steps:

Error Correction
Key Word Identification
Topic Classification
Sentiment Analysis


Pre- and post-processing highly dependent on desired outcome.

  • OCR requires careful image preprocessing
  • Error correction using a dictionary?
  • Topic classification can benefit from stopword removal
  • Stopword removal can mess up sentiment analysis

Complexity increases

The more your data lacks structure
the more complex your pipeline will be.

Divide and Conquer

A good strategy to handle complexity
and allow testing individual steps:

Isolating parts of your pipeline allows for
better testing and less interference.

Text from OCR pipeline

Stopword Removal
Topic Classification
Keyword Identification
Psycho-linguistic Classification
Sentiment Analysis
Embedding Model
Semantic Labelling

D&C is also a good strategy for your personal projects…

python-script to gather data
AWK-script for data wrangling
R-script for visualization

Service Architecture

This divided setup is also widely used in service architecture.

Transferring data from a server to a client over a network.

The server offers a REST-API.

The client contacts such a REST-URL.

And the server then responds with data.

Key principles of REST (Representational State Transfer):

  1. At its heart, the server provides
    access to resources
    accessible at URLs.
  2. REST relies on HTTP methods (GET, POST, PUT, DELETE)
  3. Resources are represented in formats like JSON
  4. The server is stateless. A request must contain all relevant information. Each request must be independent.

Accessing REST APIs is pretty simple:

  • R packages httr and jsonlite
  • python packages requests and json

An example API access in R:

# Loading packages

# Initializing API Call
call <- ""

# Getting details in API
get_movie_details <- GET(url = call)

# Getting status of HTTP Call
# 200 means, everything is ok

An example API access:

# Converting content to text
get_movie_text <- content(get_movie_details,
                          "text", encoding = "UTF-8")

# Parsing data in JSON
get_movie_json <- fromJSON(get_movie_text,
                           flatten = TRUE)

# Converting into dataframe
get_movie_dataframe <-

An example response:

This is a JSON representation (/ˈdʒeɪsən/)

Some possibilities with this notation:

This converts into a dataframe wonderfully!

JSON is an “open standard”:

[…] there is no single definition, and interpretations vary with usage.

Vulgo: JSON is probably not the safest format to use.

Finishing up

Parts form pipelines

Intelligent Data Processes are a clever combination of specific parts.

You have heard about many of those parts up to now.

Take this seminar series as an invitation to dig deeper.

