Data Intelligence

Prof. Dr. Mirco Schoenfeld

From unstructured data to useful information

Goal

Give unstructured data a structure!

Types of data

https://mycloudwiki.com/san/data-and-information-basics/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

Automation

Automation is key.

Intermediate Steps Vary

Processing unstructured data builds on various tools.

Example: Structuring a bunch of Excel sheets

Automatically collect information from a bunch of Excel sheets

Example: Structuring a bunch of Excel sheets

What is needed to process Excel sheets?

At least, it has some structure already.

Process Excel sheets

Several ways possible to extract data from Excel sheets:

readxl: Package for R to read xlsx-files
Several libraries for Python
Recommendation: openpyxl
Convert xlsx to csv automatically: using batch files, with LibreOffice, or tools like in2csv

Processing Excel: Pitfalls

Beware of complicated (i.e. nested) table structures!

Example: Structuring eMails

From a pile of emails to structured information

Example: Structuring eMails

What is needed to process eMails?

Here, the task becomes trickier.

What do you want to achieve?

Technology depends on what you aim to achieve:

identification of key words?
identification of characteristic words?
topics?
sentiment analysis?
stance detection?

Robust NLP techniques

Robust Natural Language Processing (NLP) techniques include

frequency analysis of text corpora
TF-IDF measures
topic analysis
labeling of emotions / sentiments

Semantical NLP techniques

Processing semantics of text:

embedding models
disambiguation words
entity recognition

Pattern Recognition

Pattern Recognition using Regular Expressions.

https://towardsdatascience.com/easiest-way-to-remember-regular-expressions-regex-178ba518bebd

Pattern Recognition

You should learn Regular Expressions!

Example: Structuring Paper Documents

From a pile of documents to structured information

Example: Structuring Paper Documents

What is needed to extract information from paper documents?

Optical Character Recognition

Convert the scan of a paper to machine-readable text using OCR.

Optical Character Recognition

OCR stands for Optical Character Recognition.

Current OCR builds on trained ML-models to identify glyphs.

Optical Character Recognition

OCR is usually error-prone and dependent on
a proper pre-processing of images.

Segmentation
Binarization
Skew correction
Noise removal
…

Optical Character Recognition

Segmentation

Optical Character Recognition

Binarization

Optical Character Recognition

Skew correction

Optical Character Recognition

Noise removal

Optical Character Recognition

Error rates of OCR can be reduced by specific post-processing.

Based on dictionaries and, e.g. edit distances:

Medical Mÿstory

→

Medical Mystery

Optical Character Recognition

Error rates of OCR can be reduced by specific post-processing.

Based on context and statistical language modeling:

Medical Mÿstory

→

Medical History

Optical Character Recognition

Accuracy of OCR relies on

quality of input
proper preprocessing steps
amount of post-processing

Optical Character Recognition

OCR-accuracy around 99% achievable
with a good model and near-perfect input.

Not enough for negative search…

Optical Character Recognition

Improving OCR-accuracy beyond
99% requires heavy post-processing.

Double-keying with human involvement…

…or a combination with a generative AI model

Commonalities of examples

Common to all previous examples:

Eliminate manual intervention as much as possible.

Commonalities of examples

Human inspection is still required!

Data Validation

Mixed approaches to ensure the human-in-the-loop:

pre-defined processing of data
automated checks
data that doesn’t meet predefined rules
it is marked for manual inspection

Pipelines

Retrieving structured information from unstructured data:

Unstructured Data

→

Pre-
processing

→

Information
Extraction

→

Data Validation

→

Data Integration

Pipeline

Information Extraction can contain multiple steps:

OCR

→

Error Correction

→

Key Word Identification

→

Topic Classification

→

Sentiment Analysis

Pipeline

Pre- and post-processing highly dependent on desired outcome.

OCR requires careful image preprocessing
Error correction using a dictionary?
Topic classification can benefit from stopword removal
Stopword removal can mess up sentiment analysis

Complexity increases

The more your data lacks structure
the more complex your pipeline will be.

Divide and Conquer

A good strategy to handle complexity
and allow testing individual steps:

Divide and Conquer

Isolating parts of your pipeline allows for
better testing and less interference.

Divide and Conquer

Text from OCR pipeline

↗

→

↘

Stopword Removal

→

Lemmatization

→

Filtering

→

Topic Classification

Keyword Identification

→

Psycho-linguistic Classification

→

Sentiment Analysis

Embedding Model

→

Semantic Labelling

Divide and Conquer

D&C is also a good strategy for your personal projects…

python-script to gather data

→

AWK-script for data wrangling

→

R-script for visualization

→

Service Architecture

This divided setup is also widely used in service architecture.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

Transferring data from a server to a client over a network.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

The server offers a REST-API.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

The client contacts such a REST-URL.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

And the server then responds with data.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

Key principles of REST (Representational State Transfer):

At its heart, the server provides
access to resources
accessible at URLs.
REST relies on HTTP methods (GET, POST, PUT, DELETE)
Resources are represented in formats like JSON
The server is stateless. A request must contain all relevant information. Each request must be independent.

Accessing REST APIs

Accessing REST APIs is pretty simple:

R packages httr and jsonlite
python packages requests and json

Accessing REST APIs

An example API access in R:

# Loading packages
library(httr)
library(jsonlite)

# Initializing API Call
call <- "http://www.omdbapi.com/?i=tt0208092&apikey=948d3551&plot=short&r=json"

# Getting details in API
get_movie_details <- GET(url = call)

# Getting status of HTTP Call
# 200 means, everything is ok
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
status_code(get_movie_details)

Accessing REST APIs

An example API access:

# Converting content to text
get_movie_text <- content(get_movie_details,
                          "text", encoding = "UTF-8")

# Parsing data in JSON
get_movie_json <- fromJSON(get_movie_text,
                           flatten = TRUE)

# Converting into dataframe
get_movie_dataframe <- as.data.frame(get_movie_json)

Accessing REST APIs

An example response:

Accessing REST APIs

This is a JSON representation (/ˈdʒeɪsən/)

Accessing REST APIs

Some possibilities with this notation:

Accessing REST APIs

This converts into a dataframe wonderfully!

Accessing REST APIs

JSON is an “open standard”:

[…] there is no single definition, and interpretations vary with usage.

https://en.wikipedia.org/wiki/Open_standard

Vulgo: JSON is probably not the safest format to use.

Finishing up

Finishing up…

Parts form pipelines

Intelligent Data Processes are a clever combination of specific parts.

Parts form pipelines

You have heard about many of those parts up to now.

Take this seminar series as an invitation to dig deeper.

Data Intelligence

Prof. Dr. Mirco Schoenfeld

From unstructured data to useful information

From unstructured data to useful information

From unstructured data to useful information

Goal

Types of data

From Digits to Knowledge

From Digits to Knowledge

From Digits to Knowledge

From Digits to Knowledge

From Digits to Knowledge

Automation

Intermediate Steps Vary

Example: Structuring a bunch of Excel sheets

Example: Structuring a bunch of Excel sheets

Process Excel sheets

Processing Excel: Pitfalls

Example: Structuring eMails

Example: Structuring eMails

What do you want to achieve?

Robust NLP techniques

Syntax-related NLP techniques

Semantical NLP techniques

Pattern Recognition

Pattern Recognition

Example: Structuring Paper Documents

Example: Structuring Paper Documents

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Optical Character Recognition

Commonalities of examples

Commonalities of examples

Data Validation

Pipelines

Pipeline

Pipeline

Complexity increases

Divide and Conquer

Divide and Conquer

Divide and Conquer

Divide and Conquer

Service Architecture

Service Architecture

Service Architecture

Service Architecture

Service Architecture

Service Architecture

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Accessing REST APIs

Finishing up

Parts form pipelines

Parts form pipelines