Data Intelligence

Prof. Dr. Mirco Schoenfeld

From unstructured data to useful information

From unstructured data to useful information

From unstructured data to useful information

From unstructured data to useful information

Goal

Give unstructured data a structure!

Types of data

https://mycloudwiki.com/san/data-and-information-basics/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

From Digits to Knowledge

https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/

Automation

Automation is key.

Intermediate Steps Vary

Processing unstructured data builds on various tools.

Example: Structuring a bunch of Excel sheets

Automatically collect information from a bunch of Excel sheets

Example: Structuring a bunch of Excel sheets

What is needed to process Excel sheets?

At least, it has some structure already.

Process Excel sheets

Several ways possible to extract data from Excel sheets:

Processing Excel: Pitfalls

Beware of complicated (i.e. nested) table structures!

Example: Structuring eMails

From a pile of emails to structured information

Example: Structuring eMails

What is needed to process eMails?

Here, the task becomes trickier.

What do you want to achieve?

Technology depends on what you aim to achieve:

  • identification of key words?
  • identification of characteristic words?
  • topics?
  • sentiment analysis?
  • stance detection?

Robust NLP techniques

Robust Natural Language Processing (NLP) techniques include

  • frequency analysis of text corpora
  • TF-IDF measures
  • topic analysis
  • labeling of emotions / sentiments

Semantical NLP techniques

Processing semantics of text:

  • embedding models
  • disambiguation words
  • entity recognition

Pattern Recognition

Pattern Recognition using Regular Expressions.

https://towardsdatascience.com/easiest-way-to-remember-regular-expressions-regex-178ba518bebd

Pattern Recognition

You should learn Regular Expressions!

Example: Structuring Paper Documents

From a pile of documents to structured information

Example: Structuring Paper Documents

What is needed to extract information from paper documents?

Optical Character Recognition

Convert the scan of a paper to machine-readable text using OCR.

Optical Character Recognition

OCR stands for Optical Character Recognition.

Current OCR builds on trained ML-models to identify glyphs.

Optical Character Recognition

OCR is usually error-prone and dependent on
a proper pre-processing of images.

  • Segmentation
  • Binarization
  • Skew correction
  • Noise removal

Optical Character Recognition

Segmentation

Optical Character Recognition

Binarization

Optical Character Recognition

Skew correction

Optical Character Recognition

Noise removal

Optical Character Recognition

Error rates of OCR can be reduced by specific post-processing.

  1. Based on dictionaries and, e.g. edit distances:
Medical Mÿstory
Medical Mystery

Optical Character Recognition

Error rates of OCR can be reduced by specific post-processing.

  1. Based on context and statistical language modeling:
Medical Mÿstory
Medical History

Optical Character Recognition

Accuracy of OCR relies on

  • quality of input
  • proper preprocessing steps
  • amount of post-processing

Optical Character Recognition

OCR-accuracy around 99% achievable
with a good model and near-perfect input.

Not enough for negative search…

Optical Character Recognition

Improving OCR-accuracy beyond
99% requires heavy post-processing.

Double-keying with human involvement…

…or a combination with a generative AI model

Commonalities of examples

Common to all previous examples:

Eliminate manual intervention as much as possible.

Commonalities of examples

Human inspection is still required!

Data Validation

Mixed approaches to ensure the human-in-the-loop:

  • pre-defined processing of data
  • automated checks
  • data that doesn’t meet predefined rules
    it is marked for manual inspection

Pipelines

Retrieving structured information from unstructured data:

Unstructured Data
Pre-
processing
Information
Extraction
Data Validation
Data Integration

Pipeline

Information Extraction can contain multiple steps:

OCR
Error Correction
Key Word Identification
Topic Classification
Sentiment Analysis

Pipeline

Pre- and post-processing highly dependent on desired outcome.

  • OCR requires careful image preprocessing
  • Error correction using a dictionary?
  • Topic classification can benefit from stopword removal
  • Stopword removal can mess up sentiment analysis

Complexity increases

The more your data lacks structure
the more complex your pipeline will be.

Divide and Conquer

A good strategy to handle complexity
and allow testing individual steps:

Divide and Conquer

Divide and Conquer

Isolating parts of your pipeline allows for
better testing and less interference.

Divide and Conquer

Text from OCR pipeline

Stopword Removal
Lemmatization
Filtering
Topic Classification
Keyword Identification
Psycho-linguistic Classification
Sentiment Analysis
Embedding Model
Semantic Labelling

Divide and Conquer

D&C is also a good strategy for your personal projects…

python-script to gather data
AWK-script for data wrangling
R-script for visualization

Service Architecture

This divided setup is also widely used in service architecture.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

Transferring data from a server to a client over a network.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

The server offers a REST-API.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

The client contacts such a REST-URL.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

And the server then responds with data.

https://medium.com/@MakeComputerScienceGreatAgain/understanding-rest-api-a-comprehensive-guide-52fc10f6c9ed

Service Architecture

Key principles of REST (Representational State Transfer):

  1. At its heart, the server provides
    access to resources
    accessible at URLs.
  2. REST relies on HTTP methods (GET, POST, PUT, DELETE)
  3. Resources are represented in formats like JSON
  4. The server is stateless. A request must contain all relevant information. Each request must be independent.

Accessing REST APIs

Accessing REST APIs is pretty simple:

  • R packages httr and jsonlite
  • python packages requests and json

Accessing REST APIs

An example API access in R:

# Loading packages
library(httr)
library(jsonlite)

# Initializing API Call
call <- "http://www.omdbapi.com/?i=tt0208092&apikey=948d3551&plot=short&r=json"

# Getting details in API
get_movie_details <- GET(url = call)

# Getting status of HTTP Call
# 200 means, everything is ok
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
status_code(get_movie_details)

Accessing REST APIs

An example API access:

# Converting content to text
get_movie_text <- content(get_movie_details,
                          "text", encoding = "UTF-8")

# Parsing data in JSON
get_movie_json <- fromJSON(get_movie_text,
                           flatten = TRUE)

# Converting into dataframe
get_movie_dataframe <- as.data.frame(get_movie_json)

Accessing REST APIs

An example response:

Accessing REST APIs

This is a JSON representation (/ˈdʒeɪsən/)

Accessing REST APIs

Some possibilities with this notation:

Accessing REST APIs

This converts into a dataframe wonderfully!

Accessing REST APIs

JSON is an “open standard”:

[…] there is no single definition, and interpretations vary with usage.

https://en.wikipedia.org/wiki/Open_standard

Vulgo: JSON is probably not the safest format to use.

Finishing up

Finishing up…

Parts form pipelines

Intelligent Data Processes are a clever combination of specific parts.

Parts form pipelines

You have heard about many of those parts up to now.

Take this seminar series as an invitation to dig deeper.

Back to Seminar