From unstructured data to useful information
Give unstructured data a structure!
https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/
https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/
https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/
https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/
https://derwirtschaftsinformatiker.de/2012/09/12/it-management/wissenspyramide-wiki/
Automation is key.
Processing unstructured data builds on various tools.
Automatically collect information from a bunch of Excel sheets
What is needed to process Excel sheets?
At least, it has some structure already.
Several ways possible to extract data from Excel sheets:
xlsx
-filesxlsx
to csv
automatically: using
batch files, with
LibreOffice, or tools
like in2csvBeware of complicated (i.e. nested) table structures!
From a pile of emails to structured information
What is needed to process eMails?
Here, the task becomes trickier.
Technology depends on what you aim to achieve:
Robust Natural Language Processing (NLP) techniques include
Syntax-based (pre-)processing of text:
Processing semantics of text:
Pattern Recognition using Regular Expressions.
https://towardsdatascience.com/easiest-way-to-remember-regular-expressions-regex-178ba518bebd
You should learn Regular Expressions!
From a pile of documents to structured information
What is needed to extract information from paper documents?
Convert the scan of a paper to machine-readable text using OCR.
OCR stands for Optical Character Recognition.
Current OCR builds on trained ML-models to identify glyphs.
OCR is usually error-prone and dependent on
a proper
pre-processing of images.
Segmentation
Binarization
Skew correction
Noise removal
Error rates of OCR can be reduced by specific post-processing.
Error rates of OCR can be reduced by specific post-processing.
Accuracy of OCR relies on
OCR-accuracy around 99% achievable
with a good model and
near-perfect input.
Not enough for negative search…
Improving OCR-accuracy beyond
99% requires heavy
post-processing.
Double-keying with human involvement…
…or a combination with a generative AI model
Common to all previous examples:
Eliminate manual intervention as much as possible.
Human inspection is still required!
Mixed approaches to ensure the human-in-the-loop:
Retrieving structured information from unstructured data:
Information Extraction can contain multiple steps:
Pre- and post-processing highly dependent on desired outcome.
The more your data lacks structure
the more complex your pipeline
will be.
A good strategy to handle complexity
and allow testing individual
steps:
Divide and Conquer
Isolating parts of your pipeline allows for
better testing and
less interference.
↗
→
↘
D&C is also a good strategy for your personal projects…
This divided setup is also widely used in service architecture.
Transferring data from a server to a client over a network.
The server offers a REST-API.
The client contacts such a REST-URL.
And the server then responds with data.
Key principles of REST (Representational State Transfer):
Accessing REST APIs is pretty simple:
httr
and
jsonlite
requests
and
json
An example API access in R:
# Loading packages
library(httr)
library(jsonlite)
# Initializing API Call
call <- "http://www.omdbapi.com/?i=tt0208092&apikey=948d3551&plot=short&r=json"
# Getting details in API
get_movie_details <- GET(url = call)
# Getting status of HTTP Call
# 200 means, everything is ok
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
status_code(get_movie_details)
An example API access:
An example response:
This is a JSON representation (/ˈdʒeɪsən/)
Some possibilities with this notation:
This converts into a dataframe
wonderfully!
JSON is an “open standard”:
[…] there is no single definition, and interpretations vary with usage.
https://en.wikipedia.org/wiki/Open_standard
Vulgo: JSON is probably not the safest format to use.
Finishing up…
Intelligent Data Processes are a clever combination of specific parts.
You have heard about many of those parts up to now.
Take this seminar series as an invitation to dig deeper.