Data Modeling
From Theory to Application with CRISP-DM

Prof. Dr. Mirco Schoenfeld

Remember Data Modeling

Remember Data Modeling?

(Nguyen 2009)

Becoming better in modeling

How to become better in data modeling?

“We’ve seen large organizations hire 30+ PhDs without clear business alignment upfront. They then emerge from a six week research hole only to realize they had misunderstood the target variable, rendering the analysis irrelevant.”

(Hotz 2024b)

Introducing CRISP

Introducing:

Cross-industry standard process for data mining

CRISP-DM

CRISP-DM

CRISP-DM is

  • an open standard process model
  • describing data mining projects
  • the most widely-used analytics model

CRISP-DM

CRISP-DM is not

  • dealing with long-term applications
  • dealing with deployment of ML models
  • new

Be aware

CRISP-DM can easily be combined with management frameworks.

And it is also easily adaptable to AI-related projects. (Saltz 2024)

Major phases

CRISP-DM identifies six phases

Major phases

Each of the six phases is further broken down into a list of tasks.

Leaper (2009)

Some important highlights

We will step through some key points here.

For further reading, please refer to online material
(Chapman et al. 1999; Martínez-Plumed et al. 2021; Hotz 2024c).

Some important highlights

  1. Business Understanding

You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.”

Chapman et al. (1999)

Some important highlights

Business Understanding leads to definition of
business success criteria.

Some important highlights

  1. Data Understanding

Focus shifts to identification, collection, and analyzation of data sets.

Some important highlights

  1. Data Preparation

https://xkcd.com/1319/

Some important highlights

Data Preparation will take up 80% of time spent on a project.

Some important highlights

  1. Data Modeling

What is meant by CRISP-DM’s modeling phase?

This does not mean data modeling
like outlined in the presentation about data modeling!

Some important highlights

  1. Evaluation

Do the models meet business success criteria?

Common Data Science Metrics

https://bee-z.com/which-metrics-will-help-you-improve-your-innovation-led-growth/

Common Data Science Metrics

Common Data Science KPI Groups

Category Question Example
Traditional metrics How are we performing relative to plan? Time, budget, and scope variance to plan
Agile metrics How frequently are we providing value?

Velocity metrics

Cycle times

Lean metrics What percent of our time is value-add? Effeciency
Financial metrics Are we creating organizational financial value? Revenue and cost metrics, payback period, ROI, NPV
Organizational goals Is my project impacting organizational goals? Varies widely
Artifact creation Are we creating re-useable artifacts? Number / value of artifacts created
Competencies gained Are team members gaining valuable skillsets? Number / value of competencies gained
Stakeholder satisfaction Are my project stakeholders satisfied? Net promoter score; “gut feel” assessment
Software metrics What is the quality of the overall system being developed? Defect count, defect resolution rate, latency, test coverage
Model performance How are the models performing? RMSE, F1, recall, precision, ROC, p-value

(Hotz 2024a)

Common Data Science Metrics

Common Data Science KPI Groups

Anything missing on this list?

Category Question Example
Traditional metrics How are we performing relative to plan? Time, budget, and scope variance to plan
Agile metrics How frequently are we providing value?

Velocity metrics

Cycle times

Lean metrics What percent of our time is value-add? Effeciency
Financial metrics Are we creating organizational financial value? Revenue and cost metrics, payback period, ROI, NPV
Organizational goals Is my project impacting organizational goals? Varies widely
Artifact creation Are we creating re-useable artifacts? Number / value of artifacts created
Competencies gained Are team members gaining valuable skillsets? Number / value of competencies gained
Stakeholder satisfaction Are my project stakeholders satisfied? Net promoter score; “gut feel” assessment
Software metrics What is the quality of the overall system being developed? Defect count, defect resolution rate, latency, test coverage
Model performance How are the models performing? RMSE, F1, recall, precision, ROC, p-value

Common Pitfall

A common pitfall in all data-related projects:

Data does not fit the question.

Data Modeling is easy now?

Data and training models are not good
at things that haven’t happened before!

References

Chapman, Pete, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth. 1999. “The CRISP-DM User Guide.” In 4th CRISP-DM SIG Workshop in Brussels in March. Vol. 1999. sn.
Hotz, Nick. 2024a. “10 Data Science Project Metrics.” https://www.datascience-pm.com/10-data-science-metrics/.
———. 2024b. “10 Questions to Ask Before Starting a Data Science Project.” https://www.datascience-pm.com/10-questions-to-ask-before-starting-a-data-science-project/.
———. 2024c. “What Is CRISP DM?” https://www.datascience-pm.com/crisp-dm-2/.
Leaper, Nicole. 2009. “Visual Guide to CRISP-DM Methodology.” https://exde.wordpress.com/wp-content/uploads/2009/03/crisp_visualguide.pdf.
Martínez-Plumed, Fernando, Lidia Contreras-Ochando, Cèsar Ferri, José Hernández-Orallo, Meelis Kull, Nicolas Lachiche, María José Ramírez-Quintana, and Peter Flach. 2021. “CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories.” IEEE Transactions on Knowledge and Data Engineering 33 (8): 3048–61. https://doi.org/10.1109/TKDE.2019.2962680.
Nguyen, Marie-Lan. 2009. “File:double Herm Chiaramonti Inv1395.jpg — Wikimedia Commons, the Free Media Repository.” https://commons.wikimedia.org/w/index.php?title=File:Double_herm_Chiaramonti_Inv1395.jpg&oldid=776623637.
Saltz, Jeff. 2024. “Managing Generative AI Projects.” https://www.datascience-pm.com/managing-generative-ai-projects/.
Back to Seminar