Data Modeling Pitfalls

Prof. Dr. Mirco Schoenfeld

Measuring temperature on earth

https://novalynx.com/store/pc/110-WS-25-Modular-Weather-Stations-18p1073.htm

Measuring temperature on earth

https://novalynx.com/store/pc/110-WS-25-Modular-Weather-Stations-18p1073.htm

Secret life of data

https://community.wmo.int/en/observation-components-global-observing-system

Secret life of data

(Bates, Lin, and Goodale 2016)
https://doi.org/10.1177/2053951716654502

Measuring temperature on earth

http://www.surfacestations.com/odd_sites.htm

Measuring temperature on earth

http://www.surfacestations.com/odd_sites.htm

What’s the point here?

What’s the point here?

Know your data

Know your data

All data are based on assumptions!

Data have reliability and validity problems.

Data Modeling Pitfalls

Data might have been gathered for different purposes.

Data might not describe the problem at hand.

Data Modeling Pitfalls

Things to do to overcome
data modeling pitfalls (Le 2023)

  • cross-checking data confirms business-logic
  • adhering to business’s naming conventions
  • cross-checking model results with data sources
  • account for all the stakeholders

Naming conventions

…about naming conventions:

Each business domain uses their own jargon or naming convention.

This is reflected in column names of databases/data sets.

Naming conventions

Changing names makes communication and
mutual understanding extra hard.

Involving Stakeholders

…about involving stakeholders:

Identify key stakeholders early on in your data process.

They will probably be the future users of your report/dashboard/tool.

Involving Stakeholders

Engage them in discussions especially about metric definitions.

Biggest Pitfall of them all

What’s the biggest pitfall of them all?

Biggest Pitfall of them all

Biggest Pitfall of them all

Example: Excel and Dates.

Biggest Pitfall of them all

Dates are always tricky anyways.

Biggest Pitfall of them all

With Excel, this becomes even more difficult.

The reason is Excel’s auto conversion feature.

Biggest Pitfall of them all

Example: SEPT2.

Septtin 2 is a gene that plays an
important role in the function of cells.

For Excel, this is September, 2.

Biggest Pitfall of them all

Not an issue, you say?

Roughly 20% of genomics articles published in high-ranking journals contained errors in the names of genes in published data sets.

(Zeeberg et al. 2004; Ziemann, Eren, and El-Osta 2016)

Biggest Pitfall of them all

The genomics community decided to rename human genes to avoid conflicts with Excel. (Shaikly 2020)

Friends don’t let friends

Friends don’t let friends handle data processed through Excel. Hatt (2018)

References

Bates, Jo, Yu-Wei Lin, and Paula Goodale. 2016. “Data Journeys: Capturing the Socio-Material Constitution of Data Objects and Flows.” Big Data & Society 3 (2): 2053951716654502. https://doi.org/10.1177/2053951716654502.
Hatt, Bertil. 2018. “What Does Bad Data Look Like?” https://medium.com/@bertil_hatt/what-does-bad-data-look-like-91dc2a7bcb7a.
Le, Hanna. 2023. “Data Modeling Pitfalls and Where to Find Them.” https://medium.com/refined-and-refactored/data-modeling-pitfalls-and-where-to-find-them-cf146a112ff4.
Shaikly, Valerie. 2020. “Human Genes Renamed as Microsoft Excel Reads Them as Dates.” https://www.progress.org.uk/human-genes-renamed-as-microsoft-excel-reads-them-as-dates/.
Zeeberg, Barry R, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett, and John N Weinstein. 2004. “Mistaken Identifiers: Gene Name Errors Can Be Introduced Inadvertently When Using Excel in Bioinformatics.” BMC Bioinformatics 5 (1). https://doi.org/10.1186/1471-2105-5-80.
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1). https://doi.org/10.1186/s13059-016-1044-7.
Back to Seminar