(Bates, Lin, and
Goodale 2016)
https://doi.org/10.1177/2053951716654502
What’s the point here?
All data are based on assumptions!
Data have reliability and validity problems.
Data might have been gathered for different purposes.
Data might not describe the problem at hand.
Things to do to overcome
data modeling pitfalls (Le
2023)
…about naming conventions:
Each business domain uses their own jargon or naming convention.
This is reflected in column names of databases/data sets.
Changing names makes communication and
mutual understanding extra
hard.
…about involving stakeholders:
Identify key stakeholders early on in your data process.
They will probably be the future users of your report/dashboard/tool.
Engage them in discussions especially about metric definitions.
What’s the biggest pitfall of them all?
Example: Excel and Dates.
Dates are always tricky anyways.
With Excel, this becomes even more difficult.
The reason is Excel’s auto conversion feature.
Example: SEPT2.
Septtin 2 is a gene that plays an
important role in the
function of cells.
For Excel, this is September, 2.
Not an issue, you say?
Roughly 20% of genomics articles published in high-ranking journals contained errors in the names of genes in published data sets.
(Zeeberg et al. 2004; Ziemann, Eren, and El-Osta 2016)
The genomics community decided to rename human genes to avoid conflicts with Excel. (Shaikly 2020)
Friends don’t let friends handle data processed through Excel. Hatt (2018)