Personal Research Data Management
From an organised hard drive to an automated life

Prof. Dr. Mirco Schoenfeld

25 October 2022

Todays Requirements

Plain Text Tools

A typical data report

 
 

  • Cites other papers / journal articels / textbooks / data reports
  • Presents experimental results / calculations
  • Illustrates scenarios using drawings
  • References items in the text
  • Contains graphs / shows data

(Munroe 2021)

 

 
 
 

Two models of work

  1. Office Model

  1. Engineering Model

(Healy 2019)

The difference between those models

  1. Office Model

  1. Engineering Model

The Office Model

A Word file is probably the center of your project

  • Changes are tracked inside that file
  • Citation & Reference managers plug into Word
  • Output of data analysis is pasted into the file
  • Often, the word file is the output

The Engineering Model

The project is organized around a repository

  • Changes are tracked outside of files
  • Data analysis is kept in code that produces output in a known and reproducible manner
  • Citation & references are kept in separate files
  • Final output is assembled from a bunch of input files

Software Requirements

Windows users: INSTALL R FIRST!

R-Markdown Requirements

install.packages("rmarkdown")
install.packages("knitr")
library(knitr)
install.packages("tinytex")
tinytex::install_tinytex()

Version Control

What is Version Control?

  • Class of systems responsible for managing changes to documents or other collections of information
  • Changes are usually identified by revision levels or “revisions”
  • Each revision is associated with a timestamp and the person making the change
  • Revisions can be compared, restored, and, depending on the file type, merged.

(Munroe 2013)

Version Control Example

VC: List Revisions

VC: Compare Revisions

git

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

(Healy 2019)

git details

Commit
A ‘commit’ is a modification that is applied to the repository.

Tag
A tag assigns a label to a revision (including many files) allowing to directly jump to that revision. Often used to label a specific version of a software.

Branching
Duplication of an object under version control. Objects can then be modified separately and in parallel so that they become different. Might require a merge operation afterwards

(Commons 2009)

git started

https://git-scm.com/

https://git-scm.com/book/en/v2

(Chacon and Straub 2014)

Task

  1. Access the git repository https://github.com/TwlyY29/dmkg_intro_to_git
  2. Familiarize with the tags: What changed?
  3. If you’re fast: What happened in the branches?

Personal Data Management

Get organised

Thousands of emails. Hundreds of files. […] Duplicated content, lost content. We thought search would save us from this nightmare, but we were wrong.

(Noble 2019)

Dewey Decimal Classification

  • Library classification scheme (yet another)
  • Scheme is based on a hierarchy of numbers
  • Books are placed relative to other books with similar subject

Example:

   500 Natural sciences and mathematics

      510 Mathematics

         516 Geometry

            516.3 Analytic geometries

                 516.37 Metric differential geometries

                      516.375 Finsler geometry

Johnny Decimals

Create your own classification scheme to organise your file system.

(Noble 2019)

Johnny Decimals: How to (1)

  1. Divide everything into ten things areas.
  2. Divide each area into ten categories.

Johnny Decimals: How to (2)

First ten categories will be numbers 10-19.

   11 Tax Returns
   12 Payroll
   13 ...

This could be the Finance-area. All items in that category belong together.

Johnny Decimals: How to (3)

A Johnny.Decimal number looks like this:

    42.18
    12.03
    63.17
    ...
    AC.ID

The Area Code (AC) specifies the category in the area.

The Identifier (ID) references a specific element in the category.

Johnny Decimals: Saving files the right way

Johnny Decimals: Don’t do this

You may not store things anywhere other than in a folder with a full Johnny.Decimal number.

Johnny Decimals: Don’t do this

Documents must relate to something!

Johnny Decimals: Don’t do this

You must not create a folder inside a Johnny.Decimal folder.

Why should you care?

  • Organize your folder structure
  • Files have one place where they belong
  • It helps you find things:
    • Put a Johnny Decimal in an email subject
    • Display a Johnny Decimal on a printed copy
    • Put it on a sticky note
  • Use it as a shortcut to open files and folders

Task

  1. Choose your latest project (e.g. research project)
  2. COPY IT TO A NEW LOCATION
  3. Develop a J.D classification scheme for the contents
  4. And reorganize your project folder
  5. If you’re really fast: develop a system covering all your projects

An Automated Life?

Automate your life

(Healy 2019)

Automate your life using a Makefile

## Read as "mypaper.pdf depends on mypaper.md and fig1.pdf"
mypaper.pdf: mypaper.md fig1.pdf 
    pandoc mypaper.md -o mypaper.pdf

## Read as "fig1.pdf depends on fig1.r"
fig1.pdf: fig1.r
    R CMD BATCH fig1.r

(Healy 2019)

Should you really?

  1. These tools won’t make you more productive
  2. It is about the principle of workflow management

Basic principles of organizing your work

  1. Leave a coherent record of your actions
    Sustain reproducibility! Your future you will thank you at some point!
  2. Files and folders need to tell you what they are
    Increase findability and sustain reproducibility.
  3. Automate repetitive and error-prone processes where possible
    Checking for mistakes will be easier because there is only one place to correct them.

Where to learn more?

Data Modeling &
Knowledge Generation

All materials online

Data Literacy:
Supplementary program

Supplementary course of studies.

References

Chacon, Scott, and Ben Straub. 2014. Pro Git (Second Edition). Apress. https://git-scm.com/book/en/v2.
Commons, Wikimedia. 2009. “Revision Controlled Project Visualization.” https://commons.wikimedia.org/wiki/File:Revision_controlled_project_visualization.svg.
Healy, Kieran. 2019. The Plain Person’s Guide to Plain Text Social Science. The Internet. https://plain-text.co/.
Munroe, Randall. 2013. “Git Commit.” https://xkcd.com/1296/.
———. 2021. “Types of Scientific Papers.” https://xkcd.com/2456/.
Noble, John. 2019. Johnny Decimal. The Internet. https://johnnydecimal.com/.