R Markdown

Prof. Dr. Mirco Schoenfeld

data reports

Data reports. What’s that again?


(Munroe 2021)

  • Cites other papers / journal articels / textbooks / data reports
  • Presents experimental results / calculations
  • Illustrates scenarios using drawings
  • References items in the text
  • Contains graphs / shows data

Two models of work

Two models of work.

  1. Office Model

  1. Engineering Model

The difference between those models

  1. Office Model

  1. Engineering Model

The Office Model

  1. Office Model

Center of your project: Word file

  • Changes are tracked inside that file
  • Citation & Reference managers plug into Word
  • Output of data analysis is pasted into the file
  • Often, the word file is the output

The Engineering Model

Center of your project: repository

  • Changes are tracked outside of files
  • Data analysis is kept in code
  • Code produces output reproducibly
  • Citation & references in separate files
  • Final output is assembled
  1. Engineering Model

draw the owl

R Markdown

Getting Started.

https://rmarkdown.rstudio.com/articles_intro.html (Grolemund 2014)

https://daringfireball.net/projects/markdown/basics (Gruber 2004)

Basic Document

Create a basic document with the following content

---
output:
  pdf_document: default
  html_document: default
---

# Say Hello to markdown

Markdown is an **easy to use** format for writing reports.
It resembles what you naturally write every time you compose an email. In fact, you may have already used markdown *without realizing it*. These websites all rely on markdown formatting

## Subheading

Fantastic!

* [Github](www.github.com)
* [StackOverflow](www.stackoverflow.com)
* [Reddit](www.reddit.com)
* and surely others do as well!

# Chapter Two Starts Here

Let me cite some clever person:

> Dorothy followed her through many of the beautiful rooms in her castle.

Document Metadata

Now, adjust the header of the file:

---
title: "Hello World in Markdown"
author: "Mirco"
date: "24. April 2024"
---

# Say Hello to markdown

Markdown is an **easy to use** format for writing reports.
[...]

Combined data analyses

Now, for the best part, add this to the end of your document:

## What is cool about R Markdown

It is easy to show what you did by including code and output of code. The max speed of a car in our database is `r max(cars)`mp/h. We have `r nrow(cars)` cars in our dataset.

```{r}
summary(cars)
```

You can also embed plots, for example:

```{r, echo=FALSE}
plot(cars)
```

With references

To use references, adjust the header and end of your document:

---
title: "Hello World in Markdown"
author: "Mirco"
date: "27. October 2022"
output:
  pdf_document: default
  html_document:
    df_print: paged
csl: apa.csl
bibliography: references.bib
---

[...]

# References

With references

…create a references.bib file…

@book{2020-healy-plain_text,
  author ={Healy, Kieran},
  title= {{The Plain Person’s Guide to Plain Text Social Science}},
  url={https://plain-text.co/},
  publisher={The Internet},
  year={2019}
}

…and download the apa.csl here (or look for more )

It’s your turn!

Task

Work with a data report!

Yo, programming!

  1. Download the dataset to the folder of your R Markdown file
    Don’t forget to set the working directory
  2. Within the R Markdown file, load the dataset using read.csv
    The file is tab-separated
  3. Visualize the data using the plot-function
    temperature on x-axis, pressure on y-axis

Come again, please?

  1. Download the data report to your hard drive
  2. Open the data_report.Rmd in RStudio
  3. Compile the report to a PDF file
  4. Manipulate the pressure.tsv using Excel or a proper text editor
  5. Re-compile the report to a PDF file

 

Time: 30 minutes

Task Summary

What have you seen?

Do you think you could make use of this in your daily work?

Further Reading

Advanced Usage:

https://plain-text.co/pull-it-together.html#pull-it-together (Healy 2019)

Why the engineering model

Why else should you care about the engineering model?

Why the engineering model

This model can make your life so much easier!

…automation, remember?

Great Scripting Languages

For the purpose of automation, consider scripting languages

  • AWK
  • Perl
  • Python
  • sed

AWK Example

Why AWK is so great… Consider the amazon review dataset:

http://jmcauley.ucsd.edu/data/amazon/index_2014.html

or an updated version

https://nijianmo.github.io/amazon/index.html

The total number of reviews is 233.1 million (142.8 million in 2014).

AWK Example

An excerpt from a book review

marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
US 22480053 R28HBXXO1UEVJT 0843952016 34858117 The Rising Books 5 0 0 N N Great Twist on Zombie Mythos I've known about this one for a long time, but just finally got around to reading it for the first time. I enjoyed it a lot!  What I liked the most was how it took a tired premise and breathed new life into it by creating an entirely new twist on the zombie mythos. A definite must read! 2012-05-03

AWK Example

Data Columns of Amazon Review dataset (excerpt):

01  marketplace       - 2 letter country code [...]
02  customer_id       - Random customer identifier [...]
03  review_id         - The unique ID of the review.
04  product_id        - The unique Product ID the review pertains to.
05  product_parent    - Random product identifier [...]
06  product_title     - Title of the product.
07  product_category  - Broad product category [...]
08  star_rating       - The 1-5 star rating of the review.
09  helpful_votes     - Number of helpful votes.
10  total_votes       - Number of total votes the review received.
11  vine              - Review was written as part of the Vine program.
12  verified_purchase - The review is on a verified purchase.
13  review_headline   - The title of the review.
14  review_body       - The review text.
15  review_date       - The date the review was written.

AWK Example

Obtain all star ratings:

awk '{ print $8 }' reviews.tsv

AWK Example

Obtain the average star rating over the entire dataset:

awk -F '\t' '
   { total = total + $8 }
   END { print "Average book review is", total/NR, "stars" }
' reviews.tsv

Is AWK fast enough?

Is AWK even fast enough? An example with ~1.7GB of data.

Adam Drake reported a 47x speedup compared to a Hadoop cluster using standard awk.

(Drake, n.d.)

Is AWK fast enough?

Using an optimized version of awk gained a 235x speedup compared to Hadoop on ~1.7GB of data.

(12 seconds vs. 26 minutes)

On a Laptop.

(Drake, n.d.)

Why you should care

Try it yourself

Scripting languages can help you with your daily routines or repetetive tasks

Try it yourself

Try it yourself

Try it yourself

rename.py:

# importing os module
import os

folder = "xyz"
for count, filename in enumerate(os.listdir(folder)):
    dst = f"Hostel {str(count)}.jpg"
    src =f"{folder}/{filename}"  # foldername/filename, if .py file is outside folder
    dst =f"{folder}/{dst}"

    os.rename(src, dst)

> python3 rename.py

(Joshi 2022)

Conclusion

You should learn at least one scripting language.

https://programming-motherfucker.com/

References

Drake, Adam. n.d. “Command-Line Tools Can Be 235x Faster Than Your Hadoop Cluster.” https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html.
Grolemund, Garrett. 2014. Introduction to R Markdown. R Studio. https://rmarkdown.rstudio.com/articles_intro.html.
Gruber, John. 2004. Markdown. The Internet. https://daringfireball.net/projects/markdown/basics.
Healy, Kieran. 2019. The Plain Person’s Guide to Plain Text Social Science. The Internet. https://plain-text.co/.
Joshi, Vineet. 2022. “Rename Multiple Files Using Python.” https://www.geeksforgeeks.org/rename-multiple-files-using-python/.
Munroe, Randall. 2021. “Types of Scientific Papers.” https://xkcd.com/2456/.
Back to Lecture Website