Data reports. What’s that again?
(Munroe 2021)
Two models of work.
Center of your project: Word file
Center of your project: repository
Getting Started.
https://rmarkdown.rstudio.com/articles_intro.html (Grolemund 2014)
https://daringfireball.net/projects/markdown/basics (Gruber 2004)
Create a basic document with the following content
---
output:
pdf_document: default
html_document: default
---
# Say Hello to markdown
Markdown is an **easy to use** format for writing reports.
It resembles what you naturally write every time you compose an email. In fact, you may have already used markdown *without realizing it*. These websites all rely on markdown formatting
## Subheading
Fantastic!
* [Github](www.github.com)
* [StackOverflow](www.stackoverflow.com)
* [Reddit](www.reddit.com)
* and surely others do as well!
# Chapter Two Starts Here
Let me cite some clever person:
> Dorothy followed her through many of the beautiful rooms in her castle.
Now, adjust the header of the file:
Now, for the best part, add this to the end of your document:
## What is cool about R Markdown
It is easy to show what you did by including code and output of code. The max speed of a car in our database is `r max(cars)`mp/h. We have `r nrow(cars)` cars in our dataset.
```{r}
summary(cars)
```
You can also embed plots, for example:
```{r, echo=FALSE}
plot(cars)
```
To use references, adjust the header and end of your document:
…create a references.bib
file…
@book{2020-healy-plain_text,
author ={Healy, Kieran},
title= {{The Plain Person’s Guide to Plain Text Social Science}},
url={https://plain-text.co/},
publisher={The Internet},
year={2019}
}
…and download
the apa.csl
here (or look for
more )
Work with a data report!
Yo, programming!
read.csv
tab
-separatedplot
-functiontemperature
on x-axis, pressure
on y-axisCome again, please?
data_report.Rmd
in RStudiopressure.tsv
using Excel or a proper
text editor
Time: 30 minutes
What have you seen?
Do you think you could make use of this in your daily work?
Advanced Usage:
https://plain-text.co/pull-it-together.html#pull-it-together (Healy 2019)
Why else should you care about the engineering model?
This model can make your life so much easier!
…automation, remember?
For the purpose of automation, consider scripting languages
Why AWK is so great… Consider the amazon review dataset:
http://jmcauley.ucsd.edu/data/amazon/index_2014.html
or an updated version
https://nijianmo.github.io/amazon/index.html
The total number of reviews is 233.1 million (142.8 million in 2014).
An excerpt from a book review
marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
US 22480053 R28HBXXO1UEVJT 0843952016 34858117 The Rising Books 5 0 0 N N Great Twist on Zombie Mythos I've known about this one for a long time, but just finally got around to reading it for the first time. I enjoyed it a lot! What I liked the most was how it took a tired premise and breathed new life into it by creating an entirely new twist on the zombie mythos. A definite must read! 2012-05-03
Data Columns of Amazon Review dataset (excerpt):
01 marketplace - 2 letter country code [...]
02 customer_id - Random customer identifier [...]
03 review_id - The unique ID of the review.
04 product_id - The unique Product ID the review pertains to.
05 product_parent - Random product identifier [...]
06 product_title - Title of the product.
07 product_category - Broad product category [...]
08 star_rating - The 1-5 star rating of the review.
09 helpful_votes - Number of helpful votes.
10 total_votes - Number of total votes the review received.
11 vine - Review was written as part of the Vine program.
12 verified_purchase - The review is on a verified purchase.
13 review_headline - The title of the review.
14 review_body - The review text.
15 review_date - The date the review was written.
Obtain all star ratings:
awk '{ print $8 }' reviews.tsv
Obtain the average star rating over the entire dataset:
awk -F '\t' '
{ total = total + $8 }
END { print "Average book review is", total/NR, "stars" }
' reviews.tsv
Is AWK even fast enough? An example with ~1.7GB of data.
Adam Drake reported a 47x speedup compared to a Hadoop cluster using standard awk.
(Drake, n.d.)
Using an optimized version of awk gained a 235x speedup compared to Hadoop on ~1.7GB of data.
(12 seconds vs. 26 minutes)
On a Laptop.
(Drake, n.d.)
Scripting languages can help you with your daily routines or repetetive tasks
rename.py
:
# importing os module
import os
folder = "xyz"
for count, filename in enumerate(os.listdir(folder)):
dst = f"Hostel {str(count)}.jpg"
src =f"{folder}/{filename}" # foldername/filename, if .py file is outside folder
dst =f"{folder}/{dst}"
os.rename(src, dst)
> python3 rename.py
(Joshi 2022)
You should learn at least one scripting language.