This is the main course website for the seminar Introduction to Computer-based Text Analysis given in summer term 2023 at University of Bayreuth.
To view results of past participant's projects, go here: https://mircoschoenfeld.de/seminar-introduction-to-computer-based-text-analysis.html
News¶
As you can see, the schedule lists all tutorials available for this course. The given dates showcase which videos will be discussed in the lecture on the mentioned date.
- 07. July 2023: July 7 will be the day of your preliminary poster presentations!
- 30. June 2023: The last session in June is devoted to Information Visualization and Poster Preparation. Make sure to check out good examples for posters from previous semesters
- 23. June 2023: On 23rd June, your blog post is due. In the session, we will peer review your ideas.
- 16. June 2023: Next topic: Word Embedding
- 09. June 2023: On 9th June, we will discuss a very important issue: preprocessing or stemming and lemmatization, to be more precise.
- 02. June 2023: On 2nd June, we will discuss clustering of documents and topic modeling
- 26. May 2023: Check out tutorials on how to investigate framing of concepts: Manual classifications and feature co-occurences
- 19. May 2023: On 19th May, we will discuss both videos of this section covering the context of words
- 12. May 2023: Tokenization, Document Feature Matrices, and first metrics! These videos will be discussed in the lecture of 12th May!
- 05. May 2023: Building a corpus and assigning document variables.
- 28. April 2023: Today, we will have an online meeting to help you setting up R, Rstudio, and the rest of your development environment. During the week, get familiar working with a corpus!
- 21. April 2023: First tutorials are online! Jump to the schedule!
Recordings of the lecture are available online. Please see the schedule for a selection of relevant videos.
Syllabus¶
A central challenge of our time is the processing of a constantly growing amount of texts. Every day, collections are created that a single person can hardly work through in a reasonable amount of time: be it newspaper articles, statements, minutes, communiqués, blog articles or posts in social media. To help us understand large amounts of text, we turn to computational methods. In this course, we will explore such methods. We will learn methods for quantitative analysis of text collections, methods for extracting information, and statistical methods for analyzing large corpora. These methods will also be presented practically using R and evaluated together. An important part of the seminar is also the critical look at the results of the automated analyses.
Based on the newly learned methods, the participants develop their own scientific questions and work on them in small groups during the semester.
Based on their new methodological and theoretical insights, participants develop own research questions in the seminar and answer them throughout the semester in groups of two persons.
In this course, students learn the main theoretical and methodological principles of computer-assisted text analysis and they will be able to apply these methods to their own research projects. After successful participation in this seminar, students will be able to realize, based on an own project, the transfer between a scientific research question and methods of computer-based text analysis.
R-Basics¶
In case you want to (re-)build basic R skills, please feel free to check out my other tutorials on R. Students of University of Bayreuth can also enroll in an elearning-course which offers tasks and automated evaluation of tasks.
Schedule¶
In this section, you will find a list of tutorial videos helping you to get started with analyzing text data in R.
Getting Started¶
Title | Video | Source-Code | Material |
---|---|---|---|
Working with RStudio | |||
Creating a corpus |
|
|
Working with corpora¶
Title | Video | Source-Code |
---|---|---|
Assigning document variables | ||
Saving Time |
|
Setting the Basis¶
Title | Video | Source-Code |
---|---|---|
Tokenization |
|
|
Tokenization and Preprocessing |
|
|
Document Feature Matrices |
|
Metrics and Statistics¶
Title | Video | Source-Code |
---|---|---|
Simple Text Statistics |
|
|
Obtaining Metrics |
|
|
Multi-word expressions |
|
Concepts & Context¶
Title | Video | Source-Code |
---|---|---|
Keywords in context |
|
|
Differentiating context and the rest of the document |
|
Manual Classification¶
Title | Video | Source-Code |
---|---|---|
Using a dictionary for manual classification |
|
More about context¶
Title | Video | Source-Code |
---|---|---|
Feature Co-Occurrences |
Clustering¶
Title | Video | Source-Code |
---|---|---|
Clustering Documents |
Topic Modeling¶
Title | Video | Source-Code |
---|---|---|
Modeling topics | ||
Identify parameter k | ||
Seeded topic models |
|
Stemming & Lemmatization¶
Title | Video | Source-Code | Material |
---|---|---|---|
Stemming |
|
||
Lemmatization |
|
Computing with Semantics¶
Title | Video |
---|---|
Word Embedding |
Visualization¶
Title | Video | Slides | |
---|---|---|---|
Information Visualization | |||
Poster Preparation |
Legend¶
Find the video here | |
Find code material here | |
Find external material here |
References¶
- Mirco Schoenfeld, Steffen Eckhard, Ronny Patz, Hilde van Meegdenburg, and Antonio Pires. The UN Security Council debates 1995-2020. 2021. doi:10.7910/DVN/KGVSYH.
- Ken Benoit. Text as data: an overview. In: The SAGE Handbook of Research Methods in Political Science and International Relations. SAGE Publications Ltd, 55 City Road, London, Apr 2020. doi:10.4135/9781526486387.
- James H Martin and Daniel Jurafsky. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall Upper Saddle River, 3 edition, 2020. https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf.
- Henry E. Brady. The challenge of big data and data science. Annual Review of Political Science, 22(1):null, 2019. doi:10.1146/annurev-polisci-090216-023229.
- Kenneth Benoit and Adam Obeng. Readtext: import and handling for plain and formatted text files. 2018. https://readtext.quanteda.io/.
- Kenneth Benoit, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. Quanteda: an r package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30):774, 2018. https://quanteda.io, doi:10.21105/joss.00774.
- Jenny Bryan. Happy git and github for the user. 2018. https://happygitwithr.com/.
- Kieran Healy. Data visualization: a practical introduction. Princeton University Press, 2018. http://socviz.co/.
- Yihui Xie, Joseph J Allaire, and Garrett Grolemund. R markdown: The definitive guide. CRC Press, 2018. https://bookdown.org/yihui/rmarkdown/.
- David Lazer and Jason Radford. Data ex machina: introduction to big data. Annual Review of Sociology, 43(1):19–39, 2017. doi:10.1146/annurev-soc-060116-053457.
- John Wilkerson and Andreu Casas. Large-scale computerized text analysis in political science: opportunities and challenges. Annual Review of Political Science, 20(1):529–544, 2017. doi:10.1146/annurev-polisci-052615-025542.
- Paul DiMaggio. Adapting computational text analysis to social science (and vice versa). Big Data & Society, 2(2):2053951715602908, 2015. doi:10.1177/2053951715602908.
- Justin Grimmer and Brandon M. Stewart. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3):267–297, 2013. doi:10.1093/pan/mps028.