This is the main course website for the seminar Introduction to Computer-based Text Analysis given in summer term 2024 at University of Bayreuth.
To view results of past participant's projects, go here: https://mircoschoenfeld.de/seminar-introduction-to-computer-based-text-analysis.html
News¶
As you can see, the schedule lists all tutorials available for this course. The given dates showcase which videos will be discussed in the lecture on the mentioned date.
- 15. July 2024: And the final poster presentation will take place on 15. July 2024! Looking forward to great contributions!
- 01. July 2024: July 1 will be the day of your preliminary poster presentations!
- 24. June 2024: The last session in June is devoted to Information Visualization and Poster Preparation. Make sure to check out good examples for posters from previous semesters
- 17. June 2024: On this session, we will conduct a peer review on your project ideas.
- 10. June 2024: Today, we have our methodology forum. Also, you can find the slides for Large Language Models and Word Embeddings here. For next week, prepare a written description of your project ideas. In the session, we will peer review your ideas.
- 03. June 2024: No class on 3.6., instead, you are asked to contribute to our methodology forum for next week: Create a small working tutorial of a method that has not been presented in the course yet. Presentation and discussion of contributions will take place on 10. June.
- 27. May 2024: Next topic: Word Embedding
- 13. May 2024: On 27. May, we will discuss clustering of documents, topic modeling, as well as important details around preprocessing: stemming and lemmatization.
- 06. May 2024: On 13th May, we will discuss the context of words and manual classificaions
- 29. April 2024: These videos on metrics and statistics will be discussed in the session of 6th May!
- 22. April 2024: For the next session, please prepare the topics on “Setting the basis”!
- 15. April 2024: Welcome to the Summer Term 2024! Today, we start with a gentle introduction. Next week, we will discuss how to build a corpus!
Recordings of the lecture are available online. Please see the schedule for a selection of relevant videos.
Syllabus¶
A central challenge of our time is the processing of a constantly growing amount of texts. Every day, collections are created that a single person can hardly work through in a reasonable amount of time: be it newspaper articles, statements, minutes, communiqués, blog articles or posts in social media. To help us understand large amounts of text, we turn to computational methods. In this course, we will explore such methods. We will learn methods for quantitative analysis of text collections, methods for extracting information, and statistical methods for analyzing large corpora. These methods will also be presented practically using R and evaluated together. An important part of the seminar is also the critical look at the results of the automated analyses.
Based on the newly learned methods, participants develop their own scientific questions and work on them in small groups during the semester.
In this course, students learn the main theoretical and methodological principles of computer-assisted text analysis and they will be able to apply these methods to their own research projects. After successful participation in this seminar, students will be able to realize, based on an own project, the transfer between a scientific research question and methods of computer-based text analysis.
Check out results of research projects from previous iterations of this course.
R-Basics¶
In case you want to (re-)build basic R skills, please feel free to check out my other tutorials on R. Students of University of Bayreuth can also enroll in an elearning-course which offers tasks and automated evaluation of tasks.
Schedule¶
In this section, you will find a list of tutorial videos helping you to get started with analyzing text data in R.
Getting Started¶
Title | Video | Source-Code | Material |
---|---|---|---|
Working with RStudio | |||
Creating a corpus |
|
|
Working with corpora¶
Title | Video | Source-Code |
---|---|---|
Assigning document variables | ||
Saving Time |
|
Setting the Basis¶
Title | Video | Source-Code |
---|---|---|
Tokenization |
|
|
Tokenization and Preprocessing |
|
|
Document Feature Matrices |
|
Metrics and Statistics¶
Title | Video | Source-Code |
---|---|---|
Simple Text Statistics |
|
|
Obtaining Metrics |
|
|
Multi-word expressions |
|
Concepts & Context¶
Title | Video | Source-Code |
---|---|---|
Keywords in context |
|
|
Differentiating context and the rest of the document |
|
Manual Classification¶
Title | Video | Source-Code |
---|---|---|
Using a dictionary for manual classification |
|
More about context¶
Title | Video |
---|---|
Feature Co-Occurrences |
Clustering¶
Title | Video |
---|---|
Clustering Documents |
Topic Modeling¶
Title | Video | Source-Code |
---|---|---|
Modeling topics | ||
Identify parameter k | ||
Seeded topic models |
|
Stemming & Lemmatization¶
Title | Video | Source-Code | Material |
---|---|---|---|
Stemming |
|
||
Lemmatization |
|
Computing with Semantics¶
Title | Video | Slides | |
---|---|---|---|
Word Embedding |
Visualization¶
Title | Video | Slides | |
---|---|---|---|
Information Visualization | |||
Poster Preparation |
Legend¶
Find the video here | |
Find code material here | |
Find external material here |
References¶
- Mirco Schoenfeld, Steffen Eckhard, Ronny Patz, Hilde van Meegdenburg, and Antonio Pires. The UN Security Council debates 1995-2020. 2021. doi:10.7910/DVN/KGVSYH.
- Ken Benoit. Text as data: an overview. In: The SAGE Handbook of Research Methods in Political Science and International Relations. SAGE Publications Ltd, 55 City Road, London, Apr 2020. doi:10.4135/9781526486387.
- James H Martin and Daniel Jurafsky. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall Upper Saddle River, 3 edition, 2020. https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf.
- Henry E. Brady. The challenge of big data and data science. Annual Review of Political Science, 22(1):null, 2019. doi:10.1146/annurev-polisci-090216-023229.
- Kenneth Benoit and Adam Obeng. Readtext: import and handling for plain and formatted text files. 2018. https://readtext.quanteda.io/.
- Kenneth Benoit, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. Quanteda: an r package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30):774, 2018. https://quanteda.io, doi:10.21105/joss.00774.
- Jenny Bryan. Happy git and github for the user. 2018. https://happygitwithr.com/.
- Kieran Healy. Data visualization: a practical introduction. Princeton University Press, 2018. http://socviz.co/.
- Yihui Xie, Joseph J Allaire, and Garrett Grolemund. R markdown: The definitive guide. CRC Press, 2018. https://bookdown.org/yihui/rmarkdown/.
- David Lazer and Jason Radford. Data ex machina: introduction to big data. Annual Review of Sociology, 43(1):19–39, 2017. doi:10.1146/annurev-soc-060116-053457.
- John Wilkerson and Andreu Casas. Large-scale computerized text analysis in political science: opportunities and challenges. Annual Review of Political Science, 20(1):529–544, 2017. doi:10.1146/annurev-polisci-052615-025542.
- Paul DiMaggio. Adapting computational text analysis to social science (and vice versa). Big Data & Society, 2(2):2053951715602908, 2015. doi:10.1177/2053951715602908.
- Justin Grimmer and Brandon M. Stewart. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3):267–297, 2013. doi:10.1093/pan/mps028.