SCIS 2130 - Text As Data - computational linguistics for social sciences

Course Description

Over the past decade, there has been a remarkable proliferation of textual content and interaction data generated by a variety of online platforms: social media networks, online forums, online news comments, etc. Meanwhile, Natural Language Processing (NLP) tools, powered by AI-based architectures, have greatly expanded our capacity to automatically process these corpora, turning digital text repositories into troves of data for the social sciences. This course introduces students to recent computational methods for processing and analyzing text as data, including named entity extraction, dependency parsing, topic modeling, or supervised learning. More importantly, we will read the related literature in computational social sciences aiming at exploiting these methods for investigating socio-political dynamics using text data such as political mobilization, cultural dynamics, and the empirical modeling of polarization processes. The course is organized as a workshop. Each day is split into two parts: three hours of reading and lecture in the morning, and three hours of hands-on practice of text analysis in small groups in the afternoon. Students are expected to work on collective projects over the week. For this iteration of the course, we will have prepared specific datasets to enquire. Students may also bring their own dataset and associated research questions. General References: Salganik MJ. Bit by bit: Social research in the digital age. Princeton University Press; 2019 Aug 6. Gentzkow M, Kelly B, Taddy M. Text as data. Journal of Economic Literature. 2019 Sep;57(3):535-74. Day 1: lexicometry, words as data During day 1, we will cover the fundamentals of information extraction from texts: tokenization, pos-tagging, keyword extraction, and named entity recognition. These basic methods shall set the ground for a more holistic vision of text analysis we will cover during the week. We will also teach you the programming skills needed for the rest of the course. This includes the fundamentals of script writing in python, especially the use of traditional data science libraries such as pandas and spacy. Reading: Moretti F, Pestre D. Bankspeak: the language of World Bank reports. New Left Review. 2015 Mar;92(2):75-99. Day 2: Topic Modeling, words and documents as data On day 2, we will introduce how the structure of co-occurrence between words can enlighten the topical structure of texts. We will introduce the notions of term document matrix, semantic networks and learn how to learn how to model the topic distribution of a corpus using various algorithms like LDA or STM. Reading: Ylä-Anttila T, Eranti V, Kukkonen A. Topic modeling for frame analysis: A study of media debates on climate change in India and USA. Global Media and Communication. 2022 Apr;18(1):91-112. Day 3: Word Embedding, contexts as data On day 3, we will introduce word and sentence embeddings which are continuous representations of meaning and offer a convenient geometric framework to model semantic similarities (as well as their dynamics), and complex relationships between words (analogy). We will also leverage document embedding methods to teach document clustering methods which offer an alternative to traditional topic model (bertopic, top2vec). Reading: Kozlowski AC, Taddy M, Evans JA. The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review. 2019 Oct;84(5):905-49. Day 4: Coding with the machine, expert annotations as data For the final course, we will cover (semi-)supervised learning methods for text analysis. We will discuss various applications, from sentiment analysis to the automatic recognition of specific discursive strategies. Reading: Do S, Ollion É, Shen R. The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy. Sociological Methods & Research. 2022 Dec 4:00491241221134526. Day 5: Project Presentation The morning will be devoted to a general discussion of the field of text-as-data analysis. We will invite NLP and computational social sciences experts to attend your project presentations in the afternoon. Reading: Hofman JM, Watts DJ, Athey S, Garip F, Griffiths TL, Kleinberg J, Margetts H, Mullainathan S, Salganik MJ, Vazire S, Vespignani A. Integrating explanation and prediction in computational social science. Nature. 2021 Jul 8;595(7866):181-8.

Enseignants

Jean-Philippe COINTET

Type

Séminaire

Language of tuition

English

Pre-requisite

Aucune, les étudiants peuvent néanmoins se préparer en apprenant les rudiments de programmation en python en amont.

Semester

Spring 2024-2025