SCIS 2130 - Text As Data - computational linguistics for social sciences
Over the past decade, there has been a remarkable proliferation of
textual content and interaction data generated by a variety of online
platforms: social media networks, online forums, online news
comments, etc. Meanwhile, Natural Language Processing (NLP) tools,
powered by AI-based architectures, have greatly expanded our capacity
to automatically process these corpora, turning digital text repositories
into troves of data for the social sciences. This course introduces
students to recent computational methods for processing and analyzing
text as data, including named entity extraction, dependency parsing,
topic modeling, or supervised learning. More importantly, we will read
the related literature in computational social sciences aiming at
exploiting these methods for investigating socio-political dynamics
using text data such as political mobilization, cultural dynamics, and the
empirical modeling of polarization processes.
The course is organized as a workshop. Each day is split into two parts:
three hours of reading and lecture in the morning, and three hours of
hands-on practice of text analysis in small groups in the afternoon.
Students are expected to work on collective projects over the week. For
this iteration of the course, we will have prepared specific datasets to
enquire. Students may also bring their own dataset and associated
research questions.
General References: Salganik MJ. Bit by bit: Social research in the
digital age. Princeton University Press; 2019 Aug 6.
Gentzkow M, Kelly B, Taddy M. Text as data. Journal of Economic
Literature. 2019 Sep;57(3):535-74.
Day 1: lexicometry, words as data
During day 1, we will cover the fundamentals of information extraction
from texts: tokenization, pos-tagging, keyword extraction, and named
entity recognition. These basic methods shall set the ground for a more
holistic vision of text analysis we will cover during the week.
We will also teach you the programming skills needed for the rest of the
course. This includes the fundamentals of script writing in python,
especially the use of traditional data science libraries such as pandas
and spacy.
Reading: Moretti F, Pestre D. Bankspeak: the language of World Bank
reports. New Left Review. 2015 Mar;92(2):75-99.
Day 2: Topic Modeling, words and documents as data
On day 2, we will introduce how the structure of co-occurrence
between words can enlighten the topical structure of texts. We will
introduce the notions of term document matrix, semantic networks and
learn how to learn how to model the topic distribution of a corpus using
various algorithms like LDA or STM.
Reading: Ylä-Anttila T, Eranti V, Kukkonen A. Topic modeling for
frame analysis: A study of media debates on climate change in India
and USA. Global Media and Communication. 2022 Apr;18(1):91-112.
Day 3: Word Embedding, contexts as data
On day 3, we will introduce word and sentence embeddings which are
continuous representations of meaning and offer a convenient geometric
framework to model semantic similarities (as well as their dynamics),
and complex relationships between words (analogy). We will also
leverage document embedding methods to teach document clustering
methods which offer an alternative to traditional topic model (bertopic,
top2vec).
Reading: Kozlowski AC, Taddy M, Evans JA. The geometry of culture:
Analyzing the meanings of class through word embeddings. American
Sociological Review. 2019 Oct;84(5):905-49.
Day 4: Coding with the machine, expert annotations as data
For the final course, we will cover (semi-)supervised learning methods
for text analysis. We will discuss various applications, from sentiment
analysis to the automatic recognition of specific discursive strategies.
Reading: Do S, Ollion É, Shen R. The Augmented Social Scientist:
Using Sequential Transfer Learning to Annotate Millions of Texts with
Human-Level Accuracy. Sociological Methods & Research. 2022 Dec
4:00491241221134526.
Day 5: Project Presentation
The morning will be devoted to a general discussion of the field of
text-as-data analysis. We will invite NLP and computational social
sciences experts to attend your project presentations in the afternoon.
Reading: Hofman JM, Watts DJ, Athey S, Garip F, Griffiths TL,
Kleinberg J, Margetts H, Mullainathan S, Salganik MJ, Vazire S,
Vespignani A. Integrating explanation and prediction in computational
social science. Nature. 2021 Jul 8;595(7866):181-8.
Jean-Philippe COINTET
Séminaire
English
Aucune, les étudiants peuvent
néanmoins se préparer en
apprenant les rudiments de
programmation en python en
amont.