Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 318 — #34 318 • Chapter 9 available data such as authorship (Rosen Zvi et al 2010), relationships among auth[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 318 — #34 318 • Chapter available data such as authorship (Rosen-Zvi et al 2010), relationships among authors (Krafft et al 2012), and other document metadata (Gerrish and Blei 2012; Zhao et al 2017) By incorporating side information researchers believe is relevant, these models can provide the starting point for further, alternative “readings” of large text corpora 9.5 Further Reading Mixture models and mixed-membership models tend not to feature in introductory texts in statistics and machine learning Readers interested in learning more about them will find their task easier if they are already well-versed in Bayesian statistics (see, for example, Hoff (2009) or Lee (2012)) Bishop (2007) covers mixture models in chapter Murphy (2012) addresses mixture models in chapter 11 and discusses LDA specifically in chapter 27 Those interested in digging into the details of LDA should consult two remarkably complete technical reports: Heinrich (2009) and Carpenter (2010) Research articles in the humanities and interpretive social sciences which feature the use of topic models include Block and Newman (2011), Mimno (2012a), Riddell (2014), and Schöch (2017) Chaney and Blei (2012) discusses strategies for visualizing topic models Schofield, Magnusson, and Mimno (2017) consider how stemming and stop word removal influence the final fitted model Exercises The Proceedings of the Old Bailey, 1674–1913, include almost 200,000 transcriptions of criminal trials that have taken place in London’s central court The Old Bailey Corpus 2.0 is a balanced subset of the Proceedings, which was compiled and formatted by the University of Saarland (Magnus Huber 2016; see Huber 2007, for more information) It consists of 637 proceedings (files) in TEI-compliant XML, and amounts to approximately 24 million words In the following exercises, we will explore (a subset of) the corpus using topic models A simplified CSV version of the corpus can be found under data/oldbailey.csv.gz This CSV file contains the following five columns: (i) the id of each trial, (ii) the transcription of the trail (text), (iii) the category of the offence, (iv) the verdict, and (v) the date of the trial Easy First, load the corpus using Pandas, and ensure that you parse the date column as Python datetime objects Then, select a subset of the corpus consisting of trial dates between 1800 and 1900 Before running a topic model, it is important to first get a better understanding of the structure of the collection under scrutiny Answer the following four questions: “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 319 — #35 Topic Model of US Supreme Court Opinions, 1900–2000 (1) (2) (3) (4) How many documents are there in the subset? How many trials resulted in a “guilty” verdict? What is the most frequent offence category? In which month(s) of the year did most court cases take place? We now continue with training a mixed-membership model for the collection (1) First, construct a document-term matrix for the collection using scikit-learn’s Countvectorizer Pay close attention to the different parameters of the vectorizer, and motivate your choices (2) Use scikit-learn’s LatentDirichletAllocation class to estimate the parameters of a mixed-membership model Think about the number of components (i.e., topics) you consider necessary After initializing the class, fit the model and transform the document-term matrix into a document-topic distribution matrix (3) Create two Pandas DataFrame objects, one holding the topic-word distributions (with the topics as index and the vocabulary as columns), and the other holding the document-topic distributions (with the topics as columns and the index of the corpus as index) Verify that the shapes of both data frames are correct Moderate Look up the topic distribution of trial “t18680406-385.” Which topics feature prominently in this transcription? To which words these topics assign relatively high probability? Do the topics provide a good summary of the actual contents of the transcription? To further assess the quality of the mixed-membership model, create a “rank abundance” curve plot for the latent topic distributions of eight randomly chosen trials in the data (cf section 9.3.1) Describe the shape of the document-specific mixing weights Why aren’t the weights distributed uniformly? Most trials provide information about the offence In this exercise, we will invesigate the relation between the topic distributions and the offence Compute the average topic distribution for each unique offence label Use Matplotlib’s imshow to plot the resulting matrix Add appropriate labels (i.e., offence labels and topic labels) to the axes Comment on your results Challenging Topic models are sometimes used as a form of dimensionality reduction, in a manner resembling the way Principal Component Analysis (PCA) is used Recall from chapter that a PCA analysis of corpus with N documents using the first K principal components produces a decomposition of a document-term matrix, of counts which superficially resembles a topic model’s decomposition of the same matrix: both take a sparse matrix of counts and produce, among other outputs, a dense matrix which describes each of the N documents using • 319 ... dimensionality reduction, in a manner resembling the way Principal Component Analysis (PCA) is used Recall from chapter that a PCA analysis of corpus with N documents using the first K principal components... transform the document-term matrix into a document-topic distribution matrix (3) Create two Pandas DataFrame objects, one holding the topic-word distributions (with the topics as index and the vocabulary... (with the topics as columns and the index of the corpus as index) Verify that the shapes of both data frames are correct Moderate Look up the topic distribution of trial “t18680406-385.” Which