Humanities data analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	4
Dung lượng	117,06 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 313 — #29 Topic Model of US Supreme Court Opinions, 1900–2000 • 313 import itertools opinion text = df loc[o[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 313 — #29 Topic Model of US Supreme Court Opinions, 1900–2000 import itertools opinion_text = df.loc[opinion_of_interest, 'text'][0] window_width, num_words = 3, len(opinion_text.split()) words = iter(opinion_text.split()) windows = [ ' '.join(itertools.islice(words, 0, window_width)) for _ in range(num_words // window_width) ] print([window for window in windows if 'minor' in window]) [ 'a minor-party candidate', 'candidates from minor', 'amendments, a minor-party', 'minor-party candidate secured', 'that a minor-party', 'a minor-party or', 'and minor-party candidates,', 'candidates of minor', 'minor parties who', 'number of minor', 'minor parties having', 'virtually every minor-party', 'of 12 minor-party', 'minor-party ballot access.', 'that minor parties', 'about how minor', "minor party's qualification", 'primary," minor-party candidates', 'which minor-party candidates', 'a minor-party candidate', ] Having superficially checked the suitability of the mixed-membership model as a model of our corpus and having reviewed the capacity of topic models to capture—at least to some extent—differences in word senses, we will now put the model to work in modeling trends visible in the Supreme Court corpus 9.3.4 Exploring trends over time in the Supreme Court The opinions in the corpus were written over a considerable time frame In addition to being associated with mixing weights, each opinion is associated with a year, the year in which the opinion was published Having the year of publication associated with each opinion allows us to gather together all the opinions published in a given year and calculate how frequently words associated with each latent topic distribution appear As we will see, the rise • 313 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 314 — #30 314 • Chapter and fall of the prevalence of specific latent distributions appears to track the prominence of legal issues associated with words linked to the distributions We began this chapter with one example of a trend—an increasing rate of cases related to race and education—and we will offer an additional example here Figure 9.1 at the beginning of this chapter showed the proportion of all words in all opinions that year related to a “topic” characterized, in a sense which is now clear, by words such as school, race, voting, education, and minority That is, these words are among the most probable words under the latent topic distribution In this section we will consider a different trend which the model appears able to capture This trend tracks the rise of laws regulating labor union activity since the 1930s and the associated challenges to these laws which yield Supreme Court opinions Prior to the 1930s, the self-organization of employees into labor unions for the purpose of, say, protesting dangerous or deadly working conditions faced considerable and sometimes unsurmountable obstacles In this period, capitalist firms were often able to to enlist the judiciary and the government to prevent workers from organizing Legislation passed in the 1930s created a legal framework for worker organizations out of which modern labor law emerged Likely because labor law is anchored in laws passed during a short stretch of time (the 1930s), it is a particularly well-defined body of law and our mixed-membership model is able to identify our two desired items: a cluster of semantically related words linked to labor law and document-specific proportions of words associated with this cluster In our model, the latent distribution is Topic 24 and it is clear from inspecting the top 10 words most strongly associated with the latent distribution that it does indeed pick out a set of semantically related words connected to workers’ organizations: labor_topic = 'Topic 24' topic_word_distributions.loc[labor_topic].sort_values( ascending=False).head(10) board union 23,593.34 22,111.38 labor employees bargaining 16,687.69 9,523.08 7,999.96 act employer collective 6,221.95 5,935.99 5,838.04 agreement relations 4,569.44 3,802.37 As it will be useful in a moment to refer to this constellation of top words by a human-readable label rather than an opaque number (e.g., Topic 24), we will concatenate the top words into a string (topic_top_words_joined) which we can use as an improvised label for the latent distribution topic_top_words = topic_word_distributions.loc[labor_topic].sort_values( ascending=False).head(10).index “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 315 — #31 Topic Model of US Supreme Court Opinions, 1900–2000 topic_top_words_joined = ', '.join(topic_top_words) print(topic_top_words_joined) board, union, labor, employees, bargaining, act, employer, collective, agreement, relations Before we can plot the prevalence of words associated with this semantic constellation we need to decide on what we mean by “prevalence.” This question is a conceptual question which has little to with mixed-membership models as such and everything to with measuring the presence of a continuous feature associated with elements in a population Our mixed-membership model gives us measurements of the mixing weights—technically speaking, point-estimates of parameters—interpretable as the estimated proportions of words associated with latent distributions in a single opinion Taken by itself, these proportions not capture information about the length of an opinion And we might reasonably expect to distinguish between a 14,000-word opinion in which 50% of the words are associated with a topic and an opinion which is 500 words and has the same proportion associated with the topic Our recommended solution here is to take account of document length by plotting, for each year, the proportion of all words in all opinions associated with a given topic We can, in effect, calculate the total number of words in all opinions published in a given year associated with a topic by multiplying opinion lengths by the estimated topic shares Finally, to make years with different numbers of opinions comparable, we can divide this number by the total number of words in opinions from that year.21 # convert `dtm` (matrix) into an array: opinion_word_counts = np.array(dtm.sum(axis=1)).ravel() word_counts_by_year = pd.Series(opinion_word_counts).groupby( df.year.values).sum() topic_word_counts = document_topic_distributions.multiply( opinion_word_counts, axis='index') topic_word_counts_by_year = topic_word_counts.groupby(df.year.values).sum() topic_proportion_by_year = topic_word_counts_by_year.divide( word_counts_by_year, axis='index') topic_proportion_by_year.head() Topic Topic Topic Topic Topic 96 Topic 97 1794 0.00 0.02 0.00 0.00 0.00 0.00 1795 0.00 0.01 0.00 0.00 0.00 0.00 1796 0.00 0.04 0.00 0.00 0.00 0.00 1797 0.00 0.02 0.00 0.00 0.00 0.00 1798 0.00 0.01 0.00 0.00 0.00 0.00 Topic 98 Topic 99 1794 0.00 0.00 1795 0.00 0.00 \ 21 There are some alternative statistics which one might reasonably want to consider For example, the maximum proportion of words associated with a given topic would potentially measure the “peak attention” any judge gave to the topic in an opinion • 315 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 316 — #32 316 • Chapter Figure 9.7 Prevalence of Topic 23 (3 year rolling average) 1796 0.00 0.00 1797 0.00 0.00 1798 0.00 0.00 As a final step, we will take the three-year moving average of this measure to account for the fact that cases are heard by the Supreme Court irregularly Because the Court only hears a limited number of cases each year, the absence of a case related to a given area of law in one or two years can happen by chance; an absence over more than three years is, by contrast, more likely to be meaningful A three-year moving average of our statistic allows us to smooth over aleatory absences (Moving averages are discussed in chapter 4.) Finally, we restrict our attention to the period after 1900 as the practices of the early Supreme Court tended to be considerably more variable than they are today (see figure 9.7) import matplotlib.pyplot as plt window = topic_proportion_rolling = topic_proportion_by_year.loc[ 1900:, labor_topic].rolling(window=window).mean() topic_proportion_rolling.plot() plt.title(f'Prevalence of {labor_topic} ({window} year rolling average)' f'\n{topic_top_words_joined})' Figure 9.7 shows the rise of decisions about the regulation of labor union activity As mentioned earlier, prior to the 1930s, the self-organization of employees into labor unions for the purpose of, say, protesting dangerous working conditions faced considerable obstacles In this period, employers were typically able to recruit the judiciary and the government into criminalizing workers’ efforts to organize themselves into unions One well-known example of this is from May 1894, when the railroad corporations enlisted the government to dispatch the military to stop workers associated with the American Railway Union (ARU) from striking The pretext for deploying the army in ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 314 — #30 314 • Chapter and fall of the prevalence... topic_word_distributions.loc[labor_topic].sort_values( ascending=False).head(10).index “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 315 — #31 Topic Model of US Supreme Court Opinions,... measure the “peak attention” any judge gave to the topic in an opinion • 315 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 316 — #32 316 • Chapter Figure 9.7 Prevalence of

Ngày đăng: 20/11/2022, 11:32