Humanities data analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	6
Dung lượng	323,65 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 274 — #27 274 • Chapter 8 8 4 3 Loadings What do the components, yielded by a PCA, look like? Recall that, i[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 274 — #27 274 • Chapter 8.4.3 Loadings What the components, yielded by a PCA, look like? Recall that, in scikitlearn, we can easily inspect these components after a PCA object has been fitted: pca = sklearn.decomposition.PCA(n_components=2).fit(v_documents) print(pca.components_) array([[-0.01261718, -0.17502663, 0.19371289, , -0.05284575, 0.15797111, -0.14855212], [ 0.25962781, -0.06927065, 0.0071369 , 0.05184334, , -0.05141676, 0.04312771]]) In fact, applying the fitted pca object to new data (i.e., the transform() step), comes down to a fairly straightforward multiplication of the original data matrix with the component matrix The only step which scikit-learn adds is the subtraction of the columnwise mean in the original matrix, to center values around a mean of zeros (Don’t mind the transpose() method for now, we will explain it shortly.) X_centered = v_documents - np.mean(v_documents, axis=0) X_bar1 = np.dot(X_centered, pca.components_.transpose()) X_bar2 = pca.transform(v_documents) The result is, as you might have expected, a matrix of shape 36 × 2, i.e., the coordinate pairs which we already used above The numpy.dot() function which we use in this code block refers to a so-called dot product, which is a specific type of matrix multiplication (cf chapter 3) Such a matrix multiplication is also called a linear transformation, where each new component will assign a specific weight to each of the original feature scores These weights, i.e., the numbers contained in the components matrix, are often also called loadings, because they reflect how important each word is to each PC A great advantage of PCA is that it allows us to inspect and visualize these weights or loadings in a very intuitive way, which allows us to interpret the visualization in an even more concrete way: we can plot the word loadings in a scatter plot too, since we can align the component scores with the original words in our vectorizer’s vocabulary For our own convenience, we first transpose the component matrix, meaning that we flip the matrix and turn the row vectors into column vectors print(pca.components_.shape) comps = pca.components_.transpose() print(comps.shape) (2, 65) (65, 2) This is also why we needed the transpose() method a couple of code blocks ago, since we needed to make sure that dimensions of both X and the components matrix matched We can now more easily “zip” this transposed matrix with our vectorizer’s vocabulary, and sort these words according to their loadings on PC1: “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 275 — #28 Stylometry and the Voice of Hildegard vocab = vectorizer.get_feature_names() vocab_weights = sorted(zip(vocab, comps[:, 0])) We can now inspect the top and bottom of this ranked list to find out which items have the strongest loading (either positive or negative) on PC1: print('Positive loadings:') print('\n'.join(f'{w} -> {s}' for w, s in vocab_weights[:5])) Positive loadings: a -> -0.10807762935669124 ac -> 0.1687221258690923 ad -> 0.09937937586060344 adhuc -> -0.14897266503028866 ante -> -0.006326890556843035 print('Negative loadings:') print('\n'.join(f'{w} -> {s}' for w, s in vocab_weights[-5:])) Negative loadings: uidelicet -> -0.052845746442774184 unde -> 0.17621949750358742 usque -> -0.03736204189807938 ut -> -0.1511522714405677 xque -> 0.013536731659158457 Now that we understand how each word has a specific “weight” or importance for each component, it becomes clear that, instead of the texts, we should also be able to plot the words in the two-dimensional space, defined by the component matrix The visualization is shown in figure 8.9; the underlying code runs entirely parallel to our previous scatter plot code: l1, l2 = comps[:, 0], comps[:, 1] fig, ax = plt.subplots(figsize=(10, 10)) ax.scatter(l1, l2, facecolors='none') for x, y, l in zip(l1, l2, vocab): ax.text( x, y, l, ha='center', va='center', color='darkgrey', fontsize=12) It becomes truly interesting if we first plot our texts, and then overlay this plot with the loadings plot We can this by plotting the loadings on the socalled twin axis, opposite of the axes on which we first plot our texts A full example, which adds additional information, reads as follows The resulting visualization is shown in figure 8.10 import mpl_axes_aligner.align def plot_pca(document_proj, loadings, var_exp, labels): # first the texts: • 275 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 276 — #29 276 • Chapter Figure 8.9 Word loadings for a PC analysis (first two dimensions) on texts by three authors (texts not displayed here) fig, text_ax = plt.subplots(figsize=(10, 10)) x1, x2 = documents_proj[:, 0], documents_proj[:, 1] text_ax.scatter(x1, x2, facecolors='none') for p1, p2, author in zip(x1, x2, labels): color = 'red' if author not in ('H', 'G', 'B') else 'black' text_ax.text( p1, p2, author, ha='center', color=color, va='center', fontsize=12) # add variance information to the axis labels: text_ax.set_xlabel(f'PC1 ({var_exp[0] * 100:.2f}%)') “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 277 — #30 Stylometry and the Voice of Hildegard Figure 8.10 Word loadings for a PC analysis (first two dimensions) on texts by three authors Both axes (PCs and loadings) have been properly aligned text_ax.set_ylabel(f'PC2 ({var_exp[1] * 100:.2f}%)') # now the loadings: loadings_ax = text_ax.twinx().twiny() l1, l2 = loadings[:, 0], loadings[:, 1] loadings_ax.scatter(l1, l2, facecolors='none') for x, y, loading in zip(l1, l2, vectorizer.get_feature_names()): loadings_ax.text( x, y, loading, ha='center', va='center', color='darkgrey', fontsize=12) • 277 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 278 — #31 278 • Chapter mpl_axes_aligner.align.yaxes(text_ax, 0, loadings_ax, 0) mpl_axes_aligner.align.xaxes(text_ax, 0, loadings_ax, 0) # add lines through origins: plt.axvline(0, ls='dashed', c='lightgrey', zorder=0) plt.axhline(0, ls='dashed', c='lightgrey', zorder=0) # fit the pca: pca = sklearn.decomposition.PCA(n_components=2) documents_proj = pca.fit_transform(v_documents) loadings = pca.components_.transpose() var_exp = pca.explained_variance_ratio_ plot_pca(documents_proj, loadings, var_exp, authors) Such plots are great visualizations because they show the main stylistic structure in a dataset, together with an indication of how reliable each component is Additionally, the loadings make clear which words have played an important role in determining the relationships between texts Loadings which can be found to the far left of the plot can be said to be typical of the texts plotted in the same area As you can see in this analysis, there are a number of very common lemmas which are used in rather different ways by the three authors: Hildegard is a frequent user of in (probably because she always describes things she witnessed in visions), while the elevated use of et reveals the use of long, paratactic sentences in Guibert’s prose Bernard of Clairvaux uses non rather often, probably as a result of his love for antithetical reasonings Metaphorically, the loadings can be interpreted as little “magnets”: when creating the scatter plot, you can imagine that the loadings are plotted first Next, the texts are dropped in the middle of the plot and then, according to their word frequencies, they are attracted by the word magnets, which will eventually determine their position in the diagram Therefore, loading plots are a great tool for the interpretation of the results of a PCA A cluster tree acts much more like a black box in this respect, but these dendrograms can be used to visualize larger datasets In theory, a PCA visualization that is restricted to just two or three dimensions is not meant to visualize large datasets that include more than ca three to four oeuvres, because two dimensions can only visualize so much information (Binongo and Smith 1999) One final advantage, from a theoretical perspective, is that PCA explicitly tries to model the correlations which we know exist between word variables Distance metrics, such as the Manhattan distance used in Delta or cluster analyses, are much more naïve in this respect, because they not explicitly model such subtle correlations We are now ready to include the other texts of disputed authorship in this analysis—these are displayed in red in figure 8.11, but we exclude the previously analyzed test text by Bernard of Clairvaux We have arrived at a stage in our analysis where the result should look reasonably similar to the graph which was shown in the beginning of the chapter, because we followed the original implementation as closely as possible The texts of doubtful provenance are “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 279 — #32 Stylometry and the Voice of Hildegard Figure 8.11 Word loadings for a PC analysis (first two dimensions) on texts by three authors Red texts indicate disputed authorship clearly drawn to the area of the space which is dominated by Guibert’s samples (indicated with the prefix G_): the scatter plot therefore reveals that the disputed documents are much more similar, in term of function word frequencies, to the oeuvre of Guibert of Gembloux, Hildegard’s last secretary, than to works of the mystic herself The very least we can conclude from this analysis is that these writing samples cannot be considered typical of Hildegard’s writing style, which should fuel doubts about their authenticity, in combination with other historic evidence all_documents = preprocessing.scale(np.vstack((v_documents, v_test_docs))) pca = sklearn.decomposition.PCA(n_components=2) documents_proj = pca.fit_transform(all_documents) loadings = pca.components_.transpose() var_exp = pca.explained_variance_ratio_ plot_pca(documents_proj, loadings, var_exp, list(authors) + test_titles[1:]) • 279 ... first the texts: • 275 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 276 — #29 276 • Chapter Figure 8.9 Word loadings for a PC analysis (first two dimensions) on texts by... 100:.2f}%)'') “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 277 — #30 Stylometry and the Voice of Hildegard Figure 8.10 Word loadings for a PC analysis (first two dimensions)... dendrograms can be used to visualize larger datasets In theory, a PCA visualization that is restricted to just two or three dimensions is not meant to visualize large datasets that include more than ca

Ngày đăng: 20/11/2022, 11:30