(LUẬN văn THẠC sĩ) tagset evaluation and automatical error verrification in pos tagged corpus, đánh giá tập nhãn và xác định lỗi tự động trong kho ngữ liệu đã gán nhãn

Characteristics of Vietnamese language

Vietnamese, like every language, possesses unique characteristics that distinguish it from others In this article, we will explore key features of the Vietnamese language and draw comparisons with languages such as Chinese and English to enhance understanding.

Vietnamese, a native language of Vietnam, belongs to the South Asian language family and is closely related to the Muong language It is classified as an isolating language with three key characteristics Firstly, syllables serve as the foundational units for forming words and sentences, whether as standalone words or components of complex, compound, or reiterative words Secondly, Vietnamese words do not inflect; for instance, there is no distinction between singular and plural nouns, as seen in "một cuốn sách" (one book) and "hai cuốn sách" (two books) Lastly, grammatical meaning is primarily conveyed through word order and the use of expletives, such as "sẽ," "đã," and "không." For example, the sentence "Tôi ra ngoài" can be transformed into three different meanings: "Tôi sẽ ra ngoài" (I will go out), "Tôi đã ra ngoài" (I have gone out), and "Tôi không ra ngoài" (I am not going out).

Figure 1 The features of Vietnamese type

Isolating languages like Chinese and Thai differ significantly from flexional languages such as English, French, and Russian For example, when comparing Vietnamese, English, and Chinese sentences, distinct features emerge that highlight these differences in structure and grammar.

Syllable is foundation unit to form word or sentence

Vietnamese word is not inflectional

The grammatical meaning express mainly through word order and expletive method

Table 1 The expression of grammatical meaning in Vietnamese

Word order Tôi yêu anh ấy

Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him

Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe).

Vietnamese part of speech

In European language, POS notion glues with morphological category such as gender, numeral, mood, so on In Vietnam, there are two idea followed:

 Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)

 Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria

The Vietnamese branch has largely agreed on specific criteria for parts of speech (POS), as outlined by Diep Quang Ban and Hoang Van Thung (2010) Firstly, the general meaning of a POS refers to the collective meaning of a group of words, which is based on vocabulary generalization that forms common grammatical categories For example, words like "nhà" (house), "bàn" (table), and "chim" (bird) are classified as nouns since their meanings are generalized as objects Secondly, the combination ability of words allows them to form meaningful phrases, where certain words can replace others in specific contexts, further supporting their classification as nouns Lastly, the syntax function of words demonstrates their roles in sentence composition, where they can occupy certain positions and interact with other sentence elements, reinforcing their classification as nouns.

1.2.2 The ways to build up tagset

Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers

The first category of part-of-speech (POS) tagging is based on eight fundamental tags commonly found in dictionaries and linguistic resources: noun, verb, adjective, pronoun, adverb, conjunction, interjection, and emotive word From these basic tags, researchers develop more specific tag sets according to various criteria, as detailed in section 1.2.1 For instance, Tran Thi Oanh's VnQtag comprises 14 tags, while the VietTreeBank features 17 tags, and the VnQtag expands to 59 tags, as outlined in the appendix.

The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem

Copora

Annotated corpora are extensive collections of text that feature linguistically informative markup, making them essential for advancements in computational linguistics Significant effort has been dedicated to the development of these corpora, which exist in various countries, each with its own unique datasets Notable examples include the British National Corpus, which serves as a key resource in linguistic research.

Notable linguistic corpora include the Penn Treebank (Marcus et al., 1993), the German NEGRA Treebank (Skut et al., 1997), and the Lancaster corpus of Mandarin Chinese (McEnery & Xiao, 2005) In Vietnam, significant corpora such as VnQtag, VnPos, and VTB have also been developed.

To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson,

 Sampling and representativeness: elements in a corpus must be general, diversified and plentiful A sample is representative if what we find for the sample also holds for the general population

 Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size

Building a large linguistic corpus manually is a time-consuming process that requires extensive knowledge Additionally, the quality of a manually constructed corpus can be inconsistent Our thesis aims to identify these issues and enhance the overall quality of the corpus.

Two corpora we used in our experiments are VietTreeBank and VnQtag After that, we would like to deeper discuss about building way of the corpora

VietTreeBank is a product of the national VLSP project, developed by the VTB group, including Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen, and annotators This corpus consists of 142 documents focused on political and social topics from Youth news, featuring 10,000 annotated Vietnamese sentences encompassing word segmentation, POS tagging, and syntax structure Utilizing MEMs and CRFs machine learning models, the group achieved a POS tagging precision exceeding 93% VTB aims to support various programs such as word segmentation, POS tagging, and syntax parsing The group classified POS based on two criteria: combinability and syntactic function, exemplified by nouns serving as subjects or objects and their ability to combine with numerals (e.g., three, four) and attributes (e.g., each, every).

A POS tag provides essential information about the basic class of words, including nouns, verbs, and adjectives, as well as morphological details like countability and subcategories that define their grammatical relationships The VTB group developed a tagset that focuses solely on the fundamental classes of words, omitting additional information such as morphological aspects and subcategories For a detailed view of the tagset, please refer to the appendix.

In addition to providing POS information, the group outlines essential syntax elements such as phrases and clauses Syntax tags serve as the foundational components of the syntax tree, forming its core structure For detailed reference, see A7 and A8 in the appendix, which list the phrase and clause tagsets, respectively.

The function tag of a syntax element indicates its role within higher-level syntax structures These tags are assigned to key components of a sentence, including the subject, predicate, and object By providing essential information, they help us recognize fundamental grammatical relationships.

Tagging process of each sentence in corpus consists of three steps: word segmentation, POS tagging, and syntactic parsing

The VnQtag tagset, developed as part of the KC01 national project, is created by a team consisting of Nguyen Thi Minh Huyen, Vu Xuan Luong, and Le Hong Phuong Utilizing the Vietnamese dictionary from the Linguistic Institution (2000), the group first segmented sentences into words using both a syllable automaton and a lexical automaton They then employed the Qtag tagger to assign one of 59 POS labels to Vietnamese words, incorporating semantic information to categorize them into distinct word classes For instance, verbs are identified by their expression of process meanings, reflecting action features, while state meanings relate to the object's actions in time and space The automatic tagger was tested on seven documents, as detailed in Table 2 This annotated corpus is vital for natural language processing (NLP), serving as a high-quality linguistic database that adheres to international standards.

The corpus is formatted with each lexical unit and its corresponding part of speech (POS) on separate lines, using spaces to separate syllables and tabs to distinguish between words and their POS labels Punctuation and other symbols are treated as lexical units with corresponding punctuation labels This corpus consists of seven documents from various genres, including stories, novels, scientific texts, and press articles It compiles commonly used words in everyday language and journalism, as well as terms frequently found in literary works and scientific literature.

Table 2 Corpus with VnQtag tagset annotation

The number of lexical unit

The number of processing unit (included punctuations)

2 Chuyen tinh ke truoc luc rang dong-part I Novel 14277 16787

3 Chuyen tinh ke truoc luc rang dong- part II Novel 12499 14698

4 Luoc su thoi gian Science 10598 11626

6 Nhung bai hoc nong thon Story 6682 8244

7 Cong nghe va he thong phong thu quoc gia Press 1028 1162

Motivation

In this section, we will explore the specific problems my thesis aims to address and the rationale behind my choices in tackling these issues.

Linguistic theories initially focused on Indo-European languages, leading to significant advancements over time In Vietnam, the field of Natural Language Processing (NLP) began in 1990, but progress has been limited The responsibility for Vietnamese language processing lies primarily with local researchers, as foreign scholars may not prioritize this area (Ho Tu Bao, 2001) This thesis aims to contribute to the improvement of Vietnamese processing by enhancing tagsets and error detection in tagging.

Natural language processing is done at five stages These are:

Morphological and lexical analysis involves examining the lexicon of a language, which encompasses its vocabulary, including words and expressions Morphology focuses on identifying, analyzing, and describing the structure of words, recognized as the smallest units of syntax Syntax, in turn, refers to the rules and principles that dictate the sentence structure within any given language.

Lexical analysis: The aim is to divide the text into paragraphs, sentences and words The lexical analysis cannot be performed in isolation from morphological and syntactic analysis

Syntactic analysis involves examining the grammatical structure of sentences by analyzing the relationships between words This process transforms words into structures that illustrate their connections, while also identifying and rejecting sequences that violate language rules regarding word combinations.

 Semantic analysis: It derives an absolute meaning from context it determines the possible meanings of a sentence in a context

 Discourse integration: The meaning of an individual sentence may depend on the sentences that precede it and may influence the meaning of the sentences that follow it

Pragmatic analysis involves extracting knowledge from external commonsense information, focusing on the intentional use of language in various contexts This approach emphasizes understanding language elements that necessitate world knowledge For instance, the phrase "Do you know what time it is?" should be interpreted as a request rather than a mere question.

Our thesis concentrates on the first stage (i.e morphological analysis) in natural language processing It is very important preprocessing step for following stages such as syntactic analysis and semantic analysis

Our thesis identifies two major challenges and two minor issues related to tagset evaluation and error detection The primary concerns include the evaluation of the tagset and the automatic detection of tagging errors, while the secondary issues focus on assessing the convertible possibilities of the tagset and the automatic identification of segmentation errors.

In the previous section, we discussed various tagsets including VietTreeBank with 17 tags, VNPOS with 15 tags, and VNQTag comprising 59 tags This inconsistency raises important questions regarding the effectiveness of different tagsets, the methods available for evaluating them, and how to select the appropriate set of POS tags for specific applications The first part of this thesis aims to address these questions comprehensively.

The conversion capability of tagsets significantly impacts the complexity of part-of-speech (POS) tagging tasks A larger tagset can complicate the tagging process, while a smaller tagset may not meet specific requirements Thus, it is crucial to strike a balance between quality and quantity in selecting a tagset.

 Information quality more clear (i.e classify to more Part-of-speech based on concrete meaning)

 Possibility of tagging (i.e the number of Pos as little as possible)

To address the identified issues, we conducted experiments on both the source tagset (ST) and target tagset (TT) to find a balance This involved calculating the number of ambiguous words during the conversion process, leading to our conclusions Additionally, we focused on detecting errors in part-of-speech (POS) tagging and word segmentation.

Part-of-speech (POS) tagging can be challenging due to the ambiguity that arises when a single word can belong to multiple labels This complexity makes it difficult to create a comprehensive dictionary that accurately assigns words to their corresponding labels Manual correction of these tagging errors is time-consuming and costly Therefore, our goal is to develop an automated method for error detection in POS tagging, which will significantly reduce both time and financial expenditures.

Vietnamese word segmentation presents a complex challenge, as a single sentence can be segmented in multiple ways For instance, the phrase "chiếc xe đạp nặng quá" can be divided as either "chiếc/ xe/ đạp/ nặng/ quá" or "chiếc/ xe đạp/ nặng/ quá." Both segmentation methods are valid, as each conveys a distinct meaning while maintaining coherence.

One of reasons causes the difference is listed in following table And the last problem in our thesis is word segmentation:

Table 3 Principle differences between Vietnamese and English

Prefix or Suffix No Yes

Part of speech No agreement Defined clearly

Boundary of word Context meaningful combination of syllable Blank or Delimiters

All above reasons are motive power to help me find the last answer.

Organization of the thesis

The thesis is organized four main chapters with basic content following:

Chapter 1 provides a general picture about Vietnamese such as features of Vietnamese and part-of-speech Besides, reasons I chose the topic in the thesis also discuss

Chapter 2: Evaluating distributional properties and conversion possibility of tagsets in Vietnamese

Chapter 2 we will find out deeper about tagset for instance way to build up tagset or way to merge labels as well as introduction basic notions to carry out evaluating properties of tagsets

Chapter 3: Automatic error verification of pos-tagged corpus

In this chapter, we will introduce notion related to errors detecting method, after that present algorithm and discuss about classifying variation into errors or ambiguity

This chapter addresses three key issues: the contributions of the thesis to theory, experimental findings, and potential new directions for future research It summarizes the achievements we've made while also highlighting areas that require further exploration and resolution.

Tagset evaluation

Evaluating tagsets has been a significant focus for NLP researchers for over two decades This evaluation process enables the testing and assessment of how modifications to tagsets affect results by applying various versions of the same tagset to identical texts Notably, Martin Volk and Gerol Schneider highlighted this in 1998, while Dzeroski Saso and Erjavec contributed to the field in 2000.

Tomaz and Zavrel Jakub analyzed the accuracy of design tagsets by reducing their cardinality, either by omitting specific attributes or retaining only a few They computed accuracies using a Black-Box combiner, as outlined by Halteren and Dzeroski In the same year, Herv Ejean Seminar and Hervé Déjean introduced two evaluation methods for tagsets: a global evaluation of the initial grammar generated by ALLiS and a local evaluation based on the reliability of an element, which is determined by its frequency in the structure compared to its total frequency in the corpus Additionally, Madhav Gopal, Diwakar Mishra, and Devi Priyanka Singh (2010) discussed various evaluated tagsets in the context of Indian languages, including the ILMT tagset, JNU-Sanskrit tagset, LDCIL tagset, and Sanskrit consortium tagset.

Vietnamese is an isolating language where word order plays a crucial role in conveying syntactic information This chapter presents a straightforward method for evaluating Vietnamese tagsets, utilizing both internal and external criteria Internal criteria, such as frequency frame and purity, assess the accuracy of tag assignments, while external criteria focus on the reduction of tagset cardinality to ensure the retention of information quality Evaluations have indicated that many tagging errors arise from overly nuanced distinctions within major categories, as noted by Eugenie Giesbrecht.

A POS is a set of words with some grammatical characteristic(s) in common and each

POS differs in grammatical characteristics from every other POS For example, nouns have different properties from verbs, which have different properties from adjective and so on

Tagset is set of POS tags built up based on the criteria (see in 1.2) Therefore, tagsets usually vary quantity of tags and also used in various applications

Properties of tagset: One tagset need guarantee some properties as followed: Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately, reduction ambiguous words when we carried out tagging

2.1.3 A method for evaluating distributional properties of tagsets

The effectiveness of tagsets is significantly enhanced by the accurate assignment of tags within a corpus, which is guided by an internal criterion This criterion can be examined through the concepts of frame notions and purity formulas The frame notion outlines the relevant local context, indicating which tags are likely to appear within it Meanwhile, the purity formula evaluates the likelihood of tag convergence in that local context, ensuring a more precise tagging process.

Purity is a key external evaluation criterion used for assessing tagsets in natural language processing, particularly as highlighted by Stanford's research This straightforward and transparent measure of cluster quality involves assigning each cluster to the most frequently occurring class within it The purity score is calculated by counting the number of correctly assigned documents and dividing this figure by the total number of documents (N), providing a clear indication of clustering accuracy.

(1) Where: is the set of clusters is the set of classes

We interpret w k as the set of documents in w k and c j as the set of documents in c j in equation (1)

High purity is easy to achieve when the number of clusters in large, in particular, purity is 1 if each document gets its own cluster

Figure 2 Purity as external evaluation criterion for cluster quality Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4

(cluster 2); and , 3 (cluster 3) Purity is

The concept of "frame" was introduced by Mintz in 2006 and later redefined by Dickinson and Jochim in 2010, describing a frame as consisting of three words, with two words surrounding a target word that aids in its categorization Frames will be utilized to evaluate the quality of distributional mappings, with the English frame "you_it" predicting a verbal category for the target word, such as "hit," "beat," "eat," or "kiss." In contrast, the Vietnamese frame "mẹ_là" leads to target words that belong to the pronoun category, including "tôi," "anh," "chị," and "bác." To achieve more accurate results, we employed a frequency-based approach alongside the frequent frame notion, which provides category information in child language corpora The significance of a frame within a corpus varies, as increased occurrences of a frame yield more linguistic information We established a frequent threshold based on a formula of approximately 0.03% of the total frames; for instance, in a corpus with 10,000 frames, this frequency threshold would apply.

3 (10000*0.03%) So, one frame appears above 3 times, we consider them as one frequent frame

The purity formula is utilized to assess the likelihood of distributing tags within a single frame, indicating that the percentage of each tag's appearance varies To determine the purity value, we focus solely on the highest frequency of a specific tag within the frame.

A higher purity value enhances the accuracy of word tagging For example, consider two frames: "Tôi_ở" and "mẹ_bảo." The first frame appears four times in a corpus, with the target word tagged as Vits once and Vitn three times In contrast, the second frame appears eight times, with the target word's part of speech (POS) identified seven times.

Np, 1 times is Pp We can calculate the purity value by

Linguistic scientists typically evaluate tagsets by mapping them to a reduced version, which allows for the assessment of retained linguistic features This reduced tagset is created by merging tags, but determining the optimal way to merge them poses a challenging question that requires resolution.

In their 2000 seminar, Herv Ejean, Hervộ Dộjean, and Universitọt Tỹbingen explored the concept of a theoretically minimal tagset, emphasizing that the quality of a tagset is not determined by the number of tags it contains They developed a minimal tagset essential for parsing sentences across various domains, initially utilizing a structure with a single tag for each component (NP-VP) Their findings suggest that a tagset comprising approximately 20 tags is sufficient for effectively parsing sentences into phrase structure and clause structures.

There are several methods to merge labels, allowing for the coexistence of tagsets with varying quantities of tags While English, as a morphological language, facilitates the identification of mergeable situations—such as conflating base form verbs (VB) with present tense verbs (non-third person singular, VPB)—Vietnamese presents more challenges in this regard In our thesis, we utilize two distinct types of tagsets.

Firstly, we used tagset that it is built up by preceding NLP researchers, for instance, VnQtag, VietTreeBank

We often associate ourselves with various labels that reflect Vietnamese characteristics Given that VnQtag has the most extensive collection of tags, we utilize it as the primary source to create additional tag sets.

To concrete above mentioned theory, we would like to introduce the algorithm containing 5 steps in tagged corpus as followed

1 Identifying all the words and its POS in the corpus, store them and its positions

2 Calculating the quantity of frames in the corpus, after based on total of the frames to calculate a frequency

3 Then, finding frequency frames and a purity value

4 Mapping the original tagset to new reduced tagsets

5 Finally calculating the new purity value in the new tagsets and statistic lost ambiguous words

We carried out this method on corpus with VnQtag tagset annotated corpus

The experiments were conducted using the VnQtag corpus, which consists of four annotated documents We merged several tags from VnQtag to create new tagsets, resulting in VietTreeBank with 18 tags, basic tagset 2 with 8 tags, tagset 3 with 25 tags, and tagset 4 with 40 tags, as detailed in the appendix Our merging process was guided by the book "Ngữ pháp tiếng Việt" by Diệp Quang Ban, which organizes the Vietnamese POS system into two distinct groups.

 Group 2: Adjunct (Determine, adverb) Conjunction

In linguistics, nouns are categorized into two primary types: proper nouns and common nouns Common nouns can be further divided into synthetic and non-synthetic nouns, which are then classified into countable and uncountable nouns This detailed classification helps in understanding the nuances of noun usage in language.

To achieve 25-POSs and 30-POSs tagsets, we combined various noun and verb tags, which represent fundamental categories with the highest word counts in Vietnamese The VnQtag tagset categorizes nouns into eight specific tags and verbs into ten detailed tags Our analysis utilized four annotated documents within the VnQtag framework and incorporated four tagsets, leading to the results presented in Tables 4 and 5.

Table 4 Some frames is found in corpus

POS (Frequency) mẹ_là (4) Pp (4) Tôi_ở (4) Vits (1)

Tôi_nông dân (3) Vla (3) nhà _ở (3) Np (1)

No (2) Còn_sinh (3) Pp (3) ba _Phúc (2) Nh (2) với _đứa (2) Nn (2) sinh _nông thôn (3) Cm (3) Con _nhỏ (2) No (2) dăm _trẻ (2) Nu (2) đứa _dâng (2) Nh (2) trẻ _đào (2) Vta (2) có _người (3) Aa (2)

Np (7) đây _lần (2) Vla (2) là _đầu (2)

Possibility of Tagsets convertibility

The existence of various tagsets within the same language provides linguistic scientists with diverse tagging options In English, notable tagsets include the Brown tagset (1967) with 87 tags, the Susanne tagset (1987) featuring 353 word tags, the Penn Tree Bank tagset (1991) comprising 36 tags, and the IBM Lancaster tagset (1993) with 132 tags Researchers have explored the relationships between these tagsets to inform their specific applications effectively.

In Vietnamese linguistics, three primary tagsets are utilized: VnQtag with 59 tags, VnPos with 15 tags, and VietTreeBank containing 18 tags Some researchers advocate for a minimal tagset approach, emphasizing the benefits of a smaller tagset for easier and more cost-effective tagging processes This study aims to explore the conversion from a large tagset to a smaller one, acknowledging that while some words may lose their tagging ambiguity in this transition, the impact can be mitigated by incorporating contextual or syntactic information.

(2008) used Interset (Tagset diriver) to convert source tagset into target one Bartosz Zaborowski andAdam Przepiórkowski (2012) used set of rules converting particular tags

In our thesis, we highlight the capability of converting large tagsets into smaller ones with minimal ambiguity in word cardinality Ambiguous words are those that lose their distinct meanings when mapped to finer tags in the target tagset Our research demonstrates that this conversion process is not only feasible but also efficient.

 Identifying tagsets that we want to check

 Identifying corpus annotated as well as tagger

 Calculating the number of word belonging to each POS tag of tagset

In the analysis of the corpus, we observe a significant number of ambiguous tokens arising from the conversion of a large tagset into a smaller one, where certain tags from the larger set merge to align with those in the smaller set Additionally, the corpus contains a notable count of ambiguous word types, highlighting the complexities in linguistic categorization and the challenges faced in natural language processing tasks.

 Computing the percentage of ambiguous tokens and word types

This method utilizes two tagsets, VnQtag and VietTreeBank, for data input We employed Qtag probability and Vn Tagger to tag a folder containing seven documents, including "Hoàng tử bé," "Chuyện tình," "Lược sử thời gian," "Những bài học nông thôn," "Chiến tranh cục bộ," "Muối của rừng," and "An Dương Vương," using each tagset accordingly.

Then we compared outputs to have last conclusion

Table 6 Some properties in tagset convertibility method in Hoangtube

In the above table, it's important to note that the word count refers to word types rather than tokens, meaning each unique word is counted only once The experiment was conducted on a single document (hoangtube), resulting in a relatively small percentage of ambiguous words Although the total number of ambiguous words can be significant, the table includes only a selection of examples rather than an exhaustive list.

Table 7 Statistic ambiguous the word types in VnQtag corpus

POS Ambiguity Total of word type Percentage

V3 tag is merged from followed tags: Vo, Vs, Vb, Va, Vc, V la, Vm, Vim, Vla,

V1 tag: Vitb, Vits, Vitc, Vitm, Vitim

V2 tag: Vta, Vtb, Vtc, Vtv, Vtim, Vto, Vts, Vtm, Vtv

Table 8 Statistic ambiguous the token in VnQtag corpus

POS Ambiguity Total of token Percentage

Table 9 Statistic detail ambiguous word types in VnQtag corppus

Number of ambiguous word types

A Aa/An 12 bé, cao, con, cái, gần, hun hút, lớn, nhè nhẹ, nhỏ, sâu, ít, đầy

Na/Np 6 Nguyên Đán, chúa, elip, thuyết tương đối, thuỷ, đường

Na/Nt 15 ban, cuộc đời, công nguyên, khoảnh khắc, một khi, phép, sớm, thuở, thế kỷ, thời bình, thời gian, thời kỳ, tuổi thơ, tương lai, tết

Nt/Nu 13 buổi, bông, bọn, bữa, canh, con, cuộc, cái, kỳ, lát, lần, mồng, phiên

Trong bài viết này, chúng ta sẽ khám phá 33 khái niệm quan trọng liên quan đến xã hội và văn hóa, bao gồm bài học, chư hầu, và chứng cứ Những yếu tố như công việc, cảnh vật, và gia đình đóng vai trò quan trọng trong cuộc sống hàng ngày Hình ảnh của thiên nhiên và sinh vật cũng góp phần tạo nên bản sắc văn hóa của đất nước Các khái niệm như giải phóng, huyện đội, và quân đội thể hiện sự phát triển của xã hội qua các triều đại Đường nét và màu sắc trong nghệ thuật phản ánh tâm tư của dân chúng, trong khi các hội nghị và luận chứng là nền tảng cho sự phát triển của nhà nước Cuối cùng, những lỗi lầm và lời nói trong lịch sử giúp chúng ta hiểu rõ hơn về quá trình hình thành và phát triển của xã hội.

Nl/Nu 3 bên, khu, nơi

Na/Nu 41 là một bộ từ vựng phong phú bao gồm các danh từ như bước, bộ, chừng, câu, cõi, cú, cấp, dòng, gợn, hàng, hình, hương, khối, loại, màu, món, mảng, mối, niềm, nước, nền, nụ, phần, thằng, tiếng, tiền, trận, tên, tụi, vì, vòng, vòng tròn, vị, vở, vụ, điều, điệu, đàn, đám, đơn vị, và đạo Những từ này phản ánh sự đa dạng và phong phú trong ngôn ngữ, góp phần làm phong phú thêm khả năng diễn đạt và giao tiếp.

Trong thời gian gần đây, nhiều người thường bận rộn với công việc và cuộc sống hàng ngày Vào buổi chiều, khi mọi thứ trở nên tĩnh lặng hơn, chúng ta có thể nhìn lại những gì đã xảy ra trong ngày hôm nay và so sánh với năm ngoái Trước hết, cần nhận ra rằng những khoảnh khắc quý giá thường diễn ra vào giữa ngày hoặc trưa, khi chúng ta có một chút thời gian cho bản thân Mai đây, chúng ta nên dành một hơi để suy ngẫm về quá khứ và những gì chúng ta đã trải qua, từ đó tạo ra những kế hoạch cho tương lai.

Bề mặt của Na/Nl 19 mở ra một không gian rộng lớn, từ chân trời đến các vùng đất xa xôi Căn cứ quân khu nằm trong phạm vi tầm nhìn, tạo nên một góc nhìn rõ ràng về nguồn gốc và phương hướng phát triển Trời đất hòa quyện, hình thành nên một vương quốc với nhiều đặc điểm độc đáo, từ lòng đất đến ven bờ, tất cả đều mang một sắc thái riêng biệt.

Na/Nl/Nu 3 chân, khoảng cách, nguồn

Ng/Nu 6 chốc lát, cậu, cỗ, em, tập hợp, đội

Na/Nn 4 con số, hai, số, tí

Nn/Nu 6 các, lũ, ngôi, rưỡi, từng, độ

Ng/Nn 4 cặp, dăm ba, năm tháng, toàn bộ

Ng/Nt/Nu 1 giây phút

Na/Ng/Nt/Nu 1 giấc

Nl/Nt/Nu 4 giờ, hồi, khoảng, lúc

Na/Ng/Nu 8 kích thước, loài, lớp, sự, thứ, việc, đoàn, đại đội

Na/Nt/Nu 3 lượt, lứa, nỗi

Na/Ng/Nl 1 thế giới

Na/Nl/Nt 2 trước tiên, điểm

Pi/Pp 2 ai, những ai

Pd/Pi 2 bao giờ, đó

Jr/Jt 2 bỗng dưng, bỗng nhiên

Jd/Jr 3 còn, cứ, luôn

Vitm/Vtm 16 bắn, co, leo, luồn, lùi, mỉm, ngược, rúc, rẽ, thót, vòng, văng, vật, xuyên, đuổi, động

Vits/Vts 4 bắt đầu, cách, lả, phí

Vitm/Vits 18 chan hoà, chơi, cuồn cuộn, dậy, dồn, hé, loà xoà, mở, ngồi, nằm, quỳ, rời, sập, thấm, vượt, úp, đổ, đứng

Vta/Vtv 3 chấp nhận, chịu, phải

Vitm/Vta 3 du nhập, dâng, dẫn

Vitm/Vto 4 lên, lại, qua, vào

Vitm/Vtm/Vto 6 ra, sang, về, xuống, đi, đến

V1 Vitm/Vits 18 chan hoà, chơi, cuồn cuộn, dậy, dồn, hé, loà xoà, mở, ngồi, nằm, quỳ, rời, sập, thấm, vượt, úp, đổ, đứng

Vta/Vtv 3 chấp nhận, chịu, phải

Vtm/Vto 6 ra, sang, về, xuống, đi, đến

Table 8 reveals a significant level of ambiguity, with many percentages exceeding 15% This indicates the challenges in converting the original tagset into the subsequent tagsets V, V1, V2, and V3 However, a deeper analysis uncovers distinct patterns within the parts of speech (POS) For instance, the verb category (V) shows that the (Vitm/Vtm) group contains 16 ambiguous words, while the (Vitm/Vits) group has 18, in stark contrast to the (Vits/Vtc) and (Vitm/Vtc) groups, which each have only one ambiguous word Additionally, Table 9 illustrates that certain groups, such as the noun category (Na/Ng/Nu), contain more ambiguous words—8 in total—compared to just 4 in the (Nn/Nu/Nt) group.

Concept related to variation n-gram method

The method has some significant notions that we think you should grasp These are n- gram, variation and variation n-grams

N-gram: N-gram is contiguous sequence of n item from a given sequence or speech

In particular, item in question is word in computational linguistics An n-gram of size

1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram" Larger sizes are sometimes referred to by the value of n, e.g.,

"four-gram", "five-gram", and so on For example: I have just gone out

Unigram Bigram Trigram Four-gram Five-gram

I I have I have just I have just gone I have just gone out Have Have just Have just gone Have just gone out

Just Just gone Just gone out

Variation: If a particular word occurs more than once in a corpus can thus be assigned different tags in a corpus We will refer to this as variation (Markus Dickinson, 2005)

In the analysis of Hoangtube.txt, the term "sau" demonstrates versatility within an 11-gram local context, functioning both as a position noun (Nl) and a classification noun (Nt) For example, it appears as "sau (Nl) một lát im lặng, em lại nói" and "sau (Nt) một lát im lặng, em lại nói," highlighting its dual role in sentence structure.

Markus Dickinson identifies two primary causes of variation in corpus annotation: ambiguity and error Ambiguity arises when a word can be assigned multiple lexical tags, leading to different instances of that word being annotated with various options In contrast, errors are evident when the tagging of a single word is inconsistent across similar occurrences within the corpus.

Variation n-grams are crucial for our thesis, serving as a key method for detecting errors in corpus annotation When identical n-grams appear in different positions within the corpus and contain one word annotated differently, we identify these n-grams as variation n-grams The word responsible for this variation is termed the variation nucleus.

For example: sau (Nl) một lát im lặng, em lại nói: -

Types of Vietnamese tagging error

A error is defined simply that when POS label of word is inconsistence in its present

To enhance the performance of algorithms, it is crucial to identify existing errors within the results By determining the number of mistakes present, we can work towards correcting them effectively Kübler and Wagner (2000) categorized tagging errors into four distinct types.

Ambiguity occurs when a word has multiple meanings, leading to confusion in its classification For instance, the word "light" can function as a noun when referring to an object that emits brightness, while it can also serve as an adjective to denote something that is not heavy Such errors often arise from the tagging process, prompting the need for careful analysis to identify and rectify these ambiguities.

No ambiguity: The wrong tag is not possible tag for the word

In Vietnamese grammar, there are eight primary categories of parts of speech: noun, adjective, verb, adjunct, pronoun, conjunction, introductory word, and emotive word These categories can be divided into four major parts of speech—noun, adjective, adverb, and verb—each of which can be further classified into more specific subcategories For instance, nouns can be subdivided into eight types, including proper nouns, countable nouns, collective nouns, classifier nouns, concrete nouns, abstract nouns, numerals, and locative nouns Errors often arise when the broader part-of-speech category is identified correctly, but the finer distinctions are overlooked For example, while "tờ" is classified as a noun, it specifically falls under the category of classifier noun.

The "No major category" error arises when a word is misclassified in its primary part-of-speech category For instance, in the context of the phrase "Phạm Huỳnh Tam lang-ký ức một thời vang bóng Kỳ 10" from the VietTreeBank corpus, the word "vang" is incorrectly labeled both as a noun (N) and a verb (V).

By using n-gram algorithm, we try to find errors belong to one of four error types.

A algorithm for detecting errors

After familiarizing ourselves with the fundamental concepts, we will now discuss the algorithm designed for error detection This algorithm is constructed on the principle that an n-gram variation must include an (n-1)-gram variation Consequently, we begin our analysis with n=1 and progress to the longest n, as outlined by Dickinson (2003).

Step 1: Compute all of variation unigrams and store them and their positions

Step 2: Based on positions of the variation n-gram last stored, extend the n-grams to either side (of course, unless the corpus ends there) For each (n+1)-gram achieved, test whether it has another instance in the corpus If there are still exist and have variation in the way the different occurrences of the (n+1) gram are tagged.

Classifying variations

After applying algorithms to identify all n-gram variations and variation nuclei within the corpus, it is essential to distinguish between errors and ambiguities To address this challenge, we rely on the following comments.

Detecting errors in a corpus relies on the principle that longer contexts can lead to a higher likelihood of variations and errors In Vietnamese, an isolating language, grammatical meaning is predominantly conveyed through word order Consequently, context is crucial for effective error detection This concept can be further clarified through illustrative examples.

The article discusses the responsibility surrounding the incident referred to as "Thanh Nien Thac Loan." It raises questions about accountability and the implications of the event, particularly in the context of the "Dem Trang" or "White Night" phenomenon The repeated emphasis on these terms highlights the urgency of addressing the issues at hand and understanding the societal impacts involved.

“về” is once assigned as adverb (R), twice annotated as preposition (E)

Besides, left side of variation nuclei is 8 words and right side is 9 words (included punctuation) These things turn out to be error

A pertinent question arises regarding the sufficient length of variation for review We begin our analysis with n=5, though this figure is not fixed and can be adjusted Nevertheless, the outcomes obtained from this approach are quite reasonable.

Another comment is structural boundaries, i.e, if a variation nucleus occurs within a complete sentence then it is likely to be an error

In our study, we delve deeper into the relationship between context length and variation nuclei We find that when variation nuclei are positioned at the edges of a context, they are more likely to introduce ambiguity Specifically, variation nuclei tend to emerge prominently at the beginning or end of a given context.

Result of detecting errors in POS tagging

In our analysis of variation nuclei, we focus solely on the longest context derived from each variation This means that we only consider the most recent context for calculations, while disregarding other contexts that stem from the same variation nuclei.

Figure 3 N-gram and variation nuclei in VTB corpus with n up to 29

By applying the variation n-gram method to the VTB corpus, we obtained significant results, illustrated in Figure 3 and detailed in Table 12 Our analysis revealed that the longest n reached 29, with a total of 11,428 variation nuclei identified Notably, the number of unigram variation nuclei (n=1) was significantly higher than that of bigrams, with counts of 1,691 and 5,515, respectively This discrepancy arises from the fact that both words in the bigram are considered variations.

Table 10 Statistic errors in corpus

N-gram Variation nuclei Errors N-gram Variation nuclei Errors

We appreciate high about the idea using variation n-gram to detect errors And table

In our analysis, we identified a total of 67 errors, accounting for 0.107% According to theoretical guidelines, variations in the nucleus that appear on the fringe of context are not classified as errors, leading us to exclude such instances from our results Consequently, some rows show zero errors due to their inclusion in subsequent contexts While we initially calculated errors starting from n=6, we believe that a lower value for n may also be applicable, suggesting that the total number of errors could vary.

Table 11 The detail n-gram in tagged corpus

N- gram Context Nuclei Label Line File

, một lượt vé máy_bay Lượt NC 42 105055.seg.pos

Sau ba năm làm_việc ở Ở E 42 109898.seg.pos

Bộ Kế_hoạch - đầu_tư cấp phép Kế_hoạch NP 28 86375.seg.pos

Bộ Lao_động - thương_binh & xã_hội Lao_động

! - Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? Về

- Kỳ 1 : Kỳ 2 : Lênh_đênh chìm_nổi đời Lênh_đênh A 79 104884.seg.pos

Chào_mừng Đại_hội thi_đua Đại_hội Np 2 82711.seg.pos

- Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? - Về

- ký_ức một thời vang bóng Kỳ 10 : “ vang

Bữa nào khao nặng phải mất hai cục ( Cục N 21 104395.seg.pos

1 : Kỳ 2 : Lênh_đênh chìm_nổi đời người Kỳ Lênh_đênh A 79 104884.seg.pos

20 triệu hả ? ” Tuấn nhả khói thuốc hả I 19 104395.seg.pos

: Gặp lại “ kỳ_quan bóng_bàn thế_giới ”

6 : Danh_thủ Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ

Cái kết buồn của VĐV VN xuất_sắc nhất thế_kỷ 20 Kết

Ngã rẽ của ông Weigang và số_phận chiếc cúp vô_địch vô_địch

Phạm_Huỳnh_Tam_Lang - ký_ức một thời vang bóng Kỳ 10 : vang

_Văn_Hòa và chữ_ký giải nợ Chà_và Kỳ

3 : Gặp lại “ kỳ_quan bóng_bàn thế_giới Gặp

Phạm_Văn_Rạng Kỳ 6 : Danh_thủ

Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ 7 : Danh_thủ

: “ Có đêm ông tiêu hết 20 triệu hả ? ”

Tuấn nhả khói thuốc lạnh_lùng bảo Hả

Weigang và số_phận chiếc cúp vô_địch

Kỳ 9 : Phạm_Huỳnh_Tam_Lang - ký_ức một thời vang bóng Kỳ 10 : “ Nữ_hoàng

” không ngai môn bóng nhựa Kỳ 11 vang

V 57 108804.seg.pos bắt , lắc vẫn cứ lắc ! - Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? - - Đêm trắng theo “ thiếu_gia ” đi thác_loạn -

E 66 84143.seg.pos thế_giới ” Kỳ 5 : “ Lưỡng_thủ vạn_năng

” Phạm_Văn_Rạng Kỳ 6 : Danh_thủ

Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ 7 : Danh_thủ Thể_Công làm HLV trên đất

Mai_Văn_Hòa và chữ_ký giải nợ

Chà_và Kỳ 3 : Gặp lại “ kỳ_quan bóng_bàn thế_giới ” Kỳ 4 : Gặp lại “ kỳ_quan bóng_bàn

The Vietnamese treebank tagset

The tagset contains 59 part of speech tags which are distributed into 7 classes and 10 tags for punctuations and symbols

Id POS English Vietnamese Id POS English Vietnamese

Countable noun Danh từ đơn thể 2 Vitc Comparative intransitive verb Động từ nội động so sánh

3 Np Pronoun noun Danh từ riêng 4 Vitm Moving intransitive verb Động từ nội động chuyển động

5 Ng Collective noun Danh từ tổng thể 6 Pd Time and space pronoun Đại từ không gian, thời gian

7 Nt Classifier noun Danh từ loại thể 8 Pn Quantity pronoun Đại từ số lượng

9 Nu Concrete noun Danh từ đơn vị 10 Pi Interrogative pronoun Đại từ nghi vấn

11 Na Abstract Noun Danh từ trừu tượng 12 Pp Personal pronoun Đại từ xưng hô

13 Nn Numeral Danh từ số lượng 14 Pa Quality pronoun Đại từ hoạt động, tính chất

15 Nl Locative noun Danh từ vị trí 16 An Quantity adjective Tính từ hàm lượng

17 Vt Transitive verb Động từ ngoại động 18 Aa Quality adjective Tính từ hàm chất

19 Vit Intransitive verb Động từ nội động 20 Jt Time adverb Phụ từ thời gian

21 Vim Impression verb Động từ cảm nghĩ 22 Jd Degree adverb Phụ từ mức độ

23 Vo Orientation verb Động từ phương hướng 24 Jr Comparative adverb Phụ từ so sánh

25 Vs State verb Động từ tồn tại 26 Ja Negation or acceptation adverb

27 Vb Transformation verb Động từ biến hoá 28 Ji Imperative adverb Phụ từ mệnh lệnh

29 Va Acceptation verb Động từ tiếp thụ 30 Cm Cajor/minor conjunction Liên từ chính phụ

31 Vc Comparative verb Động từ so sánh 32 Cc Combination conjunction Liên từ liên hợp

33 Vla Verb 'là' Động từ 'là' 34 I Introductory word Trợ từ

35 Vm Moving verb Động từ chuyển động 36 E Emotivity word Cảm từ

37 Vv Volitive verb Động từ ý chí 38 X Unknown/Uncertain Không xác định

39 Vtim Impression transitive verb Động từ ngoại động cảm nghĩ 40 # Pound sign Dấu thăng

41 Vta Acceptation intransitive verb Động từ ngoại động tiếp thụ 42 $ Dollar sign Dấu đô-la

43 Vtb Transformation transitive verb Động từ ngoại động biến hóa 44 Sentence-final punctuation Dấu chấm hết câu

45 Vtc Comparative transitive verb Động từ ngoại động so sánh 46 , Comma Dấu phẩy

47 Vto Orientation transitive verb Động từ ngoại động chỉ hướng 48 : Colon Dấu hai chấm

49 Vts State transitive verb Động từ ngoại động tồn tại 50 ; Semi-colon Dấu chấm phảy

51 Vtm Moving transitive verb Động từ ngoại động chuyển động 52 ( Left bracket character

Dấu mở ngoặc đơn trái

53 Vtv Volitive transitive verb Động từ ngoại động chỉ ý chí 54 ) Right bracket character

Dấu đóng ngoặc đơn phải

Impression intransitive verb Động từ nội động cảm nghĩ

56 ' Single quote Dấu nháy đơn

-n intransitive verb Động từ nội động biến hóa

58 " Double quote Dấu nháy kép

59 Vits State intransitive verb Động từ nội động tồn tại 60.

Vietnamese Tagset (VietTreeBank)

1 Np Proper noun Danh từ riêng

2 Nc Classifier Danh từ chỉ loại

3 Nu Unit noun Danh từ đơn vị

4 N Noun other Danh từ khác

8 L Determiner (e.g mot, nhung, cac) Định từ

11 E Preposition Giới từ (Liên kết chính phụ)

12 C Conjunction Liên kết từ (Liên kết đẳng lập)

14 T Particle Trợ từ, tình thái từ (tiểu từ )

15 U Bound morpheme Từ tiếng nước ngoài

17 X Unknown Các từ không phân loại được

18 Symbol Symbol Các ký hiệu đặc biệt khác (? / # $)

Tagset 3 (25tags)

Countable noun Abstract noun Collective noun

Danh từ đơn thể Danh từ tổng thể Danh từ trừu tượng

3 Np Pronoun noun Danh từ riêng 4 Aa Quality adjective Tính từ hàm chất

5 Nt Classifier noun Danh từ loại thể 6 Jt Time adverb Phụ từ thời gian

Danh từ đơn vị Danh từ số lượng

8 Jd Degree adverb Phụ từ mức độ

9 Nl Locative noun Danh từ vị trí 10 Jr Comparative adverb Phụ từ so sánh

Vt/Vt im/V ta/Vt b/Vtc

Vt Động từ nội động 12 Ja

Vit Động từ nội động 14 Ji Imperative adverb Phụ từ mệnh lệnh

Vr Động từ còn lại

16 Cm Cajor/minor conjunction Liên từ chính phụ

17 Cc Combination conjunction Liên từ liên hợp

18 Pd Time and space pronoun Đại từ không gian, thời gian

20 Pn Quantity pronoun Đại từ số lượng 21 E Emotivity word Cảm từ

22 Pi Interrogative pronoun Đại từ nghi vấn 23 X Unknown/Uncert ain Không xác định

24 Pp Personal pronoun Đại từ xưng hô 25 Pa Quality pronoun Đại từ hoạt động, tính chất

Countable noun Danh từ đơn thể 2 Vitb Transformation transitive verb Động từ nội động biến hóa

Pronoun noun Numeral Classifier noun

Danh từ loại thể Danh từ số lượng

4 Vits State transitive verb Động từ nội động tồn tại

5 Ng Collective noun Danh từ tổng thể 6 Vitc Comparative transitive verb Động từ nội động so sánh

7 Nu Concrete noun Danh từ đơn vị 8 Vitm Moving transitive verb Động từ nội động chuyển động

9 Na Abstract Noun Danh từ trừu tượng 10 Pd Time and space pronoun Đại từ không gian, thời gian

11 Nl Locative noun Danh từ vị trí 12 Pi Interrogative pronoun Đại từ nghi vấn

Intransitive verb/Comparati ve verb/Verb 'là'/Volitive verb/

Acceptation intransitive verb Động từ ngoại động /Động từ nội động /Động từ so sánh /Động từ 'là' /Động từ ý chí /Động từ ngoại động tiếp thụ

14 Pp Personal pronoun Đại từ xưng hô

15 Vim Impression verb Động từ cảm nghĩ 16 Pa Quality pronoun Đại từ hoạt động, tính chất

17 Vo Orientation verb Động từ phương hướng 18 An Quantity adjective Tính từ hàm lượng

19 Vs State verb Động từ tồn tại 20 Aa Quality adjective Tính từ hàm chất

21 Vb Transformation verb Động từ biến hoá 22 Jt Time adverb Phụ từ thời gian

23 Va Acceptation verb Động từ tiếp thụ 24 Jd Degree adverb Phụ từ mức độ

25 Vm Moving verb Động từ chuyển động 26 Jr Comparative adverb Phụ từ so sánh

27 Vtim Impression intransitive verb Động từ ngoại động cảm nghĩ 28 Ja Negation or acceptation adverb

29 Vtb Transformation intransitive verb Động từ ngoại động biến hóa 30 Ji Imperative adverb Phụ từ mệnh lệnh

31 Vtc Comparative intransitive verb Động từ ngoại động so sánh 32 Cm Cajor/minor conjunction Liên từ chính phụ

33 Vto Orientation intransitive Động từ ngoại động chỉ hướng 34 Cc Combination conjunction Liên từ liên hợp

35 Vts State intransitive verb Động từ ngoại động tồn tại 36 I Introductory word Trợ từ

37 Vtm Moving intransitive verb Động từ ngoại động chuyển động 38 E Emotivity word Cảm từ

39 Vtv Volitive intransitive verb Động từ ngoại động chỉ ý chí 40 Vitim Impression transitive verb Động từ nội động cảm nghĩ

A5 Syntax function tags in VTB

1 H The head element of phrase

3 DOB Direct object function label

4 IOB Indirect object function label

6 PRD Predicate function label not verb phrase

7 LGS Logic subject function label of passive voice sentence

8 EXT Complement function label expresses the range or frequence of action

9 VOC Complain component function label

A6 Adverbial classification tag of verb in VTB

1 TMP Adverbial function label expresses time

2 LOC Adverbial function label expresses location

3 DIR Adverbial function label expresses direction

4 MNR Adverbial function label expresses manner

5 PRP Adverbial function label expresses purpose or reason

6 CND Adverbial function label expresses condition

7 CNC Adverbial function label expresses concession

8 ADV Adverb function label (the rest of stituations)

Tiêu đề	Tagset Evaluation and Automatical Error Verification in Pos Tagged Corpus
Tác giả	Thi-Thanh-Tam Do
Người hướng dẫn	Dr. Nguyen Phuong Thai
Trường học	Vietnam National University Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Master Thesis
Năm xuất bản	2012
Thành phố	Ha Noi

Định dạng
Số trang	51
Dung lượng	842,67 KB