Developing a validity argument for the english placement test at btec international college DANANG CAMPUS

INTRODUCTION

INTRODUCTION TO TEST VALIDITY

Language tests are essential for assessing students' English proficiency in college settings, with entrance or placement tests being among the most common At BTEC International College, the placement test serves as a vital tool for determining students' readiness for collegiate English courses, thereby influencing the academic journey of students, administrators, and instructors alike Test scores enable students to gauge their preparedness, assist administrators in appropriately placing students in language courses, and provide instructors with valuable insights for lesson planning Furthermore, students recognize the significance of their language skills for academic success, motivating them to focus on improving their language proficiency.

This study emphasizes the significance of test validity in entrance examinations Test validity refers to how accurately a test measures its intended purpose, and it encompasses the interpretations of test scores based on their proposed uses, supported by both evidence and theoretical frameworks (American Educational Research Association, American Psychological Association).

Validation is a crucial process in which test developers and users collect evidence to establish a solid scientific foundation for interpreting test scores, as highlighted by the National Council on Measurement in Education in 1999.

Validity researchers prioritize the quality of evidence over quantity to substantiate validity interpretations This evidence can be classified into four categories: content-based evidence from tests, evidence derived from response processes, evidence relating to other variables, and evidence concerning the consequences of testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999).

To provide comprehensive evidence across four categories, various research methodologies must be employed The Achieve alignment method offers evidence grounded in test content (Rothman et al., 2002), while the cognitive interview method is utilized to gather evidence based on response processes (Willis, 2005; Miller et al., 2013) Additionally, the predictive method is essential for establishing evidence related to other variables, and argument-based approaches support evidence concerning the consequences of testing through a test's interpretative and validity arguments (Kane, 2006).

THE STUDY

BTEC International College – FPT University conducts a placement test each semester for new students to assess their English proficiency for university studies, focusing specifically on four skills: reading, listening, speaking, and writing This study concentrates solely on the writing skill component of the test.

This study presents a validity argument for the English Placement Writing test (EPT W) at BTEC International College – FPT University, first administered in Summer 2019 The EPT W is designed to assess the writing skills essential for academic success, making the establishment of its validity crucial Understanding the implications of this assessment is beneficial for educators and researchers alike The primary objectives of the study include examining the extent to which task design and rater variability contribute to score differences.

2) how many tasks and raters are needed to get involved in assessment to obtain the test score dependability of at least 85; and 3) the extent to which vocabulary distributions are different across proficiency levels of academic writing

Table 1.1 The structure of the EPT W Total test time 30 minutes

Task content Write a paragraph using one tense on any familiar topics

For example: Write a paragraph (100-120 words) to describe an event you attended recently

Task content Write a paragraph using more than one tense on a topic that relates to publicity

For example: Write a paragraph (100-120 words) to describe a vacation trip from your childhood Using these clues:

Where did you go? When did you go? Who did you go with? What did you do? What is the most memorable thing? Etc

The EPT W evaluates test takers' performance using a comprehensive rating rubric This rubric considers various criteria, including task achievement, grammatical range and accuracy, lexical resource, as well as coherence and cohesion, to determine the appropriateness of responses (refer to Appendix A).

SIGNIFICANCE OF THE STUDY

This study aims to enhance the theoretical framework of language assessment by offering evidence that supports inferences drawn from the EPT W test scores, thereby contributing to the discourse on test validity within academic writing contexts.

The findings of this study aim to enhance the assessment of writing ability by analyzing the impact of task quantity and rater variability on test scores By understanding how different components influence these scores and the language produced, this research will guide educators in selecting suitable tasks for evaluating academic writing effectively.

LITERATURE REVIEW

STUDIES ON VALIDITY DISCUSSION

2.1.1 The conception of validity in language testing and assessment

The definition of validity in language testing and assessment could be given in three main time periods

Validity, as defined by Messick (1989), is an evaluative judgment regarding the extent to which empirical evidence and theoretical frameworks support the appropriateness of interpretations and actions derived from test scores or other assessment methods This perspective on validity has been officially recognized by prominent organizations such as the AERA, APA, and NCME (1985), further affirming its significance in the field of assessment.

Test validation involves gathering evidence to ensure that the inferences drawn from test scores are appropriate, meaningful, and useful It encompasses various types of inferences that can be made from a single test's scores, with multiple methods available to support each inference Despite these variations, validity remains a unified concept.

Bachman (1990) provides a comprehensive definition of validity, emphasizing that it pertains to the validation of inferences drawn from test scores rather than the tests themselves, aligning with Messick's perspective Validity encompasses various dimensions, including content validity, construct validity, concurrent validity, and the consequences of test usage As reiterated by AERA et al (1999), validity is defined as the extent to which evidence and theory substantiate the interpretations of test scores associated with their intended applications.

Kane (2001) identifies four key aspects of validity based on an analysis of various authoritative statements (AERA et al., 1999; APA, 1985; Bachman, 1990; Messick, 1989) Firstly, validity is defined as the assessment of the overall plausibility of interpretations or applications of test scores Secondly, it aligns with the principles of construct validity, emphasizing the need for a comprehensive understanding of how test scores relate to the constructs they are intended to measure.

The proposed interpretations require a thorough analysis of inferences and assumptions, including a rationale for the interpretation and consideration of competing views This evaluative judgment assesses the adequacy and appropriateness of the interpretation, as well as the strength of the supporting evidence Ultimately, validity represents a unified evaluation of the interpretation rather than merely a collection of techniques.

The complexity of validity is crucial in test evaluation, as highlighted by Bachman (1990, 2004) and Brown (1996) They identified three primary aspects of validity: content validity, which encompasses content relevance and coverage; criterion validity, which relates to criterion relatedness; and construct validity, focusing on the meaningfulness of the construct Additionally, Brown (1996) emphasized the importance of examining standard setting, specifically the appropriateness of cut-points, as a significant aspect of validity.

2.1.2 Using interpretative argument in examining validity in language testing and assessment

The argument-based validation approach in language testing and assessment conceptualizes validity as an argument formed through the analysis of both theoretical and empirical evidence, rather than merely a collection of quantitative or qualitative data A prominent framework within this approach is the interpretative argument, which is well-articulated by Kane This method emphasizes the integration of various forms of evidence to support claims about the validity of assessments, ensuring a more comprehensive understanding of their effectiveness.

The argument-based approach to validation utilizes an interpretative framework to gather and present evidence of validity, aiming to substantiate its inferences and assumptions, particularly those that are most contentious.

The interpretative argument involves several key inferences, including domain description, evaluation, generalization, explanation, extrapolation, and utilization These elements are visually represented by arrows that connect the foundational observations of the target domain at the bottom, the intermediate claims such as observed and expected scores in the middle, and the final conclusions regarding the target score and test usage at the top Figure 2.1 illustrates these inferences within the interpretative argument framework.

Figure 2.1 An illustration of inferences in the interpretative argument (adapted from Chapelle et al 2008)

In the article discussing the practical application of the argument-based approach, Kane (2002) encapsulated the consensus among testing researchers, including Crooks, Kane, and Cohen (1996), as well as Shepard (1993), regarding the definition of an interpretative argument He defined it as "a network of inferences and supporting assumptions leading from scores to conclusions and decisions" (Kane, 2002, p 231).

Structure of an interpretative argument

Kane (1992) posited that various forms of inferences link observations to conclusions This concept of interconnected inferences aligns with the insights of Toulmin, Rieke, and Janik (1984), who emphasized the significance of implications in this inferential chain.

Arguments often lead to further discussions, creating a chain reaction of interconnected debates Each argument can serve as a foundation for subsequent arguments, resulting in a continuous cycle of dialogue.

Kane et al (1999) presented an interpretive argument that forms the basis of performance assessment, which includes six types of inferential bridges These bridges facilitate the interpretation of test performance as indicative of broader performance contexts beyond the test itself Figure 2.2 visually represents the inferences involved in this interpretive argument.

Figure 2.2 Bridges that represent inferences linking components in performance assessment (adapted from Kane et al., 1999)

Kane et al (1999) present an interpretive argument composed of seven interconnected parts, similar to the framework established by Chapelle et al (2008) The initial component, domain description, establishes a connection between performances in the target domain and observations from the test domain, which is defined by the test's objectives This observation of test performance highlights the essential knowledge, skills, and abilities relevant to scenarios that mirror those found in the target domain.

Evaluation involves inferring a score from observed performance, relying on the assumptions regarding the appropriateness and consistency of the scoring methods and the conditions under which the performance is assessed According to Kane et al (1999), effective evaluation requires that the criteria used for scoring are suitable and have been implemented correctly, as well as ensuring that the performance evaluation occurs under conditions that align with the intended assessment framework.

GENERALIZABILITY THEORY (G-THEORY)

What is Generalization theory (G-theory)?

Generalizability (G) theory is a statistical theory about the dependability of behavioral measurements Cronbach, Gleser, Nanda, and Rajaratnam (1972) sketched the notion of dependability as follows:

The score used for decision-making is just one of many potential scores that could fulfill the same role Decision-makers typically do not focus on the specific responses to individual stimuli, the particular tester, or the exact timing of the assessment Moreover, certain measurement conditions can be modified without compromising the score's acceptability to the decision-maker.

G theory's key advantage lies in its ability to separately estimate multiple sources of error in measurement within a single analysis This approach allows decision-makers to ascertain the necessary number of occasions, test forms, and administrations required to achieve reliable scores Additionally, G theory offers a summary coefficient that indicates the level of dependability, akin to the reliability coefficient found in classical test theory.

In language assessment, Bachman (1990) highlighted the application of generalizability theory (g-theory) to evaluate errors in task score generalization across various language tests and participants This concept of generalization serves as a framework for understanding reliability, which has been extensively utilized in validation studies of language assessments The assumptions that underpin generalization inference are supported by reliability estimates, while additional research, including studies on test administration conditions (Kane et al., 1999) and score equating (Kane, 2004), further validates this inference.

2.2.1 Generalizability and Multifaceted Measurement Error

In the context of G theory, a measurement represents a sample from a universe of permissible observations deemed interchangeable by decision makers for decision-making purposes (Shavelson & Webb, 1991) A one-facet universe is characterized by a single source of measurement error, with items serving as facets of the measurement; the item universe encompasses all admissible items When decision makers seek to generalize performance from one occasion to a broader set of occasions, these occasions form another facet, with the occasions universe defined by all admissible occasions It is important to note that error is inevitably present when generalizing from a measurement to behavior within the universe.

Relying on a single test score to assess an individual's ability is often inaccurate due to various sources of error, including occasions, items, and raters Generalizability theory (G-theory) allows for the estimation of variance from multiple sources, enabling a clearer understanding of both the construct of interest and the associated errors.

2.2.2 Sources of variability in a one-facet design

Definitions of some sources of variability

Person/ object of measurement is variability due to differences in ability on construct of interest

Rater facet is the extent to which items vary in difficulty

Interaction between person and item is the extent to which items are differently difficult for different persons (e.g., background knowledge of test takers can affect score on a reading test.)

SUMMARY

This study aims to investigate the validity of the English Placement Writing Test (EPT W) at BTEC International College – Da Nang Campus, specifically for non-native English speakers Utilizing the interpretative argument framework for the TOEFL iBT test developed by Chapelle et al (2008), the research focuses on key inferences: generalization and explanation To achieve this, three research questions were formulated, with the first two addressing the evidence for evaluation and generalization inferences The third question analyzes the linguistic features from the hand-typed writing records of 21 successful tests, providing support for the explanation inference.

1 To what extent is test score variance attributed to variability in the following: a task? b rater?

2 How many raters and tasks are needed to obtain the test score dependability of at least 85?

3 What are vocabulary distributions across proficiency levels of academic writing?

METHODOLOGY

RESEARCH DESIGN

This study utilized a descriptive design to collect and analyze data in parallel, enhancing the approach to addressing the research questions The qualitative data comprised 21 typescripts of written exams from students who successfully passed the entrance placement tests, while 79 out of 100 test takers were placed into English class Level 0 Additionally, the quantitative data included 400 writing scores from two writing tasks, providing a comprehensive overview of student performance.

100 test takers (each task was scored by two raters)

First, the analyses of 400 writing scores were used to answer these three research questions:

1 To what extent is test score variance attributed to variability in the following: a task? b rater?

2 How many raters and tasks are needed to obtain the test score dependability of at least 85?

Second, the linguistic feature analyses from 21 passing written tests were used to answer the following research question:

3 What are vocabulary distributions across proficiency levels of academic writing?

PARTICIPANTS

There were two categories of participants: 100 test takers and two raters

First, 100 test takers were asked to do two writing tasks which were graded for entrance placement and unidentified recorded data were used for this research

The second group of participants consists of two raters who evaluate two writing tasks, providing four sets of scores based on specific rating scales Figure 3.1 illustrates the two categories of participants, with further details about each category provided in the following subsection.

One hundred high school students enrolled in a program at BTEC International College – FPT University were required to pass five English levels before pursuing their international majors Students with equivalent English proficiency certificates, such as IELTS, TOEFL, or TOEIC, were exempt from these levels per the college's academic regulations Following an entrance test that assessed reading, grammar, listening, and writing skills, the students were placed into suitable classes This study focused solely on the results from the writing test.

400 scores from 100 test takers were used to answer the first and second RQ

21 writing descriptions from 21 test takers who passed the test were used to answer the third

In this study, the raters were two female lecturers from BTEC International College, both holding a Bachelor's degree in English Pedagogy from the University of Foreign Language Studies at the University of Da Nang Additionally, they are currently pursuing a Master of Arts degree in English Language at the same university.

MATERIALS

The material used in this study included the English Placement Test, the writing task types, and the rating scale (rating rubric)

3.3.1 The English Placement Writing Test (EPT W) and the Task Types

The EPT W is an evaluation designed to assess academic writing skills essential for success in an academic environment This test evaluates candidates based on four key criteria: task achievement, grammatical range and accuracy, lexical resource, and coherence and cohesion.

There are two tasks in the EPT W An example of each task is presented below:

Task 1: Write a paragraph (100-120 words) to describe an event you attended recently

Task 2: Write a paragraph (100-120 words) to describe a vacation trip from your childhood Using these clues:

For two tasks, 30 minutes was given and taking notes was allowed

The written examinations of test takers were evaluated on a scale from 0 to over 7, with detailed criteria and descriptors for each band available in Appendix A Each score level corresponds to a specific class that offers learners tailored learning materials, lectures, and progress tests The band information is outlined below.

- 0 – 4.5: level 1 class, material: Top Notch Fundamental

- 4.5 – 5.0: level 2 class, material: Top Notch 1

PROCEDURES

The raters had been given training before rating the writings This training process took about one hour This training process comprised two steps

In the initial phase, the researcher collaborated with another rater to review the rating scale outlined in Appendix A A response rated as a high pass (greater than 7) is expected to consistently meet the task's requirements.

"Almost always" indicates that the test taker demonstrates a diverse range of sentence structures, with most sentences being error-free and only occasional minor mistakes They exhibit a broad vocabulary and show sophisticated control over lexical features, with rare minor errors The writing is well-organized, utilizing effective cohesion that remains unobtrusive, and the test taker skillfully manages paragraphing without any spelling or punctuation errors.

A score of 0 is assigned if the test taker submits no response or fails to attend the test Test takers scoring between 0 and 4.5 are classified as level 1, where their answers are only minimally related to the task This includes the occasional use of correct grammatical structures or tenses, but predominantly relies on a few isolated words Their writing lacks cohesive devices and fails to establish connections between sentences, resulting in numerous spelling and punctuation errors that significantly impede comprehension.

Level 2 is rated for test takers in band 4.5-5.0 whose answer sometimes presents limited related ideas “Sometimes” means that they can use correct simple sentences (the first subject always) and correct forms of verbs

Test takers scoring between 5.0 and 5.5 are categorized at level 3 While they attempt to address both tasks, their responses often overlook key points and display limited structural variety, rarely incorporating subordinate clauses Frequent grammatical errors and basic vocabulary—sometimes used repetitively or inappropriately—characterize their writing Their control over word formation is limited, and although they present information and ideas, these lack coherence and clear progression Basic cohesive devices are used, but they may be inaccurate or repetitive, and spelling and punctuation errors can hinder overall comprehension.

Level 4 was reached by test takers who can attempt to address the task but usually cover all key points; usually use only a range of structures; attempt complex sentences but these tend to be less accurate than simple sentences; have frequent grammatical errors; use a limited range of vocabulary, but this is minimally adequate for the task; have a good control of word formation; present information with some organizations but there may be a lack of overall progression; use cohesive devices effectively, but cohesion within and/ or between sentences may be faulty or mechanical; may be repetitive because of lack of referencing and substitution; and have some spelling and punctuation errors but does not hinder comprehension

Test takers scoring between band 6.5 and 7 are assigned to level 5, the most challenging tier of the BTEC language program They demonstrate the ability to meet most task requirements, utilizing a combination of simple and complex sentences while generally producing error-free writing Their grammatical control is strong, with only minor errors, and they employ a sufficient range of vocabulary, occasionally attempting less common words despite some inaccuracies Additionally, they effectively organize information and ideas, ensuring clear progression, and use various cohesive devices, albeit with occasional inconsistencies While there may be a few spelling and punctuation mistakes, these do not impede overall comprehension.

Test takers achieving a score above 7 are exempt from attending English courses at BTEC, as their performance consistently meets task requirements They demonstrate a diverse range of sentence structures, with the majority being error-free Their vocabulary is extensive and used with natural sophistication, exhibiting only rare minor errors Additionally, they present well-organized paragraphs that effectively utilize cohesion, with no spelling or punctuation mistakes.

In the second step, two raters engaged in a one-hour face-to-face discussion to align their ratings and ensure a clear understanding of each criterion This rating process was grounded in norming training.

After the training, the two raters rated 200 written examinations (there were

Out of 100 test takers, 79 wrote no responses, resulting in only 42 examinations being rated Raters thoroughly evaluated each performance, providing both analytical and holistic ratings for each task in accordance with the established rating rubric.

Data analysis for the first two research questions was conducted using SPSS version 22, while the third question, which focused on linguistic features, was analyzed using the Vocabulary Profiler tool from Compleat Lexical Tutor software (www.lextutor.ca/vp/eng/).

DATA ANALYSIS

3.5.1 To what extent is test score variance attributed to variability in the following: a task?; b rater?

To investigate the research question, SPSS version 22 was utilized to analyze the writing scores of 400 test takers, which were evaluated by two raters This analysis was grounded in Generalizability Theory (G-theory), which allows for the estimation of variance from multiple error sources, emphasizing that a single test score may not accurately reflect an individual's true ability The study aimed to determine whether the variance in test scores was primarily due to the task or the rater.

Cronbach’s alpha is a key metric for assessing internal consistency in SPSS, serving as an indicator of scale reliability In this study, the calculated Cronbach’s alpha score was an impressive 0.981, demonstrating a high level of reliability for the data.

3.5.2 How many raters and tasks are needed to obtain the test score dependability of at least 85?

The generalizability coefficient was utilized to estimate the reliability of relative decisions, indicating that the number of raters and tasks required is crucial for achieving dependable test scores.

Dependability refers to the reliability of inferring a person's true abilities based on their observed test scores, reflecting what their average performance would be across various conditions This concept assumes that the individual's knowledge, skills, or attributes remain stable over time, indicating that variations in scores from different assessments are attributed to measurement errors rather than genuine changes in the person's capabilities due to growth or learning.

3.5.3 What are vocabulary distributions across proficiency levels of academic writing?

This study utilized writing examination papers from non-native English students at BTEC International College, collected during EPT tests in 2019 In a 30-minute writing task, EFT undergraduate students were tasked with composing two paragraphs in response to a specific viewpoint EFL instructors evaluated and classified each paragraph into two levels based on a rating rubric (refer to Appendix A) Additionally, Table 3.2 provides details on the number of texts and word counts for each of the two sub-corpora levels.

Table 3.1 Texts and word counts in the two levels of the EPT sub-corpora

Sub-corpora Texts Word Counts

The Vocabulary Profiler in Compleat Lexical Tutor was utilized for various vocabulary analyses, including word types, frequency, token count, lexical diversity, density, and sophistication Lexical diversity was assessed using the type-token ratio (TTR), which compares the number of unique words to the total word count, indicating the text's lexical richness Lexical density was determined by the ratio of content words—nouns, verbs, adjectives, and adverbs—to the total number of words, highlighting the text's informational richness Lexical sophistication was evaluated based on the proportion of low and high-frequency words categorized into K1 tokens (the most frequent 1000 English words), K2 tokens (the next 1000), and AWL (academic writing language) tokens.

RESULTS

RESULTS FOR RESEARCH QUESTION 1

To what extent is test score variance attributed to variability in the following: a Task?; b Rater?

Table 4.1 presents the sources of variation and variance component estimates, highlighting three main effects: persons (students), tasks, and raters, along with three interaction effects The results reveal that the largest variance component for the universe score is attributed to persons at 29%, followed by tasks at 11% and raters at a minimal 0.2% Among the interactions, the person-task interaction accounts for the highest percentage at 35%, while the person-rater interaction is at 20%, and the rater-task interaction is notably low at 0.11% Additionally, measurement error contributes to 8% of the variability in test scores Detailed explanations of each variance source will follow in subsequent paragraphs.

Table 4.1 Variance components attributed to test scores

Source of Variance Estimate Percentage

The variance component for individuals accounts for 29% of the total variance in scores, with an estimate of 1.063, making it the second largest contributor to the overall variance This indicates that individuals exhibited differences in their abilities as measured by the test, when averaged across various raters and tasks.

Task: The variance estimate for task (0.299) accounts for 11% of the total variance This suggests that a task was more difficult than another one when averaging over all persons

The variance estimate for raters is 0.009, representing just 0.2% of the total variance This indicates that there was minimal difference among raters in their ratings when averaged across all individuals.

The interaction between person and task accounts for a variance estimate of 1.264, representing 35% of the total variance This indicates significant variability in individuals' performance across different tasks when averaged across raters.

Person*Rater: The second largest person-by-task interaction (20%) suggests that the relative standing of persons on the performance assessment differed across raters, averaging over tasks

Rater*Task: The small variance component for the interaction between rater and task (0.004 or 0.11%) suggests the consistency of raters’ average ratings of persons from one task to another task

The error percentage of 0.277, or 8%, is significant relative to the total variance, indicating potential systematic and unsystematic variations Systematic variation arises from uncontrolled factors in the model, including occasion, task ordering, method of task delivery, and rating scale In contrast, unsystematic variance stems from uncontrollable sources like room temperature during testing and the interaction of unmeasured errors.

In summary, approximately 11% of score variability was linked to the difficulty levels of the tasks, while only about 0.2% was attributed to differences among raters This indicates that the tasks varied significantly in difficulty, and although raters were consistent in their evaluations, their ratings still showed some variation.

How many raters and tasks are needed to obtain the test score dependability of at least 85?

To achieve a test score dependability of at least 85, we calculated dependability indices, including relative dependability (generalizability coefficient) for relative decisions and absolute dependability (phi coefficient) for absolute decisions Initial calculations were performed for one rater and one task, as well as for two raters and two tasks Subsequently, D-studies were conducted to identify the necessary number of raters and tasks to meet the desired dependability threshold Detailed calculations of the dependability indices are provided below.

The dependability (relative and absolute) for 1 task and 1 rater

The generalizability coefficient based on one rater and one task was found to be 0.32, indicating a low dependability of test scores in accurately representing test takers' true scores This score suggests that only 32% of the observed score variance reflects true score variance, while a significant 68% is attributed to measurement error, highlighting the need for improved reliability in the testing process.

The phi coefficient for a single rater and task is 0.29, indicating a low dependability index that reflects the reliability of absolute decisions This score suggests that only 29% of the total variance in test scores can be attributed to the true scores of test takers, while a significant 71% is due to measurement error Consequently, the test's dependability is limited, highlighting concerns about the accuracy of the test scores in representing the true abilities of the individuals assessed.

The study found that the variance components for raters and tasks, as well as their interaction, did not impact the relative ranking of test takers in a language course placement test Regardless of whether raters were harsh or lenient, the test takers' standings remained unchanged Additionally, since all test takers completed both tasks, task difficulty did not influence their rankings; performance varied with task difficulty, but relative standings were consistent Consequently, relative error variance is deemed more suitable for this assessment context.

The dependability (relative and absolute) for 2 tasks and 2 raters

The test, which consists of two tasks and is evaluated by two raters, achieves a dependability score of 0.5 This indicates that about 50% of the observed score variance for each test taker reflects true score variance, while the remaining 50% is attributed to measurement error, highlighting the test's reliability and areas for improvement.

The test involving two raters and two tasks achieves a dependability score of 0.33, indicating that it is about 33% dependable, meaning that roughly one-third of the observed score variance reflects true score variance Conversely, approximately 67% of the variance is attributed to measurement error, highlighting the need for improved reliability in the assessment.

Overall the generalizability coefficient is usually greater than the phi coefficient, since the relative error includes fewer variance components than the absolute error

If we want a dependability of 0.85, then we would need to have 14 raters and

10 tasks or 10 raters and 12 tasks (see the Table 4.2 below)

What are vocabulary distributions across proficiency levels of academic writing?

Table 4.3 Distribution of vocabulary across proficiency levels

(3) Lexical diversity (type-token ratio)

(6) Total number of word families

The article discusses key linguistic metrics, including TTR, which measures the mean type-token ratio per text, and lexical density, defined as the ratio of content words to the total word count It also highlights the significance of K1 and K2, representing the most frequent 1000 and 2000 words in English, respectively, alongside the Academic Word List (AWL) for understanding vocabulary usage.

The analysis in Table 4.3 reveals a clear progression in the number of types and tokens across varying levels of writing proficiency Low-level written responses (EPT L1) exhibited only 450 types and 1260 tokens, whereas higher-level responses (EPT L2) showed an increase to 524 types and 1686 tokens This indicates that advanced learners produce longer and more diverse written output, reflecting their enhanced linguistic knowledge These findings align with Shaw and Weir’s (2007) research, which demonstrated that as learners improve their proficiency, they generate a broader vocabulary range in terms of both tokens and types.

Lexical diversity was assessed using the type-token ratio (TTR) for each text, revealing slight differences between the two groups The TTR for EPT L1 was 0.36, while EPT L2 had a TTR of 0.31, indicating varied levels of lexical diversity in their responses.

The analysis of lexical sophistication revealed that the use of the most frequent 1000 English words (K1 tokens) decreased as proficiency levels increased, with EPT L1 showing 0.36 and EPT L2 showing 0.31 Conversely, the use of K2 tokens and Academic Word List (AWL) terms increased across proficiency levels, with EPT L1 responses containing 0.52 K2 tokens and 0.05 AWL, while EPT L2 responses included 0.53 K2 tokens and 0.15 AWL Despite these trends, there was no statistically significant difference in lexical diversity, density, or sophistication between the two proficiency levels.

The study identified 372 word families in the EPT L1 group and 379 in the EPT L2 group, revealing a statistically significant difference in the proportion of word families between the two learner groups This indicates that higher-level learners utilized a greater variety of word families in their written discourse, reflecting a more advanced linguistic and cognitive capability compared to lower-level learners.

While the differences in lexical density, diversity, and sophistication among EPT levels are minimal, higher-level learners demonstrate greater complexity in their written discourse when considering additional measures like types, tokens, and word families.

DISCUSSION AND CONCLUSIONS

GENERALIZATION INFERENCE

5.2 The explanation inference in the validity argument for the PT test with 1 assumption and backing 42

LIST OF TABLES Number of table Name of table Page

1.1 The structure of the EPT W 3

Summary of the inferences, warrants in the TOEFL validity argument with their underlying assumptions (Chapelle et al.,

2.2 A framework of sub-skills in academic writing (McNamara,

3.1 Texts and word counts in the two levels of the EPT sub-corpora 30 4.1 Variance components attributed to test scores 32

4.3 Distribution of vocabulary across proficiency levels 38

I would like to express my deeply sincere appreciation for my supervisor, Dr

I would like to express my heartfelt gratitude to Vo Thanh Son Ca for her unwavering support and guidance throughout my research project Her ability to help me narrow down my initially broad topic and effectively combine theory with practice has been invaluable Her insightful feedback and thorough observations at every stage of my work led to significant improvements in my research on language testing and assessment I am also deeply thankful for the encouragement and assistance from my family and friends, whose support was crucial in the completion of my dissertation Without these remarkable individuals, this journey would not have been possible.

Writing in a foreign or second language is a crucial skill in language education, with universities relying on writing assessment scores to place students in language support courses To ensure the validity of these assessments, it is essential to develop a validity argument for tests such as the English Placement Writing test (EPT W) at BTEC International College Danang Campus This study focused on two key inferences: generalization and evaluation, exploring how task variability and rater differences contribute to test scores It aimed to determine the number of tasks and raters required to achieve a test score dependability of at least 0.85, while also examining vocabulary distribution differences across various academic writing proficiency levels The analysis utilized test score data from 21 students who completed two writing tasks, applying Generalizability theory and Decision studies (D-studies) to ascertain the necessary number of tasks and raters for reliable assessment outcomes.

An analysis of 42 written responses from 21 students revealed significant insights into vocabulary distributions across different proficiency levels The findings indicated that the variability in test scores was primarily influenced by the tasks, while raters had a more limited impact on score variance To achieve a dependability score of 0.85, it is recommended that the test incorporates either 14 raters and 10 tasks or 10 raters and 12 tasks Furthermore, the study showed that lower-level students utilized a less varied vocabulary compared to their higher-level peers, suggesting that more proficient learners tend to produce a broader range of word families.

This chapter presents the introduction to test validity and the purpose of this thesis The chapter concludes with the significance of this thesis

Language tests are essential for assessing students' English proficiency in college settings, with entrance or placement tests being among the most common At BTEC International College, the placement test serves as a key example for this research study, emphasizing the importance of test scores in determining students' readiness for collegiate courses conducted in English These scores assist administrators in appropriately placing students in English language classes, while also guiding instructors in their lesson planning Additionally, students recognize the significance of their language skills for academic success, motivating them to focus on enhancing their proficiency.

This study emphasizes the critical role of test validity in entrance examinations, defined as the degree to which a test accurately assesses what it intends to measure Validity encompasses the interpretations of test scores based on their intended applications, supported by both evidence and theoretical frameworks, as highlighted by the American Educational Research Association and the American Psychological Association.

Validation is a crucial process in which test developers and users collect evidence to establish a solid scientific foundation for interpreting test scores, as highlighted by the National Council on Measurement in Education (1999).

Validity researchers prioritize the quality of evidence over its quantity to substantiate validity interpretations Evidence can be categorized into four types: content-based evidence from tests, evidence derived from response processes, evidence relating to other variables, and evidence concerning the consequences of testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999).

To provide the four categories of evidence, various research methods are essential The Achieve alignment method offers evidence grounded in test content (Rothman et al., 2002), while the cognitive interview method supplies evidence based on response processes (Willis, 2005; Miller et al., 2013) The predictive method is utilized to establish evidence related to other variables, and argument-based approaches support evidence concerning the consequences of testing, focusing on a test's interpretative and validity arguments (Kane, 2006).

BTEC International College – FPT University conducts a placement test each semester for new students to assess their English proficiency in preparation for university studies This test evaluates four key skills: reading, listening, speaking, and writing, with particular emphasis on the writing skill for this study.

This study presents a validity argument for the English Placement Writing test (EPT W) at BTEC International College – FPT University, which was first administered in Summer 2019 to assess writing skills essential for academic success Establishing a validity argument is crucial for understanding the implications of the assessment, aiding educators and researchers alike The primary objectives of this study include examining the extent to which the tasks and raters contribute to score variability in the EPT W.

2) how many tasks and raters are needed to get involved in assessment to obtain the test score dependability of at least 85; and 3) the extent to which vocabulary distributions are different across proficiency levels of academic writing

Table 1.1 The structure of the EPT W Total test time 30 minutes

Task content Write a paragraph using one tense on any familiar topics

For example: Write a paragraph (100-120 words) to describe an event you attended recently

Task content Write a paragraph using more than one tense on a topic that relates to publicity

For example: Write a paragraph (100-120 words) to describe a vacation trip from your childhood Using these clues:

The EPT W employs a comprehensive rating rubric to evaluate test takers' performance, focusing on key criteria including task achievement, grammatical range and accuracy, lexical resource, and coherence and cohesion For detailed criteria, refer to Appendix A.

This study aims to enhance the theoretical understanding of language assessment by offering evidence that supports inferences drawn from EPT W test scores, thereby contributing to the discussion of test validity specifically within the realm of academic writing.

The results of this study aim to enhance the assessment of writing ability by examining the impact of task quantity and rater variability on test scores By understanding how various components influence score variability and the language produced, this research offers valuable insights for selecting suitable tasks to measure academic writing effectively.

This chapter discusses previous studies on validity and introduces generalizability theory (G-theory) that was used as background for data analyses

2.1.1 The conception of validity in language testing and assessment

The definition of validity in language testing and assessment could be given in three main time periods

Validity, as defined by Messick (1989), is a comprehensive evaluative judgment regarding how well empirical evidence and theoretical justifications support the adequacy and appropriateness of interpretations and actions derived from test scores or other assessment methods This perspective on validity gained official recognition from the AERA, APA, and NCME (1985), further solidifying its importance in the field of assessment.

EXPLANATION INFERENCE

Figure 5.2 The explanation inference in the validity argument for the PT test with 1 assumption and backing

The underlying assumption of the warrant of explanation inference is that the linguistic knowledge, processes, and strategies needed to complete tasks differ according to theoretical expectations Discourse analysis of test takers' written responses revealed variations in vocabulary frequency based on language proficiency levels Specifically, the analysis of two EPT sub-corpora, comprising 42 texts and 3,920 tokens, indicated that single word-based measures—such as the number of types and tokens—increased with higher proficiency levels Additionally, higher proficiency learners demonstrated a broader range of word families compared to their lower proficiency counterparts.

CONCLUSION/CLAIM: expected scores on the PT writing reflects test takers’ academic writing proficiency

GROUNDS/DATA: test takers’ written discourse

WARRANT: expected scores are attributed to a construct of academic language proficiency

ASSUMPTION 1: The linguistic knowledge, processes, and strategies required to successfully complete tasks vary in keeping with theoretical expectations

BACKING 1: vocabulary distributions were different across proficiency levels.

SUMMARY AND IMPLICATIONS OF THE STUDY

The findings moderately supported the two assumptions of generalization inference, indicating that additional evidence is necessary for the PT W test In contrast, the explanation inference assumption was only partially supported by qualitative evidence, which is limited when compared to quantitative analysis.

This study highlights several important implications Firstly, it reveals that the two-task-and-two-rater test can achieve a dependability estimate of 50%, indicating that expanding the number of tasks and raters is essential for improving scores Decision-makers at colleges can utilize the findings of this research to determine the optimal number of tasks and raters Secondly, the study emphasizes the methodological value of employing a mixed methods approach to strengthen the validity of inferences Lastly, it identifies raters and tasks as significant factors contributing to test score variability, underscoring the necessity of rater training in performance assessments, particularly in writing and speaking evaluations.

LIMITATIONS OF THE STUDY AND SUGGESTION FOR FUTURE

The initial investigation into the validity of the EPT Writing test at BTEC reveals several limitations in the study's results.

This study focused on two specific inferences: generalization and explanation A robust validity argument should be supported by evidence from all six inferences, including domain description, evaluation, extrapolation, and utilization Each inference is underpinned by various assumptions that reinforce the warrant supporting it.

One limitation of this study is that it only examined one or two assumptions to support its claims, particularly focusing on the first assumption related to the validity of performance tests (Chapelle et al., 2008) The initial research questions were based on three key assumptions: 1) a sufficient number of tasks must be included in the test for stable performance estimates, 2) the configuration of tasks should align with the intended interpretation, and 3) appropriate scaling and equating procedures must be applied to test scores This narrow focus suggests the need for future research to explore a broader range of assumptions to strengthen the validity argument for performance tests.

The study explored vocabulary distributions across proficiency levels by analyzing texts and tokens from two distinct sub-corpora groups To enhance linguistic knowledge, future research should also investigate additional language aspects, including grammar, semantics, and pragmatics, thereby broadening the analysis of linguistic features.

1 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999) The standards for educational and psychological testing American Educational Research Association

2 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985) The standards for educational and psychological testing American Educational Research Association

3 Bachman, L F (1990) Fundamental considerations in language testing

4 Bachman, L F & Palmer, A S (1996) Language testing in practice: Designing and developing useful language tests Oxford university press 1996

5 Borsboom, D & Mellenbergh, G J (2004) The concept of validity, 111(4),

6 Brown, C R., Moore, J L., Silkstone, B E & Botton, C (1996) The construct validity and context dependency of teacher assessment of practical skills in some pre-university level science examinations 3(3) 377-392

7 Brown, J D (1989) Improving ESL Placement tests using two perspective

9 Brow, J D (1996) Testing in language programs New in language programs New Jersey: Prentice Hall

10 Chapelle, C A., Jamieson, J., & Hegelheimer, V (2003) Validation of a web- based ESL test Language Testing, 20(4), 409-439

11 Chapelle, C A., Enright, M K., & Jamieson, J M (2008) Building a validity argument for the test of English as a foreign language

12 Chapelle, C A., Enright, M K, & Jamieson, J M (2010) Does an argument- based approach to validity make a difference? Educational measurement issues and practice, 1(29), 3-13

13 Crooks, T J., Kane, M T., & Cohen, A S (1996) Threats to the valid use of assessments Assessment in education: Principles, policy & practice,.3(3), 265-286

14 Cronbach, L J., & Meehl, P E (1955) Construct validity in psychological tests Psychological Bulletin, 52(4), 281-232

15 Douglas, D (ed.) (2003) English language testing in U.S colleges and universities (2 nd ed.) Washington, D.C: Association of International Educators

16 Douglas, D (2009) Understanding Language Assessment London: Hodder

17 Fulcher, G (1997) An English language placement test: Issues in reliability and validity Language testing, 14(2), 113-139

18 Kane, M (1992) An argument-based approach to validity Psychological bulletin, 112(3), 527-535

19 Kane, M (2001) Current concerns in validity theory Journal of educational measurement, 38(4), 319-342

20 Kane, M (2002) Validating high stakes testing programs Educational measurement: Issues and practice, 21 (1), 31-41

21 Kane, M (2004) The analysis of interpretive arguments: some observations inspired by the comments Measurement: Interdisciplinary research and perspective, 2(3), 192-200

22 Kane, M (2006) Validation In R Brennon (ed.), Educational measurement

(4 th ed.) Westport, CT: American Council on Education and Praeger, 17-64

23 Kane, M., Crooks, T & Cohen, A (1999) Validating measures of performance Educational measurement: Issues and practice, 18(2), 5-17

24 Lee, Y J & Greene, J (2007) The predictive validity of an ESL placement test: a mixed methods approach Journal of mixed methods research, 1(4),

25 Messick, S (1989) Meaning and values in test validation: The science and ethics of assessment America Educational Research Association & SAGE Publications, 18(1), 5-11

26 Mislevy, R L (2003) Argument substance and argument structure in educational assessment CSE Technical Report 605 Los Angeles: Center for the study of evaluation

27 Shavelson, R., & Webb, N (1991) Generalization theory: A primer Sage

28 Raimes (1994) Testing writing in EFL exams: The learners’ viewpoint as valuable feedback for improvement Procedia - Social and behavioral sciences, 199 (2015), 30-37

29 Lines (2004) Guiding the reader (or not) to re-create coherence: observations

30 on postgraduate student writing in an academic argumentative writing task

Journal of English for academic purposes, 16(2014), 14-22

31 Toulmin, S., Rieke, R., & Janik, A (1984) An introduction to reasoning New York: Macmillan

32 Usaha, S (2000) Effectiveness of Suranaree University’s English placement test Suranaree University of Technology Retrieved on 12 th September 2010 from http://hdl.handle.net/123456789/2213

33 Wall, D., Claphan, C & Alderson, J C (1994) Evaluating a placement test

Writes nothing or write no English words

 Answer is barely related to the task

 Answer sometimes presents related ideas

 Attempts to address the task but answer often does not address all key points

 Attempts to address the task but usually cover all key points

 Barely uses correct grammatical structures and tenses

 Sometimes uses correct simple sentences (the first subject always) and correct forms of verbs

 Attempts to use a variety of structures, but only rare use of subordinate clauses

 Usually Uses only a range of structures; Attempts complex sentences but these tend to be less accurate than simple sentences

 Only uses a few isolated words

 Uses a limited range of words and expressions with no control of word formation and/or spelling

 Uses basic vocabulary which may be used repetitively or which may be inappropriate for the task

 Has limited control of word formation

 Uses a limited range of vocabulary, but this is minimally adequate for the task

Has a good control of word formation

 Rarely writes any message due to lack of cohesive devices and of close connection among sentences

 A lot of spelling and punctuation

Has very little control of organisational features such as cohesive devices

 Presents information and ideas but these are not arranged coherently and there is no clear progression in the response

 Present s information with some organisation but there may be a lack of overall progression

 Uses cohesive devices effectively, but cohesion within

5.5-6.5 errors that hinder comprehension punctuation errors that hinder comprehension cohesive devices but these may be inaccurate or repetitive

 Sometimes has spelling and punctuation errors that may hinder comprehension and/ or between sentences may be faulty or mechanical

 May be repetitive because of lack of referencing and substitution

 Has some spelling and punctuation errors but does not hinder comprehension

Định dạng
Số trang	60
Dung lượng	1,31 MB