INTRODUCTION
Rationale
Currently, the education system places significant emphasis on enhancing students' English communication skills Caban (2003) highlights that oral interviews provide valuable insights into students' second-language abilities, insights that traditional paper-and-pencil exams cannot capture Furthermore, these interviews contribute to improving the effectiveness of teaching and assessing communicative languages Bachman and Palmer (1996) also demonstrate the high content and face validity of such assessment methods.
The outcomes of oral interviews are significantly influenced by various factors, particularly the complexity of the topics presented during the speaking exam Research has shown that topic-related issues, such as interviewees' preferences, opinions, and prior knowledge, can create unfair advantages or disadvantages in scoring (Jenning et al., 1999) Bachman (1990) highlighted that background knowledge is a crucial element of the testing environment that affects language test performance Additionally, Nguyen & Tran (2015) found that background knowledge plays a vital role in influencing test takers' performance, as evidenced by their study involving two hundred and three students and ten English teachers.
According to Papajohn (2002), both the performance of examinees and the interpretation of raters can significantly influence the final scores Myford and Wolfe (2015) highlight that rater effects are an important variable in the evaluation process This is particularly relevant in oral interviews, where examiners are responsible for grading.
The scoring process involves a significant degree of subjectivity, as highlighted by Cronbach (1990), who characterized it as a "complex and error-prone cognitive process." This complexity means that there is no assurance that different raters will assign identical grades to the same examinees.
In Vietnam, many students excel in English written tests but struggle significantly with speaking assessments, particularly in secondary schools in Hanoi, where their average speaking scores are notably lower than their reading, listening, and writing scores This challenge affects a substantial number of students who find oral presentations particularly difficult While numerous studies have explored this issue globally, there is a lack of research specifically focusing on Vietnamese students, with no studies addressing the challenges faced by secondary school students in this context.
This research, titled “The Impact of Background Knowledge and Raters on the English Speaking Assessment Results of Gifted Students: A Case Study of Grade 9 Students at a Secondary School in Hanoi,” explores how background knowledge and evaluator influence affect the speaking performance of gifted students.
This study aims to analyze existing literature to develop research questionnaires, concentrating on two primary factors: the background knowledge of test-takers and the influence of raters The objective is to investigate whether these elements affect the variability of scores among secondary school students.
Statement of research problem & questions
This research investigates the score variations of grade 9 students at Khuong Mai Secondary School during speaking tasks with diverse topics and raters The study aims to determine the influence of students' background knowledge and raters on speaking test outcomes, thereby enhancing teachers' understanding and guiding them in selecting appropriate topics to improve test reliability Additionally, the findings encourage students to cultivate their background knowledge alongside language skills, ensuring they are well-prepared for speaking tests despite topic changes.
In brief, the study purported to address the following questions:
(1) To what extent do the students‟ scores vary with different topics?
(2) To what extent do topic familiarity and topic interest affect students‟ performance?
(3) To what extent do the students‟ scores vary with different raters?
Scope of the research
This study primarily focused on the variability of student scores across different topics and raters Eckes (2011) suggests that while numerous factors impact student scores, background knowledge and the characteristics of raters are the two most significant elements affecting the reliability of assessments.
The study's samples were limited to gifted ninth-grade students at Khuong Mai Secondary School, highlighting the need for caution when generalizing findings to other student populations.
Significance
This study enhances the understanding of factors influencing scoring in the Vietnamese educational context It highlights the significance of teachers recognizing the impact of background knowledge and raters on assessing students' speaking performances Consequently, this awareness can lead to more thoughtful topic selection for speaking tests, ultimately improving students' final results.
This research aims to be a valuable reference for both teachers and students at Khuong Mai, while also establishing a foundation for future studies on the same topic.
Organization of the study
The study is composed of 3 main parts:
This chapter is the presentation of basic information such as the statement of the problem, rationale, scope, aims and objectives as well as the organization of the study
LITERATURE REVIEW
Communicative language competence
1.1 Definition of communicative language competence
Understanding the current concept of communicative language competence is essential for this study According to Bachman and Palmer (1996), communicative language competence is defined as the ability to create and interpret discourse Additionally, Canale and Swain (1980) and Canale (1983) describe communicative competence as a synthesis of the knowledge and skills necessary for effective communication.
In their concept of communicative competence, knowledge refers to the (conscious or unconscious) knowledge of an individual about language and about other aspects of language use
Figure 1 Components of communicative language ability in communicative language use (Bachman 1990)
In the model of communicative language use proposed by Bachman (1990), there are three components of communicative language competence, which are language competence, strategic competence, and psychophysiological mechanisms
As can be seen from figure 1, the strategic competence is placed in the centre of the model According to Bachman (1990, p 84), strategic competence refers to the
“capacity for implementing the components of language competence in contextualised communicative language use”
Background knowledge significantly influences language competence, as noted by Bachman and Palmer (1996) This foundational knowledge enables test-takers to effectively communicate about their surrounding world.
Assessing speaking skills
This article specifically examines speaking skills, highlighting four key sections: spoken language, the oral assessment process, oral assessment tasks, and the criteria for evaluating oral competence.
Speaking skills are essential in language learning, serving as a primary means of communication While the definition of speaking is widely understood, scholars offer varying perspectives Bygate (1987) emphasized that oral language conveys the speaker's intended messages, including ideas, intentions, opinions, and emotions Fulcher (2003) further elaborated on the nature of speaking, highlighting its complexity and significance in effective communication.
Effective verbal communication is essential for interacting with others, as highlighted by Hedge (2000, cited in Mazouzi, 2013), who emphasizes that speaking skills play a significant role in forming first impressions and evaluating individuals This underscores the importance of developing speaking abilities in both native and foreign languages.
Byrne (1987) also agreed with Hedge that speaking was a two-way process that involved not only speakers and listeners, but also the use of both
Effective communication involves both productive (speaking) and receptive (listening) skills, highlighting the interactive nature of dialogue between speakers and listeners Thornbury (2005) emphasized that face-to-face communication often leads to direct interaction, whether in monologues or conversations He identified two critical elements for managing conversations: paralinguistics and turn-taking Turn-taking refers to the established norm that allows speakers to hold the "floor" and ensures that no two individuals speak simultaneously for extended periods Understanding these dynamics is essential for effective communication.
Paralinguistics plays a crucial role in communication, encompassing the use of eye gaze and gestures to enhance interaction Effective communication involves not only verbal language but also non-verbal cues such as eye contact, facial expressions, body language, pauses, tempo, and pitch variation, which together convey emotions and ideas Thus, speaking is a dynamic, multi-sensory experience that engages both speakers and listeners.
Recent studies in language evaluation have led to the development of various process models that assess second language and speaking performance These models illustrate the relationships among different factors influencing examinees' final scores This research will specifically examine the models proposed by Milanovic and Saville (1996) and Fulcher (2003).
2.2.1 Milanovic and Saville‟s model of oral test performance (1996)
Milanovic and Saville (1996) introduced a model for performance testing that was later recognized by O'Sullivan, Weir, and Saville (2002) as one of the earliest and most comprehensive frameworks for understanding the interactions of elements within these tests.
Figure 2 A conceptual framework for performance testing
In their analysis of a testing model, O'Sullivan et al (2002) identified five key elements that influence the reliability and validity of assessments: the candidate, the examiner, the assessment criteria, the task, and the interactions among these components Additionally, Zhao (2013) emphasized three critical processes involved in performance testing, as outlined in Milanovic and Saville's framework.
The first process is the design of a test by a group of examination developers
It shows the heavy responsibility that they have to take in the process of developing a test to ensure the reliability and validity of the test
According to Zhao (2013), the second phase is the administration of the test
The performance of candidates will be evaluated by raters during the exam, highlighting that their knowledge and abilities are not the only determinants of their final scores The testing conditions, tasks, and evaluation criteria significantly influence candidates' success O'Sullivan et al (2002) compiled studies that explore variations in candidate performance based on differing assessment factors.
Nine methods were employed, emphasizing the importance of this aspect Notably, factors like self-esteem, age, and gender were excluded from the model, despite studies like Romero's (2006) identifying age as a negative influence on interview speaking ability.
The final stage of the assessment process is the rating procedure, where examiners evaluate test takers' performance based on specific tasks and established evaluation criteria Milanovic and Saville highlighted the significant impact of these tasks, the evaluation criteria, and the training provided to raters on the overall rating process.
(1996) Also, their choices may be influenced by their own experience and skill
In other words, when it comes to designing a test, it is important to take all the above variables into consideration
2.2.2 Fulcher‟s model of oral assessment (Fulcher, 2003)
Figure 3 An expanded model of speaking test performance (Fulcher, 2003)
In 2003, Fulcher proposed an oral test performance model to investigate how construction, task, and scale affect candidate scores (Zhao, 2013) He emphasized that rater preparation and characteristics significantly influence the grading outcomes for students Consequently, the success of candidates is closely tied to the effectiveness of raters in evaluating their performance.
The rating scale and band descriptors significantly influence the final scores and assumptions made about them, as evidenced by the theory of orientation and ranking.
Fulcher (2003) expanded on the model proposed by Milanovic and Saville (1996) by illustrating the interplay between local conditions, task requirements, and candidate performance He identified several key factors that can influence task mediation, including task orientation, interactional relationships, goals, interlocutors, topics, situations, and the specific characteristics of the task based on its context.
The model examined various factors that influence examinees during testing, highlighting the importance of task-related skills and knowledge as noted by Milanovic and Saville (1996) Additionally, Fulcher (2003) expanded on this by incorporating internal factors such as candidates' real-time processing abilities and individual differences, including personality traits, which were overlooked in the initial model.
Background knowledge in oral assessment
Background knowledge, often referred to as prior knowledge, encompasses the accumulated experiences that shape an individual's understanding, described as "abstracted residue" (Schallert, 2002, p 557) Related concepts like "topic familiarity" or "topical knowledge" are frequently mentioned, as they are essential for effectively discussing various subjects in speaking assessments For the purposes of this study, the term "background knowledge" will be consistently utilized.
3.2 The role of background knowledge in oral assessment
Bachman emphasized the significance of background knowledge in language learning, as noted in section 1.2 Furthermore, in their language use and language test performance model, Bachman and Palmer (1996) identified strategic competence as a key component, underscoring its central role in effective communication.
Figure 5 Components of language use and language test performance
As can be seen from figure 5, the strategic competence is affected by other elements in the smaller circle which shows the characteristics of the test-takers
Topical knowledge plays a crucial role in language proficiency, as different subjects necessitate varying levels of background understanding Bachman and Palmer (1996) argue that familiarity with a topic can ease the retrieval of relevant information, while Skehan (1998) highlights that greater background knowledge reduces cognitive demands, facilitating speech production Consequently, the ease with which examinees engage with specific topics often hinges on their familiarity with those subjects Fulcher (2003) supports this view, noting that variations in discourse among examinees can be linked to the topics at hand Therefore, investigating learners' familiarity with diverse topics is essential for understanding their performance.
Among the studies on the effects of topics, several aspects of the topics have been investigated, including topic familiarity and topic interest
Topic familiarity, as defined by Bui (2014), refers to the prior knowledge individuals possess about a specific subject area Research by Pulido (2007) demonstrates that this familiarity enhances learners' comprehension of reading materials Similarly, Othman and Vanathas (2004) affirm that topic familiarity plays a significant role in improving listening comprehension skills.
Research by Ellis (2003) highlights that learners' familiarity with a topic significantly impacts their ability to negotiate meaning Language learners leverage their existing knowledge to enhance both comprehension and text production Furthermore, Ellis notes that a learner's inclination to negotiate meaning is closely tied to their familiarity with the specific task at hand.
18 subjects Language users and language learners use their world knowledge to help them create and comprehend the texts
Learners' interest significantly impacts second language acquisition (SLA), as highlighted by Dürnyei (1994) This interest is shaped by individual personality traits, leading to varied outcomes across different contexts Ellis and Barkhuizen (2005) emphasize that interest is crucial in determining how information is selected and processed Additionally, topic interest, which refers to the heightened attention towards specific themes, has been explored by researchers Hidi and McLaren (1991) identified topic interest as a form of situational interest, while others, such as Schiefele (1990), have classified it as a type of personal interest.
3.4 Related studies on topic influence on oral assessment
There exists several studies regarding the impact of topic familiarity on the performance of the students
Research indicates that background knowledge significantly enhances second language speaking performance Change (1999) demonstrated that fluency improves when examinees engage with familiar topics during monologic tasks Similarly, Kazemi and Zarei (2015) found that Iranian EFL learners benefited from topic familiarity, which positively influenced their oral presentations.
In contradiction with Kazemi and Zarei‟s result, in a study entitled: “Topic and background knowledge effects on performance in speaking assessment”,
In her 2017 study, Khabbazbashi examined how topics and background knowledge influenced spoken performance during oral tests The findings revealed that the topics presented posed significant challenges for two of the three task types assessed However, the minor variations in topic difficulty did not result in substantial differences in performance outcomes.
A study by Huang, Hung, and Plakans (2016) titled "Topical Knowledge in L2 Speaking Assessment: Comparing Independent and Integrated Speaking Test Tasks" found that topical knowledge significantly impacts candidates' scores in integrated speaking tests, such as those derived from the TOEFL iBT This highlights the importance of subject familiarity in influencing performance across different speaking assessment formats.
While previous studies have explored the influence of various topics on candidate performance, they often examined multiple task types or primarily emphasized topic effects without delving deeply into a single task type Consequently, the relationship between different topics within one task type and their effect on candidates' final scores remains underexplored This study aims to investigate how different topics within a single task type impact student performance.
Raters in oral assessment
Fulcher's models of oral assessment emphasize the essential role of raters in the evaluation process Unlike multiple-choice questions that present a single correct answer, speaking tests introduce complexity due to their lack of fixed answers, which can introduce subjectivity into scoring This subjectivity can significantly impact inter-rater reliability and overall test outcomes This section will clarify the concept of inter-rater reliability and explore the various factors that may influence the scoring process.
4.1 Definition of rater reliability in oral language assessment
For the differences between raters in assessment, inter-rater reliability should be taken into consideration and worth investigating
Many researchers have put forth the same definition of inter-rater reliability
In "The Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters," Gwet (2014) emphasizes that inter-rater reliability experiments involve two raters independently assessing the same set of objects A high level of agreement between the raters indicates that they can be considered interchangeable and that the categorization is minimally affected by individual rater biases.
20 it is the extent to which the categorizations stay the same that defined inter-rater reliability
Shohamy (1993) questioned the consistency of grading among different raters for the same examinees, leading to the important issue of inter-rater reliability, which assesses whether raters assign identical scores to the same candidates This study will utilize Gwet's (2014) definition of inter-rater reliability, as it comprehensively encompasses the various definitions put forth by other researchers.
4.2 Factors that affect rating operation
This article follows the framework established by Popham (1990, as cited in Myford & Wolfe, 2015), which identifies several factors influencing raters during the rating process The primary sources of error in this operation include the rating scales, the procedures employed, and the raters themselves.
The interpretation of rating scales significantly influences raters' decisions during the scoring process McNamara (1996) highlights the varying interpretations of "discrete rating categories," particularly with the 1-6 rating scale An illustrative example from her book compares two raters using the same 1-6 scale, demonstrating how different interpretations can arise even when the same criteria are applied.
However, their interpretations might differ like this:
Determining which score the candidate should receive in case their performance falls between 2 and 3, for example, could result in low inter-rater reliability (McNamara, 1996)
If rating scales lack clarity and detail, raters may experience confusion due to ambiguous criteria, making it challenging for them to differentiate between the various rating categories.
Wolfe, 2015) As a result, different raters may interpret the rating scales differently, lowering inter-rater reliability significantly
The rating procedure can significantly influence rater-related issues, as highlighted by Myford & Wolfe (2015), who noted that raters tasked with evaluating numerous performances in a limited timeframe may experience fatigue and boredom Additionally, grading candidates' performances over multiple days can introduce further complications McNamara (1996) emphasizes that the time of day during testing can impact raters' alertness and accuracy, with morning sessions potentially yielding more attentive evaluations Such variations can adversely affect inter-rater reliability.
The final factor influencing scoring is the characteristics of the raters themselves, which can lead to variability in evaluation According to Cronbach (1990), each rater engages in a "complex and error-prone cognitive process" during scoring Additionally, McNamara (1996) notes that raters may employ their own systematic scoring methods, contributing to what is known as rater effects This aspect will be explored in greater detail later in the article.
4.3 Related studies on inter-rater reliability and rater effects in assessment
There are some studies into the rater effects in language assessment Engelhard, for example, conducted a study on rater error in writing in 1994 called
The study "Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model" highlights significant disparities in rater severity during writing evaluations This research emphasizes the impact of rater variability on the assessment process, revealing how differing levels of strictness among evaluators can affect the outcomes of written composition evaluations.
In their 1999 meta-analysis, "Magnitude and moderators of bias in observer ratings," Hoyt and Kerns examined 79 generalizability studies, revealing that rater effects and the interaction between raters and candidates contributed to 37% of the variance in final scores Similarly, Eckes (2005) explored rater main effects and their relationships with various factors, including examinees, rating criteria, and tasks, in his study titled "Examining Rater Effects in TestDaF Writing."
The study on "Speaking Performance Assessments: A Many-Facet Rasch Analysis" found that raters displayed a consistent approach to evaluations, yet significant differences in leniency and severity were evident among them Notably, while previous research examined rater effects on candidate scores, it primarily focused on countries outside of Vietnam, highlighting a gap in the literature regarding this specific context.
METHODOLOGY
Research design
This study is conducted to answer the three following questions:
Question 1: To what extent do the students‟ scores vary with different topics?
Question 2: To what extent do topic familiarity and topic interest affect students‟ performance?
Question 3: To what extent do the students‟ scores vary with different raters?
The research employed both quantitative and qualitative methodologies, utilizing quantitative analysis for data interpretation and qualitative approaches for document analysis during instrument development and discussion Specifically, the test design required thorough discussions between the two raters prior to selecting the two topics, initially narrowing down from 20 common topics presented in the book "First Certificate in ".
The vocabulary list for the English (FCE) exam was curated following a successful pilot study, leading to the selection of the topics "schooling" and "travelling," as detailed in section 3.1 The questionnaire design involved extensive discussions, resulting in an initial draft that addressed topic familiarity, topic interest, and students' self-evaluation However, two raters noted that grade 9 students lack training in self-evaluation, potentially skewing the results Consequently, the study focused on only the first two aspects, as outlined in section 4 The data analysis included comparisons with other researchers' findings, which are thoroughly discussed in section two of chapter four, titled "Findings and Discussion."
Research participants
This research focused on twenty gifted grade 9 students from Khuong Mai Secondary School, selected based on their high scores from the annual gifted examination The choice of grade 9 participants is significant, as these students have completed three years of English education at the school, providing a more stable performance compared to their younger peers in grades 6, 7, and 8 The study aims to isolate the impact of topic variation on student scores by minimizing the influence of other factors, such as background knowledge and raters Thus, the research specifically targets grade 9 students at Khuong Mai Secondary School.
Purposive sampling was employed to select two raters for this research, focusing on individuals with specific characteristics beneficial to the study (Etikan, Musa, and Alkassim, 2016) The selection criteria included an undergraduate degree, a TESOL qualification, familiarity with IELTS criteria, and a minimum of three years of teaching experience Given that the speaking test aligns with the IELTS format, it was essential for the raters to meet the IELTS rater criteria to ensure result reliability The chosen teachers were those who directly taught English at Khuong Mai Secondary School and fulfilled all the outlined requirements.
Instruments
The speaking test was designed for the examination for gifted students annually by English teachers at Khuong Mai secondary school
Because the speaking test in this study followed the same format as part 3 of the IELTS speaking test, the IELTS criteria were used to assess the students'
25 performance Because they had spent a year training and practicing with their teachers, all of the candidates were expected to be familiar with the test format
Part 3 of the IELTS speaking test was selected for this study due to its focus on abstract ideas and topics, distinguishing it from the first part, which centers on personal information Additionally, the limited two-minute duration of the second part does not adequately evaluate students' speaking abilities, making Part 3 a more suitable format for comprehensive assessment.
The purposive sampling method was employed to select topics for this study from the First Certificate in English (FCE) Vocabulary List, focusing on common themes A randomized selection process involved placing all topics into a hat, from which 20 were drawn for examination Following the pilot study's significant results, three topics anticipated to be the most prevalent were chosen as the academic subjects for this research.
After consulting with the two participating raters, the topic of "hospital" was removed from the discussion, as it was determined that many students infrequently visit hospitals, making the task less authentic Consequently, the focus remains on the topics of "schooling" and "travelling."
“schooling” and “travelling” were chosen
Table 1: The speaking test‟s questions
1 Are there any differences in schools now and in the past?
1 Are there any differences in travelling now and in the past?
2 What are the advantages of studying in a good school?
2 What are the advantages of travelling?
3 What is an ideal school? 3 What is an ideal trip?
*The underlining words were changed depending on the topic
Each topic included 3 questions and students will answer each topic in about 3 to 4 minutes The questions among all the topics have been modified to be identical
26 in the types of questions, making the topic word the only part that differs (see table
1) The full version of the speaking test questions can be found in appendix 1
The IELTS speaking band descriptors are widely recognized for their validity and popularity, as noted by Read (2005) They evaluate four key criteria: fluency and coherence, grammatical range and accuracy, lexical resources, and pronunciation Each criterion is scored on a scale from 0 to 9, and the overall score is calculated as the average of these four individual scores.
The questionnaire
Questionnaires were utilized to gather extensive data through a standardized process, ensuring all participants answered the same set of questions (McDonough, 2014) For students, the questionnaires featured four multiple-choice questions (see Appendix 3), with participants rating their responses on a scale from 1 to 5 to indicate their level of agreement.
Questions 1 and 3 determined the familiarity of the two topics to the student participants
Questions 2 and 4 looked into the participants' interest of the topic on speaking performance
The questionnaire result was later used to answer research question 2 to see the correlation between students „score and topic familiarity as well as topic interest.
Data collection procedure
The speaking test and questionnaire were developed, followed by a pilot study to evaluate their effectiveness Raters provided feedback on any ambiguities within the questions and offered suggestions for improvement The insights from teachers played a crucial role in clarifying and refining the questions to enhance their quality.
The speaking test took place in a classroom at Khuong Mai Secondary School and lasted about 3 hours in the morning, from 9 a.m to 12 a.m Answering all
Each candidate participated in a speaking test consisting of 27 questions, which took 12-15 minutes per individual Participants signed a consent form to allow their performances to be recorded exclusively for research purposes They were assigned numbers from 1 to 20 based on their order in the study At the conclusion of the speaking test, each student received two separate scores for each topic, which were then compiled for analysis Following the speaking test, the students completed a questionnaire.
Prior to the grading process, raters received an email containing the IELTS marking rubric and scoring instructions, allowing them to familiarize themselves with the study All relevant issues were addressed and clarified to ensure a smooth evaluation.
After the test, the raters also sat for the questionnaire
This research follows 8 following steps to collect the data:
(1) Designing the questions for the speaking test
(3) Designing the questionnaires to interview students
(5) Delivering the questions to the teacher, then administering the test and collecting the data
(6) Collecting the data from the students
(8) Interpreting the results of the study
Data analysis procedure
In this step, the SPSS statistics software was utilized to analyse the collected data The statistical techniques involved were:
- Bivariate Correlations: Kendall‟s tau-b correlation
Table 2: Techniques of data analysis
Descriptive Command the suitability of the test for the students' level
- T-Test: Paired-samples T-Test the students‟ scores with different topics
2 - Bivariate Correlations: Kendall‟s tau-b correlation the correlation between students
„score and topic familiarity as well as topic interest
3 - Bivariate Correlations: Pearson correlation if two raters could agree on how to grade the students' performances
- T-Test: Paired-samples T-Test the students‟ scores with different raters
In the descriptive statistics technique, the frequencies command was selected to determine mean, median, mode, range, standard deviation, variance, minimum and maximum scores
This research employed descriptive statistics to evaluate the speaking competence of 20 candidates across two distinct topics, with assessments conducted by two raters.
Figure 6 IELTS and the CEFR ("Common European Framework", n.d.)
The validity of the speaking test was assessed by comparing the average scores to the CEFR band scores, ensuring the test aligns with the students' proficiency levels Given that students are expected to reach B2 proficiency by the end of the academic year, average scores between 5.5 and 6 suggest that the test is appropriately challenging, indicating that candidates can effectively respond to the test without it being overly difficult or easy.
The study utilized a paired-samples T-Test to analyze data, as the same students responded to both topics and were evaluated by two raters, indicating that the variables were not independent This method was specifically chosen to effectively address research questions 1 and 3.
In addition, to answer research questions 1 and 3, four paired-samples T-Tests were conducted for four pairs of data in this paper
Research question 1 : To what extent do the students‟ scores vary with different topics?
Pair 1: “Topic1Rater1” and “Topic2Rater1”: two topics which were graded by the first rater
Pair 2: “Topic1Rater2” and “Topic2Rater2”: two topics which were graded by the second rater
Research question 3: To what extent do the students‟ scores vary with different raters?
Pair 3: “Topic1Rater1” and “Topic1Rater2”: Topic 1 which was graded by both raters
Pair 4: “Topic2Rater1” and “Topic2Rater2”: Topic 2 which was graded by both raters
Three important sections which were described from the results of a paired- samples T-Test are:
The T-test statistics, denoted as "t," represent the ratio of the mean of the differences to the standard error of the differences This calculation is essential for comparing sample means against the null hypothesis As the differences increase, the absolute value of the t-value also rises, indicating a more significant deviation from the null hypothesis.
• “df‟ : the degrees of freedom for the test
In hypothesis testing, the null hypothesis posits that the true mean difference between two samples is zero, suggesting that any observed differences are due to random variation The paired-samples T-test employs the p-value to assess statistical significance, with a p-value below 0.05 indicating that the null hypothesis can be rejected This implies that the study's findings are statistically significant and that the differences observed are unlikely to be due solely to chance Consequently, rejecting the null hypothesis leads to the acceptance of the alternative hypothesis, which asserts that the true mean difference between the samples is not zero (Rouder et al., 2009).
6.3 Bivariate Correlations: Kendall‟s tau-b Correlation and Pearson Correlation
The correlation coefficient, denoted as "r," measures the strength of the relationship between two variables, with values ranging from -1 to +1 (Haining, 2010) A value of ±1 signifies a perfect association, while values closer to 0 indicate a weaker relationship Additionally, the sign of "r" reveals the correlation's direction: a positive value means both variables move in the same direction, while a negative value indicates that as one variable increases, the other decreases.
The detailed interpretation of this value by Zou, Tuncali & Silverman (2003) can be found below
Table 3 Interpretation of Correlation Coefficient (Zou, Tuncali, & Silverman, 2003)
Direction and Strength of Correlation
In this research, the degree of the relationship between two variables was measured by Kendall's tau-b Correlation to see the correlation between students
Kendall's tau-b (τ b) correlation coefficient is a nonparametric statistic used to assess the strength and direction of the association between two variables that are measured on at least an ordinal scale It is particularly useful for evaluating the relationship between score, topic familiarity, and topic interest.
This study employed the Pearson Correlation to assess the strength of the relationship between two variables, specifically to determine the level of agreement between two raters in grading student performances Consistent results would validate the scores for subsequent analysis, ensuring their reliability.
Following the assessment, qualitative data was collected from both students and raters to evaluate their perceptions of student performance across various topics This data was analyzed in detail and compared to address the first and third research questions.
Chapter summary
This chapter outlines three research questions and the criteria for selecting participants, along with the primary data collection instrument, the questionnaire It details the steps taken to gather student scores and rater opinions, and emphasizes the data analysis procedures, which include Pearson correlation, descriptive statistics, and paired-samples T-Test The next chapter will present the study's findings and discussions.
FINDINGS AND DISCUSSION
Findings
To assess the speaking competence of 20 candidates in this study, descriptive statistics were conducted, and the findings are presented below to enhance the validity of the speaking test.
Table 4 Descriptive statistics for the scores of the two topics by rater 1
The table presents the scores for two topics assessed by rater 1 across five criteria: Fluency and Coherence, Lexical Resource, Grammatical Range and Accuracy, Pronunciation, and Overall Notably, both topics received an identical average pronunciation score of 6.13, the highest among the criteria Topic 1's Grammatical Range and Accuracy score is the lowest at 6.05, while Topic 2's lowest score is in Fluency and Coherence at 5.73 Overall, Topic 1 outperformed Topic 2 in Fluency and Coherence, Lexical Resource, and Grammatical Range and Accuracy, resulting in higher overall scores of 6.18 for Topic 1 and 5.95 for Topic 2.
Table 5 Descriptive statistics for the scores of the two topics by rater 2
The table presents scores across five criteria: Fluency and Coherence, Lexical Resource, Grammatical Range and Accuracy, Pronunciation, and Overall, evaluated by rater 2 for two topics Notably, Pronunciation received the highest scores for both topics, with scores of 6.13 for topic 1 and 6.1 for topic 2 Conversely, the lowest score for topic 1 was in Lexical Resource at 5.98, while topic 2's lowest score was in Fluency and Coherence at 5.7 Unlike rater 1, rater 2 assigned higher scores to topic 1 across all criteria, resulting in an overall score of 6.13 for topic 1 compared to 5.95 for topic 2.
Table 6 Descriptive statistics for students‟ overall speaking competence
The table provides a comprehensive overview of student performance across various topics as evaluated by two raters Notably, both raters assigned a median score of 6 for the two topics The first rater's mean scores were 6.175 and 5.95, while the second rater's means were 6.1225 and 5.95 Additionally, the score ranges for both raters on the same topic were identical, highlighting consistency in their evaluations.
The standard deviation of scores for topic 1 was slightly lower for rater 1 at 0.4375 compared to rater 2's 0.4552 Conversely, for topic 2, rater 1 exhibited a higher standard deviation of 0.3591, while rater 2 had a lower standard deviation of 0.3204.
In summary, the average score of approximately 6 indicates that the speaking test was appropriately aligned with the candidates' abilities Consequently, this test demonstrates content validity and can serve as a solid foundation for future research inquiries.
1.2 Research question 1: The variation of the students‟ scores under the influence of different topics
This study analyzes the differences in student scores during speaking tests across various topics by utilizing two sets of scores evaluated by the same rater The findings are illustrated in Table 5 and Table 6, which detail the scores for the first topic pair, labeled as “Topic1Rater1” and “Topic2Rater1.”
Table 7 Paired Samples Statistics for two topics by Rater 1
The table presents the descriptive statistics for scores assessed by the first rater, indicating a higher mean score for topic 1 (6.175) compared to topic 2 (5.95) It also highlights the differing dispersion of scores, with topic 1 exhibiting a standard deviation of 0.4375, suggesting greater variability than topic 2, which has a standard deviation of 0.3591 Notably, the number of examinees for both topics was identical.
Table 8 Paired Samples Test Result for two topics by Rater 1
95% Confidence Interval of the Difference Lower Upper
Table 8 depicts that the absolute value of the t-statistic was 3.943 with 19 degrees of freedom and the corresponding two-tailed p-value for pair 1 was 0.001
The study's findings indicate a significant difference in candidate performance due to the topic change, with a p-value of 0.001, which is well below the 0.05 threshold This low p-value suggests a nearly 0% probability of obtaining the same results if the means were equal, leading to the rejection of the null hypothesis Specifically, rater 1's assessment of the candidates yielded t(19)= 3.943, reinforcing the conclusion that the topic change positively influenced performance.
The negative t-value reflects the direction of the mean difference, which was 0.225 for pair 1 This indicates that, on average, topic 1 scored 0.225 points higher than topic 2 when both topics were evaluated by rater 1.
Next, another paired-samples T-test was conducted to compare the scores of
37 the test-takers answering two topics graded by the second rater The result of the second pair calculation with 2 variables namely “Topic1Rater2” and
“Topic2Rater2” was described in table 9 and 10
Table 9 Paired Samples Statistics for two topics by Rater 2
Table 9 presents the descriptive statistics for scores assigned by the second rater, revealing that topic 1 received a higher mean score of 6.125 compared to topic 2's mean of 5.95 Additionally, the score dispersion differed, with topic 1 exhibiting a variance of 0.4552 and topic 2 showing a variance of 0.3204 Notably, there were no missing data for either topic.
Table 10 Paired Samples Test Result for two topics by Rater 2
Confidence Interval of the Difference Lower Upper Pair 2 Topic1Rater2
Table 10 presents the t-statistic for pair 2, showing a value of t=2.3333, along with the degrees of freedom and a significance (2-Tailed) value of p=0.031 Since the p-value exceeds 0.05, we can conclude that there is no statistically significant difference between the two variables.
The null hypothesis is rejected due to a p-value of 0.031, which is below the 0.05 threshold This indicates that the change in topic within this research has significantly impacted the performance of the candidates evaluated by the rater.
The negative t-value signifies the direction of the mean difference, which for pair 2 was 0.175 This indicates that, on average, topic 1 received a score that was 0.175 higher than topic 2, based on evaluations conducted by rater 2.
The analysis of pair 1 and pair 2 revealed significant differences between the variables, as demonstrated by the paired-samples T-Test This indicates that changes in the topic can influence the scores assigned to candidates by both raters Consequently, it can be concluded that students' scores are impacted by variations in the topics presented.
1.2.1 Research question 2: the correlation between students „score and topic familiarity as well as topic interest
Discussion
2.1 The impact of background knowledge on the oral output of students
Research indicates that background knowledge significantly influences students' oral output, aligning with Nahal Khabbazbashi's (2017) findings, which revealed that low background knowledge presents challenges for test takers, while high levels enhance performance Similarly, Mohammad Bagher Shabani (2013) emphasizes that familiarity with a topic boosts speaking ability; the more knowledge students have, the better they can articulate their thoughts Providing learners with both background and systemic knowledge equips them with essential information to discuss unfamiliar topics effectively This study underscores the importance of topic familiarity in enhancing speaking skills, as the background knowledge students possess enables them to integrate new information with their existing understanding, leading to more successful speaking tasks.
Participant 12 suggested that he was relaxing and came up with more ideas when answering the familiar topic; whereas, his performance on the unfamiliar topic was disappointing because of insufficient background knowledge on the topic Similarly, participant 19 shared that it was easier to answer questions related to schooling, so she felt less anxious in articulating her ideas The travelling topic, in contrast, negatively impacted on her performance as she was not well-prepared This concurs well with Ellis‟ findings (2007), which emphasises on the positive impact of topic familiarity on the self-confidence of the students
In summary, the result of the study supports the importance of background knowledge in the speaking test Therefore, it is recommended that along with
To enhance language proficiency, students should be motivated to expand their background knowledge This preparation enables them to excel in speaking tests, regardless of topic variations.
2.2 The correlation between students ‘score and topic familiarity as well as topic interest
2.2.1 The correlation between students „score and topic familiarity
Research indicates that topic familiarity significantly influences students' oral output, aligning with previous studies by Skehan (1998), who argues that familiarity allows for easier information retrieval, reducing cognitive demands and enhancing speech production Additionally, factors such as foreign language anxiety can negatively impact the oral performance of English as a second language speakers (Woodrow, 2006) Familiarity with speaking topics may alleviate this anxiety, potentially improving students' grades Moreover, rater comments from Khabbazbashi (2015) reveal that students often struggle when faced with unfamiliar topics, further underscoring the importance of topic familiarity in effective communication.
Candidates often feel unprepared, confused, or even intimidated when faced with unfamiliar topics during oral language assessments, highlighting the importance of topic familiarity Additionally, their inability to utilize complex grammar or vocabulary when expressing a lack of knowledge underscores the significance of being well-acquainted with the subject matter in evaluating their speaking skills.
2.2.2 The correlation between students „score and topic interest
The study found a significant relationship between students' topic interest and their oral output, aligning with Nana Nurjanah's (2011) research, which indicated that students' interest correlates with their speaking scores Nurjanah emphasized that a lack of interest in learning speaking skills leads to difficulties in lesson comprehension Conversely, students with high interest are more engaged in learning, attend classes more frequently, and ultimately achieve better outcomes in their speaking abilities.
47 will get good score Anita (2020) also stated that there is correlation between students‟ interest and their speaking score
2.3 The impact of raters on the oral output of students
The finding highlighted that no significant effect of raters on the students‟ performances was detected The result is not the same with some previous findings
In their 1999 meta-analysis titled "Magnitude and moderators of bias in observer ratings," Hoyt and Kerns examined 79 generalizability studies and found that rater effects, along with the interaction between raters and candidates, contributed to 37% of the variance in final scores This suggests that high inter-rater reliability allows the two raters to be interchangeable, indicating that the categorization is minimally influenced by rater factors (Gwen, 2014) Ultimately, inter-rater reliability is defined by the consistency of categorizations across different raters.
Chapter summary
This chapter has successfully addressed all three research questions through the calculation and analysis of collected data The descriptive statistics highlighted the speaking competence of test-takers, confirming the validity of the speaking test Notably, the change in topics significantly impacted student scores, as indicated by research question 1 Additionally, Kendall’s tau-b revealed that both topic familiarity and interest influence final scores A Pearson correlation demonstrated a significant correlation between scores given by two raters, followed by T-Tests that confirmed no influence from different raters on student scores The concluding chapter will summarize the overall findings and provide recommendations for future research.