TOEIC® scores are consistent and reliable.
Evidence: The research in this section demonstrates how TOEIC Program Research helps to ensure that scores are not improperly influenced by aspects of the testing procedure that are unrelated to language ability. When examining score consistency or reliability, there are multiple aspects of the testing procedure that are considered, including:
- test items (internal consistency)
- test forms (equivalence)
- test occasions or administrations (stability)
- raters (inter- and intra-rater reliability)
How ETS Scores the TOEIC® Speaking and Writing Test Responses
Typically, human raters are used to score Speaking and Writing tests because of their ability to evaluate a broader range of language performance than automated systems. This paper describes how ETS ensures the reliability and consistency of scores by human raters for TOEIC® Speaking and Writing tests through training, certification, and systematic administrative and statistical monitoring procedures.
In order to qualify, raters must meet specific requirements, complete a systematic training course and pass a certification test. Prior to every rating session, raters must pass a calibration test which assesses their readiness to score on that day. During the scoring process itself, raters are monitored by scoring leaders, which provides real-time feedback and support. The rating process uses ETS's patented online scoring network (OSN) system, which also helps ensure that each test taker receives multiple (up to nine), independent and anonymous ratings. This specific factor in the TOEIC scoring process minimizes the "Halo Effect" (bias) and "Human Error." The TOEIC scoring process as a whole helps generate more accurate and consistent test scores by minimizing all potential sources of error and bias.
Monitoring TOEIC® Listening and Reading Test Performance Across Administrations Using Examinees' Background Information
The scoring process for the TOEIC® Listening and Reading test includes monitoring procedures that help ensure that scores are consistent across different test forms and test administrations, and that skill interpretations are fair. This study explores the possibility of using information about test takers' backgrounds in order to enhance several types of monitoring procedures. Results of the analyses suggested that some background variables may facilitate the monitoring of test performance across administrations, thereby strengthening quality control procedures for the TOEIC Listening and Reading test as well as strengthening evidence of score consistency.
Monitoring Individual Rater Performance for the TOEIC® Speaking and Writing Tests
This paper describes procedures implemented on the TOEIC Speaking and Writing tests for monitoring individual rater performance and enhancing overall scoring quality. These multifaceted, carefully developed procedures help ensure that the potential for human error is kept to a minimum, thereby contributing to the TOEIC tests' scoring consistency and reliability.
Alternate Forms Test-Retest Reliability and Test Score Changes for the TOEIC® Speaking and Writing Tests
The reliability or consistency of scores can be examined in a variety of ways, including the degree to which scores for the same test taker are consistent across different test forms (so-called "equivalent forms reliability") and different occasions of testing ("test-retest reliability"). This study examined the consistency of TOEIC Speaking and Writing scores across different test forms at different time intervals (e.g., 1–30 days, 31–60 days) and found that test scores had reasonably high equivalent form test-retest reliability. This indicates that the test takers’ scores are consistent across different test forms and/or on different occasions, providing additional evidence of the consistency and reliability of TOEIC test scores.
Statistical Analyses for the TOEIC® Speaking and Writing Pilot Study
This paper reports the results of a pilot study that contributed to TOEIC Speaking and Writing test development. The analysis of the reliability of test scores found evidence of several types of score consistency, including inter-rater reliability (agreement of several raters on a score) and internal consistency (a measure based on correlation between items on the same test). The correlational analysis found evidence that each test section measured three distinct claims about Speaking or Writing, as intended. These results helped the development of final specifications for the TOEIC Speaking and Writing tests in addition to providing evidence of score reliability and the validity of score interpretations.
Field Study Results for the Redesigned TOEIC® Listening and Reading Test
This paper describes the results of a field study for the 2006 redesigned TOEIC Listening and Reading tests, which includes analyses of item and test difficulty, reliability and correlations between test sections with classic TOEIC Listening and Reading tests. Results are consistent with another comparability study (Liao, Hatrak and Yu's in 2010), which found evidence of the reliability of the redesigned tests, and suggested that scores on the redesigned test could be interpreted and used in similar ways to classic TOEIC Listening and Reading test scores. This provides evidence for the reliability and consistency of TOEIC Listening and Reading test scores as well as the validity and fairness of score interpretations comparable to the previous version.
Comparison of Content, Item Statistics, and Test Taker Performance on the Redesigned and Classic TOEIC® Listening and Reading Test
This paper compares the content, reliability and difficulty of the classic and 2006 redesigned TOEIC Reading and Listening tests. Although the redesigned tests included slightly different item types to better reflect current models of English-language proficiency, the tests were judged to be similar across versions. The results provide evidence for the reliability and consistency of the redesigned TOEIC Reading and Listening tests, and suggest that the redesigned tests can be meaningfully interpreted and used to make decisions in line with the classic tests.