Yes, statistical analyses are the single most powerful tool available for detecting cheating on tests. You can learn more about this process in our ultimate guide to data forensics, but in short, trained psychometricians know how to use data forensic analyses to detect and pinpoint exactly who may have cheated on an exam.
Despite the powerful capabilities of data forensics, many testing programs are still uncomfortable using "cheating statistics" to make inferences about whether dishonesty has occurred on their exams. In this article, I will discuss a few of the reasons you should feel confident in your testing data.
Being a statistician, I admit to having specific ideas about data and test scores. Some of these ideas are not generally accepted, and they may not be popular. However, the idea of using statistics to detect problems with the test administration seems natural and reasonable. It would be inconsistent to accept test scores as being valid and reliable, but not use test result data to make inferences about the quality of the test administration. Here's why: the very act of administering a test and obtaining a test score is a statistical procedure with the intent of making a statistical inference.
When we give tests, we are not interested in the test taker’s performance on the actual questions that were presented. Instead, we are interested in inferring or estimating the test taker’s knowledge or competence in the tested domain. Making such an inference implicitly acknowledges that the test score is a statistical measure and subject to uncertainty. If other questions had been presented, there is no doubt that the test scores would have been different.
If you do not agree with the above perspective, you may not agree with the corollary that I now present. Despite disagreements, I now stipulate that the best and most reliable record of the testing session is the actual set of recorded responses (and any other measurements that can be obtained, such as answer changes, answer similarities, and response times). These data are more reliable than proctoring observations, video recordings, or any other externally derived measure of the testing session. If you can trust the recorded responses to calculate a test score and make decisions about a test taker’s future, you should be equally comfortable using the recorded responses to make inferences about the quality of the testing session and whether testing irregularities may have occurred.
Because many statistical techniques may appear to be arcane or even “mystical,” the statistician must be very careful in selecting and using techniques that are based in solid statistical principles. Statistics will be most easily defended if they are derived from a probability model that describes the behavior being observed and if they provide objective probability statements concerning the extremeness of any observation. These criteria are rather stringent and lead to the natural exclusion of many techniques that have been investigated by researchers.
For example, person-fit statistics are ideal for describing whether a test taker’s response pattern is consistent with the normal pattern of test taking (here at Caveon, we use the word “aberrant” to describe inconsistent response patterns). However, even though there is considerable literature on person-fit statistics, no researcher has yet published how to make objective probability statements about aberrant test taking. Without having statistically sound inferential models, the practitioner must devise ad-hoc methods that are empirically derived from the analysis of the data. There are two problems with that: (1) the judgment of what constitutes an extreme observation is subjective and may vary depending upon the situation, and (2) the modeling technique itself is not easily defended or replicated. I think these problems are two of the fundamental reasons why test administrators have been uncomfortable with using statistics to make inferences about cheating in the past.
At Caveon, our Data Forensics℠ team has worked very hard to create algorithms that are capable of computing probabilities for the statistics that we use in our analyses. Part of that work involves understanding the probability models and assumptions that underlie the models.
For example, “answer-copying” statistics that are based on the idea of similarity and excess similarity should be derived from probability models. One such example is the class of answer-copying statistics presented by Van der Linden and Sotaridona (2006): "Detecting answer copying when the regular response process follows a known response model." Journal of Educational and Behavioral Statistics, 31, 283-304. In this paper, the authors make the assumption that tests are taken independently in deriving the probability model for the number of identical responses (being the statistic of interest).
We have currently implemented person-fit statistics (for detecting aberrance), similarity statistics (for detecting collusion, test coaching, answer copying and proxy test taking), erasure statistics (for detecting test tampering), gain-score statistics (for detecting unusual learning patterns), response latency statistics (for detecting content exposure), and we continue to explore other statistics as well. You can learn more about the questions your data can answer with the right fraud detection capabilities here.