The World's Only Test Security Blog
Pull up a chair among Caveon's experts in psychometrics, psychology, data science, test security, law, education, and oh-so-many other fields and join in the conversation about all things test security.
Posted by David Foster, Ph.D.
updated over a week ago
Simply put, computerized adaptive testing (CAT) is a computer-based exam that uses special algorithms to tailor test question difficulty to each individual test taker. A computer adaptive test means the exam adapts in real time to the test taker’s ability level and provides test questions accordingly. It is one form of secure exam design you can use to protect your test content from being exposed and prevent test takers from cheating. CAT allows tests to be administered more quickly, with fewer items, and with increased security. (You can learn more about how computerized adaptive tests work in this section.) To better understand CAT, let’s look at the origins of adaptive testing.
An adaptive test, as the name suggests, adapts or tailors exam questions in real time to the ability of each test taker. This eventually results in a different set of test questions for each person. The test adapts based on how well the test taker answers earlier questions. As a test taker answers most questions presented correctly, more difficult questions are chosen and given. On the other hand, from an inability to answer previous questions correctly, easier questions are presented. After a relatively small number of questions, which might be different for each person, the test is able to stop the test and provide a score. The score is not based on the number of questions answered correctly, but on the “level” of difficulty of questions the test taker reached. Because of variable starting and stopping points, an adaptive test is very efficient, requiring the test taker to answer fewer items compared to a traditional test. One of the earliest known adaptive tests was the Stanford-Binet Intelligence Scale given at the beginning of the 20th century (you can learn more about that test in this section).
In contrast to adaptive testing, computer adaptive testing (CAT) means that the adaptive test is a computerized exam instead of a paper-and-pencil exam. Today, most tests are given on computers rather than on paper. Computerizing exams has been a significant development for the testing industry, allowing for faster scoring, greater accessibility, increased fairness, easier administration, and increased security, among other benefits.
Computerized adaptive tests access an organized pool of items during the exam. These items range from easy to complex, based on a difficulty value computed from collected data on the items. A better item pool will have a lot of items at each difficulty level. The CAT algorithm will pull an item from the pool that more or less matches the most recent estimated ability of the test taker as questions are answered—and this continues until the test ends. In short, every time a test taker answers an item, the computer re-estimates the tester’s ability and selects a different question from the item bank that the examinee should have a 50% chance of answering correctly. It does this to provide a more accurate measurement of the test taker's ability on a common scale.
In general, during the test, if a person’s estimated ability is high (that is, they have answered well on the more difficult questions), the CAT will have estimated their ability to be “high” and select and present an item from a “difficult range” of items in the pool. The process is the same for every estimated ability level from low to high, and for any number of levels from narrow to broad.
When enough questions have been asked and answered—generally not as many as an equivalent traditional test—a reliable score for the test taker is calculated. The score is not based on the number of questions answered correctly, but on the level of difficulty of items the person is able to answer correctly. While the details change slightly, this is functionally similar to how a high jumper’s score is obtained in track and field (jump to this section to learn more).
I learned about computerized adaptive testing almost 40 years ago from a legend in our field, Ron Hambleton, as I was beginning a career in testing. He taught me the logic of adaptive testing and the technical and procedural steps needed to make one. Under his direction, my colleagues and I were able to create a large number of adaptive tests and administer them to K-12 students. Around eight years later, in 1990, I used the same adaptive design for the tests for a worldwide information technology certification program, eventually administering more than a million adaptive tests. As far as I know, that was the first ever large-scale global use of computerized adaptive testing. I’ve been a fan of them ever since, and I recommend them to many of Caveon’s customers for their efficiency and security properties.
One of the earlier adaptive tests in history was the Stanford-Binet Intelligence Scale, first used around 1916. It was a test I practiced giving to children years ago during one of my graduate school courses. It is considered to be an adaptive test because it doesn’t begin and end at the same place for all children. Instead, the questions were ranked by difficulty, and younger children started with some of the easier questions (but not the easiest). If the young child could not answer three of the easy questions in a row, the examiner then re-adjusted the starting point backing up to even easier ones. Assuming the child could correctly answer three of the questions in a row, the test would proceed until the child incorrectly answered three in a row. The examiner would then stop the test. There were several sub-tests, and each one was given in this general manner. The test would be scored based on the difficulty level the child was able to attain on each sub-test.
In the non-testing world, such adaptive measurement is fairly common, and has been around for a much longer time. Take, for example, the sport of high jumping, which began in Scotland in the 19th century. A high jump competition usually begins with a bar height slightly lower than the overall abilities of the competitors. As a result, some of the more capable jumpers may actually skip the first few set heights of the bar. Because of this, as well as the fact that a person is eliminated when they fail to clear a height where others succeed, a high-jump competition is very efficient. With few jumps and different numbers of jumps, the ability of each competitor is quickly determined. The winning competitors are determined by the final bar height each finalist is able to attain. The competitor with the highest bar height wins.
It is instructive to imagine that a high-jump competition is conducted the same way a traditional test is given. This is what it might look like: The high jumper would be required to jump, or try to jump, over all bar heights of 3-inch increments from 3 feet to 10 feet, a total of 28 jumps. The score would be the number of successful jumps out of 28. This would actually be a fairly accurate measure of high jump ability. However, regardless of ability, no jumper would enjoy this type of competition. There are too many unnecessary jumps, causing fatigue. Also, the jumper would be bored by the lower jumps and frustrated by the higher ones. Yet this is exactly the experience required by traditional exams in our field, and it is the type of experience adaptive tests, particularly computerized adaptive tests, can help to avoid.
One of the significant advantages of CAT is that it is efficient. It uses fewer items than a traditional test to determine an equally useful score for a test taker. This efficiency leads to these specific benefits:
These benefits, and others, generally lead to a very positive experience taking a computerized adaptive test. At Novell in 1995, I surveyed 3,000 of our certification candidates who had taken at least one CAT. 43% indicated they preferred taking computerized adaptive tests, 19% said they had no preference, and only 19% said they preferred a traditional test format. The main obstacle to having a preference for a computerized adaptive test is the doubt that such a short test could measure their ability as well as a longer test. However, once the CAT format was explained to them and they were given opportunities to try out the tests, most were convinced that the CATs were better tests. Of course, after years of using CATs, the candidates gained confidence—and shared the confidence with the others—in the computerized adaptive test’s ability to separate those who were knowledgeable and experienced from those who were still learning.
I’ve pondered this question hundreds of times over the past three decades, mainly because I was so eager to use the technology in 1990 and found immense success using it. Surely, every other program could do as I have, but the possible answers to the question are many. Therefore, this will be a personal (and partial) list of the many reasons people have communicated foregoing CAT.
These objectives are valid, particularly for small testing programs. Each time I have used computerized adaptive tests in my past, I have had access to plenty of data, software to calculate the necessary statistics, software to administer the test, and other necessary resources and help as needed. More recently, implementing CAT with a very small certification program, I felt the validity of these counter-arguments much more strongly. Nevertheless, it became important to look at computerized adaptive tests from a broader perspective. So, for testing programs with limited resources, please note that with innovative solutions, it is possible and somewhat simple to overcome these obstacles and others.
It should be clear from the CAT examples I have provided that there is no single correct way to implement a computerized adaptive test. Currently, what might be considered the “standard” way may to implement CAT not be a good fit for most programs (as you can see in the list of “objections” above). Therefore, it is important to know that there are alternatives that can address these objections. For example, just few years ago, I created a CAT using item difficulty ratings by subject matter experts to represent the difficulty values of items, rather than conducting a large, non-feasible empirical pilot of the items. The creative solution I used came from the cognitive science literature. It saved me a significant amount of time and money and resulted in high-quality difficulty values for the CAT items. This way of statistically calibrating items is just one example of how a difficult part of the process can be made easier without sacrificing quality.
As mentioned above, there are many different ways you can get started with computerized adaptive testing. It’s generally more difficult to implement CAT, so I highly recommend working with a seasoned testing expert to get started. With that in mind, one of the first and most important needs for implementing CAT is having access to a large enough item bank. One way to increase your item pool is through AIG. Automated item generation is the fastest (and often most cost-effective) way to expand your pool. Some AIG tools, like this one, can easily increase your item pool at the push of a button. You can learn more about AIG in this ultimate guide.
The purpose of this article is to inform readers at a conceptual level about computerized adaptive testing. As you can see, I have avoided technical explanations and descriptions or discussions about test theory and formulas. Instead, this article provides reasonable definitions and examples. The intent is to provide a broad perspective from my own practical experience of over 30 years making and using CATs for K-12 education and certification exams.
In short, a computerized adaptive test is a computer-based exam that uses algorithms to tailor its test question difficulty levels to the individual test taker, depending on that examinee's previous correct and incorrect answers. This means the test is different for every examinee based on their responses. I am hoping that my overview will pique your interest in using computerized adaptive tests, and persuade you to learn more about how to go about implementing them in your program. It’s my hope that, regardless of the size of your program or the number and qualifications of your staff, you will be able to implement CAT almost as easily as you provide any other test.
A psychologist and psychometrician, David has spent 37 years in the measurement industry. During the past decade, amid rising concerns about fairness in testing, David has focused on changing the design of items and tests to eliminate the debilitating consequences of cheating and testwiseness. He graduated from Brigham Young University in 1977 with a Ph.D. in Experimental Psychology, and completed a Biopsychology post-doctoral fellowship at Florida State University. In 2003, David co-founded the industry’s first test security company, Caveon. Under David’s guidance, Caveon has created new security tools, analyses, and services to protect its clients’ exams. He has served on numerous boards and committees, including ATP, ANSI, and ITC. David also founded the Performance Testing Council in order to raise awareness of the principles required for quality skill measurement. He has authored numerous articles for industry publications and journals, and has presented extensively at industry conferences.View all articles
For more than 18 years, Caveon Test Security has driven the discussion and practice of exam security in the testing industry. Today, as the recognized leader in the field, we have expanded our offerings to encompass innovative solutions and technologies that provide comprehensive protection: Solutions designed to detect, deter, and even prevent test fraud.