The World's Only Test Security Blog
Pull up a chair among Caveon's experts in psychometrics, psychology, data science, test security, law, education, and oh-so-many other fields and join in the conversation about all things test security.
Posted by Chris Foster, Ph.D.
updated over a week ago
I recently joined Caveon after working in a university research position. I truly care about research, I am a believer in open-source software, and I enjoy tackling psychometric issues. When I was first introduced to SmartItem technology, I came away with the understanding that item writers worked with programmers to develop items whose stem and options could change upon every administration to examinees. Given this information, I had a few important questions that I’m sure will resonate with many other psychometricians out there.
As a psychometrician, I am very concerned with not only the development of exams, but also with their fairness. The idea that examinees could receive different exams concerned me for two reasons:
After spending some time thinking about SmartItem test construction, I recalled instances from my own experience in form building. In doing so, I realized that truly parallel forms don’t actually exist—it is extremely hard to balance forms based on item difficulty while also considering content and sub-content areas. In fact, the entire field of equating was developed simply because parallel forms are never truly parallel.
However, the chance that examinees could receive vastly different difficulty of exams continued to trouble me. Thus, I decided to see what happened if items in an item bank were randomly assigned to forms.
The research question was: Given a large bank of items, how frequently do randomized tests differ from traditional 40-item fixed form assessments that are built from the same item bank? While it would be possible to develop a bank of items and give random tests to examinees, there wasn’t really a need to do so since examinee performance was not part of the research goal. In fact, we didn’t even need to develop items, because all we really needed were item statistics to randomly assign to forms. The cleanest and simplest way to answer this question was to run a basic simulation (a statistical tool that allows researchers to test theories about a given distribution of data given certain assumptions).
Knowing this, we used a standard process for simulating both items and examinees. First, we simulated 10,000 IRT ability estimates by randomly drawing thetas from a normal distribution with a mean of 0 and a standard deviation of 1. We then simulated 5,000 item parameters using the 2 parameter IRT model. Both the difficulty and discrimination parameters were sampled from a normal distribution with a mean of 0 and a standard deviation of 1 for difficulty, and a mean of 1 and a standard deviation of .1 for the discrimination parameter.
Now, we had an ability parameter for 10,000 simulated examinees and item statistics for 5,000 items. Using this information, we calculated how well our simulated examinees did on each of the simulated items by calculating the probability they would get the item correct given a 2 parameter IRT model. Examinee responses to items were used to calculate the reliability of our 40-item fixed test as well as the p-value for each item. To create a fixed form of the assessment, we simply chose 40 random items from the item bank. The reliability of our 40-item assessment was .87 with an average p-value of .502.
Then, we had each of our examinees take 60 tests from between 1 and 60 items comprised of random items from the item bank. Exams were considered to be easier or harder if the average p-value for items on the form did not fall within the 95% confidence interval of the 40-item fixed test. The table below shows the percentage of tests for each test length that were of equal difficulty to the fixed form.
The above table shows two main results:
As test length increased, the amount of similarity also increased. This isn’t a surprising finding, it’s expected given how randomization works over larger numbers. However, the speed at which tests were equivalent did surprise me as I did not expect 95% equivalent tests by 40 items. While there are more things I would like to do with the simulation, I did find the results comforting.
Thinking about the results of the simulation led to several other thought-provoking questions. Five percent is obviously a large percentage of test takers—an amount that should not be simply forgotten. But if over 5% of examinees are affected by cheating or other factors that systematically influence test scores, it is likely more appropriate to rely on random differences than systematic ones.
As a psychometrician, I was pleased with the results of my initial simulation. I feel I had some very real and legitimate concerns which were partially answered by the simplistic simulation. However, as with any researcher, my initial investigation of SmartItems provided me with dozens of other research questions, questions I will eagerly address in future simulations. In the meantime, for additional simulations, please view chapter 10 of the SmartItem ebook.
For more than 18 years, Caveon Test Security has driven the discussion and practice of exam security in the testing industry. Today, as the recognized leader in the field, we have expanded our offerings to encompass innovative solutions and technologies that provide comprehensive protection: Solutions designed to detect, deter, and even prevent test fraud.