The World's Only Test Security Blog

Pull up a chair among Caveon's experts in psychometrics, psychology, data science, test security, law, education, and oh-so-many other fields and join in the conversation about all things test security.

A Newcomer’s Impression of the SmartItem™: Questions and Concerns

Posted by Chris Foster, Ph.D.

updated over a week ago

A Psychometrician's Concerns About the SmartItem

I recently joined Caveon after working in a university research position. I truly care about research, I am a believer in open-source software, and I enjoy tackling psychometric issues. When I was first introduced to SmartItem technology, I came away with the understanding that item writers worked with programmers to develop items whose stem and options could change upon every administration to examinees. Given this information, I had a few important questions that I’m sure will resonate with many other psychometricians out there.

As a psychometrician, I am very concerned with not only the development of exams, but also with their fairness. The idea that examinees could receive different exams concerned me for two reasons:

  1. By giving each examinee a random and different version of an item, one relinquishes some control of the exam development process. In doing so, one relinquishes control of the difficulty of a given administration or form. As a psychometrician, there is an innate desire to control as much about the exam as possible. One core goal of standardized exams is to control as many parts of the exam as possible in order to standardize forms and experiences across examinees. Administering essentially random items from huge item banks means that there is no viable way to pre-test all the items, which will in turn mess with one’s ability to construct parallel forms of an exam.
  2. When considering the individual test taker’s experience, the examinee would not be able to tell the difference between their exam and anybody else’s exam. Under the hood, the experiences across examinees would be vastly different. Given that SmartItems sample from a benchmark, the test would be perfectly balanced on the content area. However, since items are administered randomly, it is possible that an examinee could get all the hardest versions of each SmartItem, and ultimately end up with a much harder test than another individual since items are administered randomly. With no quality data at the question level, it would be difficultif not impossibleto correct this issue or ensure that it never happens.

After spending some time thinking about SmartItem test construction, I recalled instances from my own experience in form building. In doing so, I realized that truly parallel forms don’t actually exist—it is extremely hard to balance forms based on item difficulty while also considering content and sub-content areas. In fact, the entire field of equating was developed simply because parallel forms are never truly parallel.

However, the chance that examinees could receive vastly different difficulty of exams continued to trouble me. Thus, I decided to see what happened if items in an item bank were randomly assigned to forms.

A Simulation of the SmartItem

The research question was: Given a large bank of items, how frequently do randomized tests differ from traditional 40-item fixed form assessments that are built from the same item bank? While it would be possible to develop a bank of items and give random tests to examinees, there wasn’t really a need to do so since examinee performance was not part of the research goal. In fact, we didn’t even need to develop items, because all we really needed were item statistics to randomly assign to forms. The cleanest and simplest way to answer this question was to run a basic simulation (a statistical tool that allows researchers to test theories about a given distribution of data given certain assumptions).

Knowing this, we used a standard process for simulating both items and examinees. First, we simulated 10,000 IRT ability estimates by randomly drawing thetas from a normal distribution with a mean of 0 and a standard deviation of 1. We then simulated 5,000 item parameters using the 2 parameter IRT model. Both the difficulty and discrimination parameters were sampled from a normal distribution with a mean of 0 and a standard deviation of 1 for difficulty, and a mean of 1 and a standard deviation of .1 for the discrimination parameter.

Now, we had an ability parameter for 10,000 simulated examinees and item statistics for 5,000 items. Using this information, we calculated how well our simulated examinees did on each of the simulated items by calculating the probability they would get the item correct given a 2 parameter IRT model. Examinee responses to items were used to calculate the reliability of our 40-item fixed test as well as the p-value for each item. To create a fixed form of the assessment, we simply chose 40 random items from the item bank. The reliability of our 40-item assessment was .87 with an average p-value of .502.

Then, we had each of our examinees take 60 tests from between 1 and 60 items comprised of random items from the item bank. Exams were considered to be easier or harder if the average p-value for items on the form did not fall within the 95% confidence interval of the 40-item fixed test. The table below shows the percentage of tests for each test length that were of equal difficulty to the fixed form.

Table 1Table 1-1



The above table shows two main results:

  1. As the length of the test increased, the number of tests similar in difficulty to the fixed item test also increased.

  2. When tests were of the same length, approximately 95% were of equivalent difficulty to the fixed-item test. This value only increased as the item count increased. By 60 items, almost all the tests were of the same difficulty as the 40-item exam.

As test length increased, the amount of similarity also increased. This isn’t a surprising finding, it’s expected given how randomization works over larger numbers. However, the speed at which tests were equivalent did surprise me as I did not expect 95% equivalent tests by 40 items. While there are more things I would like to do with the simulation, I did find the results comforting.

Additional Questions

Thinking about the results of the simulation led to several other thought-provoking questions. Five percent is obviously a large percentage of test takersan amount that should not be simply forgotten. But if over 5% of examinees are affected by cheating or other factors that systematically influence test scores, it is likely more appropriate to rely on random differences than systematic ones.

As a psychometrician, I was pleased with the results of my initial simulation. I feel I had some very real and legitimate concerns which were partially answered by the simplistic simulation. However, as with any researcher, my initial investigation of SmartItems provided me with dozens of other research questions, questions I will eagerly address in future simulations. In the meantime, for additional simulations, please view chapter 10 of the SmartItem ebook.

Chris Foster, Ph.D.

View all articles

About Caveon

For more than 18 years, Caveon Test Security has driven the discussion and practice of exam security in the testing industry. Today, as the recognized leader in the field, we have expanded our offerings to encompass innovative solutions and technologies that provide comprehensive protection: Solutions designed to detect, deter, and even prevent test fraud.

Topics from this blog: Exam Development SmartItem™