The World's Only Test Security Blog
Pull up a chair among Caveon's experts in psychometrics, psychology, data science, test security, law, education, and oh-so-many other fields and join in the conversation about all things test security.
Posted by John Fremer
updated over a week ago
This article is designed for testing programs that are evaluating the need to acquire or develop exams to help with high-stakes decisions and/or are evaluating people or programs. These types of programs are often referred to as “high-stakes testing programs”. This article focuses specifically on the assessment process. It illustrates key issues at major stages in the assessment planning process, discusses why it’s important for testing programs to take note of this cycle, describes where and how security fits in, and shows you how to successfully apply the testing process to your individual program. The following six areas are covered:
By the end of this article, you should be able to properly apply and execute all eight steps of the exam process for your testing program. Let’s get started.
There are many different ways to divide up the assessment process. To provide information about the critical phases of this process, let’s first look broadly at the issues you need to address at three major checkpoints: before, during, and after testing.
We primarily give tests to obtain information—often about an individual’s or a group’s knowledge and skills. We frequently utilize these tests to make decisions about a person or a program. These decisions can include:
Once you identify the purpose of your test, you must then ask if you can use an existing test. Alternatively, do you want to start from scratch and build your own exam? A few questions to consider:
More about these issues in the section below.
If you are trying to decide whether or not to build your own exam, consider the following questions:
I highly recommend using the Security Boot Camp Workbook Part 2: Technology to answer these questions.
Often, professional test builders need around 18 months to develop an exam. Would that be a realistic timeline in your situation? Alternatively, what can you do within three to six months? Can a shorter timeline meet your needs?
Lastly, if you are using a vendor or a vendor’s exam software to build your exam, or if your exam will be hosted online, ensure that your vendor meets industry standards (like these baseline requirements) and uses the latest innovative item designs and test designs.
Once you’ve built or purchased and administered your test, you should evaluate your exam’s performance. A few questions to consider:
These questions will help you evaluate whether your exam is performing well and gathering the valuable information it was designed to collect. You can learn more about evaluating your exams in Security Boot Camp Part 3: Evaluation.
The following are frequently identified as key stages in the assessment process:
In this paper, each of these stages is addressed in terms of an important question that you need to answer as part of your work.
We are interested in testing because we have a purpose in mind. In an educational setting, teachers use tests to help determine the status of students’ learning. In an employment setting, tests are often used to help make hiring decisions. In the certification area, tests are used to establish eligibility to receive special permits, licenses, or credentials.
Thinking carefully about your purpose for testing proves to be a very important step. For example, evaluating thoroughly the purpose for your test will help you decide on the level of security you need to employ in the design, development, and administration of your exam. If important decisions are going to be made—potentially with significant consequences for test takers, test users, and test program managers—then you need to be extra careful to ensure the exam results accurately reflect the skills and knowledge of those being tested. If they don’t, the test results may be compromised and could reflect extraneous factors such as willingness and ability to cheat. As a result, if those who care about your test lose faith in the integrity of your scores, the damage to the individuals and to your testing program can be substantial and hard to recover from.
As part of planning for testing, defining the intended population is key. For what group is this test being developed? Sometimes the answer is straightforward (e.g., it is to test the knowledge and skills of the students in this particular class). On other occasions, a test is broadly available (e.g., tests used in the transition from one educational level to another, such as high school to college or college to graduate school). Knowing which group or groups your test will measure is key in the planning phase of the assessment process.
When considering who you are going to be testing, it is also critical to think about whether the test is appropriate for all of your students or candidates. What if the test taker is from another country and has limited proficiency in English? What if the test taker has little experience with tests of the type being employed (e.g., with computer-delivered exams or with exams using unusual or innovative question formats). Will you need to provide extensive practice tests to permit test takers to anticipate what to expect on your exam? Make sure your test takers have the opportunity to prepare for your exam. This way, they can best demonstrate their skills and knowledge in a fair way.
For most testing settings, the task of deciding what to test involves bringing in different kinds of experts to help. In an employment setting, testing professionals often carry out job and task analyses in order to identify the kinds of content that should be included on the test. In an educational setting, the coverage of course content (including widely used instructional materials) is collected and reviewed by content experts.
For test development, an analogy is often drawn to the process of building a house—when you are creating a test, you need to develop detailed specifications. If this is your first time developing a test, don’t try to proceed without getting help from skilled test developers. Just as you would seek expert help or advice from experienced construction designers and workers for your home, reach out for help from test developers; they’ve been through the process and can provide you and your program valuable insights.
Lastly, make sure you allow considerable time for the planning process. Think in terms of three to six months for the stage of planning and developing detailed test specifications. (Yes, people have made tests on quicker schedules, but speak to test makers who have done that and look for the pain in their eyes as they relive their experience.)
Often, test makers are building a test to take its place in an existing program. In these settings, many dimensions of the test are already set (e.g., it is a two-hour test and will be composed of five-option multiple-choice questions.” However, even when this is the case, you still need to arrange to have the test questions written and reviewed.
Whenever possible, try out your questions with test takers before they are used operationally (this process is called pretesting and it is extremely valuable). In most testing settings, the original set of items that are developed without pretesting are considerably more difficult than they need to be for the purposes the test is intended.
Even when test makers intend to create questions with a span of difficulty related to the degree of knowledge and skill that will be part of the test taking population, often only the ablest of test takers can handle the questions. The test taker with less experience, knowledge, or skills gets lost.
This question will be answered in its own section (below) as it is a vital aspect of the assessment process.
Test security considerations should influence all stages of planning and developing your assessment procedures. You will yield much better results when thinking about the importance of security during every stage, rather than trying to “retrofit” security measures when almost all decisions about how you will test have already been made.
If you are going to be utilizing a test to influence important decisions about individuals or programs, it is a virtual certainty that there will be efforts to compromise your results—both by individual test takers, and often by organized and resourceful groups of cheaters or thieves. The more carefully you protect your items and tests, the less attractive your testing program will be to miscreants, and the more confidence others will have in your program—confidence that your exam scores actually mean what they are intended to signify.
When considering your test's security, it's important to utilize the resources around you and learn from others' experiences. See, for example:
So what are the best ways you can protect your fair, reliable, and useful test scores? The validity of your exam scores depends on your ability to minimize the effects and attempts at cheating. Test security measures can be divided into each of these four categories (as discussed in this article):
Each category is quite important. However, test security works best when your measures cover all four of these categories. When your security measures are used collaboratively, each one can complement the other's protective strengths.
Deterrence activities aim to discourage test takers or organized cheating groups from attacking your exams. To deter unwanted behavior, you put procedures in place that make examinees no longer have the desire to carry out cheating or test theft. Your procedures and statements should convince those considering possible testing misbehaviors that they’d get caught, that what they’re planning won’t work, or that it simply isn’t worth the risk for them to try.
For example, you can publicize the fact that you use test items with content that varies for each test taker on an unpredictable basis. This can take away the appeal of some common types of cheating. It also removes the benefit for a test taker to copy a neighbor’s answers or memorize the answers from a previous test.
When communicating your security measures, be sure that test takers and others are aware of the consequences. For example, for programs that utilize data forensics, you could tell test takers that special analyses are done on an ongoing basis to be sure that tests were administered and taken fairly and appropriately.
Test takers should also be fully informed if there are serious potential consequences for violating the testing rules. It is a situation where redundancy in delivering the message is recommended. It is also desirable to remind test takers of their responsibilities at the start of testing, and to do this multiple times if testing spans over more than one occasion.
No matter which deterrent security measures you use, be sure to have purposeful communication with your examinees. Keeping your deterrent strategies quiet will not work—you must convince would-be cheaters that you have strategies in place that will catch them and penalize them.
Preventative measures aim to make it impossible for a threat to work in a particular situation. For example, a close and informed review of candidate IDs can both deter attempts at proxy test taking (or the use of an imposter) and also disrupt such activities before they can unfold. Even relatively simple (but valuable) testing practices, such as never allowing test takers to choose their own seats in the testing room (check out this example seating chart template) can be quite consequential.
An example of a preventative measure could include monitoring test administration for in-person or state exams. Another example, one that works for both in-person and remote exams, is utilizing a computerized adaptive test (CAT) or using automated item generation (AIG). With these, only a portion of your exam questions will be displayed to each test taker, preventing them from harvesting and distributing your entire exam to other test takers.
Especially important to note—it is entirely possible for you to prevent some threats completely (through innovations such as the SmartItem™) and some threats partially (through innovations like the Discrete Option Multiple-Choice™ item). In all, best practices—combined with common sense and creativity—will help you to build and implement the best preventive solutions for your program.
If testing misbehavior has slipped past your deterrent and preventative measures, you still can detect it—many times before test scores are actually reported or finalized. An example of a detection method could be a vigilant proctor who discovers answer-copying between two examinees during the test. Proctoring is one of the most common detection measures testing programs employ. (To learn more about proctoring, view this article on the effectiveness of proctoring, this white paper outlining the standards of online proctoring, and this collection of proctoring articles.) Keep in mind that proctoring, despite its importance, has significant security holes and should never be relied on as your sole security measure.
Another example of detection (and one of the most effective methods high-stakes testing programs use to detect cheating) are the set of psychometric methods known as data forensics. Data forensics are powerful analyses that use patterns in your testing data to identify irregularities and catch otherwise impossible-to-detect instances of cheating. For a very compelling example of how data forensics were used to find cheating in a high-stakes certification testing program, see the paper “The Value of Data Forensics for Certification Programs.”
Keep in mind, unusual response patterns due to cheating or theft are often only apparent by statistical analysis. When the analysis occurs early enough, it makes it possible for testing programs to respond quickly and avoid more extensive damage.
To derive full value from your deterrence, prevention, and detection test security activities, you need to follow up on the test security threats you uncover. It doesn’t make any sense to detect a breach without having a response to deal with it. Typically, this pre-planned reactive strategy is called a Security Incident Response Plan (SIRP), and it is part of the bigger Test Security Plan. For actionable tips that help you directly apply follow-up measures to your program, including implementing a SIRP, see the Security Boot Camp Workbook Part 1: Preparedness.
An important point to note: over time, the test security threats facing your program will change. If a national defense strategy failed to acknowledge ever-changing threats like cyber hacking, it would be alarming and ineffective. The same is true for testing programs. Be sure to frequently conduct a security risk assessment for your program and fully understand the threats and risks that can undermine your test security. This will enable you to effectively stay on top of changing threats and keep your exam secure.
As computer-delivered tests have become more widely used, some testing programs have provided the immediate reporting of scores. If you have not adopted this practice, do not start. It is important that you build in a quality control step, reporting provisional scores if you must. No matter what, you should always subject the testing process to careful checks before the scores are reported.
With the extra time you have between administering the exam and providing the final test score, you should consider carrying out psychometric analyses to detect possible test taker misbehavior (more on that below). Perhaps you will find few examples of such misbehavior in your populations? Either way, it will be much easier and less fraught with emotion if the scores being studied have not yet been reported. With that said, be careful to always focus your work on ensuring test validity rather than detecting misbehavior.
The purpose for administering tests is to obtain fair and valid measurements of human performance. The question of whether a score is fair and valid is subject to psychometric research, but it is also subject to verification in the response data. Data Forensics is essential for this verification because it provides a framework for detecting anomalies with test scores that can be found in no other way. You can see all that data forensics uncovers in this infographic. It follows that if a program intends to ensure the integrity of test scores, then data forensics should be an integral part of your verification process.
Because a testing program has the role and responsibility to verify that test scores are valid, they need to confirm or disconfirm the validity of individual test scores. While it is true that scores are sometimes canceled because a testing irregularity or security incident occurred, many of these incidents are not observable. Instead, we must rely upon the data to determine the validity of test scores. When, for whatever reason, the testing program has evidence that a test score is not valid, the program has a responsibility to cancel that score. The reason why the score is not valid is irrelevant after evidence obtained from a serious and proper review leads us to conclude the score is not valid.
Every test administration should be followed by a series of reviews to check the quality of the test that was delivered. This will help ensure your exam is measuring what you need it to measure, and that your scores are reliable and trustworthy. In these reviews, address the following questions:
Below are the eight steps of the assessment process. By following each step, your program will set itself up for success at every stage of the assessment cycle.
For more than 18 years, Caveon Test Security has driven the discussion and practice of exam security in the testing industry. Today, as the recognized leader in the field, we have expanded our offerings to encompass innovative solutions and technologies that provide comprehensive protection: Solutions designed to detect, deter, and even prevent test fraud.