If you’re monitoring the web to find your stolen test content online, it is vital you understand how search engine technology works. The parameters and limitations of current search technologies have a direct impact on how your teams can discover online content related to your testing program. For example, a posting of actual test content on a website may not pose a significant risk if that post is not readily retrieved by search technology. Here’s how web scraping is impacted by searching algorithms and technology.
Search engines “index” (in other words, “analyze” and “store”) web pages to facilitate searching functions. Unindexed web pages will not be displayed in search results from search engines (such as Google, Bing, Yahoo, etc.). Since indexing involves cost, search engines allocate a “crawl budget” to each site. Then, depending on how important a search engine deems the site and the volume and frequency of changes made on the website, the crawls are initiated by the search engine into the website more or less often.
Search engines use a ranking system to determine where pages are displayed in their search results when a search is rendered. Even if a webpage is indexed and returned as a search result by the browser, its rank could be lower, and it may not show up in the first few pages of search results. With over 95% of clicks on search engines happening on the first page, it makes that first page (and especially those first few results) the most valuable real estate for a website.
An entire ecosystem of Search Engine Optimization (SEO) companies exists to help businesses rise to the top of search results. With so many websites aiming for a first-page spot, other lower-ranked or low-authority sites (and also pages that search engines deem less relevant with that keyword’s searcher intent) are pushed farther down the results page. This drives down traffic to potentially relevant sites and further decreases the likelihood of them being found by casual browsers—and makes it less likely that your web scraping efforts will find them.
To a search engine, each individual Tweet, Facebook post, or other social media entry is a web page on its own. Search engines index each separate post on every discrete account in each social media channel (Twitter, Facebook, Twitch, Instagram, YouTube, TikTok, etc.) as a stand-alone webpage.
Search engines do not index every social media post. For example, if the posting user’s public profile has no inbound or outbound links, that user’s posts will likely not be indexed by search engines. This means that if you have a test taker who posts live exam content on their Twitter, but they have no followers, no posts, and rarely log in, the search engine won’t index or display their Tweet containing your test content. So, while your content is technically online, it will not be displayed by search engines, and it will therefore not be identifiable by your web scraping efforts.
“Social signals” refer to a website’s or social media post’s collective shares, likes, and overall social media activity. Search engines do not use social media platform signals as part of rankings. Instead, the search engine will use “inbound” links to rank social media posts. That is, the more a post is shared outside the platform with links back to the platform, the higher the post will climb in the search engine’s rank. An indexed social media post with no external links pointing to it may not be ranked at all by a search engine. For example, consider this video about Google’s ranking methods on social media, explained by Google themselves.
Important to note, most social media sites mark links to posts as “nofollow” so that search engines don’t associate specific content with the social media site itself. Therefore, a single post on social media is generally unranked by search engines. Similarly, the ranking of search results for indexed, non-social-media websites with dynamic content is affected by inbound and outbound links. So even when the search query’s text matches a website’s content exactly, the returned search result may not yield a high enough rank to be noticed by your web scraping efforts, or by examinees searching for that content.
Some additional relevant details about social media and search engines:
With these facts in mind, consider you have a test taker who posted your live test content. That specific examinee does not have any followers, and this is their first post. Therefore, not only will their post not be searchable, but other examinees would not be able to stumble upon it unless they are specifically aware of it and know how to locate it (like with a link). And your web scraping efforts, no matter how vigilant, would not pull it up.
In addition to social media posts, many regular non-social media websites that contain searchable content will not be found by web searches either. Unless the website contains relevant inbound and outbound links to other sites, search engines may fail to index these sites unless specifically tasked by the website owner to do so. Therefore, no record of the site may exist in the search engine’s index, and the search engine cannot display the website in search results.
Website administrators can also configure settings that prevent parts of their site from being crawled by search engine bots. They do this through methods such as simple text files in certain directories or with hidden links. When this is done, these pages will not be found or displayed by a search engine, nor would most—or even advanced users or in-depth web scraping efforts—be able to find this type of content, unless they were explicitly made aware of it.
Social media platforms have their very own search algorithms. Their search algorithms do not directly correlate with search content to return search results, as Google or Bing would. In fact, one could type a post’s contents word-for-word into a social media platform’s search function and likely not have that same post returned. The individual social media platform adapts its search results to the user’s specific profile, browsing history, social media posts, post security settings, and inferred interests. The platform also weighs the popularity of posts heavily in its search returns. Some details:
Also important to note, the cookies that are stored in an individual’s browser will affect the search results they see—both on social media platforms and from browser-based search engines as a whole. Because of the customizations and targeted content responses that each user profile generates, no two individuals will see the same search results, even when entering the same keyword.
For example, When a user interacts with or creates a post on a social media platform, the social media platform may store that visit in a cookie in the user’s browser. When that same user searches for the contents of that post, the result may be returned. However, no one else searching for that content would have that result rendered. So, for example, if an examinee posted live test content and then searched for their post, they would find it. However, if someone on your web scraping team hears about this post and searches for it specifically, they may not have any luck finding it.
Web scraping is an effective method for determining whether your exam content has been leaked online. It’s vital for testing programs—especially high-stakes testing programs—to monitor the web for stolen test content. This blog exists to help testing programs with their own DIY web monitoring teams understand the ins and outs of search engine technology and its impact on web monitoring efforts. However, you should always work with your team to determine whether your organization can keep up with content monitoring in-house, or if outside help patrolling the web would be a better fit for your program.
It’s extremely common for live exam content to be leaked online (whether on braindump sites or social media platforms), and often, exam content is spread within just days of a test being released. Web scraping efforts help you identify what content is exposed, where it is exposed, who may have stolen or used it, and how deeply it has impacted your program.
The parameters and limitations of current search technologies have a direct impact on how your teams—and your test takers—can find and discover leaked test content online. A posting of actual test content may exist, but it isn’t always searchable or findable. Use these guidelines to determine what content your web scraping efforts can pull up.