Caveon Security Insights Blog

How Search Engine Technology Impacts Web Scraping for Test Content

Written by Brinnlie Knavel | August 3, 2021 at 7:26 PM

If you’re monitoring the web to find your stolen test content online, it is vital you understand how search engine technology works. The parameters and limitations of current search technologies have a direct impact on how your teams can discover online content related to your testing program. For example, a posting of actual test content on a website may not pose a significant risk if that post is not readily retrieved by search technology. Here’s how web scraping is impacted by searching algorithms and technology.

The Scoop on Search Engines

1. The Importance of Indexing

Search engines “index” (in other words, “analyze” and “store”) web pages to facilitate searching functions. Unindexed web pages will not be displayed in search results from search engines (such as Google, Bing, Yahoo, etc.). Since indexing involves cost, search engines allocate a “crawl budget” to each site. Then, depending on how important a search engine deems the site and the volume and frequency of changes made on the website, the crawls are initiated by the search engine into the website more or less often.

2. Search Result Ranking

Search engines use a ranking system to determine where pages are displayed in their search results when a search is rendered. Even if a webpage is indexed and returned as a search result by the browser, its rank could be lower, and it may not show up in the first few pages of search results. With over 95% of clicks on search engines happening on the first page, it makes that first page (and especially those first few results) the most valuable real estate for a website.

An entire ecosystem of Search Engine Optimization (SEO) companies exists to help businesses rise to the top of search results. With so many websites aiming for a first-page spot, other lower-ranked or low-authority sites (and also pages that search engines deem less relevant with that keyword’s searcher intent) are pushed farther down the results page. This drives down traffic to potentially relevant sites and further decreases the likelihood of them being found by casual browsers—and makes it less likely that your web scraping efforts will find them.

3. How Search Engines Interact with Social Media Posts

How Search Engines Classify Social Media Posts

To a search engine, each individual Tweet, Facebook post, or other social media entry is a web page on its own. Search engines index each separate post on every discrete account in each social media channel (Twitter, Facebook, Twitch, Instagram, YouTube, TikTok, etc.) as a stand-alone webpage.

How Social Media User Activity Affects Search Engines

Search engines do not index every social media post. For example, if the posting user’s public profile has no inbound or outbound links, that user’s posts will likely not be indexed by search engines. This means that if you have a test taker who posts live exam content on their Twitter, but they have no followers, no posts, and rarely log in, the search engine won’t index or display their Tweet containing your test content. So, while your content is technically online, it will not be displayed by search engines, and it will therefore not be identifiable by your web scraping efforts.

How Inbound Links Affect Social Media

“Social signals” refer to a website’s or social media post’s collective shares, likes, and overall social media activity. Search engines do not use social media platform signals as part of rankings. Instead, the search engine will use “inbound” links to rank social media posts. That is, the more a post is shared outside the platform with links back to the platform, the higher the post will climb in the search engine’s rank. An indexed social media post with no external links pointing to it may not be ranked at all by a search engine. For example, consider this video about Google’s ranking methods on social media, explained by Google themselves.

Important to note, most social media sites mark links to posts as “nofollow” so that search engines don’t associate specific content with the social media site itself. Therefore, a single post on social media is generally unranked by search engines. Similarly, the ranking of search results for indexed, non-social-media websites with dynamic content is affected by inbound and outbound links. So even when the search query’s text matches a website’s content exactly, the returned search result may not yield a high enough rank to be noticed by your web scraping efforts, or by examinees searching for that content.

Some additional relevant details about social media and search engines:

  • Facebook creates 3 million new posts per minute. Twitter generates 350,000 tweets per minute. Search engines cannot physically index even a small fraction of these posts, so most of the things people post will never be found on a search engine.

  • Because social media sites are so big, search engines only index those posts with many external links leading to them. These types of posts usually involve high-profile topics, or they are created by celebrities, politicians, major corporations. Such posts may be returned in searches on search engines.

  • Search engines’ indexation of Twitter is on the decline. For example, on an aggregated basis, only 5.2% of tweets are in the Google Index (2018 study).

  • Search engines do not index tweets immediately. For example, in the same study as the one above, only 1.6% of tweets were indexed within the first seven days by Google.

  • To maintain the validity and stability of their platforms, Twitter, Facebook, and other social media platforms block most web crawling unless a post (such as a tweet) is internally retweeted a significant number of times, or it is tied to a significant public individual.

A Real-World Example

With these facts in mind, consider you have a test taker who posted your live test content. That specific examinee does not have any followers, and this is their first post. Therefore, not only will their post not be searchable, but other examinees would not be able to stumble upon it unless they are specifically aware of it and know how to locate it (like with a link). And your web scraping efforts, no matter how vigilant, would not pull it up.

4. Other Reasons Search Engines Don’t Display Certain Webpages

In addition to social media posts, many regular non-social media websites that contain searchable content will not be found by web searches either. Unless the website contains relevant inbound and outbound links to other sites, search engines may fail to index these sites unless specifically tasked by the website owner to do so. Therefore, no record of the site may exist in the search engine’s index, and the search engine cannot display the website in search results.

Website administrators can also configure settings that prevent parts of their site from being crawled by search engine bots. They do this through methods such as simple text files in certain directories or with hidden links. When this is done, these pages will not be found or displayed by a search engine, nor would most—or even advanced users or in-depth web scraping efforts—be able to find this type of content, unless they were explicitly made aware of it.

Searching Within Social Media Platforms

Social media platforms have their very own search algorithms. Their search algorithms do not directly correlate with search content to return search results, as Google or Bing would. In fact, one could type a post’s contents word-for-word into a social media platform’s search function and likely not have that same post returned. The individual social media platform adapts its search results to the user’s specific profile, browsing history, social media posts, post security settings, and inferred interests. The platform also weighs the popularity of posts heavily in its search returns. Some details:

  1. The volume of posts generated by social media platforms, combined with the algorithms used by the social media platform search engines themselves, makes searching for specific content on a social media platform very opaque and inconsistent. The user’s location, interests, history, and past interactions are all stored by the social media sites. That data is then used to return search results on the social media platform specific to that user. Content of the search itself is only a small part of what search results will be returned—social media algorithms determine which content is delivered based upon a specific user’s behavior.

  2. Because of the difficulty in parsing content on social media sites, an entire industry of social media search products has sprung up. These are designed to replace the actual search engines inside social media platforms themselves, because those search tools are so ineffective except to the specific user creating the post. Here are some examples of social media search tools:
    1. https://www.social-searcher.com/
    2. https://www.mentionlytics.com/
    3. https://infotracer.com/

  3. An example: If you are looking for a specific post that you know a previous examinee listed containing your test content, using Twitter’s search function and looking for the user and content in question may yield no results.

  4. Most social media platforms have settings that allow users to control how much public exposure their posts have. Users who restrict their posts to “invited” or “direct share” content will not be found in any outside search. Additionally, users who post in separate or private “by-invitation-only” groups will not be visible to external searches.

Impact of Browser Cookies

Also important to note, the cookies that are stored in an individual’s browser will affect the search results they see—both on social media platforms and from browser-based search engines as a whole. Because of the customizations and targeted content responses that each user profile generates, no two individuals will see the same search results, even when entering the same keyword.

For example, When a user interacts with or creates a post on a social media platform, the social media platform may store that visit in a cookie in the user’s browser. When that same user searches for the contents of that post, the result may be returned. However, no one else searching for that content would have that result rendered. So, for example, if an examinee posted live test content and then searched for their post, they would find it. However, if someone on your web scraping team hears about this post and searches for it specifically, they may not have any luck finding it.

Summary

Web scraping is an effective method for determining whether your exam content has been leaked online. It’s vital for testing programs—especially high-stakes testing programs—to monitor the web for stolen test content. This blog exists to help testing programs with their own DIY web monitoring teams understand the ins and outs of search engine technology and its impact on web monitoring efforts. However, you should always work with your team to determine whether your organization can keep up with content monitoring in-house, or if outside help patrolling the web would be a better fit for your program.

It’s extremely common for live exam content to be leaked online (whether on braindump sites or social media platforms), and often, exam content is spread within just days of a test being released. Web scraping efforts help you identify what content is exposed, where it is exposed, who may have stolen or used it, and how deeply it has impacted your program.

The parameters and limitations of current search technologies have a direct impact on how your teams—and your test takers—can find and discover leaked test content online. A posting of actual test content may exist, but it isn’t always searchable or findable. Use these guidelines to determine what content your web scraping efforts can pull up.