PerimeterX (aka Human) is a web service that protects websites, apps, and APIs from automation such as scrapers. It uses a combination of web technologies and behavior analysis to determine whether the user is a human or a bot.
It is used by popular websites like Zillow.com , fiverr.com, and many others so by understanding PerimeterX bypass we can open up web scraping of many popular websites.
Depending on how fresh your data needs to be, one option to bypass PerimeterX is to scrape the data from the Google Cache instead of the actual website.
When Google crawls the web to index web pages, it creates a cache of the data it finds. Most PerimeterX protected websites let Google crawl their websites so you can scrape this cache instead.
Scraping the Google cache can be easier than scraping a PerimeterX protected website, but it is only a viable option if the data on the website you are looking to scrape doesn't change that often.
To scrape the Google cache simply add https://webcache.googleusercontent.com/search?q=cache: to the start of the URL you would like to scrape.
For example, if you would like to scrape https://example.com/ then the URL to scrape the Google cache version would be:
Some websites (like LinkedIn), tell Google to not cache their web pages or Google's crawl frequency is too low meaning some pages mightn't be cached already. So this method doesn't work with every website.
If you want to scrape the live website, then one option is to do the entire scraping job with a headless browser that has been fortified to look like a real user's browser.
Vanilla headless browsers leak their identity in their JS fingerprints which anti-bot systems like PerimeterX can easily detect. However, developers have released a number of fortified headless browsers that patch the biggest leaks:
Puppeteer:Thestealth pluginforpuppeteer.
Playwright:Thestealth pluginis coming to Playwright soon. Follow developmentshereandhere.
Selenium:Theundetected-chromedriveran optimized Selenium Chromedriver patch.
For example, a commonly known leak present in headless browsers like Puppeteer, Playwright and Selenium is the value of the navigator.webdriver. In normal browsers, this is set to false, however, in unfortified headless browsers it is set to true.
There are over 200 known headless browser leaks that these stealth plugins attempt to patch. However, it is believed to be much higher as browsers are constantly changing and it is in browser developers & anti-bot companies interest to not reveal all the leaks they know of.
Headless browser stealth plugins patch a large majority of these browser leaks, and can often bypass a lot of anti-bot services like PerimeterX, Incapsula, DataDome, and Cloudflare depending on what security level they have been implement on the website with.
However, they don't get them all. To truly make your headless browser appear like a real browser you will have to do this yourself.
Another way to make your headless browsers more undetectable to PerimeterX is to pair them with high-quality residential or mobile proxies. These proxies typically have higher IP address reputation scores than datacenter proxies and anti-bot services are more reluctant to block them making them more reliable.
The downside of pairing headless browsers with residential/mobile proxies is that costs can rack up fast.
As residential & mobile proxies are typically charged per GB of bandwidth used and a page rendered with a headless browser can consume 2MB on average (versus 250kb without a headless browser). Meaning it can get very expensive as you scale.
The following is an example of using residential proxies from BrightData with a headless browser assuming 2MB per page.
Pages Bandwidth Cost Per GB Total Cost 25,000 50 GB $13 $625 100,000 200 GB $10 $2000 1 Million 2TB $8 $16,000
If you want to compare proxy providers you can use this free proxy comparison tool, which can compare residential proxy plans and mobile proxy plans.
The downside with using open-source Pre-Fortified Headless Browsers is that anti-bot companies like PerimeterX can see how they bypass their anti-bot protection systems and easily patch the issues that they exploit.
As a result, most open-source PerimeterX bypasses only have a couple of months of shelf life before they stop working.
The alternative to using open-source PerimeterX bypasses is to use smart proxies that develop and maintain their own private PerimeterX bypass.
These are typically more reliable as it is harder for PerimeterX to develop patches for them, and they are developed by proxy companies who are financially motivated to stay 1 step ahead of PerimeterX and fix their bypasses the very minute they stop working.
Most smart proxy providers (ScraperAPI, Scrapingbee, Oxylabs, Smartproxy) have some form of PerimeterX bypass that work to varying degrees and vary in cost.
However, one of the best options is to use the ScrapeOps Proxy Aggregator as it integrates over 20 proxy providers into the same proxy API, and finds the best/cheapest proxy provider for your target domains.
You can activate ScrapeOps' PerimeterX Bypass by simply adding bypass=perimeterx to your API request, and the ScrapeOps proxy will use the best & cheapest PerimeterX bypass available for your target domain.
You can get a ScrapeOps API key with 1,000 free API credits by signing up here.
The advantage of taking this approach is that you can use your normal HTTP client and don't have to worry about:
Fortifying headless browsers
Managing numerous headless browser instances & dealing with memory issues
Reverse engineering the PerimeterX's anti-bot protection
This is all managed within the ScrapeOps Proxy Aggregator.
The final and most complex way to bypass the PerimeterX's anti-bot protection is to actually reverse engineer PerimeterX's anti-bot protection system and develop a bypass that passes all PerimeterX anti-bot checks without the need to use a full fortified headless browser instance.
This approach works (and is what many smart proxy solutions do), however, it is not for the faint-hearted.
Advantages: The advantage of this approach, is that if you are scraping at large scales and you don't want to run hundreds (if not thousands) of costly full headless browser instances. You can instead develop the most resource-efficient PerimeterX bypass possible, which can use a slimmed-down headless browser that is solely designed to pass the PerimeterX JS, TLS, and IP fingerprint tests or no headless browser at all (this is very hard).
Disadvantages: The disadvantages to this approach are that you will have to dive deep into an anti-bot system that has been made purposedly hard to understand from the outside, and split test different techniques to trick their verification system. Then maintain this system as PerimeterX continues to develop its anti-bot protection.
It is possible to bypass PerimeterX like this, but I would only recommend someone to take this approach unless they are:
Genuinely interested in the intellectual challenge of reverse engineering a sophisticated anti-bot system like PerimeterX, or
The economic returns from having a more cost-effective PerimeterX bypass, warrant the days or weeks of engineering time that you will have to devote to building and maintaining it.
For companies scraping at very large volumes (+500M pages per month) or smart proxy solutions whose businesses depend on cost-effective ways to access sites, then building your own custom PerimeterX bypass might be a good option.
For most other developers, you are probably better off using one of the other three PerimeterX bypassing methods.
For those of you who do want to take the plunge, the following is a run down of how PerimeterX's Bot Defender works and how you can approach bypassing it.
This is the most practical and effective approach as it's much easier to ensure that the headless browser looks like a real one than to re-invent it.
However, many browser automation tools like Selenium, Playwright, and Puppeteer leave data about their existence which need to be patched to achieve high trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions that patch known leaks.
For sustained web scraping with PerimeterX bypass in 2023, these browsers should always be remixed with different fingerprint profiles: screen resolution, operating system, and browser type all play an important role in PerimeterX's bot score.
Bypass with ScrapFly
While bypassing PerimeterX is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!
Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!
Scrapfly is not only a PerimeterX bypasser but also offers many other web scraping features:
So, let's assume you set up the choice of your anti-detect browser and are ready to scrape. But wait, even with the most advanced tools, you can still run into issues if you don't follow some best practices. Here's a step-by-step guide to ensure your web scraping journey is as smooth as a hot knife through butter.
First and foremost, don't be that guy who bombards a website with a thousand requests per second. It's not just impolite; it's also a surefire way to get your IP address banned.
Implement rate limiting in your scraping code to ensure you're making requests at a frequency that is respectful to the website's server resources.
This is especially crucial when scraping smaller websites that don't have the server capacity to handle a high volume of requests. Being a gentleman in the scraping world pays off, as it reduces the likelihood of getting detected and blocked.
Web scraping bots are often detected due to their machine-like behavior. Introducing random delays between your requests can make your bot's behavior appear more human-like.
This is a simple yet effective way to bypass many anti-bot measures. For instance, instead of making a request every two seconds, randomize the intervals to range between 1.5 and 2.5 seconds. This unpredictability makes it harder for anti-bot algorithms to flag your activities, thereby increasing the longevity and effectiveness of your scraping operations.
Source: https://multilogin.com/blog/anti-detect-browsers-for-web-scraping/