Web Scraping software mainly falls in to 2 categories.
- 1. Those which run from the cloud and which can be accessed via browsers or browser extensions.
- 2. Those which are installed and run locally from your computer.
Let's see the various points which you should consider while selecting a desktop (local) or cloud based web scraping solution.
1. Cross Platform
Cloud scraping services are usually configured via browser extensions (example: Google Chrome extension) and the actual scraping process happens in the cloud/server. So they can be configured and accessed from any platform (Windows, Linux, Mac, Web, Mobile) and from any location.
But local web scraping software are installed and run locally from your PC / Mac. Your PC / Mac needs to be running for web scraping to happen.
There is a workaround to run web scraping software from the cloud. Using cloud OS instances (example: Amazon AWS Windows Instance (EC2)), you can install and run web scraping software from the cloud. Know more.
2. IP blocking
One of the main challenges in web scraping is to avoid getting blocked by web servers. This is more of a challenge to cloud based solutions than to desktop web scraping software. Since each individual user's PC or Mac has its own unique IP, the chances of them getting blocked outright by websites are very low. But since cloud scraping solutions have to run scraping tasks for many users from the same server (or set of servers), there is a high chance that websites can block their IP (or set of IPs). So, they will need to continuously use and cycle proxies to avoid detection and this may result in slower scraping speed.
3. Control over data, credentials and privacy
With a locally installed web scraping software, the data which you collect stays with you and does not go outside your computer. Also the credentials which you sometimes need to provide to the web scraping software is also kept locally. Whereas in the case of cloud scraping services, the data is first saved in the server and can later be downloaded. Also any sensitive information which you have to provide as part of configuring the scraper will have to be sent to the server. In short, you have more control over your data and privacy while using a local web scraping software.
Web scraping is a relatively resource intensive operation, compared to other normal computational tasks. This is because to correctly extract data from most modern websites, the web scraping software or platform has to run a virtual browser (a full fledged headless (without display) browser) to load pages correctly and perform extraction. This is expensive both in terms of memory (RAM) and processing power (CPU). Since cloud solutions have to run scraping tasks for multiple users, they also need to arrange sufficient infrastructure to support them. Since server costs also increases with this, the monthly plans of most cloud based web scraping services are high, compared to one time license price of locally installed web scraping software.
Most cloud based data extraction services provide APIs so that developers can write their own code/script to scrape data from websites using their platform. This functionality is absent in local web scraping software.
6. Location specific extraction
The default location used by a locally installed scraping software will be the same as your computer's location. Whereas, in the case of a cloud solution, it may be different. Websites sometimes display data based on the user's location, so there can be difference in the data which you see locally (when you access the website using your browser) v/s the data fetched by a cloud solution (due to change of location). Both local and cloud solutions provide functionality to set custom locations via proxies.
7. Limit on usage
For cloud extraction, there will be usage limit based on your subscription plan. The data which you can scrape or the number of requests you can make may be limited. There may also be limitation on the number of extraction tasks which can be run at the same time, and also on the number of days for which scraped data can be stored in the server. All such limitations are non-existent in local web scraping software. Unlimited amount of data can be extracted since the network and memory expenses are already borne by the user. Also, local solutions mostly allow you to run older versions of the software for life, even after license expiry.
The memory and processing resources in your computer are limited. Whereas it is not limited in the cloud. If you are willing to pay more, you can run more scraping tasks parallelly for larger volume data extraction. In the case of local solutions, there is a limit to the number of parallel mining operations which can be done based on your system resources (RAM, CPU). But if you have multiple computers/laptops and necessary internet bandwidth, you can run web scraping software from each one of them to scale data extraction.