How to use User Agent strings to prevent blocking while web scraping ?

What is a user agent string ?

The User-Agent string of a web browser helps servers (websites) to identify the browser (Chrome, Edge, FireFox, IE etc.), its version and also the operating system (Windows, Mac, Android, iOS etc.) on which it is running. This mainly helps the websites to serve different pages for various platforms and browser types.

If you go to https://www.whatismybrowser.com/detect/what-is-my-user-agent you can see the user agent string of your browser.

User Agent strings for web scraping

The same detail can be used by websites to block non-standard web browsers and bots. To prevent this we can configure web scrapers to mimic a standard browser’s user agent.

WebHarvy, our generic visual web scraper, allows you to set any user agent string for its mining browser, so that websites assume the web scraper to be a normal browser and will not block access. To configure this, open WebHarvy Settings and go to the Browser settings tab. Here, you should enable the custom user agent string option and paste the user agent string of a standard browser like Chrome or Edge.

This option can be used to make WebHarvy’s browser appear like any specific standard web browser (ex: Microsoft EdgeMozilla FirefoxGoogle Chrome or Apple Safari) to websites from which you are trying to extract data.

How to get user agent string of various browsers ?

You may find user agent strings of various browsers at http://useragentstring.com/pages/useragentstring.php

Scraping images from Instagram using WebHarvy

WebHarvy can be used to scrape text as well as images from websites. In this article we will see how WebHarvy can be used to scrape data from Instagram.

How to automatically download images from Instagram searches ?

The following video shows how WebHarvy can be configured to scrape images (download images) by searching Instagram for a tag (example: #newyork). As shown, a few additional techniques are used to open the first image and also to automatically load subsequent images. The JavaScript codes for these can be found in the video description given below the video here.

In addition to downloading images, WebHarvy can also scrape textual data from Instagram like post content, followers of a profile etc.

Try WebHarvy

If you are interested we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link below.

Getting started with data extraction using WebHarvy

Scraping Twitter using WebHarvy

WebHarvy can be used to scrape data from social media websites like Twitter, LinkedIn, Facebook etc. In the following video you can see how easy it is to scrape tweets from Twitter searches using WebHarvy. Similar technique can be used to scrape tweets from a Twitter profile page.

In this video, pagination via JavaScript code is used to scrape multiple pages of Twitter search results.

The JavaScript code used in the above video is copied below.

groupEl = document.getElementsByTagName(‘article’)[0].parentElement.parentElement.parentElement.parentElement;
groupEl.children[groupEl.childElementCount-1].scrollIntoView();

Normally, pages which load more data as we scroll down can be configured by following the method explained at https://www.webharvy.com/tour3.html#ScrollToLoad. But in the case of Twitter, the page also deletes tweets from the top as we scroll down. Hence, JavaScript has to be used for pagination.

Try WebHarvy

In case you are interested, we recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link given below.

https://www.webharvy.com/articles/getting-started.html

WebHarvy 6.1 – Internal Proxies, Database/File Update, New Capture window options

The following are the main changes in this version.

Option to leave a blank row when data is unavailable for a keyword/category/URL

In WebHarvy’s Keyword/Category settings page a new option has been added to leave a blank row filled with corresponding keyword/category/URL when data is unavailable for that item. This option is available only when ‘Tag with Category/URL/Keyword’ option is enabled.

For mining data using a list of keywords, categories or URLs, enabling this option helps in identifying the items for which WebHarvy failed to fetch data, as shown below.

Proxies are used internally by WebHarvy, not system-wide

In earlier versions, proxies set in WebHarvy Settings were applied system wide during mining. This caused side effects for other applications especially in cases where proxies required login with a user name and password and when a list of proxies were cycled. Starting from this version WebHarvy will use proxies internally so that other applications are not affected during mining. You still can apply proxies directly in Windows settings (system wide) and WebHarvy will use it automatically.

Also, the configuration browser will start using the proxies which are set in WebHarvy settings. In earlier versions, proxies were used only during mining.

Database, Excel File Export : Update option (Upsert)

While saving/exporting mined data to a database or excel file which already contains data (from a previous mining session), WebHarvy now allows you to update those rows of data which has the same first column value as those in the newly mined data, without creating duplicate rows.

For file export this option is currently available only for Excel files.

New Capture window options : Page reload and Go back

2 new capture window options for page interaction have been added – Reload & Go back. The reload option is helpful in cases where a page is not correctly loaded first time when a link is followed. The ‘Go back’ option navigates the browser back to the previously loaded page.

Keywords can be added even after starting configuration

Just like URLs, Keywords can also be added after starting configuration. This method is useful in cases where the normal method of Keyword Scraping cannot be applied. The only condition for adding keywords in this method is that the first keyword entered should be present in the Start URL or Post Data of the configuration.

Other minor changes

  1. During configuration, in pages reached by following links from the starting page, links (URLs) selected by applying Regular Expressions on HTML can be followed using the ‘Follow this link’ option. Earlier, only the Click option was available for this scenario.
  2. Automatically handles encoded URLs selected from HTML. Example: URLs including ‘&’. This works for following links as well as for image URLs.
  3. ‘Enable JavaScript’, ‘Share Location’ and ‘Enable plugins’ options removed from Browser settings.
  4. Bug related to scraping a list of URLs when one of the URLs fails to load fixed.
  5. While scraping a list of URLs, URLs which do not start with HTTP scheme part (http:// or https://) are handled.

Download the latest version

The latest version of WebHarvy is available here. If you are new to WebHarvy we would recommend you to view our ‘Getting started‘ guide.

 

Scraping Owner Phone Numbers from Zillow FSBO listings

This post explains how WebHarvy can be easily configured to scrape owner phone numbers from Zillow’s FSBO (For Sale By Owner) listings.

WebHarvy is a generic visual web scraper which can be used to scrape data from any website.

Scraping owner phone numbers

Property listings in Zillow display owner phone numbers at various locations within the property details page.

  • If you click the Contact Agent button, you can see the owner phone number in the popup window displayed (end of the list)

  • You can see the owner phone number listed within the owner/agents list under ‘Get More Information‘ heading.

  • You can also view the owner name and phone number under the ‘Listing provided by owner’ heading.

WebHarvy allows you to capture text displayed after a heading text using the ‘Capture following text‘ feature explained in the link given below.

How to scrape text displayed after a heading text ? 

The video displayed below shows the steps which you need to follow to configure WebHarvy to scrape owner phone numbers from Zillow FSBO listings by clicking the ‘Contact Agent’ button.

Video below shows how owner phone numbers can be selected from the other 2 locations in listing details page.

Scraping multiple properties listed over multiple pages is configured as explained here and each property link is opened using the ‘Follow this link‘ feature.

Try WebHarvy

If you are new to WebHarvy, we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. Please follow the link below to get started.

Getting started with WebHarvy

Need Support ?

Feel free to contact our support if you have any questions.

Generate Real Estate Leads using Web Scraping

Web Scraping is the automated process of extracting data from websites using software or an online service. This technique can be used to easily extract property owner or real estate agent contact details from websites like Zillow, Trulia, Realtor etc.

WebHarvy is a point and click, visual web scraper which can be used to extract data from websites.

Getting agent phone numbers

Most real estate websites allow you to search and view details of agents catering to a specific region. The following video shows how WebHarvy can be used to extract agent contact details like name, address, phone number etc from Zillow.

Getting owner/agent contact details from property listings

Owner or agent contact details can also be extracted from property listings as shown in the following videos.

Scraping agent phone numbers from property listings

Scraping owner phone numbers from property listings

Scraping leads from Realtor

The following video shows how agent contact details can be extracted from Realtor website.

We have an entire playlist of videos related to real estate data extraction which you may watch at this link. WebHarvy can be used to extract data automatically from any website.

Get Started

We recommend that you download and try using the free evaluation version of WebHarvy to know more. To get started, please follow the link below.

Getting started with web scraping using WebHarvy

How to scrape data from Bing Maps ?

WebHarvy is a generic visual web scraping software which can be easily configured to extract data from any website. In this article we will see how WebHarvy can be configured to extract data from Bing maps.

Details like business name, address, phone number, website address, rating etc. can be easily extracted from Bing maps listings using WebHarvy. Just like most map interfaces the details are opened in a popup over the map. The following video shows how WebHarvy can be configured to extract the required details.

As shown in the above video, the Open Popup feature of WebHarvy is used to open each listing details and scrape the data displayed. The Capture following text feature is used to correctly select details like address, website, phone number etc. It is recommended to use this method for data selection whenever the data is guaranteed to appear after a heading text.

Sometimes, Bing maps interface displays a ‘Website’ button, clicking which you can visit the website of the listed business. In such cases the website address as such will not be displayed in the listings popup.

1. To extract website address in such scenarios, during configuration, highlight and click the entire popup area as shown in the following image.

2. From the resulting Capture window displayed, select More Options > Apply Regular Expression. Paste and apply the following RegEx string to get the website address.

role=”button”\s*href=”(http[^”]*)

3. Click the main ‘Capture HTML‘ button to capture it.

Scraping data from Google Maps

WebHarvy also supports extracting data from Google Maps listings. We have several demonstration videos related to this, which you can watch by following the link below.

Google Maps Data Extraction using WebHarvy

Try WebHarvy

We highly recommend that you download and try using the free evaluation version of WebHarvy. To get started please follow the link given below.

Getting started with data extraction using WebHarvy

How to scrape TripAdvisor reviews and ratings ?

WebHarvy can be used to scrape data from TripAdvisor website. In this article we will be see how WebHarvy can be configured to scrape reviews and ratings from multiple listings at TripAdvisor website.

By default, TripAdvisor does not display the complete review text in its listings pages. You will have to click a ‘Read more’ link at the end of each partially displayed review, to view the complete review. This can be automated using WebHarvy as shown in the following video.

 

Regular expression strings are used to correctly select the date of review, and also the rating numerical value. The rating value is selected from the HTML source of the rating stars displayed by the website. The RegEx strings used are copied below.

wrote a review (.*)

rating bubble_([^”]*)

We have several videos in our YouTube channel related to TripAdvisor data extraction. You may watch them at the following link.

TripAdvisor Scraping Videos using WebHarvy

Try WebHarvy

We recommend that you download and try the free evaluation version of WebHarvy. To know more please follow the link below.

Getting started with data scraping using WebHarvy

How to scrape data from eBay product listings ? (price, images, specification, seller description etc.)

WebHarvy can be used to easily scrape product data from listings at eCommerce websites like Amazon, eBay etc. We have an entire playlist of demonstration videos related to eCommerce data extraction in our YouTube channel.

eBay Data Scraping

In this article we will see how WebHarvy can be used to extract product data from eBay listings. Details like product name, price, product URL, item specifications (condition, weight, UPC/MPN etc.), seller description etc. can be extracted. WebHarvy can also extract product images (thumbnail as well as high resolution images) from eBay product listings. The following video shows the steps involved.

The JavaScript code used in the above video to open seller description as a separate page is copied below.

location.href = document.getElementById(‘desc_ifr’).getAttribute(‘src’);

More videos related to eBay data extraction

Try WebHarvy

In case you are interested in exploring more, we highly recommend that you download and try using our free evaluation version. To get started, please follow the link given below.

Getting started with web data scraping using WebHarvy