The User-Agent string of a web browser helps servers (websites) to identify the browser (Chrome, Edge, FireFox, IE etc.), its version and also the operating system (Windows, Mac, Android, iOS etc.) on which it is running. This mainly helps the websites to serve different pages for various platforms and browser types.
The same detail can be used by websites to block non-standard web browsers and bots. To prevent this we can configure web scrapers to mimic a standard browser’s user agent.
WebHarvy, our generic visual web scraper, allows you to set any user agent string for its mining browser, so that websites assume the web scraper to be a normal browser and will not block access. To configure this, open WebHarvy Settings and go to the Browser settings tab. Here, you should enable the custom user agent string option and paste the user agent string of a standard browser like Chrome or Edge.
This option can be used to make WebHarvy’s browser appear like any specific standard web browser (ex: Microsoft Edge, Mozilla Firefox, Google Chrome or Apple Safari) to websites from which you are trying to extract data.
How to get user agent string of various browsers ?
WebHarvy can be used to scrape text as well as images from websites. In this article we will see how WebHarvy can be used to scrape data from Instagram.
How to automatically download images from Instagram searches ?
The following video shows how WebHarvy can be configured to scrape images (download images) by searching Instagram for a tag (example: #newyork). As shown, a few additional techniques are used to open the first image and also to automatically load subsequent images. The JavaScript codes for these can be found in the video description given below the video here.
In addition to downloading images, WebHarvy can also scrape textual data from Instagram like post content, followers of a profile etc.
Try WebHarvy
If you are interested we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link below.
WebHarvy can be used to scrape data from social media websites like Twitter, LinkedIn, Facebook etc. In the following video you can see how easy it is to scrape tweets from Twitter searches using WebHarvy. Similar technique can be used to scrape tweets from a Twitter profile page.
In this video, pagination via JavaScript code is used to scrape multiple pages of Twitter search results.
The JavaScript code used in the above video is copied below.
Normally, pages which load more data as we scroll down can be configured by following the method explained at https://www.webharvy.com/tour3.html#ScrollToLoad. But in the case of Twitter, the page also deletes tweets from the top as we scroll down. Hence, JavaScript has to be used for pagination.
Try WebHarvy
In case you are interested, we recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link given below.
The following are the main changes in this version.
Option to leave a blank row when data is unavailable for a keyword/category/URL
In WebHarvy’s Keyword/Category settings page a new option has been added to leave a blank row filled with corresponding keyword/category/URL when data is unavailable for that item. This option is available only when ‘Tag with Category/URL/Keyword’ option is enabled.
For mining data using a list of keywords, categories or URLs, enabling this option helps in identifying the items for which WebHarvy failed to fetch data, as shown below.
Proxies are used internally by WebHarvy, not system-wide
In earlier versions, proxies set in WebHarvy Settings were applied system wide during mining. This caused side effects for other applications especially in cases where proxies required login with a user name and password and when a list of proxies were cycled. Starting from this version WebHarvy will use proxies internally so that other applications are not affected during mining. You still can apply proxies directly in Windows settings (system wide) and WebHarvy will use it automatically.
Also, the configuration browser will start using the proxies which are set in WebHarvy settings. In earlier versions, proxies were used only during mining.
While saving/exporting mined data to a database or excel file which already contains data (from a previous mining session), WebHarvy now allows you to update those rows of data which has the same first column value as those in the newly mined data, without creating duplicate rows.
For file export this option is currently available only for Excel files.
New Capture window options : Page reload and Go back
2 new capture window options for page interaction have been added – Reload & Go back. The reload option is helpful in cases where a page is not correctly loaded first time when a link is followed. The ‘Go back’ option navigates the browser back to the previously loaded page.
Keywords can be added even after starting configuration
During configuration, in pages reached by following links from the starting page, links (URLs) selected by applying Regular Expressions on HTML can be followed using the ‘Follow this link’ option. Earlier, only the Click option was available for this scenario.
Automatically handles encoded URLs selected from HTML. Example: URLs including ‘&’. This works for following links as well as for image URLs.
‘Enable JavaScript’, ‘Share Location’ and ‘Enable plugins’ options removed from Browser settings.
Bug related to scraping a list of URLs when one of the URLs fails to load fixed.
While scraping a list of URLs, URLs which do not start with HTTP scheme part (http:// or https://) are handled.
Download the latest version
The latest version of WebHarvy is available here. If you are new to WebHarvy we would recommend you to view our ‘Getting started‘ guide.
The video displayed below shows the steps which you need to follow to configure WebHarvy to scrape owner phone numbers from Zillow FSBO listings by clicking the ‘Contact Agent’ button.
Video below shows how owner phone numbers can be selected from the other 2 locations in listing details page.
Scraping multiple properties listed over multiple pages is configured as explained here and each property link is opened using the ‘Follow this link‘ feature.
Try WebHarvy
If you are new to WebHarvy, we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. Please follow the link below to get started.
Web Scraping is the automated process of extracting data from websites using software or an online service. This technique can be used to easily extract property owner or real estate agent contact details from websites like Zillow, Trulia, Realtor etc.
WebHarvy is a point and click, visual web scraper which can be used to extract data from websites.
Getting agent phone numbers
Most real estate websites allow you to search and view details of agents catering to a specific region. The following video shows how WebHarvy can be used to extract agent contact details like name, address, phone number etc from Zillow.
Getting owner/agent contact details from property listings
Owner or agent contact details can also be extracted from property listings as shown in the following videos.
Scraping agent phone numbers from property listings
Scraping owner phone numbers from property listings
Scraping leads from Realtor
The following video shows how agent contact details can be extracted from Realtor website.
We have an entire playlist of videos related to real estate data extraction which you may watch at this link. WebHarvy can be used to extract data automatically from any website.
Get Started
We recommend that you download and try using the free evaluation version of WebHarvy to know more. To get started, please follow the link below.
WebHarvy is a generic visual web scraping software which can be easily configured to extract data from any website. In this article we will see how WebHarvy can be configured to extract data from Bing maps.
Details like business name, address, phone number, website address, rating etc. can be easily extracted from Bing maps listings using WebHarvy. Just like most map interfaces the details are opened in a popup over the map. The following video shows how WebHarvy can be configured to extract the required details.
As shown in the above video, the Open Popup feature of WebHarvy is used to open each listing details and scrape the data displayed. The Capture following text feature is used to correctly select details like address, website, phone number etc. It is recommended to use this method for data selection whenever the data is guaranteed to appear after a heading text.
Sometimes, Bing maps interface displays a ‘Website’ button, clicking which you can visit the website of the listed business. In such cases the website address as such will not be displayed in the listings popup.
1. To extract website address in such scenarios, during configuration, highlight and click the entire popup area as shown in the following image.
2. From the resulting Capture window displayed, select More Options > Apply Regular Expression. Paste and apply the following RegEx string to get the website address.
role=”button”\s*href=”(http[^”]*)
3. Click the main ‘Capture HTML‘ button to capture it.
Scraping data from Google Maps
WebHarvy also supports extracting data from Google Maps listings. We have several demonstration videos related to this, which you can watch by following the link below.
WebHarvy can be used to scrape data from TripAdvisor website. In this article we will be see how WebHarvy can be configured to scrape reviews and ratings from multiple listings at TripAdvisor website.
By default, TripAdvisor does not display the complete review text in its listings pages. You will have to click a ‘Read more’ link at the end of each partially displayed review, to view the complete review. This can be automated using WebHarvy as shown in the following video.
Regular expression strings are used to correctly select the date of review, and also the rating numerical value. The rating value is selected from the HTML source of the rating stars displayed by the website. The RegEx strings used are copied below.
wrote a review (.*)
rating bubble_([^”]*)
We have several videos in our YouTube channel related to TripAdvisor data extraction. You may watch them at the following link.
In this article we will see how WebHarvy can be used to extract product data from eBay listings. Details like product name, price, product URL, item specifications (condition, weight, UPC/MPN etc.), seller description etc. can be extracted. WebHarvy can also extract product images (thumbnail as well as high resolution images) from eBay product listings. The following video shows the steps involved.
The JavaScript code used in the above video to open seller description as a separate page is copied below.
In case you are interested in exploring more, we highly recommend that you download and try using our free evaluation version. To get started, please follow the link given below.