How to extract data from Yellow Pages listings ?

Yellow Pages Data Extraction

Yellow Pages websites are the go-to place for business related data extraction. WebHarvy can be used to easily extract business details like name, phone number, email, website, geo-coordinates (latitude and longitude) from various YP websites. We have demonstration videos explaining the extraction process for various YP websites in the following YouTube playlist.

 WebHarvy – Various Demonstration videos related to Yellow Pages Extraction

Steps to follow

The following video in our Web Scraping Workshop series shows the steps involved in configuring WebHarvy to extract business details from yellowpages.com.au website.

Interested ?

In case you would like to know more about WebHarvy, our easy-to-use point-and-click visual web data scraping software, then please follow the link below.

http://www.webharvy.com/articles/getting-started.html

Amazon scraping explained

We get a lot of queries regarding Amazon data extraction, so we created the following video to share with you the correct steps to follow so as to configure WebHarvy for extracting product data from Amazon’s listings. Product details like name, price, specifications, images, description, ASIN, weight, shipping details, ratings and reviews can be extracted.

As shown in the above video, the ‘Capture following text‘ option is used for most of the details which appear after a heading text. Regular expressions are used to correctly extract multiple images and product specification list.

Also, we select all data from the product details page for more accuracy. Only the details page URL is selected from the listings page during configuration.

How to easily extract data from websites ?

If you have a data extraction requirement you can either outsource it to a freelancer/consulting company or try to do it yourselves. The advantages while using a tool to perform the extraction yourself is mainly cost. Plus, with the knowledge gained while creating your first extraction project, you can capture data from a variety of websites, make changes easily and have more control.

What if there is a tool with zero learning curve to extract data from websites ?

And if you are searching for the right data extraction tool to invest in, there are a variety of options. There are complex ones, ones which run from the cloud and serve for enterprise customers, ones which provide APIs for developers to integrate the data from applications/code. And there is WebHarvy which has been mainly designed for ease of use. WebHarvy’s mission is to minimize the number of steps you need to perform before data extraction starts, and also to make these steps very intuitive. See it for yourselves in the following video where we demonstrate how simple and easy it is to extract data from multiple pages of websites using WebHarvy.

Interested ? Know more about WebHarvy’s features at https://www.webharvy.com/tour.html. We also recommend that you download and try the free evaluation version of the software available at https://www.webharvy.com/download.html.

Scraping Zillow to extract property details | Real Estate Data Extraction

WebHarvy can be used to easily extract property details from real estate websites like Zillow, Trulia, Realtor etc. In this article, we discuss how WebHarvy can be used to extract property details from Zillow.com listings.

WebHarvy is very easy to configure and use to extract data from most websites. The point and click interface of the software can be used by following a very simple method as shown in the demonstration videos at  https://www.webharvy.com/demo.html.

Configuration method for scraping Zillow property listings

But some websites like Zillow requires a special technique of configuration which is slightly complex than the normal method which is applicable for most websites. This is mainly due to the way in which Zillow website is designed and implemented.

The following video shows in detail the configuration method to be followed to scrape data from Zillow using WebHarvy.

You can find the regular expression strings and JavaScript codes used in the video description.

Update (June 2021) : Due to recent changes in Zillow website, a new technique has to be used to scrape all 40 properties which are displayed on each page. Please watch this video to know more.

WebHarvy’s new user interface

We have significantly updated the user interface of WebHarvy in the latest version available in our website and the following video explains how the features and options are laid out in the new UI. Existing users of older versions will find this video useful so that they know where to look for specific features and options.

[youtube https://www.youtube.com/watch?v=R0K3awgRAvQ?rel=0&w=560&h=315]

WebHarvy 5.2 | UI revamp + Oracle db support

Changes in 5.2 are mainly related to user interface and experience. The most visible change is the introduction of the ribbon menu system for providing easy access to most software features.

1.png

In addition to the main interface, other windows like Scheduler / Export etc. have also been updated. The export functionality (to file or database) has now been made cancel-able. User can now cancel an ongoing export to file or database.

As with every release, the Chrome browser has been updated as well. Issues related to URL update (in address bar) while navigating links in some websites has been fixed with this update.

An important non-UI feature addition in this release is the support added for exporting data to Oracle database. The default file export option is changed from CSV to Excel format.

All main settings are now displayed in snippet format in browser view’s status bar.

smarthelp

Help (videos, articles) related to the website loaded in the configuration browser is automatically loaded and displayed as a smart tip.

Miner Settings can now be opened and changed directly from the Miner window.

minersettings.png

JavaScript can now be typed in multi-line code format.

js

Browser settings now include a new option to share user location to the loaded page.

browsersetting.png

In addition to the above this release also contains minor bug fixes and improvements as always. You may download and try the latest version from https://www.webharvy.com/download.html

 

 

 

WebHarvy 5.2 | UI revamp + Oracle db support

Changes in 5.2 are mainly related to user interface and experience. The most visible change is the introduction of the ribbon menu system for providing easy access to most software features.
1.png
In addition to the main interface, other windows like Scheduler / Export etc. have also been updated. The export functionality (to file or database) has now been made cancel-able. User can now cancel an ongoing export to file or database.
As with every release, the Chrome browser has been updated as well. Issues related to URL update (in address bar) while navigating links in some websites has been fixed with this update.
An important non-UI feature addition in this release is the support added for exporting data to Oracle database. The default file export option is changed from CSV to Excel format.
All main settings are now displayed in snippet format in browser view’s status bar.
smarthelp
Help (videos, articles) related to the website loaded in the configuration browser is automatically loaded and displayed as a smart tip.
Miner Settings can now be opened and changed directly from the Miner window.
minersettings.png
JavaScript can now be typed in multi-line code format.
js
Browser settings now include a new option to share user location to the loaded page.
browsersetting.png
In addition to the above this release also contains minor bug fixes and improvements as always. You may download and try the latest version from https://www.webharvy.com/download.html
 
 
 

WebHarvy 5.1 released (Includes direct Excel Export)

The following are the changes in 5.1.0.152 :
New Features :

  1. Excel export – supports directly saving mined data as an Excel file (details)
  2. Handles page numbers in JavaScript code to load next page data (details)
  3. Updated Chromium engine from V54 to V62

Minor changes :

  1. Default values of ‘Enable Plugins’ and ‘Enable Browser Security’ in Browser Settings set to false (details)
  2. Browser address bar can be used for Google search

Bug fixes :

  1. Fixed issues related to handling headers and post data for HTTP requests
  2. Fixed issue in selecting data using mouse when Zoom-level of browser is not equal to 1 (zoomed in or zoomed out)
  3. Text formatting issues (line-breaks, spaces) in Capture window fixed
  4. Fixed issue where order of applying capture-html and capture-more-content was relevant (for applying regex to follow links or to capture images)
  5. Bug fix in editing keywords. With the previous version changing the first keyword was not possible.
  6. Minimizes memory usage in mining thread by limiting the number of browser instances created

As always, the latest version may be downloaded and installed from the following page :
https://www.webharvy.com/download.html

WebHarvy based on Google Chrome Released (version 5.0.1.148)

This release comes with least bells and whistles since we have not added features or changed cosmetics of the software. But still, this is a major upgrade. The change is all internal.
WebHarvy has been using Microsoft’s Internet Explorer (IE) as its internal browser since inception. Microsoft stopped supporting IE a few years back when they introduced the Edge browser.
So WebHarvy had to switch to another solution to power its internal browser and we believe using Google’s Chrome Browser Project is the way forward. This makes WebHarvy more stable, faster and secure. Switching to Chrome also opens up the possibility of porting the software to other platforms like Mac and Linux.
You may download and install the latest version which is based on Chrome browser from the following link.
http://www.webharvy.com/webharvysetup.exe
As mentioned before the change from IE to Chrome is internal to the software and transparent to the user interface.  So, the configuration process and user interface of WebHarvy remains the same.

Minor Changes

  1. For scraping data from sites which require login, the steps have been simplified. You no longer need to login to the website separately from IE. See https://www.webharvy.com/articles/sites-requiring-login.html
  2. The ‘Internet Options’ menu option under Edit menu has been removed. Instead a new Browser options tab has been added in Settings window.

Running configuration files created using the older version which was based on IE on this new version based on Chrome

Configuration files created using the old version should normally work fine with the new version which is based on Chrome, but there will be exceptions. In such cases we recommend that you create a new configuration using the latest version.
As always, in case you have an questions or need assistance you may contact our support at https://www.webharvy.com/support.html
 
 

WebHarvy 4.1.5.141 released

The main changes in this release are :-

  1. Pagination via JavaScript – see https://www.webharvy.com/tour3.html#JS
    This powerful feature is the main highlight of this release. When all other methods of pagination fails, this method, where you can directly provide a JavaScript code which when run would load the next page, can be used.
  2. Increased size of virtual browser used by miner
    The dimensions of miner’s virtual browser has been increased. This solves issues related with websites whose layout changes when the browser has a smaller window dimension (mobile layout). This also helps the miner to load more items in a single page and scroll, in case of websites which display data based on the size of the browser window.
  3. Support forLoad more content&Scroll to load next pagetype pagination even when the real listing page is reached by clicking links/buttons from the start page.
    In earlier versions if the listing page loads more data in same page via a button/link click or scroll and if initial navigation (click, java-script etc.) is required in the configuration itself to load the listing page from another start page, then pagination would fail. This release removes this limitation.
  4. More support for extracting data from popups.
    Popups now handle clicks and javascript. This can be used to close the popup window, in cases where closing the currently opened popup is required to open the next one.
  5. SQL data export encoding issue related to foreign languages fixed. 
    Encoding issues while exporting text in non-English languages like Chinese fixed.
  6. Other minor bug fixes

As always you may download and install the latest version from https://www.webharvy.com/download.html.