Activation problem? Please update to the latest version (6.2.0.185)

Versions of WebHarvy prior to (and including) 6.2.0.184 will receive an ‘Activation failed due to unknown reason’ error message while trying to unlock using the license key file (for registered users). This issue has been fixed in the latest version of WebHarvy 6.2.0.185 which is currently available for download at https://www.webharvy.com/download.html.

Please contact us in case you have any questions.

WebHarvy 6.2 (Enhanced Proxy Support, Chromium v86, New Browser Setting options)

The following are the changes in this version.

Enhanced proxy support

In this version we have added support for various types of proxies. Earlier, WebHarvy supported only HTTP proxies. Starting from this version the following proxy types are supported.

  • HTTP
  • HTTPS
  • SOCKS4
  • SOCKS4a
  • SOCKS5

In the proxy settings window you can select the type of proxies used as shown below.

New Browser Setting Options

The following 2 new options are added in Browser settings.

  • Disable opening popups
  • Use separate browser engine for mining links

Normally, WebHarvy opens popups or new browser tabs within the same browser view. Though this is the preferred behavior for most websites, in some cases you might want to ignore the popup or new tab pages and stay with the parent page itself. In such cases the Disable opening popups option in Browser settings should be enabled.

When ‘Use separate browser engine for mining links‘ option is enabled WebHarvy uses separate browser engine to mine links which are followed from the starting/listings page. Though this consumes more memory, in case of some websites, it will result in longer mining sessions.

Latest Chromium

We have also updated WebHarvy’s internal browser to the more recent Chromium V86. Chromium is the open source project upon which Google Chrome is based on.

As always, this release also includes minor bug fixes. You may upgrade to this latest version by downloading the latest installer from our website.

Have any questions ? Let us know !

Sequentially Scrape Websites : Automation

Often you require to scrape data from multiple websites and might also need to automate the entire process. The following would be your desired workflow.

  1. Configure WebHarvy to scrape data from each website.
  2. Then start scraping data from each website, one after the other, without any manual intervention. In short, a one-click method to start scraping data from multiple websites and also to save the data automatically once mining is completed.

Command line arguments

WebHarvy supports command line arguments so that you can run WebHarvy from a terminal or script providing details like configuration file path, number of pages to mine, location where mined data is to be saved etc. For more details please follow the link below.

WebHarvy Command Line Arguments Explained

Windows batch file

Using the command line argument support of WebHarvy, you can write a Windows batch file which runs each configuration, one after the other. You may refer the following link to know how to write a Windows batch file. In its simplest form, you can just open notepad, write commands to run, one per line and save it using a .bat extension.

https://www.windowscentral.com/how-create-and-run-batch-file-windows-10

Now, you can just run this .bat file or schedule it using Windows Task Scheduler to meet your requirement.

Example

The following is an example of a Windows batch file (saved with .bat extension).

scrape-yp.bat

“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-doctors.xml” -1 “c:\mydata\yp-doctors.csv” overwrite
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-accountants.xml” -1 “c:\mydata\yp-accountants.xlsx” update
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-lawyers.xml” -1 “c:\mydata\yp-lawyers.xml” update

You can see that the above batch file runs 3 different configurations (yp-doctors, yp-accountants and yp-laywers) one after the other. Also note that the complete path name is used for WebHarvy executable, configuration file and output file.

If you have any questions please do not hesitate to contact us.

How to Scrape Google Maps Location Coordinates?

This article explains how the Keyword Scraping feature of WebHarvy can be used to scrape geo location coordinates (latitude and longitude) of a list of addresses from Google Maps.

Given below is a sample list of addresses for which we will scrape geo location coordinates from Google Maps using WebHarvy as shown in the above video. Note that these addresses do not include special characters like comma, hyphen or semicolon. In case you wish to have commas or other special characters within address text, then each address should be enclosed within quotes. (Ex : “6657 PEDEN RD, FT WORTH, TX”)

6657 PEDEN RD FT WORTH TX
17425 DALLAS PKWY DALLAS TX
12121 COIT RD DALLAS TX
9100 WATERFORD CENTRE BLVD AUSTIN TX
13223 CHAMPIONS CENTRE DR HOUSTON TX
1221 N WATSON RD ARLINGTON TX
5313 CARNABY ST IRVING TX

To scrape Google Maps location coordinates of these addresses, load the following URL within WebHarvy’s configuration browser.

https://www.google.com/maps/place/6657 PEDEN RD FT WORTH TX

Note that the first address (6657 PEDEN RD FT WORTH TX) in the list of addresses is used ‘as-it-is’ in the above URL. Once this URL is loaded in WebHarvy’s browser view, Start Configuration. Then, edit the Start URL of the configuration and paste the same URL which we loaded (https://www.google.com/maps/place/6657 PEDEN RD FT WORTH TX).

Now we can add keywords to the configuration. Keywords in this case are the list of addresses. It is important to note that the first keyword in the list which we add, should be same as the one used in the Start URL. Since we are selecting only a single row of data from each page, we can disable pattern detection.

The latitude/longitude values are selected from the entire page HTML using regular expressions. To get the entire page HTML, click anywhere on the page and then double click on the Capture HTML toolbar button in the resulting Capture window displayed.

The regular expression strings used to get latitude and longitude values are given below.

www\.google\.com%2Fmaps%2F[\s\S]?%2F%40([^%])

www\.google\.com%2Fmaps%2F[\s\S]?%2F%40[^%]%2C([^%]*)

If you are new to WebHarvy we recommended that you download and try using the free evaluation version available in our website. To get started please follow the link below.

Getting started with web scraping using WebHarvy

In case you have any questions please feel free to contact our technical support team.

How to scrape Google Jobs? | Scraping job details

WebHarvy can be used to scrape job details from jobs listing websites like Indeed, Google Jobs etc. WebHarvy can automatically pull job details from multiple pages of listings and save them to a file or database.

The following video shows how WebHarvy can be configured to scrape data from Google Jobs listings. Details like job title, position, application URL, company name, description etc. can be easily extracted.

More jobs are loaded on to the same page when you scroll down the left hand side pane on Google Jobs listings page. To perform this the JavaScript method of pagination has to be used. The JavaScript code to be used for this is given below.

els = document.getElementsByTagName("ul");
el = els[els.length-1];
el.children[el.childElementCount-1].scrollIntoView();

Before selecting any data, the following JavaScript code needs to be run on the page to collate job listings grouped under various sections to a single group.

groups = document.getElementsByTagName("ul");
parent = groups[0];
for (var i = groups.length - 1; i >= 1; i--) {
var children = groups[i].children;
for (var j = children.length - 1; j >= 0; j--) {
parent.appendChild(children[j]);
}}

Video : How to scrape Google Jobs using WebHarvy?

Interested? We highly recommend that you download and try using the free evaluation version of WebHarvy available in our website, by following the link given below.

Start web scraping using WebHarvy

In case you have any questions feel free to contact our technical support team.

How to scrape business contact details from Google Maps ?

WebHarvy is a visual web scraper which can be easily configured to scrape data from any website. In this article we will see how WebHarvy can easily extract business contact details from Google Maps.

WebHarvy can scrape contact details (name, address, website, phone etc.) as well as reviews of businesses displayed on Google Maps. The following video shows the configuration steps which you need to follow to scrape contact details of businesses listed in Google Maps.

The regular expression strings used in the above video to scrape phone number and website address are given below.

Phone: ([^"]*)

Website: ([^"]*)

Try WebHarvy

To know more we highly recommend that you download and try using the free evaluation version of WebHarvy. To get started please follow the link below.

Getting started with web scraping using WebHarvy

How to build a simple web scraper using Puppeteer?

Table of Contents

  1. What is Puppeteer?
  2. Uses of Puppeteer
  3. How to install?
  4. How to start a browser instance?
  5. How to load a URL?
  6. How to navigate/interact with the page?
  7. How to take screenshots, save page as PDF?
  8. How to select data from page?
  9. Headless browser as a service

What is Puppeteer?

Puppeteer (https://developers.google.com/web/tools/puppeteer) is a headless Chrome browser for developers. Puppeteer is made available as a Node library.

Uses of Puppeteer

Puppeteer can be used by developers for browser automation. Developers can create a headless Chrome browser instance using which web pages can be loaded, interacted with and also take screenshots or PDF of loaded pages. Some of the main usages of Puppeteer are for web scraping, browser automation and automated testing.

How to install Puppeteer?

Since Puppeteer is a Node library (requires Node.js installation), it can be installed by running the following command.

$ npm install –save puppeteer

How to start browser instance?

The following code will start a headless (without user interface, invisible) browser instance.

const puppeteer = require(“puppeteer”);
var browser = await puppeteer.launch();
var page = await browser.newPage();

How to load a URL?

To load a URL in the above created browser instance, use the following code.

await page.goto(“https://www.webharvy.com”);

How to select items (elements) from the page?

To select an item/element from the page loaded in puppeteer, you will first need to find it’s CSS selector. You can use Chrome Developer Tools to find the CSS Selector of any element on page. For this, after loading the page within Chrome, right click on the required element and select Inspect.

 

In the resulting Developer Tools window displayed, the HTML Element corresponding to the element which you clicked on page will be selected. Right click on this element and in the resulting menu displayed you will find the Copy submenu within which you should select the Copy selector option.  You now have the CSS selector of the element in clipboard.

Example:

#description > yt-formatted-string > span:nth-child(1)

How to interact with page elements?

This selector string can be used within Puppeteer to select/interact with elements. For example to click the above element, assuming it is a link, the following code can be used.

var selector = “#description > yt-formatted-string > span:nth-child(1)”;
page.click(selector);

In addition to click, Puppeteer provides several other page interaction functionality like keyboard input, typing in input fields etc. Refer : https://pptr.dev/#?product=Puppeteer&version=v2.0.0&show=api-class-page

The following code shows how you can select and click a button using Puppeteer once the page is loaded.

var buttonSelector = “#DownloadButton”
await page.evaluate(sel => {
var button = document.querySelector(sel);
Button.click();
}, buttonSelector);

How to get text of page elements?

As shown in the above code samples, we are running JavaScript codes within Puppeteer using page.evaluate function for page interaction. The same can be used to get text of elements from the page.

var reviewSelector = “review > span.cm-title”
var reviewText = await page.evaluate(sel => {
var reviewText = document.querySelector(sel).innerText;
return reviewText;
}, reviewSelector);

As shown above, JavaScript code is executed on page using the page.evaluate method to get text. You may also use the document.querySelectorAll JavaScript HTML DOM method to get data from multiple page elements.

How to take screenshots of page and save page as PDF?

You can take a screenshot of the currently loaded page by using the following code.

await page.screenshot({path: ‘./screenshots/page1.png’});

Or save the page as a PDF using the following code.

await page.pdf({path: ‘./screenshots/page1.pdf’});

Headless browser as a service

Running puppeteer is a resource intensive process. If you need to run several headless browser instances the memory and processor requirements will be high, and scaling them won’t be easy. To facilitate this services like https://www.browserless.io/ can be used which offers headless browsers as a service.

Announcing an upcoming product : GrabContacts

We are happy to announce news regarding our upcoming product launch, on which we were working during the past one year. GrabContacts is an online service which helps you easily extract contact details (email addresses, phone numbers, social media handles) from websites (URLs) or search queries.

Unlike WebHarvy there is no configuration involved, you just need to specify the website address or list of website addresses and GrabContacts will do the rest. We have also added a unique feature which lets you specify a search query (example: Accountants in New York, Doctors in Chicago) and GrabContacts will automatically scan and fetch email addresses, phone numbers and social media handles.

In case you are interested, please signup for an early stage preview of GrabContacts at the following link.

Signup to our waiting list to get early access

AliExpress Scraper – Scraping product data including images from AliExpress

WebHarvy is a visual web scraper which can be easily used to scrape data from any website including eCommerce websites like Amazon, eBay, AliExpress etc.

Scraping AliExpress

The following video shows how WebHarvy can be configured to scrape data from AliExpress product listings. Details of the products like product name, price, minimum orders, shipping details, seller details, product description, images, etc. can be scraped as shown in the video.

To scrape multiple images the following Regular Expression string is used.

src=”([^_]*)

The ‘Advanced Miner Options‘ values (in WebHarvy Miner Settings) for scraping AliExpress (as shown in the video) are as given below.

Watch More WebHarvy Demonstration Videos related to AliExpress Scraping

Try WebHarvy

We recommend that you download and try using the free trial version of WebHarvy available in our website. To get started, please follow the link below.

Getting started with web scraping using WebHarvy

In case you have any questions or need assistance, please contact our technical support team.

How to use User Agent strings to prevent blocking while web scraping ?

What is a user agent string ?

The User-Agent string of a web browser helps servers (websites) to identify the browser (Chrome, Edge, FireFox, IE etc.), its version and also the operating system (Windows, Mac, Android, iOS etc.) on which it is running. This mainly helps the websites to serve different pages for various platforms and browser types.

If you go to https://www.whatismybrowser.com/detect/what-is-my-user-agent you can see the user agent string of your browser.

User Agent strings for web scraping

The same detail can be used by websites to block non-standard web browsers and bots. To prevent this we can configure web scrapers to mimic a standard browser’s user agent.

WebHarvy, our generic visual web scraper, allows you to set any user agent string for its mining browser, so that websites assume the web scraper to be a normal browser and will not block access. To configure this, open WebHarvy Settings and go to the Browser settings tab. Here, you should enable the custom user agent string option and paste the user agent string of a standard browser like Chrome or Edge.

This option can be used to make WebHarvy’s browser appear like any specific standard web browser (ex: Microsoft EdgeMozilla FirefoxGoogle Chrome or Apple Safari) to websites from which you are trying to extract data.

How to get user agent string of various browsers ?

You may find user agent strings of various browsers at http://useragentstring.com/pages/useragentstring.php