How to Scrape Amazon Product Data with No Code?

In this article we will see how you can easily scrape product details like name, price, ratings/reviews, images, description, ASIN, model number, best seller rank etc. from Amazon product listings.

Just like many other eCommerce websites there is no direct way to download product details from Amazon. Either you will have to manually copy and paste data to a spreadsheet or you should use a web scraping software like WebHarvy to automate the process. Of course, you can code your own little web scraping program to do the job.

The video shown below demonstrates how easy it is to use WebHarvy to scrape data from Amazon. Data selection is done using mouse clicks. The configuration process via the visual interface is very simple. You can start collecting data from thousands of product listings within minutes of installing the software.

As shown in the above video the data scraping workflow has a configuration phase and a mining phase. In the configuration phase we teach WebHarvy what all data items we need to extract and how to navigate the pages of the website.

During configuration, you can click on any item to Capture it. (More Details)

To scrape product details from multiple pages of product listings, click on the link/button to load the next page and set it as the next page link. (More Details)

To follow the product link to load the product details page, click on the product title link and select ‘Follow this link’ option from the resulting Capture window. (More Details)

Since the location of data which you need to extract from product details page can vary from one product to another, it is recommended to use the Capture Following Text method instead of directly clicking on the data.

Once you finish configuration phase, the configuration can be saved as a file. Click the Start Mine button and WebHarvy will start fetching data. The data scraped can be saved to a file or database.

Try WebHarvy

If you are new to WebHarvy we highly recommend that you download and try using the free evaluation version available in our website. To get started please follow the link below.

https://www.webharvy.com/articles/getting-started.html

Have questions? Please let us know

Activation problem? Please update to the latest version (6.2.0.185)

Versions of WebHarvy prior to (and including) 6.2.0.184 will receive an ‘Activation failed due to unknown reason’ error message while trying to unlock using the license key file (for registered users). This issue has been fixed in the latest version of WebHarvy 6.2.0.185 which is currently available for download at https://www.webharvy.com/download.html.

Please contact us in case you have any questions.

WebHarvy 6.2 (Enhanced Proxy Support, Chromium v86, New Browser Setting options)

The following are the changes in this version.

Enhanced proxy support

In this version we have added support for various types of proxies. Earlier, WebHarvy supported only HTTP proxies. Starting from this version the following proxy types are supported.

  • HTTP
  • HTTPS
  • SOCKS4
  • SOCKS4a
  • SOCKS5

In the proxy settings window you can select the type of proxies used as shown below.

New Browser Setting Options

The following 2 new options are added in Browser settings.

  • Disable opening popups
  • Use separate browser engine for mining links

Normally, WebHarvy opens popups or new browser tabs within the same browser view. Though this is the preferred behavior for most websites, in some cases you might want to ignore the popup or new tab pages and stay with the parent page itself. In such cases the Disable opening popups option in Browser settings should be enabled.

When ‘Use separate browser engine for mining links‘ option is enabled WebHarvy uses separate browser engine to mine links which are followed from the starting/listings page. Though this consumes more memory, in case of some websites, it will result in longer mining sessions.

Latest Chromium

We have also updated WebHarvy’s internal browser to the more recent Chromium V86. Chromium is the open source project upon which Google Chrome is based on.

As always, this release also includes minor bug fixes. You may upgrade to this latest version by downloading the latest installer from our website.

Have any questions ? Let us know !

Sequentially Scrape Websites : Automation

Often you require to scrape data from multiple websites and might also need to automate the entire process. The following would be your desired workflow.

  1. Configure WebHarvy to scrape data from each website.
  2. Then start scraping data from each website, one after the other, without any manual intervention. In short, a one-click method to start scraping data from multiple websites and also to save the data automatically once mining is completed.

Command line arguments

WebHarvy supports command line arguments so that you can run WebHarvy from a terminal or script providing details like configuration file path, number of pages to mine, location where mined data is to be saved etc. For more details please follow the link below.

WebHarvy Command Line Arguments Explained

Windows batch file

Using the command line argument support of WebHarvy, you can write a Windows batch file which runs each configuration, one after the other. You may refer the following link to know how to write a Windows batch file. In its simplest form, you can just open notepad, write commands to run, one per line and save it using a .bat extension.

https://www.windowscentral.com/how-create-and-run-batch-file-windows-10

Now, you can just run this .bat file or schedule it using Windows Task Scheduler to meet your requirement.

Example

The following is an example of a Windows batch file (saved with .bat extension).

scrape-yp.bat

“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-doctors.xml” -1 “c:\mydata\yp-doctors.csv” overwrite
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-accountants.xml” -1 “c:\mydata\yp-accountants.xlsx” update
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-lawyers.xml” -1 “c:\mydata\yp-lawyers.xml” update

You can see that the above batch file runs 3 different configurations (yp-doctors, yp-accountants and yp-laywers) one after the other. Also note that the complete path name is used for WebHarvy executable, configuration file and output file.

If you have any questions please do not hesitate to contact us.

How to scrape location coordinates (latitude,longitude) from Google Maps?

This article explains how the Keyword Scraping feature of WebHarvy can be used to scrape coordinates (latitude and longitude) of a list of addresses from Google Maps.

Suppose the following is the list of addresses which we have. Note that these address strings do not have commas within them. It is recommended to have them this way. In case you wish to have commas within each address string then each address line should be enclosed within quotes.

6657 PEDEN RD FT WORTH TX
17425 DALLAS PKWY DALLAS TX
12121 COIT RD DALLAS TX
9100 WATERFORD CENTRE BLVD AUSTIN TX
13223 CHAMPIONS CENTRE DR HOUSTON TX
1221 N WATSON RD ARLINGTON TX
5313 CARNABY ST IRVING TX

To scrape the coordinates of these addresses automatically, first load the following URL within WebHarvy’s configuration browser.

https://www.google.com/maps/place/6657 PEDEN RD FT WORTH TX

Note that the first address (6657 PEDEN RD FT WORTH TX) is used ‘as-it-is’ in the above URL. Once this URL is loaded in the browser, Start Configuration. Now, you will have to edit the Start URL of the configuration. The Start URL in the configuration will be different since the page will have redirected to the location result page in Google. Set it again to the same Start URL which we initially loaded.

Now we can add keywords to the configuration. Keywords in this case are the list of addresses. It is important to note here that the first keyword in the list which we add, should be same as the one used in the Start URL. We have already made this sure. Since we are selecting only a single row of data from each page, we can also disable pattern detection.

The latitude/longitude values are selected from the entire page HTML using regular expressions. To get the entire page HTML double click on the Capture HTML toolbar button.

The regular expression strings used to get latitude and longitude values are given below.

www\.google\.com%2Fmaps%2F[\s\S]?%2F%40([^%])

www\.google\.com%2Fmaps%2F[\s\S]?%2F%40[^%]%2C([^%]*)

If you are new to WebHarvy we recommended that you download and try using the free evaluation version available in our website. To get started please follow the link below.

Getting started with web scraping using WebHarvy

In case you have any questions please feel free to contact our technical support team.

How to scrape Google Jobs? | Scraping job details

WebHarvy can be used to scrape job details from jobs listing websites like Indeed, Google Jobs etc. WebHarvy can automatically pull job details from multiple pages of listings and save them to a file or database.

The following video shows how WebHarvy can be configured to scrape data from Google Jobs listings. Details like job title, position, application URL, company name, description etc. can be easily extracted.

More jobs are loaded on to the same page when you scroll down the left hand side pane on Google Jobs listings page. To perform this the JavaScript method of pagination has to be used. The JavaScript code to be used for this is given below.

els = document.getElementsByTagName("ul");
el = els[els.length-1];
el.children[el.childElementCount-1].scrollIntoView();

Before selecting any data, the following JavaScript code needs to be run on the page to collate job listings grouped under various sections to a single group.

groups = document.getElementsByTagName("ul");
parent = groups[0];
for (var i = groups.length - 1; i >= 1; i--) {
var children = groups[i].children;
for (var j = children.length - 1; j >= 0; j--) {
parent.appendChild(children[j]);
}}

Video : How to scrape Google Jobs using WebHarvy?

Interested? We highly recommend that you download and try using the free evaluation version of WebHarvy available in our website, by following the link given below.

Start web scraping using WebHarvy

In case you have any questions feel free to contact our technical support team.

How to scrape business contact details from Google Maps ?

WebHarvy is a visual web scraper which can be easily configured to scrape data from any website. In this article we will see how WebHarvy can easily extract business contact details from Google Maps.

WebHarvy can scrape contact details (name, address, website, phone etc.) as well as reviews of businesses displayed on Google Maps. The following video shows the configuration steps which you need to follow to scrape contact details of businesses listed in Google Maps.

The regular expression strings used in the above video to scrape phone number and website address are given below.

Phone: ([^"]*)

Website: ([^"]*)

Try WebHarvy

To know more we highly recommend that you download and try using the free evaluation version of WebHarvy. To get started please follow the link below.

Getting started with web scraping using WebHarvy

How to build a simple web scraper using Puppeteer?

Table of Contents

  1. What is Puppeteer?
  2. Uses of Puppeteer
  3. How to install?
  4. How to start a browser instance?
  5. How to load a URL?
  6. How to navigate/interact with the page?
  7. How to take screenshots, save page as PDF?
  8. How to select data from page?
  9. Headless browser as a service

What is Puppeteer?

Puppeteer (https://developers.google.com/web/tools/puppeteer) is a headless Chrome browser for developers. Puppeteer is made available as a Node library.

Uses of Puppeteer

Puppeteer can be used by developers for browser automation. Developers can create a headless Chrome browser instance using which web pages can be loaded, interacted with and also take screenshots or PDF of loaded pages. Some of the main usages of Puppeteer are for web scraping, browser automation and automated testing.

How to install Puppeteer?

Since Puppeteer is a Node library (requires Node.js installation), it can be installed by running the following command.

$ npm install –save puppeteer

How to start browser instance?

The following code will start a headless (without user interface, invisible) browser instance.

const puppeteer = require(“puppeteer”);
var browser = await puppeteer.launch();
var page = await browser.newPage();

How to load a URL?

To load a URL in the above created browser instance, use the following code.

await page.goto(“https://www.webharvy.com”);

How to select items (elements) from the page?

To select an item/element from the page loaded in puppeteer, you will first need to find it’s CSS selector. You can use Chrome Developer Tools to find the CSS Selector of any element on page. For this, after loading the page within Chrome, right click on the required element and select Inspect.

 

In the resulting Developer Tools window displayed, the HTML Element corresponding to the element which you clicked on page will be selected. Right click on this element and in the resulting menu displayed you will find the Copy submenu within which you should select the Copy selector option.  You now have the CSS selector of the element in clipboard.

Example:

#description > yt-formatted-string > span:nth-child(1)

How to interact with page elements?

This selector string can be used within Puppeteer to select/interact with elements. For example to click the above element, assuming it is a link, the following code can be used.

var selector = “#description > yt-formatted-string > span:nth-child(1)”;
page.click(selector);

In addition to click, Puppeteer provides several other page interaction functionality like keyboard input, typing in input fields etc. Refer : https://pptr.dev/#?product=Puppeteer&version=v2.0.0&show=api-class-page

The following code shows how you can select and click a button using Puppeteer once the page is loaded.

var buttonSelector = “#DownloadButton”
await page.evaluate(sel => {
var button = document.querySelector(sel);
Button.click();
}, buttonSelector);

How to get text of page elements?

As shown in the above code samples, we are running JavaScript codes within Puppeteer using page.evaluate function for page interaction. The same can be used to get text of elements from the page.

var reviewSelector = “review > span.cm-title”
var reviewText = await page.evaluate(sel => {
var reviewText = document.querySelector(sel).innerText;
return reviewText;
}, reviewSelector);

As shown above, JavaScript code is executed on page using the page.evaluate method to get text. You may also use the document.querySelectorAll JavaScript HTML DOM method to get data from multiple page elements.

How to take screenshots of page and save page as PDF?

You can take a screenshot of the currently loaded page by using the following code.

await page.screenshot({path: ‘./screenshots/page1.png’});

Or save the page as a PDF using the following code.

await page.pdf({path: ‘./screenshots/page1.pdf’});

Headless browser as a service

Running puppeteer is a resource intensive process. If you need to run several headless browser instances the memory and processor requirements will be high, and scaling them won’t be easy. To facilitate this services like https://www.browserless.io/ can be used which offers headless browsers as a service.

Announcing an upcoming product : GrabContacts

We are happy to announce news regarding our upcoming product launch, on which we were working during the past one year. GrabContacts is an online service which helps you easily extract contact details (email addresses, phone numbers, social media handles) from websites (URLs) or search queries.

Unlike WebHarvy there is no configuration involved, you just need to specify the website address or list of website addresses and GrabContacts will do the rest. We have also added a unique feature which lets you specify a search query (example: Accountants in New York, Doctors in Chicago) and GrabContacts will automatically scan and fetch email addresses, phone numbers and social media handles.

In case you are interested, please signup for an early stage preview of GrabContacts at the following link.

Signup to our waiting list to get early access

AliExpress Scraper – Scraping product data including images from AliExpress

WebHarvy is a visual web scraper which can be easily used to scrape data from any website including eCommerce websites like Amazon, eBay, AliExpress etc.

Scraping AliExpress

The following video shows how WebHarvy can be configured to scrape data from AliExpress product listings. Details of the products like product name, price, minimum orders, shipping details, seller details, product description, images, etc. can be scraped as shown in the video.

To scrape multiple images the following Regular Expression string is used.

src=”([^_]*)

The ‘Advanced Miner Options‘ values (in WebHarvy Miner Settings) for scraping AliExpress (as shown in the video) are as given below.

Watch More WebHarvy Demonstration Videos related to AliExpress Scraping

Try WebHarvy

We recommend that you download and try using the free trial version of WebHarvy available in our website. To get started, please follow the link below.

Getting started with web scraping using WebHarvy

In case you have any questions or need assistance, please contact our technical support team.