Scraping images : various methods : WebHarvy

WebHarvy lets you scrape images from websites with ease (in addition to text). During configuration, you can directly click on an image to capture it. The resulting Capture window displayed will have a ‘Capture Image’ button, clicking which either the image file can be downloaded or its URL be captured. Know More.
Images can also be downloaded from its URL obtained by applying Regular Expression on its HTML content. This method is shown in the following demonstration video.

Watch more demonstration videos
Download the free trial version

Scraping data from HTML by applying Regular Expressions

WebHarvy can scrape data from HTML source code of selected area (or whole of) of web pages by applying Regular Expressions.
During configuration, after clicking on an item, the ‘Capture HTML’ option under ‘More Options’ of Capture window allows the HTML of the item to be captured and displayed in the preview area. After this, Regular Expressions can be applied (More Options > Apply Regular Expression) to select data from a portion of the HTML code displayed.
The following video shows how this feature can be applied to scrape URLs from HTML.

Download & try the 15 days evaluation version

How to scrape tweets ? – Twitter data scraping using WebHarvy

WebHarvy can be used to easily scrape tweets from twitter.com. The following demonstration video shows the steps involved.

As shown, using WebHarvy to scrape tweets is very easy. WebHarvy is a point and click visual web scraper, using which data to be extracted can be selected using mouse clicks.
In case you need to scrape tweets after logging in using your Twitter account please make sure that you follow the steps mentioned at http://www.webharvy.com/articles/sites-requiring-login.html.
To know more, please watch demonstration videos at http://www.webharvy.com/demo.html
15 days free evaluation version of WebHarvy may be downloaded from http://www.webharvy.com/download.html

Scraping Facebook graph search results

The following video shows how WebHarvy can be used to extract data from Facebook graph search results. The extracted data can be saved as a file or to a database.
[youtube https://www.youtube.com/watch?v=As5pIsh73Cw]
While using WebHarvy to extract data from secure websites (which require login with a user name and password) please make sure that you follow the steps mentioned at http://www.webharvy.com/articles/sites-requiring-login.html
Scraping websites using WebHarvy is incredibly easy. Text, Images, URLs, Email Addresses etc can be easily extracted from web pages using WebHarvy. Watch these demonstration videos to know more.
Trial version download

How to scrape data from eBay ?

WebHarvy can be used to scrape data from ebay.com. The following video shows the process.
[vimeo 68767010 w=500 h=375]
In the above video the Keyword Scraping feature of WebHarvy is used to scrape product search results for multiple input keywords. In addition to this the Category Scraping feature can scrape products listed under various categories at ebay.com.
WebHarvy can extract data (text and images) from websites with just mouse clicks. Being a generic, visual web scraper, WebHarvy can be configured to extract data from any website. The configuration part is very easy – see this demonstration video which shows how easy it is to setup WebHarvy for data extraction.
In case you are looking for an easy to use web scraping solution to extract data from ebay (and other websites like amazon, yellow pages etc) we recommend that you try the evaluation version of WebHarvy from http://www.webharvy.com/download.html.
Please contact our support team in case you need any assistance.

WebHarvy version 3.3 released !

3.3 version of WebHarvy was released on June 16, 2014. The major changes are :

  1. Fixed issues related to URL encoding in Category Scraping
  2. Added option to disable automatic pattern (data field repetition) detection in start page (more details)
  3. Option to follow links (URLs) obtained by applying Regular Expression on HTML – handles both absolute and relative URLs (more details)
  4. Option to capture images whose URL is obtained by applying Regular Expression on HTML – handles both absolute and relative URLs – works even when the image URL does not contain image file extension (more details)
  5. Separate options to download image and to capture image URL (more details)
  6. Fixed issue due to which downloaded image files did not have the correct file extension
  7. Added Multiline mode in RegEx processing
  8. Faster mining ‘restart’ from where it stopped (aborted) previously – remembers last mined URL and its PostData.
  9. Context menu options (copy/cut/paste) added for ‘Additional URLs in Configuration‘ window

Download the latest version of WebHarvy

WebHarvy version 3.2 released !

We have made several improvements and feature additions to our popular web scraping software WebHarvy. Most of the new features added in this release were recommended by WebHarvy’s existing customers. We would like to thank everyone who helped us test and improve this release while in beta.
The changes are :-

  • Supports scraping data from web pages where more data is loaded when page is scrolled to the end (more details)
  • Supports scraping data from web pages where more data is loaded when a ‘load more data’ or ‘show more content’ type button/link is clicked (more details)
  • Supports editing URLs associated with a configuration (more details)
  • Supports editing keywords associated with a configuration (more details)
  • Supports downloading images whose URL is obtained after applying Regular Expression (RegEx) on the HTML source of selected content
  • Ability to select category links one-by-one, during configuration (more details)
  • Refined ‘Capture following text’ option (more details)
  • Multiple groups in a single RegEx string captured (more details)
  • Handles different layouts used by Amazon for displaying product details like ASIN
  • Advanced Miner Options (more details)
  • Automatically checks for new updates
  • Authentication support for private proxies while scraping data from HTTPS websites (more details)
  • Minor bug fixes and several improvements

The latest version may be downloaded from https://www.webharvy.com/download.html

Use 'Capture Following Text' option to scrape data from details pages

While extracting data from details pages (page reached by navigating a link from the start page), it is recommended that the ‘Capture Following Text‘ option be used whenever possible to correctly and consistently scrape data.
This is because the layout and the amount of data displayed in details pages may not be consistent. For example, if you are trying to scrape Amazon products listing, the data displayed in the product details page (page reached by clicking the product link from the search results) may vary slightly from product to product. Here, if you are tying to extract the Shipping Weight under Product Details, instead of clicking on the data (example: ‘1.2 pounds’) click on the heading ‘Shipping Weight’ and apply the ‘Capture following text’ option under the ‘More Options’ button.
Watch the demo :-

 
So in summary, if the data to be extracted comes under a heading, always click the heading and apply the ‘Capture following Text’ option. This ensures that the data is scraped from all similar pages without missing any, even if the page contents varies slightly.
 

Scrape HTML

WebHarvy allows you  to scrape HTML of page contents in addition to plain text. In the Capture window, click ‘More Options’ button and select the ‘Capture HTML’ option to scrape the HTML of the selected content.
To capture only a portion of the displayed HTML, you may select and highlight the required portion before clicking the Capture button.
Usually Regular Expressions are applied over the HTML source of the content to extract the data of interest like image URL or hidden fields like phone number.
The following video shows how the ‘Capture HTML’ option is used along with Regular Expressions to correctly extract the product price.

Try out the free evaluation copy of WebHarvy from https://www.webharvy.com/download.html.

Scraping hidden (click to display) fields using WebHarvy

Certain web pages require that you to click on a link or button for the data to be displayed. There are many websites where email addresses or phone numbers are partially displayed, they will be fully displayed only if you click on them.
The ‘Click’ option under ‘More Options’ button in the Capture Window lets you scrape data in such scenarios. (See https://www.webharvy.com/tour1.html#ScrapeHidden). 
The following video shows how this option can be used to scrape hidden fields.

Here the phone numbers are partially displayed. Using the Click option, they can be made fully visible and then scraped.
To know more about the features of WebHarvy, see the product feature tour at https://www.webharvy.com/tour.html.