WebHarvy is designed as a ‘point and click’ visual Web Scraper. The design concentrates on easy of use, so that you can start scraping data within few minutes after downloading the software.
But in case you need more control over what needs to be extracted you can use Regular Expressions (RegEx) with WebHarvy. WebHarvy allows you to extract data by matching RegEx strings on text content as well as on HTML source of the web page.
If you are new to Regular Expressions, see http://en.wikipedia.org/wiki/Regular_expression.
The following video shows how WebHarvy can be used to scrape the image URL from a web page by applying Regular Expression.
The ‘Capture More Content’ feature comes in handy here (as shown in the video) to make sure that the selected text contains the data (text or HTML code) of interest, before RegEx string is applied.
Regular Expressions can also be applied directly on the text content of the page as shown in the following video.
We are happy to announce the release of WebHarvy 3.0. We have added a lot of new features in this major update. The feature/changes list for this update is the longest among all product updates which we have done till date. Here we go. .
Added the following options in the Capture Window (grouped under ‘More Options’)
Capture following text: Improved by using brute force search for all elements in the page
Capture HTML: Option to scrape HTML of selected element
Capture Text as File: Option to scrape text and save it as a local file (useful while scraping articles and blog posts)
Click: Ability to scrape hidden (partially displayed) fields in webpages which require a click from the user to be displayed in full. For example phone numbers or email addresses which are displayed completely only if you click them.
Apply Regular Expression: Option to apply Regular Expressions (RegEx) on captured text. RegEx can be applied even after applying ‘Capture following text’, ‘Capture HTML’ & ‘Capture More Content’ options.
Capture More Content: Option to capture more text than the selected text, captures parent element’s text. For example this would capture the entire article if you apply this option after having selected the first paragraph.
Option to individually select categories/links (one by one) for Category Scraping (Mine menu – Scrape a list of similar links)
Export captured data as JSON
Ability to mine data from tables (row-column / grid layout)
Ability to mine pages which has fewer (less than 10) data items
Option to test proxies before using them (Edit menu – Settings – Proxy Settings)
Non responsive proxies are skipped during mining. Mining would not stop because of a bad/non-responsive proxy in the list.
Option to manually add URLs to an existing configuration (Edit menu – Add URLs to configuration)
Option to remove duplicates while mining (Edit menu – Settings – Miner)
Added ‘Hourly’ frequency option in Scheduler (Mine menu – Scheduler)
Added option to export data directly to database for scheduled mining tasks & command line
Added ‘Clear’ option in Edit menu which will clear both the browser and data preview pane
Language encoding defaulted to ‘utf-8’ for file exports (XML, CSV etc)
CSV/Database export : handles delimiters (comma, quotes etc) in captured data
Keyword/Category scraping allowed for 2 entries in evaluation version
Rendering issues with in-built browser fixed – defaults to IE 9 rendering
WebHarvy supports command line arguments so that you can run the software directly from the command line. This allows you to run WebHarvy from script or batch files, or to invoke it via code from your own applications.
To know more, read : Running WebHarvy Web Scraper from Command Line
In the latest update of WebHarvy, the Visual Web Scraping Software, the newly introduced ‘capture following text’ option allows you to capture text/block/paragraph following a heading within a webpage.
Often with many websites the data to be scraped may not be located at the same position within all pages, but is guaranteed to be found under a given heading (Example : “Technical Details”, “Product Specification” etc). Sometimes, the text under a given heading may not be selected as a single item during configuring. In such scenarios the ‘Capture following text’ option in the capture window will provide helpful.
How to ?
While in configuration mode, click on the heading and select the ‘Capture following text’ option in the capture window. Provide a suitable name for the field and hit OK. In the preview pane you will be able to see the text following the heading captured.
Refer http://www.webharvy.com/tour1.html#ScrapeFollowingText for more details.
The latest version (V126.96.36.199) of WebHarvy Visual Web Scraper is available for download. The changes in this update are :
New option: ‘Capture following text’ added in capture form.
Web Miner has been improved to handle even HTML errors of target websites.
Allows exporting scraped data while mining is paused.
For CSV, TSV exports, column names are added as the first row.
Option to input keywords in CSV format.
Option to manually set page load timeout value in application settings.
The ‘Capture following text’ feature helps to scrape text following a given heading within the page. This feature is useful when data to be scraped does not occur at a fixed position within the page, but is guaranteed to follow a heading text (Example ‘Product Details:‘ or ‘Specification‘).
The option to manually set the page load timeout value from settings window helps to scrape data from websites with slow response times or from those which employ AJAX.
We recommend that you download and try the 15 days free evaluation version.
WebHarvy Web Scraper allows you to scrape data from remote websites anonymously with the help of proxy servers. This prevents remote web servers from blocking / black listing your computer’s IP address.
WebHarvy provides you the option to specify either a single proxy server address or a list of proxy servers addresses through which the remote website will be scraped. In case you are providing a list of proxy server addresses, WebHarvy will cycle through the list in a periodic manner. Please follow this link to know more about this feature. Download WebHarvy Web Scraper FREE Trial !
In most cases the data to be scraped is the result of performing a search operation from the main page of the website. Often it is required that you need to extract data from the search results for a list of input keywords.
The ‘Keyword Scraping’ feature of WebHarvy allows you to perform this task with ease. You can specify a list of input keywords and WebHarvy will automatically scrape data from the search results corresponding to each keyword in the specified list. Please follow this link to know more about ‘Keyword based Scraping’. Video Demonstration : Keyword based Scraping
We recommend that you download and try the evaluation version of our Web Scraper to know more about the features.