Scrape with Regular Expressions using WebHarvy

WebHarvy is designed as a ‘point and click’ visual Web Scraper. The design concentrates on easy of use, so that you can start scraping data within few minutes after downloading the software.
But in case you need more control over what needs to be extracted you can use Regular Expressions (RegEx) with WebHarvy.  WebHarvy allows you to extract data by matching RegEx strings on text content as well as on HTML source of the web page.
If you are new to Regular Expressions, see http://en.wikipedia.org/wiki/Regular_expression.
The following video shows how WebHarvy can be used to scrape the image URL from a web page by applying Regular Expression.

The ‘Capture More Content’ feature comes in handy here (as shown in the video) to make sure that the selected text contains the data (text or HTML code) of interest, before RegEx string is applied.
Regular Expressions can also be applied directly on the text content of the page as shown in the following video.

To explore further download the latest version of WebHarvy from https://www.webharvy.com/download.html.

WebHarvy 3.1 (Minor Update)

The 3.1 update of WebHarvy which was released yesterday (July 24) has the following changes.

  • Added option to Tag captured data rows with corresponding Keyword/Category. (Applicable only for Keyword/Category based Scraping). See the new Miner Settings Window (Edit menu – Settings)
  • Option to separately set Page Load Timeout and AJAX Load Wait Time in Miner Settings.
  • Option to edit the start URL / Post Data / Headers for the configuration directly from the UI, without editing the XML configuration file. (under Edit menu – Edit Options)
  • Updates related to Category Scraping, Capture Text following a Heading, Mining multiple pages
  • Bug Fixes

Download and install the latest update from https://www.webharvy.com/download.html.

WebHarvy Version 3.0 Released !

We are happy to announce the release of WebHarvy 3.0. We have added a lot of new features in this major update. The feature/changes list for this update is the longest among all product updates which we have done till date. Here we go. .

  • Added the following options in the Capture Window (grouped under ‘More Options’)
    • Capture following text: Improved by using brute force search for all elements in the page
    • Capture HTML: Option to scrape HTML of selected element
    • Capture Text as File: Option to scrape text and save it as a local file (useful while scraping articles and blog posts)
    • Click: Ability to scrape hidden (partially displayed) fields in webpages which require a click from the user to be displayed in full. For example phone numbers or email addresses which are displayed completely only if you click them.
    • Apply Regular Expression: Option to apply Regular Expressions (RegEx) on captured text. RegEx can be applied even after applying ‘Capture following text’, ‘Capture HTML’ & ‘Capture More Content’ options.
    • Capture More Content: Option to capture more text than the selected text, captures parent element’s text. For example this would capture the entire article if you apply this option after having selected the first paragraph.
  • Option to individually select categories/links (one by one) for Category Scraping (Mine menu – Scrape a list of similar links)
  • Export captured data as JSON
  • Ability to mine data from tables (row-column / grid layout)
  • Ability to mine pages which has fewer (less than 10) data items
  • Option to test proxies before using them (Edit menu – Settings – Proxy Settings)
  • Non responsive proxies are skipped during mining. Mining would not stop because of a bad/non-responsive proxy in the list.
  • Option to manually add URLs to an existing configuration (Edit menu – Add URLs to configuration)
  • Option to remove duplicates while mining (Edit menu – Settings – Miner)
  • Added ‘Hourly’ frequency option in Scheduler (Mine menu – Scheduler)
  • Added option to export data directly to database for scheduled mining tasks & command line
  • Added ‘Clear’ option in Edit menu which will clear both the browser and data preview pane
  • Language encoding defaulted to ‘utf-8’ for file exports (XML, CSV etc)
  • CSV/Database export : handles delimiters (comma, quotes etc) in captured data
  • Keyword/Category scraping allowed for 2 entries in evaluation version
  • Rendering issues with in-built browser fixed – defaults to IE 9 rendering
  • New Installer built with InstallShield

Download the latest installation of WebHarvy Web Scraper from https://www.webharvy.com/download.html.

Web Scraping from Command Line

WebHarvy supports command line arguments so that you can run the software directly from the command line. This allows you to run WebHarvy from script or batch files, or to invoke it via code from your own applications.
To know more, read : Running WebHarvy Web Scraper from Command Line

Schedule scraping tasks

WebHarvy comes with an in-built scheduler using which you may schedule your scraping tasks. The scheduler window can be opened from the Mine menu.

WebHarvy Scheduler
WebHarvy Scheduler

The scheduler enables you to run scraping tasks periodically – daily, weekly or monthly.
Know More about WebHarvy Scheduler
Download  and Try  the free 15 days evaluation version of WebHarvy Web Data Extraction Software.

WebHarvy v2.0 Released !

The new features in the 2.0 update are :

  • Built-in scheduler for running scraping tasks – (know more)
  • Command Line Options – (know more)
  • MySQL Support for exporting scraped data – (know more)
  • Option to scrape sub text of selected text – (know more)
  • Updated proxy settings – (know more)
    • Supports proxies which require authentication
    • Supports importing proxies from CSV/Text files
  • Option to resume mining from where it stopped/aborted
  • Option to auto-save captured data on regular intervals – (know more)
  • Option to automatically inject pauses while mining (prevents IP blocking) – (know more)
  • Major improvements in mining
  • Minor changes
    • Number of pages & records mined are always displayed in Miner window’s status strip
    • Fixed bug related to capturing images where image text is empty
    • Updated capturing email addresses
    • Record numbers displayed inside captured data grid view in Miner window
    • Option to cancel preview generation for large index page data

You may download the latest version of WebHarvy Web Scraper from http://www.webharvy.com/download.html.
 

How to scrape text following a heading using WebHarvy ?

In the latest update of WebHarvy, the Visual Web Scraping Software, the newly introduced ‘capture following text’ option allows you to capture text/block/paragraph following a heading within a webpage.
Often with many websites the data to be scraped may not be located at the same position within all pages, but is guaranteed to be found under a given heading (Example : “Technical Details”, “Product Specification” etc). Sometimes, the text under a given heading may not be selected as a single item during configuring. In such scenarios the ‘Capture following text’ option in the capture window will provide helpful.

How to ?

While in configuration mode, click on the heading and select the ‘Capture following text’ option in the capture window. Provide a suitable name for the field and hit OK. In the preview pane you will be able to see the text following the heading captured.
Refer http://www.webharvy.com/tour1.html#ScrapeFollowingText for more details.

WebHarvy Web Scraper V1.5.0.26 released

The latest version (V1.5.0.26) of WebHarvy Visual Web Scraper is available for download. The changes in this update are :

  • New option: ‘Capture following text’ added in capture form.
  • Web Miner has been improved to handle even HTML errors of target websites.
  • Allows exporting scraped data while mining is paused.
  • For CSV, TSV exports, column names are added as the first row.
  • Option to input keywords in CSV format.
  • Option to manually set page load timeout value in application settings.

The ‘Capture following text’ feature helps to scrape text following a given heading within the page. This feature is useful when data to be scraped does not occur at a fixed position within the page, but is guaranteed to follow a heading text (Example ‘Product Details:‘ or ‘Specification‘).
The option to manually set the page load timeout value from settings window helps to scrape data from websites with slow response times or from those which employ AJAX.
We recommend that you download and try the 15 days free evaluation version.

How to scrape data anonymously ?

WebHarvy Web Scraper allows you to scrape data from remote websites anonymously with the help of proxy servers. This prevents remote web servers from blocking / black listing your computer’s IP address.
WebHarvy provides you the option to specify either a single proxy server address or a list of proxy servers addresses through which the remote website will be scraped. In case you are providing a list of proxy server addresses, WebHarvy will cycle through the list in a periodic manner.
Please follow this link to know more about this feature.
Download WebHarvy Web Scraper FREE Trial !

How to scrape search results data for a list of input keywords ?

In most cases the data to be scraped is the result of performing a search operation from the main page of the website. Often it is required that you need to extract data from the search results for a list of input keywords.
The ‘Keyword Scraping’ feature of WebHarvy allows you to perform this task with ease. You can specify a list of input keywords and WebHarvy will automatically scrape data from the search results corresponding to each keyword in the specified list.
Please follow this link to know more about ‘Keyword based Scraping’.
Video Demonstration : Keyword based Scraping
We recommend that you download and try the evaluation version of our Web Scraper to know more about the features.