WebHarvy 5.3 (Parallel mining, Chrome developer tools)

How to increase mining speed ?‘ was one of the most commonly asked questions by our users. With previous versions, the main limitation was that when links had to be followed from the starting page to get each listing details, the miner took more time to scrape a page full of listings. This is because WebHarvy used to sequentially load links one after the other to scrape data.

Parallel Mining

Instead of processing links to be followed and extracted one after the other, the latest update of WebHarvy processes them in bulk, in parallel, using multiple mining threads. You can set the maximum number of parallel mining threads which WebHarvy uses in Advanced Miner Options window as shown below.

Providing a higher value for ‘Maximum number of parallel mining threads’ option in the above window will increase mining speed. But, to run more threads in parallel, WebHarvy will require more memory, processing power and  internet-bandwidth. So we recommend that you increase this setting only based on your system’s CPU, installed physical memory (RAM) and internet speed.

Chrome developer tools

This feature is for power users who are familiar with web page internals like HTML, DOM structure and JavaScript. We use this tool extensively while supporting our customers with not so straightforward scraping scenarios and complex websites.

Chrome Developer Tools allow you to easily inspect the internal structure of a web page, see how the page is organised, view the HTML and data hidden in HTML source and devise methods to extract them. You can also find the JavaScript code run when buttons/links are clicked and directly call them using these features.

More accurate automatic sub-text selection

To scrape only a portion of the text displayed in the Capture window, you can highlight the required portion with mouse. We have improved the accuracy of this method, especially when the text selected is in between delimiter characters like currency symbols, punctuation/special characters, new line/space etc.

Improvements and bug fixes

  1. Improved select dropdown option. This option now reflects the selection (selected item change) on the page. Earlier separate JavaScript code needed to be run by the user to reflect page change upon dropdown list selection.
  2. Miner now scrolls the page before clicking on Load More links. This is done to make sure that the ‘load more’ link is visible and loaded before miner tries to click it.
  3. When text scaling in Windows is not set to 100% (which is the recommended setting on most systems), it was not possible to click and correctly select the required data items during configuration. This issue is fixed in this version. Configuration time data selection works irrespective of text scaling.
  4. Fixed issue related to downloading images behind SSL.
  5. Non-visibility of miner window in multi monitor systems when monitor configuration changes is fixed.
  6. Earlier, the Capture window would become unresponsive for a second or two after applying Regular Expression on HTML. This unresponsive state has been removed.
  7. Added browser zoom level and number of parallel mining threads info in status bar of configuration browser.
  8. Fixed issue with loading and displaying upgrade purchase page in cases where user’s license has expired.
  9. Disabled ‘Mine all pages/Number of pages to mine’ controls while mining is in progress.
  10. Updated internal browser to a more recent version of Chromium.

Update to the latest version

As always, you can download and install the latest version from http://www.webharvy.com/download.html.

 

Leave a Comment