Loading Web Pages & Starting Configuration
Selecting Data / Page Interaction
Following a link
Capturing data from multiple pages
Export captured data
Keyword based Scraping
Scrape via Proxy Server
Scheduler & Command line options
How to register ?
WebHarvy Settings window can be opened from the Edit menu (Edit menu > Settings). The Settings window's Miner tab allows you to set the following miner options : Automatic Duplicate Removal, Page Load Timeout, AJAX Load Wait Time, Automatic Pause Injection, Auto Save Mined Data.
Automatic Duplicate Removal
When this option is enabled, WebHarvy will automatically find and remove duplicate entries while mining data.
Page Load Timeout
The 'Page Load Timeout' value specifies the maximum time up to which WebHarvy should wait for a page to load completely. The default value of this parameter is 30 seconds. This value can made as low as 5 seconds (for high speed connections and fast responding websites) to mine pages faster.
AJAX Load Wait Time
The 'AJAX Load Wait Time' value specifies the additional time which WebHarvy miner should wait after the page has been loaded, before parsing data. Default value for this parameter is 5 seconds. AJAX Load Wait Time value can be increased (as high as 60 seconds or more if required) in case you are facing problems mining data from websites which are slow to respond or which employ AJAX.
Inject pauses during mining
This options allows you to periodically pause the miner while scraping data. This prevents the miner from making continuous (long time) data requests to the website, thereby minimizing chances of the website blocking your IP. The 'Add random pauses' option adds random wait times after each page load to emulate human behavior.
Auto Save Data
This option helps you prevent data loss. When 'Auto Save Mined Data' option is enabled the miner will automatically save the scraped data to a predefined file on your computer periodically. Periodically saving the data is a good practice and ensures that the data captured over long mining sessions is not lost due to unexpected problems.
Category / Keyword Scraping settings
The Tag with Category/keyword option adds an extra column in the data table while mining configurations which has enabled categories/keywords or has multiple URLs. The additional column will be filled with the category name / keyword or URL related with the captured data. You can also specify the name of this additional column.
The Disable automatically identifying category links option allows you to select links manually (for Scrape a list of similar links feature), one-by-one (by clicking them one after the other, when you are done selecting links click any empty space to load the first link and start configuration). When this option is enabled, WebHarvy will not try to automatically parse and find the category links.
If your mining configuration includes scraping images, then by default while saving images during mining, WebHarvy gets the image file name from the image URL.
You can also set to save images using a name obtained from another column/cell of the current record. For example, while scraping product details, this allows you save product images with the product name/title as the file name. So if the product title is the second column of data, select the second option in the above window and provide value '2' for the column number.