WebHarvy Configuration File (XML) Format

WebHarvy Configuration File Format Explained

WebHarvy configuration files are saved in XML format. Following is the description of the WebHarvy Configuration XML file format. Advanced users can directly tweak the XML config file created using WebHarvy with the help of this description.

WebHarvy allows you to change most of the details in the configuration directly from the UI. See Editing Configuration for more details. This lets you easily change the configuration parameters without manually editing the XML file.

This document provides only a high level description of the configuration file format and is not complete. Please contact our Support in case you need any further information.

Header

The header portion of the configuration file is as follows. This portion is the same for all configuration files.

Version and Registration Info

The version and registration details in the configuration file are optional and are written by WebHarvy versions 6.3.0.189 and above.

Miner Options

From version 6.3.0.189, Advanced Miner Options are saved in the configuration file. If this part is not present, the default values of these options from Settings are used.

Selection Accuracy Values : Strict (-1), Low (0), Medium (1), High (2), Highest (3)

URL details

The following StartURL tag describes the URL of the page from which data scraping starts. The url tag inside StartURL contains the URL of the web page from which you intend to scrape data. The StartURL tag can optionally contain headers and postdata tags if required.

Editing this portion directly from UI : How to edit Start URL, PostData and Headers

Field List

This section, which follows StartURL, provides information regarding the data to be extracted from the start URL. That is, the list of data to be extracted. Each Data Field describes a data element to be extracted or a link to be followed.

Data Field

Each Data Field takes the following format:

The type tag defines the type of data. It can take the following values :

Text Capture element's text
Text_Near_Heading Capture Text next to the heading text
Url Capture element's URL
Image Download Image
Image_URL Capture Image URL
Image_RegEx Capture Image from URL obtained by applying RegEx on HTML
Image_RegExMulti Capture multiple images. First image URL obtained by applying RegEx on HTML
HTML Capture HTML code
File Capture element's text as file
Link_Follow Follow link
Link_RegEx Follow link obtained by applying RegEx on HTML
Click Click the element
Link_Back Navigate back (after a link has been followed)
Link_NextPage Link to load next page (for paginated lists)
Link_LoadNextPageSet Link to load next set of pages
Link_LoadMoreContent Link to load more content (display more results)
Auto_Scroll Load more data by scrolling down the page
Input_Text Enter string in input text field
Invoke_Script Run Java Script on page
Open_Popup Click to open popup and extract data
Select Select list/dropdown option
Scroll Scroll page down slowly to load all contents
Custom Custom data fields (Page URL, Page Screenshot, Date-Time, Text)

The name tag provides a name for the data element. For Text/URL/Image/HTML/File elements this will be the name of the corresponding data column while the data is scraped.

The selector tag provides the CSS selector of the data field. This is used to locate the element (and following patterns) during mining.

In WebHarvy versions before 6.0, xpath tag is used instead of selector tag. The xpath tag provides the XPath which denotes the data element. WebHarvy uses a customized XPath format which is explained as follows:-

The path starts with the topmost HTML tag which is <HTML>. The tag name is followed by two indices ([ ]). The first one denotes the index position (first child is at [0], next one at [1] and so on) of the current tag related to its parent tag. The second one is optional and denotes the class id of tag (if exists).

The heading tag is optional and if present contains the heading text for Text_Near_Heading type.

The pattern tag can take values 'true' or 'false'. 'true' for repeating data (valid only in start page), 'false' otherwise.

The regex tag is optional. In case a value is provided (regular expression) it is matched with the captured element's text.

Code for 'Add URLs to Configuration' / 'Scrape a list of similar links'

Using the 'Add URLs to Configuration' option in Edit menu > Edit Options, you can directly add URLs to an existing configuration without manually editing the configuration XML file. Also see, Scrape a list of similar links.

In case you need to add a list of URLs to a configuration file, so that WebHarvy scrapes data from each of the URL in the list as per the configuration, add the following XML code. This code should be added towards the end of the configuration XML file, before </MineParams>.

To make this work, make sure that the first URL in the list (www.url1.com) is the same as that provided in the <StartURL> tag (see above) present in the start of the configuration file.

Code for Keyword Scraping

Using the 'Edit Keywords' option in Edit menu > Edit Options, you can directly edit keywords associated with a configuration without manually editing the configuration XML file.

The following is the format for enabling Keyword based Scraping. The first keyword provided should match the keyword used in the start URL/PostData.

In case you need any further information please do not hesitate to contact our support team at support@webharvy.com with the necessary details.