support@sysnucleus.com | sales@sysnucleus.com | YouTube Channel | KB Articles

Articles Home

Product Help

YouTube Channel

WebHarvy Blog


Scraping data from websites which require login


WebHarvy supports scraping data from websites which require authentication (login with user name and password). While using WebHarvy to scrape data from such websites (for example www.linkedin.com) please follow the steps below.

  1. 1. Open Internet Explorer (IE) and navigate to the website.

  2. 2. Login with your user name and password. If the website offers an option to remember the password ('keep signed in' / 'remember me' option), use it. If IE offers to remember the password, click Yes. Keep IE window open.

  3. 3. Open WebHarvy and navigate to the website.

  4. 4. Log in with user name and password (if not shown as logged in).

  5. 5. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  6. 6. Start Mine.


Scraping data from websites which shows login in a popup window


Some websites display a popup window as soon as you load them where you can enter user name and password to authenticate and proceed to view the page. In such cases follow the method below.

  1. 1. Open WebHarvy and load the required page in WebHarvy by providing its URL in the address bar in the following format. Here, the username and password are provided in the URL itself.

  2. http://username:password@webdomain.com/path1/path2/page.php

  3. 2. Start Config

  4. 3. Select Edit menu > Edit Options > Edit Start URL/PostData

  5. 4. Paste the same URL entered in Step 1 in the Start URL box. Apply changes

  6. 5. Now you can proceed by creating the configuration

Here, the functionality provided by the browser to provide the login username and password as part of the URL is used.


Scraping data from pages which require CAPTCHA


Like in the case of websites which require login with a user name and password, WebHarvy does not solve CAPTCHAs by itself, you will have to manually load the page (which shows CAPTCHA form) in Internet Explorer (IE) as well as within WebHarvy's browser and solve the CAPTCHA manually. Once solved, CAPTCHA form will not be displayed again for the current session by most websites.

  1. 1. Open Internet Explorer (IE) and navigate to the website.

  2. 2. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.

  3. 3. Open WebHarvy and navigate to the website.

  4. 4. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.

  5. 5. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  6. 6. Start Mine.

Please do not hesitate to contact our support team at support@sysnucleus.com with the necessary details in case you need any assistance.