support@sysnucleus.com | sales@sysnucleus.com | YouTube Channel | KB Articles

Articles Home

Product Help

YouTube Channel

WebHarvy Blog


Scraping data from websites which require login


WebHarvy supports scraping data from websites which require authentication (login with user name and password). While using WebHarvy to scrape data from such websites (for example www.linkedin.com) please follow the steps below.


Method 1 (Recommended)


  1. 1. Open Internet Explorer (IE) and navigate to the website.

  2. 2. Login with your user name and password. If the website offers an option to remember the password ('keep signed in' / 'remember me' option), use it. If IE offers to remember the password, click Yes. Keep IE window open.

  3. 3. Open WebHarvy and navigate to the website.

  4. 4. Log in with user name and password (if not shown as logged in).

  5. 5. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  6. 6. Start Mine.


Method 2


A disadvantage of the first method is that configurations created cannot be scheduled without manual intervention. This is because login in not handled by the configuration and WebHarvy expects that you have logged in to the website via IE and WebHarvy's browser. Follow the steps below if you would like to include the login process in the configuration, so that you need not perform additional login from IE or WebHarvy's browser when the configuration is run. Configurations created following the method below can be scheduled.

  1. 1. Open WebHarvy and navigate to the login page of the website.

  2. 2. Login with your user name and password.

  3. 3. Once you have successfully logged in, click Start Config.

  4. 4. Select Edit menu > Edit Options > Disable start-page pattern detection

  5. 5. Now, if required, click on links in the page to navigate to the target page which displays the data which you need to extract. After clicking each link, select More Options > Click from the resulting Capture window to follow that link. You can also use other methods to interact-with/navigate pages as explained here.

  6. 6. Once the target data page is reached, select Edit menu > Edit Options > Disable start-page pattern detection (to turn off)

  7. 7. Now you can select required data and continue configuration in normal method

  8. 8. Stop Config, Save and Start Mine.


Scraping data from websites which shows login in a popup window


Some websites display a popup window as soon as you load them where you can enter user name and password to authenticate and proceed to view the page. In such cases follow the method below.

  1. 1. Open WebHarvy and load the required page in WebHarvy by providing its URL in the address bar in the following format. Here, the username and password are provided in the URL itself.

  2. http://username:password@webdomain.com/path1/path2/page.php

  3. 2. Start Config

  4. 3. Select Edit menu > Edit Options > Edit Start URL/PostData

  5. 4. Paste the same URL entered in Step 1 in the Start URL box. Apply changes

  6. 5. Now you can proceed by creating the configuration

Here, the functionality provided by the browser to provide the login username and password as part of the URL is used.


Scraping data from pages which require CAPTCHA


Like in the case of websites which require login with a user name and password, WebHarvy does not solve CAPTCHAs by itself, you will have to manually load the page (which shows CAPTCHA form) in Internet Explorer (IE) as well as within WebHarvy's browser and solve the CAPTCHA manually. Once solved, CAPTCHA form will not be displayed again for the current session by most websites.

  1. 1. Open Internet Explorer (IE) and navigate to the website.

  2. 2. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.

  3. 3. Open WebHarvy and navigate to the website.

  4. 4. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.

  5. 5. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  6. 6. Start Mine.

Please do not hesitate to contact our support team at support@sysnucleus.com with the necessary details in case you need any assistance.