Web Scraping Using AI

  • WebHarvy supports AI-assisted web scraping. You can connect WebHarvy to local or cloud AI Providers / LLMs to summarize, analyze and extract data intelligently while mining a website, in addition to WebHarvy's regular point-and-click data selection.

    1. 1. Supported AI Providers
    2. 2. What can you do with AI in WebHarvy?
    3. 3. Configuring AI Settings
    4. 4. Extracting Data Using AI

  • Supported AI Providers

    WebHarvy can connect to the following AI providers:

    What can you do with AI in WebHarvy?

    Once configured, AI can be used during configuration and mining to:

    • 1. Generate summaries from blocks of text
    • 2. Analyze the sentiment of scraped content
    • 3. Extract complex insights from unstructured data
    • 4. Transform or clean data before it is captured
    • 5. Extract data that is difficult to select using regular point-and-click methods
    • etc.

    Configuring AI Settings

    Before you can use AI while selecting data, you need to connect WebHarvy to your preferred AI provider.

    1. 1. Open WebHarvy Settings from the Home menu.
    2. 2. Switch to the AI tab.
    3. AI Settings
    4. 3. Select your AI provider (Ollama (local), OpenAI or Anthropic) and provide the required connection details, such as the local server address for Ollama/LM Studio, or the API key for OpenAI/Anthropic.
    5. 4. Select the model you would like WebHarvy to use, then save the settings.

    Note: Click the 'Test' button to verify that the connection parameters you provided are correct.

    Extracting Data Using AI

    Once an AI provider has been configured, you can use it while selecting data during configuration.

    1. 1. Click on the area of the webpage from which you want to extract data using AI.
    2. 2. From the More Options menu, select Extract with AI.
    3. Scrape with AI option in WebHarvy
    4. 3. In the window that appears, specify:
      • The extraction area - either the currently selected region, or the entire page.
      • The source to use for extraction - the page's displayed (rendered) text, or its underlying HTML code.
    5. Describe what you want the AI to do - for example, summarize the selected text, determine its sentiment, or pull out specific values from it - and WebHarvy will capture the AI generated output as a data column, just like any other selected field.
    6. Scrape with AI dialog in WebHarvy

    Note: Using cloud based AI providers (OpenAI, Anthropic) sends the selected page content to the respective provider for processing, and may incur usage costs as per their pricing. Local LLMs run via Ollama or LM Studio process data entirely on your own computer and do not incur charges.