JavaScript for Web Scraping and Automation

JavaScript is one of the most popular programming languages used for web scraping and automation, mainly because it is the language natively supported by web browsers. This article will help you master JavaScript techniques useful for web scraping and web automation.

Modern Web Scraping involves the following stages:

1. Loading the web page within a browser (commonly a headless browser instance)
2. Waiting for the page load completely
3. Interacting with the page - Follow links, select options, scroll down the page, submit forms etc.
4. Selecting required data from the page

JavaScript is used in stages 3 and 4 of web scraping - for page interaction and data selection.

JavaScript Techniques for Web Scraping

In this article you will learn how to perform the following actions using JavaScript:

1. Various ways to scroll down a page or list of items
2. How to manipulate DOM elements, tables and lists?
3. How to click on links, buttons etc.?
4. How to read text from elements?
5. How to navigate pages?
6. How to submit forms?
7. Various ways to input text
8. How to wait for dynamic content to load?
9. How to handle content within a Shadow DOM?
10. Mouse hovering techniques
11. How to store data and pass data between pages?

Various ways to scroll down a page or list of items

When scraping websites with dynamic content loading or infinite scroll, controlled scrolling is essential. The code given below performs smooth scrolling through list elements or to page end, with customizable delays.

Smoothly Scroll a List of Items

function sleep(ms) {
 return new Promise(resolve => setTimeout(resolve, ms));
}

async function scrollList() {
    list = document.getElementsByClassName('data-cards')[0];
    for (var i = 0; i < list.childElementCount; i++) {
        list.children[i].scrollIntoView();
        await sleep(100);
    }
}

scrollList();

This function simulates slow scrolling through a list of items, pausing briefly between each scroll to allow content to load properly. For example, pages which employ lazy loading of images can be fully loaded using this code.

Scroll to Last Item in a List

var list = document.querySelector('selectorstring').children;
list[list.length-1].scrollIntoView();

This code scrolls to the last element in a container, which often triggers infinite scroll loading of additional content.

Scroll to the End of Page

window.scrollTo(0, document.body.scrollHeight);

This code scrolls to the end of the web page, which often triggers infinite scroll loading of additional content.

How to manipulate DOM elements, tables and lists?

Removing unwanted elements from the DOM can help clean up the page structure and improve scraping accuracy. Code given below performs element deletion and merging of multiple tables or lists.

Delete a Single Element

element.parentElement.removeChild(element);

Delete All Elements of a Specific Class

var theaders = document.getElementsByClassName('js-tournament');
for (var i=theaders.length-1; i >= 0; i--) {
    theaders[i].parentElement.removeChild(theaders[i]);
}

This code removes all elements with a specific class name. Note the reverse iteration to avoid index shifting issues when removing elements.

Merge Multiple Tables into a Single Table

Merging multiple tables or lists on a page into a single table or list can help consolidate data for easier extraction. This ensures that all items are scraped in one go.

var tables = document.querySelectorAll("tbody");
for (var i = 1; i < tables.length; i++) {
    var rows = tables[i].querySelectorAll("tr");
    for (var j = 0; j < rows.length; j++) {
        tables[0].appendChild(rows[j]);
    }
}

This code combines multiple table bodies into the first table, useful for consolidating data from separate tables on the same page.

Merge Various Groups/Lists into the First Group/List

var groups = document.getElementsByClassName('groups-class-name');
var parent = groups[0];
for (var i = groups.length - 1; i >= 1; i--) {
    var group = groups[i];
    for (var j = group.children.length - 1; j >= 0; j--) {
        parent.appendChild(group.children[j]);
    }
}

This merges multiple groups or lists into the first one. Note that this may change the order of items.

Merge by Preserving Order

var groups = document.getElementsByClassName('groups-class-name');
var anchor = null;
var parent = groups[0];
for (var i = groups.length - 1; i >= 1; i--) {
    var group = groups[i];
    for (var j = group.children.length - 1; j >= 0; j--) {
        anchor = parent.insertBefore(group.children[j], anchor);
    }       
}

This version preserves the original order of items when merging multiple groups or lists.

How to click on links, buttons etc.?

Automating clicks on multiple elements is useful for expanding content, opening details, or navigating through interactive elements.

Click All Elements with a Specific Class

var elements = document.getElementsByClassName("expand-item");
for(i=elements.length-1; i >= 0; i--) {
    elements[i].click();
}

This function clicks all elements with a specified class name, useful for expanding collapsible content or triggering multiple actions.

How to read text from elements?

Reading text and HTML content from elements is essential for data extraction during web scraping. Various methods are available depending on the selector type and whether you need text only or complete HTML content.

Read Text by Element ID

var text = document.getElementById('element-id').textContent;

This retrieves the text content from an element using its ID. Use textContent to get only text or innerHTML to get HTML content.

Read Text by Class Name

var elements = document.getElementsByClassName('class-name');
for (var i = 0; i < elements.length; i++) {
    var text = elements[i].textContent;
    console.log(text);
}

This retrieves text from all elements with a specific class name. The loop iterates through all matching elements.

Read Text by CSS Selector

var text = document.querySelector('selector-string').textContent;

Use querySelector to select a single element using CSS selector. Use querySelectorAll for selecting multiple elements.

Read Text from All Matching CSS Selectors

var elements = document.querySelectorAll('selector-string');
var textArray = Array.from(elements).map(el => el.textContent);

This retrieves text from all elements matching a CSS selector and stores them in an array for further processing.

Read Text by Tag Name

var elements = document.getElementsByTagName('p');
for (var i = 0; i < elements.length; i++) {
    var text = elements[i].textContent;
    console.log(text);
}

This retrieves text from all elements of a specific tag type, such as all paragraphs, links, or headings.

Read Text Using XPath

function getTextByXPath(xpath) {
    var result = document.evaluate(xpath, document, null, 
        XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
    var textArray = [];
    for (var i = 0; i < result.snapshotLength; i++) {
        textArray.push(result.snapshotItem(i).textContent);
    }
    return textArray;
}

// Usage: getTextByXPath('//p[@class="description"]');

This function uses XPath expressions to locate and extract text from elements. XPath provides powerful selection capabilities for complex document structures.

Read HTML Content of Element

var html = document.getElementById('element-id').innerHTML;

This retrieves the complete HTML content inside an element, including all child elements and tags.

Get HTML of Entire Page

var pageHTML = document.documentElement.innerHTML;

This retrieves the complete HTML of the entire page including the HTML tag itself.

Read Element Attributes

var element = document.querySelector('selector-string');
var href = element.getAttribute('href');
var title = element.getAttribute('title');
var dataAttr = element.getAttribute('data-custom-attribute');

This retrieves specific attributes from elements, such as links, titles, and custom data attributes. Useful for extracting URLs and metadata.

How to navigate pages?

Navigation functions help control page flow and extract information about the current page state.

Get Current Page URL

window.location.href

Navigate Back

window.history.back();

Navigate Forward

window.history.forward();

These simple commands provide essential navigation control for web scraping workflows.

How to submit forms?

Form submission is often required for accessing protected content or submitting search queries during web scraping.

Submit Form

formEl.submit();

This command programmatically submits a form element, triggering the same action as clicking a submit button.

Various ways to input text

Different websites and frameworks require different approaches for programmatic text input. These methods cover various scenarios. The code given below not only fills the input field with the specified text, but also triggers the necessary events to notify the web page of the change.

Standard Text Input Method

const changeValue = (element, value) => {
 const event = new Event('input', { bubbles: true });
 event.simulated = true;
 element.value = value;
 element.dispatchEvent(event);
}

Text Input for React/Vue Frameworks

const input = document.querySelector("selector-string"); 
Object.defineProperty(input, 'value', { value: 'paste-text-here', writable: true}); 
input.dispatchEvent(new Event('input', { bubbles: true }));

Modern frameworks like React and Vue require special handling for programmatic input. The second method works better with these frameworks by properly triggering their change detection mechanisms.

How to wait for dynamic content to load?

Timing control is crucial in web scraping to allow content to load properly and avoid overwhelming servers.

Simple Sleep Function

function sleep(ms) {
 return new Promise(resolve => setTimeout(resolve, ms));
}

// Usage example
await sleep(1000); // Wait for 1 second

This sleep function creates delays in your scraping process, allowing time for dynamic content to load or reducing the rate of requests to be respectful to the target server.

How to handle content within a Shadow DOM?

Shadow DOM elements are often used in modern web applications and can be challenging to scrape. These techniques help access content within shadow roots.

Select Data from #shadow-root Element

container = document.getElementById('shadow-root-parent-id');
document.body.innerHTML = container.shadowRoot.querySelector('div').innerHTML;

This code brings content from inside a shadow-root element to outside, so that it can be selected for scraping. The page design will be lost, but the HTML structure is preserved.

Alternative Shadow DOM Solution

function querySelectorAllShadows(selector, el = document.body) {
  const childShadows = Array.from(el.querySelectorAll('*')).
    map(el => el.shadowRoot).filter(Boolean);
  const childResults = childShadows.map(child => querySelectorAllShadows(selector, child));
  const result = Array.from(el.querySelectorAll(selector));
  return result.concat(childResults).flat();
}
document.body.innerHTML = querySelectorAllShadows('#app')[1].innerHTML;

This function searches for elements both inside and outside the shadow DOM. The [1] index selects the desired element inside the shadow-root.

Mouse hovering techniques

The following code simulate mouse hover actions on elements, useful for revealing tooltips/popups or triggering hover-based content.

Hover Mouse Over Element (jQuery)

$(".username.mo").mouseover();

This code simulates mouse hover over all elements with the specified class names, using jQuery.

Load jQuery on Page

var script = document.createElement('script');
script.type = "text/javascript";
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);

If the page is not already using jQuery, this code loads jQuery onto the page so that jQuery-based commands can be run.

Hover Mouse Over Element (Vanilla JavaScript)

var event = new Event('mouseover'); 
element.dispatchEvent(event);

Disable All Mouseover/Hover Events

$('div').unbind("mouseenter mouseleave hover");

How to select options from dropdowns/comboboxes?

Select Dropdown Option

element = document.getElementById('dropdown-id'); 
element.selectedIndex = index-to-select; // example: 5
event = new Event('change'); 
element.dispatchEvent(event);

This code selects the specified option index from a dropdown element and triggers the change event to notify the page of the selection.

How to store data and pass data between pages?

These techniques help store and manage data across different pages and sessions during web scraping.

Store Data as Element Attribute

element.setAttribute('attr-name', 'attr-data');

This stores data on an element as a new attribute. The data can be read later using element.getAttribute('attr-name').

Store Variables Across Pages

window.name = location.href;

This stores the current page URL in window.name, which persists across page navigation.

Navigate Back Using Stored URL

location.href = window.name;

This navigates back to a previously stored URL. Useful for returning to listing pages after opening popups that don't provide a way to close or navigate back.

Best Practices

When using these JavaScript code for web scraping:

1. Always test code snippets on the target website first
2. Be respectful of server resources by implementing appropriate delays
3. Handle errors gracefully with try-catch blocks
4. Consider the website's terms of service and robots.txt
5. Use browser developer tools to inspect and test element selectors

Integration with WebHarvy

These JavaScript code can be integrated with WebHarvy using the Run Script feature. This allows you to enhance your scraping configurations with custom logic for handling complex websites and dynamic content.

Download and Try Free 15 days evaluation version of WebHarvy

Summary

This collection of JavaScript code provides essential building blocks for web scraping applications. Whether you're dealing with dynamic content, form interactions, or complex page navigation, these snippets can help streamline your scraping workflow and handle challenging web scraping scenarios effectively.

For more advanced web scraping capabilities and a user-friendly interface, consider using WebHarvy, which provides point-and-click web scraping without requiring extensive programming knowledge.