How to scrape hidden details from HTML source of a web page?

Web scraping typically refers to the extraction of data that is visibly displayed on web pages. However, there are instances where the information you require might not be openly presented like the text or images on the webpage. Instead, it could be hidden within the HTML source code of the page. Examples of such hidden data include latitude and longitude values for pages displaying maps, UPC/product IDs for eCommerce product detail pages, data from HTML meta tags etc.


Scraping data from HTML source

When data is not visually displayed and is available only in the HTML source of the page, the web scraping software has to read the complete HTML source of the loaded page and locate the required data by performing a search.

Regular Expressions to select required data

Regular Expressions can be used to correctly match and select the required data from the HTML of the page. The advantage of this technique is that it can be easily adapted to all types of websites and data requirements.

Steps to follow

  1. Scrape entire HTML of the page
  2. Apply Regular Expression to select the required data from HTML

WebHarvy

WebHarvy is a visual web scraping software which can be used to easily scrape data from any website using a point and click user interface.

WebHarvy allows you to easily capture the HTML source of the whole page or a selected region using the Capture HTML capture window option. Double clicking on the Capture HTML toolbar icon in the Capture window will load the entire page HTML. Then regular expression can be applied to select the required portion.

Video : Scraping hidden UPC codes from HTML

The following video shows how WebHarvy can be used to scrape UPC codes from the HTML source of product details pages. These codes are not visually displayed by the page in the browser.


The regular expression string used in the above video is given below. The RegEx string to use varies from website to website. To learn how to write regular expressions for your own requirement please refer this guide.

var\s+upc\s+=\s+'([^']*)

Video : Scraping latitude/longitude from HTML

The following video shows how geo-coordinates (latitude and longitude) can be scraped from the page source using WebHarvy.


The RegEx strings used in the above video are given below.

Latitude

data-lat="([^"]*)

Longitude

data-lng="([^"]*)


Video : Scraping hidden product ID from HTML

The following video is another example where WebHarvy is configured to scrape product details along with product ID which is available only in the HTML code of the page.


The RegEx string used in the video to select the product ID from HTML is given below.

"productId":"([^"]*)

Download and Try

Please follow the link given below to download and try the free evaluation version of WebHarvy.

Getting Started Guide