Regular Expressions For Web Scraping

Regular Expressions are useful in web scraping for selecting the required portions of text or HTML from larger content. The powerful pattern-matching capabilities of Regular Expressions let you to extract data directly from the full-page text or HTML source without needing to parse the HTML code or DOM structure.

Web Scraping mainly consists of 2 phases : 1. Loading the page in a browser or getting its HTML or text content, and 2. Extracting the required data from the loaded content. Both phases have their own challenges. Parsing the HTML or DOM structure of the loaded page content is a complicated process, whereas using Regular Expressions to extract the required data is often simpler and more efficient.

1. Using regular expressions with WebHarvy
2. How to select string following another string ?
3. How to select string following another string till a specific character ?
4. How to select string between 2 other strings ?
5. How to select URLs and email addresses from HTML ?
6. Commonly used regular expressions for web scraping

For a detailed Regular Expression tutorial we highly recommend https://www.regular-expressions.info.

Using Regular Expressions with WebHarvy

WebHarvy allows you to apply Regular Expressions on the selected text (or HTML) before scraping it. You may apply Regular Expressions on Text or HTML. Regular expressions can be applied by clicking the 'More Options' button and then selecting the Apply Regular Expression option.

How to select string following another string ?

Suppose you want to extract the price in dollars from the text below.

Product Details
Price: 99$
This product comes with absolutely no warranty . .

The RegEx string to be used is:

Price: (.*)

Here the text following the heading Price: till the end of line is selected. The extracted portion is the portion matched within the parenthesis (.*). Dot (.) denotes any character and * denotes repetition.

How to select string following another string till a specific character ?

In the same example above if you need to extract the price excluding the dollar sign, the RegEx string to be used is:

Price: ([\d]*)

Here the captured portion is the string which follows the heading Price: which contains only digits \d which are repeated [\d]*. An alternative RegEx for the same purpose is :

Price: ([^$]*)

Here the captured portions it the set of repeating characters which follows the heading Price: such that it is not a dollar $ (escaped with a \ since $ is a special character in RegEx).

How to select string between 2 other strings ?

Suppose you need to extract the string embedded between the tags <address> and </address> in the HTML code given below.

<address>
  356, Street Name, City, Country
</address>

The RegEx string to be used is :

<address>([\s\S]*?)</address>

The portion ([\s\S]*?) matches all characters between <address> and </address>.

How to select URLs and email addresses from HTML ?

You can use the Capture HTML option to get the HTML of the selected content in Capture window.

To extract the URL/website address from the following HTML.

<div class="call-to-action ">
<a title="Website (opens in a new window)" 
class="contact contact-main contact-url " href="http://www.canberraeyelaser.com.au" target="_blank" rel="nofollow">
<span class="glyph icon-website border border-dark-blue with-text"></span><span class="contact-text">Website</span>
</a>
</div>

Use the following RegEx string:

href="([^"]*)

href=" denotes the heading text before the URL and ([^"]*) matches all characters till " in the HTML code.

To extract the email address from the following HTML.

<div class="call-to-action ">
<a title="Email" class="contact contact-main contact-email " 
href="mailto:info@canberraeyelaser.com.au?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au&body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch" 
rel="nofollow" data-email="info@canberraeyelaser.com.au">
<span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span>
</a>
</div>

Use the following RegEx string :

mailto:([^?]*)

mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .

The following RegEx string can also be used to extract email address (second occurrence in HTML) :

data-email="([^"]*)

data-email=" denotes the heading text before the email address and ([^"]*) matches all characters till ".

Commonly used regular expressions for web scraping

Select only the first line from a block of text or HTML

(.*)

Select first line, ignoring the starting white-spaces, (spaces, line feeds and carriage returns). [\s]* matches all white-spaces till the first viewable character.

[\s]*(.*)

Get the href link/URL from HTML. [^"]* matches till the next " character.

href="([^"]*)

Get src link/URL from HTML

src="([^"]*)

Above RegEx can be modified according to requirement as shown below.

zoom-image="([^"]*)
data-large-image="([^"]*)

Get email address from HTML

mailto:([^"]*)

Get the string between 'Starting Text' and 'Ending Text'. [\s\S]* matches everything in between (white-space and non-white-space - includes all characters)

Starting Text([\s\S]*?)Ending Text

Gets HTML code between itemprop="name"> and <div class="line">. [^<]* matches all characters till <.

itemprop="name">([^<]*)<div class="line">

Same effect as above.

itemprop="name">([\s\S]*?)<div class="line">

Conditional regular expression. Captures MAP price if available, else capture List Price. RegEx special characters like $, ., ^ etc. should be escaped by \ (example: \$, \. etc).

(?=[^M]*MAP)[^M]*MAP: \$(.*)|List Price: \$(.*

Get first image URL

<img src="([^"]*)

Get second image URL. src value of second img tag in HTML.

<img src=[\s\S]*?<img src="([^"]*)

Matches and gives value 'In Stock', only if the selected HTML or TEXT has the text 'In Stock'. This can be used to check if the selected HTML or TEXT contains a specific string.

(In Stock)

Matches the string which comes between 2 HTML tags where the starting tag contains the text 'merch_name'. [^>]*> matches till the next >. [^<]* matches till the next <

merch_name[^>]*>([^<]*)