How to scrape Google Jobs? | Scraping job details

WebHarvy can be used to scrape job details from jobs listing websites like Indeed, Google Jobs etc. WebHarvy can automatically pull job details from multiple pages of listings and save them to a file or database.

The following video shows how WebHarvy can be configured to scrape data from Google Jobs listings. Details like job title, position, application URL, company name, description etc. can be easily extracted.

More jobs are loaded on to the same page when you scroll down the left hand side pane on Google Jobs listings page. To perform this the JavaScript method of pagination has to be used. The JavaScript code to be used for this is given below.

els = document.getElementsByTagName("ul");
el = els[els.length-1];
el.children[el.childElementCount-1].scrollIntoView();

Before selecting any data, the following JavaScript code needs to be run on the page to collate job listings grouped under various sections to a single group.

groups = document.getElementsByTagName("ul");
parent = groups[0];
for (var i = groups.length - 1; i >= 1; i--) {
var children = groups[i].children;
for (var j = children.length - 1; j >= 0; j--) {
parent.appendChild(children[j]);
}}

Video : How to scrape Google Jobs using WebHarvy?

Interested? We highly recommend that you download and try using the free evaluation version of WebHarvy available in our website, by following the link given below.

Start web scraping using WebHarvy

In case you have any questions feel free to contact our technical support team.

How to scrape business contact details from Google Maps ?

WebHarvy is a visual web scraper which can be easily configured to scrape data from any website. In this article we will see how WebHarvy can easily extract business contact details from Google Maps.

WebHarvy can scrape contact details (name, address, website, phone etc.) as well as reviews of businesses displayed on Google Maps. The following video shows the configuration steps which you need to follow to scrape contact details of businesses listed in Google Maps.

The regular expression strings used in the above video to scrape phone number and website address are given below.

Phone: ([^"]*)

Website: ([^"]*)

Try WebHarvy

To know more we highly recommend that you download and try using the free evaluation version of WebHarvy. To get started please follow the link below.

Getting started with web scraping using WebHarvy

How to build a simple web scraper using Puppeteer?

Table of Contents

  1. What is Puppeteer?
  2. Uses of Puppeteer
  3. How to install?
  4. How to start a browser instance?
  5. How to load a URL?
  6. How to navigate/interact with the page?
  7. How to take screenshots, save page as PDF?
  8. How to select data from page?
  9. Headless browser as a service

What is Puppeteer?

Puppeteer (https://developers.google.com/web/tools/puppeteer) is a headless Chrome browser for developers. Puppeteer is made available as a Node library.

Uses of Puppeteer

Puppeteer can be used by developers for browser automation. Developers can create a headless Chrome browser instance using which web pages can be loaded, interacted with and also take screenshots or PDF of loaded pages. Some of the main usages of Puppeteer are for web scraping, browser automation and automated testing.

How to install Puppeteer?

Since Puppeteer is a Node library (requires Node.js installation), it can be installed by running the following command.

$ npm install –save puppeteer

How to start browser instance?

The following code will start a headless (without user interface, invisible) browser instance.

const puppeteer = require(“puppeteer”);
var browser = await puppeteer.launch();
var page = await browser.newPage();

How to load a URL?

To load a URL in the above created browser instance, use the following code.

await page.goto(“https://www.webharvy.com”);

How to select items (elements) from the page?

To select an item/element from the page loaded in puppeteer, you will first need to find it’s CSS selector. You can use Chrome Developer Tools to find the CSS Selector of any element on page. For this, after loading the page within Chrome, right click on the required element and select Inspect.

 

In the resulting Developer Tools window displayed, the HTML Element corresponding to the element which you clicked on page will be selected. Right click on this element and in the resulting menu displayed you will find the Copy submenu within which you should select the Copy selector option.  You now have the CSS selector of the element in clipboard.

Example:

#description > yt-formatted-string > span:nth-child(1)

How to interact with page elements?

This selector string can be used within Puppeteer to select/interact with elements. For example to click the above element, assuming it is a link, the following code can be used.

var selector = “#description > yt-formatted-string > span:nth-child(1)”;
page.click(selector);

In addition to click, Puppeteer provides several other page interaction functionality like keyboard input, typing in input fields etc. Refer : https://pptr.dev/#?product=Puppeteer&version=v2.0.0&show=api-class-page

The following code shows how you can select and click a button using Puppeteer once the page is loaded.

var buttonSelector = “#DownloadButton”
await page.evaluate(sel => {
var button = document.querySelector(sel);
Button.click();
}, buttonSelector);

How to get text of page elements?

As shown in the above code samples, we are running JavaScript codes within Puppeteer using page.evaluate function for page interaction. The same can be used to get text of elements from the page.

var reviewSelector = “review > span.cm-title”
var reviewText = await page.evaluate(sel => {
var reviewText = document.querySelector(sel).innerText;
return reviewText;
}, reviewSelector);

As shown above, JavaScript code is executed on page using the page.evaluate method to get text. You may also use the document.querySelectorAll JavaScript HTML DOM method to get data from multiple page elements.

How to take screenshots of page and save page as PDF?

You can take a screenshot of the currently loaded page by using the following code.

await page.screenshot({path: ‘./screenshots/page1.png’});

Or save the page as a PDF using the following code.

await page.pdf({path: ‘./screenshots/page1.pdf’});

Headless browser as a service

Running puppeteer is a resource intensive process. If you need to run several headless browser instances the memory and processor requirements will be high, and scaling them won’t be easy. To facilitate this services like https://www.browserless.io/ can be used which offers headless browsers as a service.

Announcing an upcoming product : GrabContacts

We are happy to announce news regarding our upcoming product launch, on which we were working during the past one year. GrabContacts is an online service which helps you easily extract contact details (email addresses, phone numbers, social media handles) from websites (URLs) or search queries.

Unlike WebHarvy there is no configuration involved, you just need to specify the website address or list of website addresses and GrabContacts will do the rest. We have also added a unique feature which lets you specify a search query (example: Accountants in New York, Doctors in Chicago) and GrabContacts will automatically scan and fetch email addresses, phone numbers and social media handles.

In case you are interested, please signup for an early stage preview of GrabContacts at the following link.

Signup to our waiting list to get early access

AliExpress Scraper – Scraping product data including images from AliExpress

WebHarvy is a visual web scraper which can be easily used to scrape data from any website including eCommerce websites like Amazon, eBay, AliExpress etc.

Scraping AliExpress

The following video shows how WebHarvy can be configured to scrape data from AliExpress product listings. Details of the products like product name, price, minimum orders, shipping details, seller details, product description, images, etc. can be scraped as shown in the video.

To scrape multiple images the following Regular Expression string is used.

src=”([^_]*)

The ‘Advanced Miner Options‘ values (in WebHarvy Miner Settings) for scraping AliExpress (as shown in the video) are as given below.

Watch More WebHarvy Demonstration Videos related to AliExpress Scraping

Try WebHarvy

We recommend that you download and try using the free trial version of WebHarvy available in our website. To get started, please follow the link below.

Getting started with web scraping using WebHarvy

In case you have any questions or need assistance, please contact our technical support team.

How to use User Agent strings to prevent blocking while web scraping ?

What is a user agent string ?

The User-Agent string of a web browser helps servers (websites) to identify the browser (Chrome, Edge, FireFox, IE etc.), its version and also the operating system (Windows, Mac, Android, iOS etc.) on which it is running. This mainly helps the websites to serve different pages for various platforms and browser types.

If you go to https://www.whatismybrowser.com/detect/what-is-my-user-agent you can see the user agent string of your browser.

User Agent strings for web scraping

The same detail can be used by websites to block non-standard web browsers and bots. To prevent this we can configure web scrapers to mimic a standard browser’s user agent.

WebHarvy, our generic visual web scraper, allows you to set any user agent string for its mining browser, so that websites assume the web scraper to be a normal browser and will not block access. To configure this, open WebHarvy Settings and go to the Browser settings tab. Here, you should enable the custom user agent string option and paste the user agent string of a standard browser like Chrome or Edge.

This option can be used to make WebHarvy’s browser appear like any specific standard web browser (ex: Microsoft EdgeMozilla FirefoxGoogle Chrome or Apple Safari) to websites from which you are trying to extract data.

How to get user agent string of various browsers ?

You may find user agent strings of various browsers at http://useragentstring.com/pages/useragentstring.php

Scraping images from Instagram using WebHarvy

WebHarvy can be used to scrape text as well as images from websites. In this article we will see how WebHarvy can be used to scrape data from Instagram.

How to automatically download images from Instagram searches ?

The following video shows how WebHarvy can be configured to scrape images (download images) by searching Instagram for a tag (example: #newyork). As shown, a few additional techniques are used to open the first image and also to automatically load subsequent images. The JavaScript codes for these can be found in the video description given below the video here.

In addition to downloading images, WebHarvy can also scrape textual data from Instagram like post content, followers of a profile etc.

Try WebHarvy

If you are interested we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link below.

Getting started with data extraction using WebHarvy

Scraping Twitter using WebHarvy

WebHarvy can be used to scrape data from social media websites like Twitter, LinkedIn, Facebook etc. In the following video you can see how easy it is to scrape tweets from Twitter searches using WebHarvy. Similar technique can be used to scrape tweets from a Twitter profile page.

In this video, pagination via JavaScript code is used to scrape multiple pages of Twitter search results.

The JavaScript code used in the above video is copied below.

groupEl = document.getElementsByTagName(‘article’)[0].parentElement.parentElement.parentElement.parentElement;
groupEl.children[groupEl.childElementCount-1].scrollIntoView();

Normally, pages which load more data as we scroll down can be configured by following the method explained at https://www.webharvy.com/tour3.html#ScrollToLoad. But in the case of Twitter, the page also deletes tweets from the top as we scroll down. Hence, JavaScript has to be used for pagination.

Try WebHarvy

In case you are interested, we recommend that you download and try using the free evaluation version of WebHarvy available in our website. To get started, please follow the link given below.

https://www.webharvy.com/articles/getting-started.html

WebHarvy 6.1 – Internal Proxies, Database/File Update, New Capture window options

The following are the main changes in this version.

Option to leave a blank row when data is unavailable for a keyword/category/URL

In WebHarvy’s Keyword/Category settings page a new option has been added to leave a blank row filled with corresponding keyword/category/URL when data is unavailable for that item. This option is available only when ‘Tag with Category/URL/Keyword’ option is enabled.

For mining data using a list of keywords, categories or URLs, enabling this option helps in identifying the items for which WebHarvy failed to fetch data, as shown below.

Proxies are used internally by WebHarvy, not system-wide

In earlier versions, proxies set in WebHarvy Settings were applied system wide during mining. This caused side effects for other applications especially in cases where proxies required login with a user name and password and when a list of proxies were cycled. Starting from this version WebHarvy will use proxies internally so that other applications are not affected during mining. You still can apply proxies directly in Windows settings (system wide) and WebHarvy will use it automatically.

Also, the configuration browser will start using the proxies which are set in WebHarvy settings. In earlier versions, proxies were used only during mining.

Database, Excel File Export : Update option (Upsert)

While saving/exporting mined data to a database or excel file which already contains data (from a previous mining session), WebHarvy now allows you to update those rows of data which has the same first column value as those in the newly mined data, without creating duplicate rows.

For file export this option is currently available only for Excel files.

New Capture window options : Page reload and Go back

2 new capture window options for page interaction have been added – Reload & Go back. The reload option is helpful in cases where a page is not correctly loaded first time when a link is followed. The ‘Go back’ option navigates the browser back to the previously loaded page.

Keywords can be added even after starting configuration

Just like URLs, Keywords can also be added after starting configuration. This method is useful in cases where the normal method of Keyword Scraping cannot be applied. The only condition for adding keywords in this method is that the first keyword entered should be present in the Start URL or Post Data of the configuration.

Other minor changes

  1. During configuration, in pages reached by following links from the starting page, links (URLs) selected by applying Regular Expressions on HTML can be followed using the ‘Follow this link’ option. Earlier, only the Click option was available for this scenario.
  2. Automatically handles encoded URLs selected from HTML. Example: URLs including ‘&’. This works for following links as well as for image URLs.
  3. ‘Enable JavaScript’, ‘Share Location’ and ‘Enable plugins’ options removed from Browser settings.
  4. Bug related to scraping a list of URLs when one of the URLs fails to load fixed.
  5. While scraping a list of URLs, URLs which do not start with HTTP scheme part (http:// or https://) are handled.

Download the latest version

The latest version of WebHarvy is available here. If you are new to WebHarvy we would recommend you to view our ‘Getting started‘ guide.

 

Scraping Owner Phone Numbers from Zillow FSBO listings

This post explains how WebHarvy can be easily configured to scrape owner phone numbers from Zillow’s FSBO (For Sale By Owner) listings.

WebHarvy is a generic visual web scraper which can be used to scrape data from any website.

Scraping owner phone numbers

Property listings in Zillow display owner phone numbers at various locations within the property details page.

  • If you click the Contact Agent button, you can see the owner phone number in the popup window displayed (end of the list)

  • You can see the owner phone number listed within the owner/agents list under ‘Get More Information‘ heading.

  • You can also view the owner name and phone number under the ‘Listing provided by owner’ heading.

WebHarvy allows you to capture text displayed after a heading text using the ‘Capture following text‘ feature explained in the link given below.

How to scrape text displayed after a heading text ? 

The video displayed below shows the steps which you need to follow to configure WebHarvy to scrape owner phone numbers from Zillow FSBO listings by clicking the ‘Contact Agent’ button.

Video below shows how owner phone numbers can be selected from the other 2 locations in listing details page.

Scraping multiple properties listed over multiple pages is configured as explained here and each property link is opened using the ‘Follow this link‘ feature.

Try WebHarvy

If you are new to WebHarvy, we highly recommend that you download and try using the free evaluation version of WebHarvy available in our website. Please follow the link below to get started.

Getting started with WebHarvy

Need Support ?

Feel free to contact our support if you have any questions.