{"id":1026,"date":"2020-11-17T10:20:41","date_gmt":"2020-11-17T10:20:41","guid":{"rendered":"http:\/\/webharvy.com\/whblog\/?p=1026"},"modified":"2021-05-25T09:34:49","modified_gmt":"2021-05-25T09:34:49","slug":"how-to-build-a-simple-web-scraper-using-puppeteer","status":"publish","type":"post","link":"https:\/\/www.webharvy.com\/blog\/how-to-build-a-simple-web-scraper-using-puppeteer\/","title":{"rendered":"How to build a simple web scraper using Puppeteer?"},"content":{"rendered":"<h3>Table of Contents<\/h3>\n<ol>\n<li><a href=\"#what\">What is Puppeteer?<\/a><\/li>\n<li><a href=\"#uses\">Uses of Puppeteer<\/a><\/li>\n<li><a href=\"#install\">How to install?<\/a><\/li>\n<li><a href=\"#start\">How to start a browser instance?<\/a><\/li>\n<li><a href=\"#load\">How to load a URL?<\/a><\/li>\n<li><a href=\"#interact\">How to navigate\/interact with the page?<\/a><\/li>\n<li><a href=\"#screenshot\">How to take screenshots, save page as PDF?<\/a><\/li>\n<li><a href=\"#select\">How to select data from page?<\/a><\/li>\n<li><a href=\"#pas\">Headless browser as a service<\/a><\/li>\n<\/ol>\n<p><a name=\"what\"><\/a><\/p>\n<h2>What is Puppeteer?<\/h2>\n<p>Puppeteer (<a href=\"https:\/\/developers.google.com\/web\/tools\/puppeteer\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/developers.google.com\/web\/tools\/puppeteer<\/a>) is a headless Chrome browser for developers. Puppeteer is made available as a Node library.<br \/>\n<a name=\"uses\"><\/a><\/p>\n<h2>Uses of Puppeteer<\/h2>\n<p>Puppeteer can be used by developers for browser automation. Developers can create a headless Chrome browser instance using which web pages can be loaded, interacted with and also take screenshots or PDF of loaded pages. Some of the main usages of Puppeteer are for web scraping, browser automation and automated testing.<br \/>\n<a name=\"install\"><\/a><\/p>\n<h2>How to install Puppeteer?<\/h2>\n<p>Since Puppeteer is a Node library (requires <a href=\"https:\/\/nodejs.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Node.js<\/a> installation), it can be installed by running the following command.<\/p>\n<blockquote><p>$ npm install &#8211;save puppeteer<\/p><\/blockquote>\n<p><a name=\"start\"><\/a><\/p>\n<h2>How to start browser instance?<\/h2>\n<p>The following code will start a headless (without user interface, invisible) browser instance.<\/p>\n<blockquote><p>const puppeteer = require(&#8220;puppeteer&#8221;);<br \/>\nvar browser = await puppeteer.launch();<br \/>\nvar page = await browser.newPage();<\/p><\/blockquote>\n<p><a name=\"load\"><\/a><\/p>\n<h2>How to load a URL?<\/h2>\n<p>To load a URL in the above created browser instance, use the following code.<\/p>\n<blockquote><p>await page.goto(&#8220;https:\/\/www.webharvy.com&#8221;);<\/p><\/blockquote>\n<p><a name=\"select\"><\/a><\/p>\n<h2>How to select items (elements) from the page?<\/h2>\n<p>To select an item\/element from the page loaded in puppeteer, you will first need to find it&#8217;s CSS selector. You can use Chrome Developer Tools to find the CSS Selector of any element on page. For this, after loading the page within Chrome, right click on the required element and select <b>Inspect. <\/b><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1030\" src=\"http:\/\/webharvy.com\/whblog\/wp-content\/uploads\/2020\/11\/devtools.png\" alt=\"\" width=\"790\" height=\"564\"><\/p>\n<p>In the resulting Developer Tools window displayed, the HTML Element corresponding to the element which you clicked on page will be selected. Right click on this element and in the resulting menu displayed you will find the <b>Copy <\/b>submenu within which you should select the <b>Copy selector <\/b>option.&nbsp; You now have the CSS selector of the element in clipboard.<\/p>\n<p>Example:<\/p>\n<p>#description &gt; yt-formatted-string &gt; span:nth-child(1)<br \/>\n<a name=\"interact\"><\/a><\/p>\n<h2>How to interact with page elements?<\/h2>\n<p>This selector string can be used within Puppeteer to select\/interact with elements. For example to click the above element, assuming it is a link, the following code can be used.<\/p>\n<blockquote><p>var selector = &#8220;#description &gt; yt-formatted-string &gt; span:nth-child(1)&#8221;;<br \/>\npage.click(selector);<\/p><\/blockquote>\n<p>In addition to click, Puppeteer provides several other page interaction functionality like keyboard input, typing in input fields etc. Refer : <a href=\"https:\/\/pptr.dev\/#?product=Puppeteer&amp;version=v2.0.0&amp;show=api-class-page\">https:\/\/pptr.dev\/#?product=Puppeteer&amp;version=v2.0.0&amp;show=api-class-page<\/a><\/p>\n<p>The following code shows how you can select and click a button using Puppeteer once the page is loaded.<\/p>\n<blockquote><p>var buttonSelector = &#8220;#DownloadButton&#8221;<br \/>\nawait page.evaluate(sel =&gt; {<br \/>\nvar button = document.querySelector(sel);<br \/>\nButton.click();<br \/>\n}, buttonSelector);<\/p><\/blockquote>\n<p><a name=\"select\"><\/a><\/p>\n<h2>How to get text of page elements?<\/h2>\n<p>As shown in the above code samples, we are running JavaScript codes within Puppeteer using <a href=\"https:\/\/pptr.dev\/#?product=Puppeteer&amp;version=v5.4.1&amp;show=api-pageevaluatepagefunction-args\" target=\"_blank\" rel=\"noopener noreferrer\">page.evaluate<\/a> function for page interaction. The same can be used to get text of elements from the page.<\/p>\n<blockquote><p>var reviewSelector = &#8220;review &gt; span.cm-title&#8221;<br \/>\nvar reviewText = await page.evaluate(sel =&gt; {<br \/>\nvar reviewText = document.querySelector(sel).innerText;<br \/>\nreturn reviewText;<br \/>\n}, reviewSelector);<\/p><\/blockquote>\n<p>As shown above, JavaScript code is executed on page using the <a href=\"https:\/\/pptr.dev\/#?product=Puppeteer&amp;version=v5.4.1&amp;show=api-pageevaluatepagefunction-args\" target=\"_blank\" rel=\"noopener noreferrer\">page.evaluate<\/a> method to get text. You may also use the <a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/API\/Document\/querySelectorAll\" target=\"_blank\" rel=\"noopener noreferrer\">document.querySelectorAll<\/a> JavaScript HTML DOM method to get data from multiple page elements.<br \/>\n<a name=\"screenshot\"><\/a><\/p>\n<h2>How to take screenshots of page and save page as PDF?<\/h2>\n<p>You can take a screenshot of the currently loaded page by using the following code.<\/p>\n<blockquote><p>await page.screenshot({path: &#8216;.\/screenshots\/page1.png&#8217;});<\/p><\/blockquote>\n<p>Or save the page as a PDF using the following code.<\/p>\n<blockquote><p>await page.pdf({path: &#8216;.\/screenshots\/page1.pdf&#8217;});<\/p><\/blockquote>\n<p><a name=\"pas\"><\/a><\/p>\n<h2>Headless browser as a service<\/h2>\n<p>Running puppeteer is a resource intensive process. If you need to run several headless browser instances the memory and processor requirements will be high, and scaling them won&#8217;t be easy. To facilitate this services like <a href=\"https:\/\/www.browserless.io\/\">https:\/\/www.browserless.io\/<\/a> can be used which offers headless browsers as a service.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Table of Contents What is Puppeteer? Uses of Puppeteer How to install? How to start a browser instance? How to load a URL? How to navigate\/interact with the page? How to take screenshots, save page as PDF? How to select data from page? Headless browser as a service What is Puppeteer? Puppeteer (https:\/\/developers.google.com\/web\/tools\/puppeteer) is a &#8230; <a title=\"How to build a simple web scraper using Puppeteer?\" class=\"read-more\" href=\"https:\/\/www.webharvy.com\/blog\/how-to-build-a-simple-web-scraper-using-puppeteer\/\" aria-label=\"Read more about How to build a simple web scraper using Puppeteer?\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,7],"tags":[93],"class_list":["post-1026","post","type-post","status-publish","format-standard","hentry","category-howto","category-web-scraping-workshop","tag-puppeteer"],"_links":{"self":[{"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/posts\/1026","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/comments?post=1026"}],"version-history":[{"count":1,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/posts\/1026\/revisions"}],"predecessor-version":[{"id":1119,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/posts\/1026\/revisions\/1119"}],"wp:attachment":[{"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/media?parent=1026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/categories?post=1026"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.webharvy.com\/blog\/wp-json\/wp\/v2\/tags?post=1026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}