Scraping high resolution images from pinterest.com

In this blog post, we will take a look at how to scrape images from www.pinterest.com in their full sizes.We follow a two stage extraction process to capture the high-res images from pinterest.com.
In the first extraction stage, we capture the image URLs which are present in the listings page. These URLs actually point to smaller sized images (236 Pixels). Then using any Text Editor, we replace the /236x/ with /564x/ in all the URLs.
For example the URL : https://s-media-cache-ak0.pinimg.com/236x/99/….
is modified to : https://s-media-cache-ak0.pinimg.com/564x/99/….
In the second extraction stage, we use ‘Add URLs‘ method to add the modified URLs and scrape the full sized images ((564 Pixels) from each of these URLs using a single WebHarvy configuration.
This method is displayed in the following video :
[youtube https://www.youtube.com/watch?v=i6xDFIXJHDU]

Links:-

  1. Know more about WebHarvy, the easy to use visual web scraper
  2. WebHarvy video tutorial series
  3. Various methods of extracting images and image URLs using WebHarvy
  4. Download free WebHarvy trial

Have any questions ?

Contact us

WebHarvy 4.0.3.129 (Installer Update Only)

This update addresses problems in installing .NET 4.5 on Windows 7 (and earlier Windows versions where .NET 4.5 is not present) during installation process. Only the installer has been updated in this release and WebHarvy application files are unchanged compared to the just previous version. So in case you are already running 4.0.3.128 you can ignore this version.
You may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

Windows Smartscreen warning while installing WebHarvy

All WebHarvy application files and installation package are digitally signed (Comodo RSA Code Signing CA) and secured. However in case you get the following Smartscreen warning while trying to install the latest version of WebHarvy, please click the ‘More info‘ link and then click the ‘Run anyway‘ button to proceed with the installation.
smartscreen1.png
smartscreen2.png
The above popup message is displayed because we recently changed our .NET dependency from 3.5 to 4.5, thereby considerably reducing the installation package size, and more importantly the code signing agency of our digital certificate has been changed from GlobalSign to Comodo. So the above warning may appear till the new WebHarvy installer gets enough reputation from Microsoft which will take a few weeks time. In case you have any questions or require assistance please do not hesitate to contact our support.

WebHarvy 4.0.3.128 (Minor Update)

From this release on wards WebHarvy targets (depends on) .NET 4.5 which comes pre-installed on latest Windows editions. This results in smoother installation process, doing away with .NET 3.5 download and install which was previously required. Targeting .NET 4.5 also helps WebHarvy improve performance and resource usage, and to solve issues related to crashes while trying to extract data from certain websites.
The changes in this release are :-

  1. Depends on .NET 4.5
  2. More support for pages where next page link is implemented in JavaScript
  3. Handles pagination where next page link (next link or ‘show more data’ link) contains a number which varies from page to page
  4. Minor bug fixes related to running JavaScript code on page, opening popup and following links by using regular expressions.

As always you may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

WebHarvy 4.0.2.125 – Multi-level Category / Multi-list Keyword scraping

We have introduced support for scraping multiple level categories (main categories, sub categories tree) and support for multiple input keyword lists in this release. The main features are:-

True multi-level Category Scraping

WebHarvy now supports automatically navigating category/subcategory lists of a website to extract data from the final listing pages. Know More
[vimeo 171059540 w=640 h=480]
 

Support for multiple input keywords

Any number of input text fields can be populated with lists of strings/keywords during configuration. WebHarvy will automatically apply all combinations of provided keywords during the mining phase. Know More.
[vimeo 171062404 w=640 h=480]
 

Capture window with new options

webharvy

Run JavaScript on Page

Run specified Java Script code on page – know more. This option can be used to load elements on a page which cannot be done using the default navigation options (link-follow, click) provided by WebHarvy.

Input strings to text input fields

Strings to be input to text fields can now be made a part of the configuration. Know More. Earlier such parameters were automatically taken from the PostData of the configuration. But sometimes, with some websites, the PostData will not contain the input strings submitted and this option helps to correctly load the page displaying data during mining phase.

Extract data from Popups

Know More. Helps to extract data by clicking each listing link/button and get data from a popup window or a view in the same page populated by data. This is different from ‘Follow this link’ option because here the data is loaded on the same page (no page navigation) and different from ‘Click’ option because after clicking each link data has to be extracted from page before clicking the next link.

Option to smoothly scroll page during mining to load all contents (lazy loading)

Smooth scroll to page end to load elements which are loaded (for example lazy loading of images) only when the elements are made visible by scrolling down. Know More.

Select drop-down/list-box options

Select drop-down/list-box/combo-box options during configuration and mining. Again this option allows navigation to result pages when normal configuration is unable to make these selections and load the result page. Know More.

Other Minor Additions Include :-

  1. Improvements in automatic scraping of multiple product images
  2. Support for loading keyword lists directly from file
  3. ‘Capture Image’ option automatically enabled via HTML/RegEx method in applicable cases.
  4. Name downloaded image files by value obtained from a column/cell in miner data table. More.
  5. Allows applying ‘Capture More Content’ after selecting ‘Capture HTML’.
  6. Quick access to items under ‘More Options’ in Capture window via toolbar buttons.
  7. Minor bug fixes.

You may please download and try the latest version from https://www.webharvy.com/download.html.

WebHarvy crashes after installing the latest Windows update for Adobe Flash

Microsoft released a new security update for Adobe Flash Player for Internet Explorer (IE) a few days back (Dec 29, 2015). This update has caused many software (including Skype – see Skype Crash) to crash. See http://borncity.com/win/2015/12/30/windows-10-flash-update-kb3132372-issues/ for a list of other software titles affected due to this update.
InfoWorld Article : Win10 Flash patch KB 3132372 breaks Skype, HP Solutions Center, Incredimail, games
KB3132372
https://support.microsoft.com/en-us/kb/3132372

Solution ?

The solution to this problem is to uninstall the security update – KB3132372. See How to remove updates.
Meanwhile we will try if we can update WebHarvy to overcome this issue. We are also hoping that there will be another security update from Microsoft which solves this problem since many software titles including their own Skype seems to be affected.
Update ! (Jan 5, 2016)
Microsoft has released another update to fix the issues created by KB3132372. See https://support.microsoft.com/en-us/kb/3133431 for details. We are yet to test and confirm whether this completely solves the issue.
We are extremely sorry for the inconvenience caused due to this for our existing customers and trial users. In case you have any questions or assistance please do not hesitate to contact our support.

WebHarvy version 3.4 released !

We’ve just released a new WebHarvy update. The following are the changes in this version.
Major:

  1. Support for pagination where a link/button has to be clicked to load the next set of pages. More Info
  2. URL based pagination – automatically increment a numeral in start page URL to load subsequent pages. More Info
  3. One-click multiple image extraction from details pages (ex: capture multiple images from product details page)
  4. Human emulation mode support for automatic pause injection – see Miner Settings
  5. Online license activation introduced to prevent casual piracy

Minor:

  1. ‘Click’ option (Capture window > More Options > Click) can be used to navigate to the start page
  2. Bug Fix : Data alignment issue in miner window data table when some records fields do not have a value (blank columns)
  3. Bug Fix : Keyword based scraping when encoding is required
  4. Scheduler option to overwrite or append the export file in case the file already exists
  5. ‘Follow this link’ option enabled in details pages (pages reached by following links from starting page).
  6. Bug Fix : Images going blank in some cases while mouse hovers over them during configuration
  7. Bug Fix : New lines and tabs escaped in JSON export
  8. HtmlParser updated to parse elements from <HTML> tag, so META tags can be extracted from the full HTML source of the page
  9. Handles commas in keywords (Keyword Scraping)
  10. Starts with a random proxy address from the proxy list while rotating proxies
  11. In-built browser emulates IE 11 on default.

Download the latest version of WebHarvy Web Data Extraction Software.

Web Scraping from Cloud – WebHarvy on Amazon EC2

WebHarvy requires Windows operating system to run. So in case you do not have access to a Windows PC or if you do not want to run WebHarvy on your local PC, you have the option to run WebHarvy from Cloud. Amazon Web Services (AWS) Elastic Compute Cloud (EC2) platform makes this possible. See the following link.
https://aws.amazon.com/ec2/
Amazon EC2 lets you run a remote Windows instance in Cloud. You can access this cloud based Windows instance via Remote Desktop
https://aws.amazon.com/windows/
Charges for EC2 are minimal and more importantly there is a free tier available for 12 months with the following details.
https://aws.amazon.com/free/
Watch the following video which shows how to launch a Windows instance in Amazon EC2.

Detailed AWS EC2 documentation for managing Windows instances may be viewed at the following link.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/EC2_GetStarted.html
Once you connect to the Windows EC2 instance via Remote Desktop, you can download and install WebHarvy in it, just like how you would do on a normal Windows desktop or laptop. Please contact us in case you have any questions.

Scraping hidden details using WebHarvy

WebHarvy allows you to scrape hidden fields in websites which are displayed only when you click on a link or button. The ‘Click’ option in the Capture window can be used to display such ‘click to display’ fields. The following video shows the process.

The video below shows how contact details from Craigslist listing pages can be extracted using this feature.

WebHarvy also allows you to scrape data from the HTML of the page. For example, the following video shows how geo location (latitude, longitude) can be extracted from yellow page listings (map details) from its HTML – this data is not visible in browser.

 Know More