Scraping Medical/Scientific Articles - Author name, email, content

Web Scraping is the process of automatically extracting data from website using software called web scrapers. WebHarvy is a visual web scraping software using which data from any website can be easily extracted via an intuitive point and click user interface.

How to scrape medical/scientific article data?

For this, first you need to download and install WebHarvy in your computer. WebHarvy is a Windows desktop application. To familiarize with the basic operation of the software you can refer this getting started guide.

To scrape data from any website using WebHarvy, first load the page from which you need to scrape data within WebHarvy’s configuration browser and start configuration. You can then click and select any data which you need to scrape from the page. WebHarvy allows you to follow links to load article details pages and scrape additional data. In cases where article listings span across multiple pages, automatic pagination can be configured.

The scraped data can be saved as a spreadsheet file in your computer, or it can be saved to an SQL database. Various file formats and database types are supported.

Given below are demonstrations of WebHarvy related to articles scraping.

Scraping National Library of Medicine

The following video shows how article data like title, author name, email etc. scan be scraped from National Library of Medicine (https://www.nlm.nih.gov/) using WebHarvy. The regular expression string used to get author name can be found in the video description.

Scraping Frontiers Articles (frontiersin.org)

Video displayed below shows how author names and emails of articles at frontiersin.org can be scraped using WebHarvy. The RegEx string and JavaScript code used in the video can be found here.