Sequentially Scrape Websites : Automation

Often you require to scrape data from multiple websites and might also need to automate the entire process. The following would be your desired workflow.

  1. Configure WebHarvy to scrape data from each website.
  2. Then start scraping data from each website, one after the other, without any manual intervention. In short, a one-click method to start scraping data from multiple websites and also to save the data automatically once mining is completed.

Command line arguments

WebHarvy supports command line arguments so that you can run WebHarvy from a terminal or script providing details like configuration file path, number of pages to mine, location where mined data is to be saved etc. For more details please follow the link below.

WebHarvy Command Line Arguments Explained

Windows batch file

Using the command line argument support of WebHarvy, you can write a Windows batch file which runs each configuration, one after the other. You may refer the following link to know how to write a Windows batch file. In its simplest form, you can just open notepad, write commands to run, one per line and save it using a .bat extension.

https://www.windowscentral.com/how-create-and-run-batch-file-windows-10

Now, you can just run this .bat file or schedule it using Windows Task Scheduler to meet your requirement.

Example

The following is an example of a Windows batch file (saved with .bat extension).

scrape-yp.bat

“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-doctors.xml” -1 “c:\mydata\yp-doctors.csv” overwrite
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-accountants.xml” -1 “c:\mydata\yp-accountants.xlsx” update
“c:\users\tim\AppData\Roaming\SysNucleus\WebHarvy\WebHarvy.exe” “c:\myconfigs\yp-lawyers.xml” -1 “c:\mydata\yp-lawyers.xml” update

You can see that the above batch file runs 3 different configurations (yp-doctors, yp-accountants and yp-laywers) one after the other. Also note that the complete path name is used for WebHarvy executable, configuration file and output file.

If you have any questions please do not hesitate to contact us.