Web Scraping Tutorial Using PHP in Less Than 5Minutes
“Being a good citizen in a world full of spiders” – Dimitrios Kouzis
There are a few things to be aware of – let gets start web scraping tutorial with the easiest one. Before developing spider please check robots.txt file. You will see which directories are allow or disallow.
Example of a robots.txt file at http://www.google.com/robots.txt
Disallow: /search Allow: /search/about Allow: /search/howsearchworks Disallow: /sdch Disallow: /groups Disallow: /index.html? Disallow: /? Allow: /?hl=
Disallow directories should be excluded from your crawler.
More similar examples:
http://ebay.com/robots.txt
https://www.amazon.com/robots.txt
Watch this video – you will be super pumped to learn Web Scraping
We have used ‘Simple HTML DOM Parser’ to extract data from ‘ http://www.example.com/ ‘ webpage.
Go to https://sourceforge.net/projects/simplehtmldom/ and download “PHP Simple HTML DOM Parser”
Unzip and copy ‘simple_html_dom.php’ to your ‘lib’ folder.
Add ‘Firebug’ latest Firefox browser add-ons to analysis website contents. From here: https://addons.mozilla.org/en-US/firefox/addon/firebug/
“PHP Simple HTML DOM Parser” Manual is available here.
Source Codes: Web-Scraping