Web Scraping Services

Web Scraping Tutorial (Less Than 5 Minutes!)

Web Scraping Tutorial

Web Scraping Tutorial Using PHP in Less Than 5Minutes

“Being a good citizen in a world full of spiders”  – Dimitrios Kouzis
There are a few things to be aware of – let gets start web scraping tutorial with the easiest one. Before developing spider please check robots.txt file. You will see which directories are allow or disallow.
Example of a robots.txt file at http://www.google.com/robots.txt

Disallow: /search 
Allow: /search/about
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=

Disallow directories should be excluded from your crawler.

More similar examples:

Watch this video – you will be super pumped to learn Web Scraping

We have used ‘Simple HTML DOM Parser’ to extract data from ‘ http://www.example.com/ ‘ webpage.
Go to https://sourceforge.net/projects/simplehtmldom/ and download “PHP Simple HTML DOM Parser”
Unzip and copy ‘simple_html_dom.php’ to your ‘lib’ folder.
Add ‘Firebug’ latest Firefox browser add-ons to analysis website contents. From here: https://addons.mozilla.org/en-US/firefox/addon/firebug/

require_once 'lib/simple_html_dom.php'; #Initiate PHP Simple HTML DOM Parser

$source_url = 'http://www.example.com/'; #Source website from where data will be extracted

$html_source = file_get_html($source_url); #Getting HTML source code of the url

echo '<br>';
echo 'Title: '. $title = $html_source->find('h1', 0)->plaintext;
echo '<br>';
echo '<br>';
echo 'Information: '.$informaiton = $html_source->find('p', 0)->plaintext;
echo '<br>';

“PHP Simple HTML DOM Parser” Manual is available here.

Source Codes: Web-Scraping Tutorial

Leave A Response

Show Buttons
Hide Buttons