In my last post, Scraping Web Pages with cURL, I talked about what the cURL library can bring to the table and how we can use this library to create our own web spider class in PHP.
What I want to do in this tutorial is to show you how to use the cURL library to download nearly anything off of the web. In upcoming tutorials I will show you how to manipulate what you downloaded and extract whatever information that you want and either store that data in a database or save it on your server.
Creating a PHP Class -
Before we start talking about the cURL library, I first want to show you how to create a class in PHP. Classes are very useful in that they can hold a number of properties and functions and will allow use to easily reuse the code for better productivity. Classes are PHP’s way of creating what are called “Objects”.
An Object in PHP is very similar to an Object in the physical world. Everything that we can see and touch are objects. For instance, a Book is an Object. The Book also has a number of properties that describe that book such as an author, title, number of pages, line spacing, publisher, ect. We can use this same concept in our PHP code to represent a book by using the following code:
class Book
{
var $author = "";
var $title = "";
var $nopages = 0;
var $publisher = "";
function Book($author, $title, $pages, $publisher)
{
$this->author = $author;
$this->title = $title;
$this->nopages = $pages;
$this->publisher = $publisher;
}
}
The above code is an example of class that I built used to describe a book.
After declaring the class, I have listed a number of properties that I want the class to have including author, title, number of pages, and publisher.
Next comes a function called the “constructor” which must be the same name as the class. All this function is used for is to set the properties that are passed into the object (i will show you how to do this later).
The $this keyword is used to reference the current instance of the object. So in this case, we are talking about the Book that we are describing, and not all the rest of the books on planet earth. $this is a very useful keyword that makes object oriented programming possible.
Now comes the good stuff, creating instances of the object we just created. Let’s say I wanted to create an Object from the class book (let’s say that I am creating an online library for people to look up their favorite romance novel). All we need to do is write the following code:
$firstbook = new Book("John Doe", "In Love", 200, "ACME Publishing");
$secondbook = new Book("John Doe", "In Love 2", 230, "ACME Publishing");
We have now created an instance of the Book in PHP with an author of “John Doe” and a title of “In Love”. John Doe’s second book is created in exactly the same way by declaring a “new Book” and passing different values into the class constructor. As you can see, we can reuse this code as many times as we want very easily.
Having reusable code structured this way, we can create hundreds of web spiders very quickly with very little effort. So now let’s create of web spider class…
Creating a Web Spider Class in PHP -
Now lets use the same thinking in creating a web page scraping spider class that we can use to download virtually anything off of the web. Let’s start our class by giving it a name of “wSpider” and let’s create the constructor.
class wSpider
{
var $ch; /// going to used to hold our cURL instance
var $html; /// used to hold resultant html data
var $binary; /// used for binary transfers
var $url; /// used to hold the url to be downloaded
function wSpider()
{
$this->html = "";
$this->binary = 0;
$this->url = “”;
}
}
In the above code, we create some properties which we are going to need for our class and then in the constructor we initialize all the properties. Right now our class does a whole lot of nothing, so to add functionality we are going to have to add functions.
A function, or sometimes called a method, is a list of instructions of what to do on that object. In the book example, I could have a method called Read(), which could cause someone to begin reading that book.
So for our wSpider class, let’s create a function underneath the constructor called fetchPage():
function fetchPage($url)
{
$this->url = $url;
if (isset($this->url)) {
$this->ch = curl_init (); /// open a cURL instance
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // tell cURL to return the data
curl_setopt ($this->ch, CURLOPT_URL, $this->url); /// set the URL to download
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); /// Follow any redirects
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); /// tells cURL if the data is binary data or not
$this->html = curl_exec($this->ch); // pulls the webpage from the internet
curl_close ($this->ch); /// closes the connection
}
}
The above function does the following:
- Checks to see if the url was passed through the function
- Sets the options for the web page pull (see code for what each option does)
- Pulls the web page from the Internet
- Closes the cURL connection
The resultant html from the web spider is held in the $this->html property. Below is the finished code used to download an HTML web page and print it out to the screen.
$mySpider = new wSpider(); //// creates a new instance of the wSpider
$mySpider->fetchPage("http://www.msn.com"); /// fetches the home page of msn.com
echo $mySpider->html; /// prints out the html to the screen
If you wanted to download a picture instead you would have to set the $this->binary equal to true(1). Pictures, Images, and Videos are made up of data that is binary (1s and 0s). So the following code will download a picture and store the data into the $this->html variable.
$mySpider = new wSpider(); //// creates a new instance of the wSpider
$mySpider->binary = 1; /// turns on the binary transfer mode
$mySpider->fetchPage("http://www.msn.com/pic123.jpg"); /// fetches a picture off of the msn home page
You can then use some PHP code to save it to a database or to save a picture file to your hard drive. The sky are the limit as to what you want to do with this data. Actually, this is the same technique that we can use to by-pass Captcha so make sure that you know how to use it.
So we are off to a good start. What you have learned here is crucial to understand if you are planning on doing some large scale scraping projects. Creating reusable code can sometimes take longer in the beginning, but if you do it right… then you can create spiders very quickly and you can use it on all your future project.
In the next tutorial we will talk about how to manipulate the data that you just have pulled as well as how to extend our class to be more functional. Happy Scraping!!
Other Web Spider Tutorials:
Scraping Web Pages with cURL Tutorial- Part 2
Building a Web Spider – Part 2
**************************************************************************
* Looking for a comprehensive course on Web Page Scraping?
* Let me know your interest by commenting on the SpyderSchool Post
* ************************************************************************
Filed under: Automation, General Discussion, PHP Tutorials, SEO Tools, Web Page Scraping, Web Spiders, cURL Tutorials | Tagged: build web spider, CURL, cURL library, curl tutorial, curl web page scraping, data mining, object oriented programming, PHP, PHP classes, php object, php web spider, php web spider tutorial, Web Page Scraping, web site download
Stumble it!
Thanks for the great beginning tutorial! Looking forward to the upcoming posts. Question on storing the image, is this saving the same information that gd produces with $image = open_image($file)?
Jason,
I wasn’t sure if the gd library in PHP could handle cross domain server pulls, but I tested out a sample script using the gd2 library and it seems to work fine.
As far as the data being the same, I did echo the data using the gd2 library and the data contains some tags that describe what kind of picture file it is specific to the gd library. So to answer your question…. No the data is not the same.
If you pull an image file using the gd or gd2 library then you will need to use gd functions to recreate the image and save to your server.
The gd library also allows you to examine each pixel by itself which may create more overhead then doing a straight cURL download. I will have to test that and get back to you.
Thanks.. it very usefull for me
I have copied and pasted your code into Notepad++ and am unable to get it to work no matter what I try.
Error is:
Call to undefined method wSpider::fetchPage() in /home3/fonetwof/public_html/formidablewar/test/screenscrape.php on line 40
Code is:
html = “”;
$this->binary = 0;
$this->url = “”;
}
}
function fetchPage($url)
{
$this->url = $url;
if (isset($this->url)) {
$this->ch = curl_init (); /// open a cURL instance
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // tell cURL to return the data
curl_setopt ($this->ch, CURLOPT_URL, $this->url); /// set the URL to download
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); /// Follow any redirects
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); /// tells cURL if the data is binary data or not
$this->html = curl_exec($this->ch); // pulls the webpage from the internet
curl_close ($this->ch); /// closes the connection
}
}
$mySpider = new wSpider(); //// creates a new instance of the wSpider
$mySpider->fetchPage(“http://www.msn.com”); /// fetches the home page of msn.com
echo $mySpider->html; /// prints out the html to the screen
?>
Looks like I had to move a closing bracket to the end and it all works now.
[...] http://spyderwebtech.wordpress.com/2008/08/08/scraping-web-pages-with-curl-tutorial-part-1/ [...]