Announcing the Future Launch of the “SpyderSchool”

Coming Soon - SpyderSchool Launch

Coming Soon - SpyderSchool Launch

I want to thank everyone who left a comment, emailed me, and even took the time to track me down and give me a phone call about starting this project.  It seems as if there is a real need for this subject and to my knowledge there is yet to be a definitive source of web automation information on the net.

I am very happy to announce that I am going forward with the SpyderSchool, an online school on how to Automate the Internet.  The scheduled release is tentatively planned for June 1st, 2009.

This school will teach its members how to spider, scrape, mine data, and automate nearly any site or process on the net.  I will teach you how to do this in step-by-step videos, tutorials, and live seminars.

I am limiting the number of students that I will be accepting to the SpyderSchool for the first year to 250.  So if you are interested, please fill the form linked below as soon as possible.

Put me on the Waiting List for the SpyderSchool

The first 250 people who sign up will be contacted first, and the if there are still openings then I will continue down the list until all spots are filled.  Only then will I open up the enrollment, so don’t wait!  Even if you are only remotely interested, you had better sign-up.

In the SpyderSchool you will learn-by-doing with hands on examples, data-mining challenges, and competitions to test yours skills and hone your newly found techniques.  As you complete each challenge, you will be building a library of web scraping code that you can use for your future career in web automation as well as a portfolio to impress your future clients.

Don’t have any experience programming for the web?  No worries, the SpyderSchool is being built for the beginner.

In the SpyderSchool, you will learn:

  1. How the Internet Works
  2. What Types of Scraping Technologies Exist
  3. How to Analyze a Web Site for Scraping
  4. Basic/Advanced Web Spider Programming
  5. What Tools to Use
  6. How to Automate Your Web Hosting Accounts (Linux Servers)
  7. How to Automate Forms
  8. How to Automate AJAX Dynamic Content Sites
  9. How to Beat Captcha
  10. and Much, Much, More.

This course we be taught using only open-source technologies,  sorry Microsoft, so you won’t have to pay for one single solitary thing besides your tuition to the course.  There will be no up-sells, down-sells, cross-sells, or any other marketing pressure… guaranteed.  I hate that crap, nothing but learning here.

Be looking forward to more posts on the progress of the SpyderSchool.  And be sure to sign-up for more information at the following location:

Put me on the Waiting List for the SpyderSchool

You can also follow the progress of this school at my Twitter:

http://www.twitter.com/spyderwebtech

Thanks again for the huge amount of interest, and for those who took the time to encourage me to start this project.

–Spyderwebtech

Scraping Web Pages with cURL Tutorial – Part 2

In Scraping Web Pages with cURL Tutorial – Part 1, I demonstrated how to create a web spider class that uses the cURL library to transfer any type of data from the web direct to your server.

In this tutorial we are going to talk about how to parse that data into some sort of usable form by extending our wSpider class functionality.

The key to scraping web pages is to first understand how a web page is laid out and it’s resulting HTML structure. It is this HTML structure that allows our spider to identify and scrape the part of the web page that you are interested in. So let’s take a look at some example HTML and review what kind of HTML tags we may encounter with our spider.

Below is an example of the HTML that you might see on a typical web page:

<html>
<head>
<title>My Web Page Title</title>
<meta name="keywords" content="key1,key2,key3" >
</head>
<body>
<h1>Header 1</h1>
<div id="mypics">
<div class="picclass">
<img src="mypic.jpg" width="100" height="100">
</div>
<div class="picclass">
<img src="mypic2.jpg" width="100" height="100">
</div>
</div>
<a href="nextpage.php">Goto Next Page</a>
</body>
</html>

As you can see that each part of the web page is encompassed between and opening tags ( <..> ) and closing tags ( </..> ) . Every web page will have these two primary sections:

  1. <head> section – The content between the opening and closing tags provide information about the website/web page including title, keywords, doctype, ect.
  2. <body> section – This section contains all of the visual items of a web page including text, images, links, tables, and a container type item called a DIV.

Most of our attention will be focused on the <body> section as this is where a lot of the “juicy” content that we might what to scrape. In a later tutorial, I will show you how to create a competition analysis spider working mostly in the <head> section.

Within the <body> tags of our example website we find the <h1>, <div>, <img>, and <a> tags which we will pass onto our spider to scrape either the content between these tags OR a attribute of these tags. For example, the link tag <a> has information between the opening and closing tag as well as a HREF property which will tell us the URL that will be navigated when the link is clicked.

So let’s extend our wSpider class that we began to build in part 1 of this tutorial, by creating a function that will take the HTML stored in our $this->html string and strip out these different tags. I am going to call this function parse_array because I want to take all the occurrences of the supplied tag and then store it into an array.

Below is the code for our parse_array() function:


function parse_array($beg_tag, $close_tag)
{
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data);
return $matching_data[0];
}

The above function takes two parameters which are the beginning tag ($beg_tag) and the ending tag($close_tag) and then will capture everything in between these tags. This will happen for each and every time the program finds this particular configuration.

The real workhorse of this function the preg_match_all function that is part of PHP. The preg_match_all function takes a regular expression and searches a string for all of the occurrences and then extracts them into an array for us (in this case an array called $matching_data).

The regular expression that we use is the “($beg_tag.*$close_tag)siU” portion of the function above which starts by looking at the beginning tag supplied by the user. The next part of the expression is the .* which is sometimes called the “greedy” because the . will match anything and the * will grab as much as it can. The last part of the expression is the closing tags, once again supplied by the user, and then some flags that will ignore case which we will talk about in a later tutorial on regular expressions.

Once all this work is done the function will return the array back to where we called it from.

Let’s take a quick look on how we can uses our new parse_array() function to scrape all of the links from our example HTML above.

The first thing we need to know is how is a link structured in HTML so we can send our function the beginning tag and the end tag. A link’s structure looks like this from our example:

<a href="nextpage.php">Goto Next Page</a>

I can see that the link begins with a “<a” and ends with a “</a>”. If I uses these as my start and end tags in my function then the function should return an array with one element and inside that element will be the entire HTML for the link above.

Putting it all together, below is the code that I would use to download all the links from my example web page. I am also going to put in a foreach loop that will be used to do whatever I want to with each link. This loop will be important to us when we create a spider that can crawl multiple web pages and website without further user interaction.

$myspider = new wSpider();
$myspider->fetchPage(“http://wwww.examplesite.com/example.html&#8221;);
$linkarray = $myspider->parse_array(“<a”, “</a>”);

foreach ($linkarray as $result) {

/// store each $result in database or create a new spider to spider next page

}

As you can see from the code above we create a wSpider instance by using the new keyword. Next we get the HTML by using the fetchPage() function and we set the target URL in the constructor of the function (see part 1 of this tutorial if you don’t understand). From this function we store the resultant HTML in the $this->html variable within the wSpider class instance.

The next line of code assigns the result of our parse_array() function to another variable called $linkarray which we use in a foreach loop at the bottom of this code. Remember, what has been returned into the $linkarray array is the entire HTML of the link including the <a> tags, so depending on which type of information you want you are going to have to strip out that information.

PHP has a lot of great functions that can help us grab what information we want. For instance, if we wanted to grab the anchor text of these links we could use the strip_tags() function in PHP to grab the text between the opening <a> and closing </a > tags (this can be used for anything that has an opening and closing tag).

Once we have the information we want then we can do a number of things such as save the information to a database, spawn a new spider to scrape the next page, or to record what the anchor text is for each link (for SEO purposes) as explained above.

In the next tutorial, we will continue to develop our spider class to include a number of functions that will allow us to make repetitive tasks very simple, and how to use these new functions to create a simple spider that can crawl an entire website.

Other Web Spider Tutorials:

Build a Web Spider – Part 1

Building a Web Spider – Part 2

Stumble it!

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

Scraping Web Pages with cURL Tutorial- Part 1

In my last post, Scraping Web Pages with cURL, I talked about what the cURL library can bring to the table and how we can use this library to create our own web spider class in PHP.

What I want to do in this tutorial is to show you how to use the cURL library to download nearly anything off of the web. In upcoming tutorials I will show you how to manipulate what you downloaded and extract whatever information that you want and either store that data in a database or save it on your server.

Creating a PHP Class -

Before we start talking about the cURL library, I first want to show you how to create a class in PHP. Classes are very useful in that they can hold a number of properties and functions and will allow use to easily reuse the code for better productivity. Classes are PHP’s way of creating what are called “Objects”.

An Object in PHP is very similar to an Object in the physical world. Everything that we can see and touch are objects. For instance, a Book is an Object. The Book also has a number of properties that describe that book such as an author, title, number of pages, line spacing, publisher, ect. We can use this same concept in our PHP code to represent a book by using the following code:

class Book
{
var $author = "";
var $title = "";
var $nopages = 0;
var $publisher = "";

function Book($author, $title, $pages, $publisher)
{
$this->author = $author;
$this->title = $title;
$this->nopages = $pages;
$this->publisher = $publisher;

}
}

The above code is an example of class that I built used to describe a book.

After declaring the class, I have listed a number of properties that I want the class to have including author, title, number of pages, and publisher.

Next comes a function called the “constructor” which must be the same name as the class. All this function is used for is to set the properties that are passed into the object (i will show you how to do this later).

The $this keyword is used to reference the current instance of the object. So in this case, we are talking about the Book that we are describing, and not all the rest of the books on planet earth. $this is a very useful keyword that makes object oriented programming possible.

Now comes the good stuff, creating instances of the object we just created. Let’s say I wanted to create an Object from the class book (let’s say that I am creating an online library for people to look up their favorite romance novel). All we need to do is write the following code:


$firstbook = new Book("John Doe", "In Love", 200, "ACME Publishing");

$secondbook = new Book("John Doe", "In Love 2", 230, "ACME Publishing");

We have now created an instance of the Book in PHP with an author of “John Doe” and a title of “In Love”. John Doe’s second book is created in exactly the same way by declaring a “new Book” and passing different values into the class constructor. As you can see, we can reuse this code as many times as we want very easily.

Having reusable code structured this way, we can create hundreds of web spiders very quickly with very little effort. So now let’s create of web spider class…

Creating a Web Spider Class in PHP -

Now lets use the same thinking in creating a web page scraping spider class that we can use to download virtually anything off of the web. Let’s start our class by giving it a name of “wSpider” and let’s create the constructor.

class wSpider
{

var $ch; /// going to used to hold our cURL instance
var $html; /// used to hold resultant html data
var $binary; /// used for binary transfers
var $url; /// used to hold the url to be downloaded


function wSpider()
{
$this->html = "";
$this->binary = 0;

$this->url = “”;
}
}

In the above code, we create some properties which we are going to need for our class and then in the constructor we initialize all the properties. Right now our class does a whole lot of nothing, so to add functionality we are going to have to add functions.

A function, or sometimes called a method, is a list of instructions of what to do on that object. In the book example, I could have a method called Read(), which could cause someone to begin reading that book.

So for our wSpider class, let’s create a function underneath the constructor called fetchPage():

function fetchPage($url)
{
$this->url = $url;
if (isset($this->url)) {

$this->ch = curl_init (); /// open a cURL instance

curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // tell cURL to return the data

curl_setopt ($this->ch, CURLOPT_URL, $this->url); /// set the URL to download

curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); /// Follow any redirects

curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); /// tells cURL if the data is binary data or not

$this->html = curl_exec($this->ch); // pulls the webpage from the internet

curl_close ($this->ch); /// closes the connection

}
}

The above function does the following:

  1. Checks to see if the url was passed through the function
  2. Sets the options for the web page pull (see code for what each option does)
  3. Pulls the web page from the Internet
  4. Closes the cURL connection

The resultant html from the web spider is held in the $this->html property. Below is the finished code used to download an HTML web page and print it out to the screen.


$mySpider = new wSpider(); //// creates a new instance of the wSpider
$mySpider->fetchPage("http://www.msn.com"); /// fetches the home page of msn.com

echo $mySpider->html; /// prints out the html to the screen

If you wanted to download a picture instead you would have to set the $this->binary equal to true(1). Pictures, Images, and Videos are made up of data that is binary (1s and 0s). So the following code will download a picture and store the data into the $this->html variable.


$mySpider = new wSpider(); //// creates a new instance of the wSpider
$mySpider->binary = 1; /// turns on the binary transfer mode
$mySpider->fetchPage("http://www.msn.com/pic123.jpg"); /// fetches a picture off of the msn home page

You can then use some PHP code to save it to a database or to save a picture file to your hard drive. The sky are the limit as to what you want to do with this data. Actually, this is the same technique that we can use to by-pass Captcha so make sure that you know how to use it.

So we are off to a good start. What you have learned here is crucial to understand if you are planning on doing some large scale scraping projects. Creating reusable code can sometimes take longer in the beginning, but if you do it right… then you can create spiders very quickly and you can use it on all your future project.

In the next tutorial we will talk about how to manipulate the data that you just have pulled as well as how to extend our class to be more functional. Happy Scraping!! :)

Other Web Spider Tutorials:

Scraping Web Pages with cURL Tutorial- Part 2

Build a Web Spider – Part 1

Building a Web Spider – Part 2

Stumble it!

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

Scraping Websites With cURL

Web Page Scraping is a hot topic of discussion around the Internet as more and more people are looking to create applications that pull data in from many different data sources and websites.

In my other tutorials, I talked about using PHP’s file_get_contents function to pull a web page and download the information into a variable string for later manipulation. This method of pulling data off of the web is very good when you are dealing with only text and html.

But what if you wanted to download pictures, graphics, or video off a number of websites and store them on your server? This is were PHP’s file_get_contents can not help us.

Introducing cURL!

cURL is command line tool for transferring files with URL syntax, which means that we can transfer most any type of file using this tool. Most, not all, web servers have the cURL library module installed already so you won’t have to do anything to begin using this powerful library.

cURL has the ability to transfer files using an extensive list of protocols, including:

  • FTP
  • FTPS
  • HTTP
  • HTTPS
  • TFTP
  • SCP
  • SFTP
  • Telnet
  • DICT
  • FILE
  • LDAP

As you can see cURL can not only use the HTTP protocol (which is what PHP’s file_get_contents function uses), but also the FTP protocol which can prove very useful if you want to create a web spider to upload files to server automatically or FTP videos to video sharing sites.

The good news is that cURL is so powerful that it can do most everything that you will ever need to do when it comes to web page scraping. The down-side is that cURL can be very tricky to deal with because there are a tremendous number of options to set and pit-falls to side step.

What I hope to do in this series of tutorials is show you how to work with cURL and how to create you own web scraping class in PHP so you can reuse the code time and time again. So let’s begin…

cURL and Your Web Server

Like I had mentioned that most of the time cURL is already set-up on your web server if you are using a hosted plan. (Sometimes on the “cheaper” plans, cURL is disabled so contact your administrator to see if they will enable it for you)

I personally do most of my web page scraping using my local web server. That’s right, you don’t even need to pay for a hosted server to scrape web pages. All you need is a computer and a web server like Xampp!

If you are using Xampp, like I recommended in my tutorial Creating a Local Development Environment, you will need to enable the cURL module in PHP.

To do this goto the PHP.ini file in your Xampp/php folder and the Xamp/apache/bin folder and uncomment the “php_curl.dll” line by removing the semi-colon.

; Windows Extensions
; Note that ODBC support is built in, so no dll is needed for it.
; Note that many DLL files are located in the extensions/ (PHP 4) ext/ (PHP 5)
; extension folders as well as the separate PECL DLL download (PHP 5).
; Be sure to appropriately set the extension_dir directive.

;extension=php_apc.dll
;extension=php_apd.dll
;extension=php_bcompiler.dll
;extension=php_bitset.dll
;extension=php_blenc.dll
;extension=php_bz2.dll
;extension=php_bz2_filter.dll
;extension=php_classkit.dll
;extension=php_cpdf.dll
;extension=php_crack.dll
extension=php_curl.dll
;extension=php_cvsclient.dll
;extension=php_db.dll
;extension=php_dba.dll
;extension=php_dbase.dll
;extension=php_dbx.dll

Save the changes and restart your web server.

You are now ready to start scraping the web. In the next tutorial, I will show you how you can create your own web scraping class in PHP using cURL.

Next tutorial:

Scraping Web Pages Using cURL Tutorial – Part 1

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

Follow

Get every new post delivered to your Inbox.