Scraping Web Pages with cURL Tutorial – Part 2

In Scraping Web Pages with cURL Tutorial – Part 1, I demonstrated how to create a web spider class that uses the cURL library to transfer any type of data from the web direct to your server.

In this tutorial we are going to talk about how to parse that data into some sort of usable form by extending our wSpider class functionality.

The key to scraping web pages is to first understand how a web page is laid out and it’s resulting HTML structure. It is this HTML structure that allows our spider to identify and scrape the part of the web page that you are interested in. So let’s take a look at some example HTML and review what kind of HTML tags we may encounter with our spider.

Below is an example of the HTML that you might see on a typical web page:

<html>
<head>
<title>My Web Page Title</title>
<meta name="keywords" content="key1,key2,key3" >
</head>
<body>
<h1>Header 1</h1>
<div id="mypics">
<div class="picclass">
<img src="mypic.jpg" width="100" height="100">
</div>
<div class="picclass">
<img src="mypic2.jpg" width="100" height="100">
</div>
</div>
<a href="nextpage.php">Goto Next Page</a>
</body>
</html>

As you can see that each part of the web page is encompassed between and opening tags ( <..> ) and closing tags ( </..> ) . Every web page will have these two primary sections:

  1. <head> section – The content between the opening and closing tags provide information about the website/web page including title, keywords, doctype, ect.
  2. <body> section – This section contains all of the visual items of a web page including text, images, links, tables, and a container type item called a DIV.

Most of our attention will be focused on the <body> section as this is where a lot of the “juicy” content that we might what to scrape. In a later tutorial, I will show you how to create a competition analysis spider working mostly in the <head> section.

Within the <body> tags of our example website we find the <h1>, <div>, <img>, and <a> tags which we will pass onto our spider to scrape either the content between these tags OR a attribute of these tags. For example, the link tag <a> has information between the opening and closing tag as well as a HREF property which will tell us the URL that will be navigated when the link is clicked.

So let’s extend our wSpider class that we began to build in part 1 of this tutorial, by creating a function that will take the HTML stored in our $this->html string and strip out these different tags. I am going to call this function parse_array because I want to take all the occurrences of the supplied tag and then store it into an array.

Below is the code for our parse_array() function:


function parse_array($beg_tag, $close_tag)
{
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data);
return $matching_data[0];
}

The above function takes two parameters which are the beginning tag ($beg_tag) and the ending tag($close_tag) and then will capture everything in between these tags. This will happen for each and every time the program finds this particular configuration.

The real workhorse of this function the preg_match_all function that is part of PHP. The preg_match_all function takes a regular expression and searches a string for all of the occurrences and then extracts them into an array for us (in this case an array called $matching_data).

The regular expression that we use is the “($beg_tag.*$close_tag)siU” portion of the function above which starts by looking at the beginning tag supplied by the user. The next part of the expression is the .* which is sometimes called the “greedy” because the . will match anything and the * will grab as much as it can. The last part of the expression is the closing tags, once again supplied by the user, and then some flags that will ignore case which we will talk about in a later tutorial on regular expressions.

Once all this work is done the function will return the array back to where we called it from.

Let’s take a quick look on how we can uses our new parse_array() function to scrape all of the links from our example HTML above.

The first thing we need to know is how is a link structured in HTML so we can send our function the beginning tag and the end tag. A link’s structure looks like this from our example:

<a href="nextpage.php">Goto Next Page</a>

I can see that the link begins with a “<a” and ends with a “</a>”. If I uses these as my start and end tags in my function then the function should return an array with one element and inside that element will be the entire HTML for the link above.

Putting it all together, below is the code that I would use to download all the links from my example web page. I am also going to put in a foreach loop that will be used to do whatever I want to with each link. This loop will be important to us when we create a spider that can crawl multiple web pages and website without further user interaction.

$myspider = new wSpider();
$myspider->fetchPage(“http://wwww.examplesite.com/example.html”);
$linkarray = $myspider->parse_array(“<a”, “</a>”);

foreach ($linkarray as $result) {

/// store each $result in database or create a new spider to spider next page

}

As you can see from the code above we create a wSpider instance by using the new keyword. Next we get the HTML by using the fetchPage() function and we set the target URL in the constructor of the function (see part 1 of this tutorial if you don’t understand). From this function we store the resultant HTML in the $this->html variable within the wSpider class instance.

The next line of code assigns the result of our parse_array() function to another variable called $linkarray which we use in a foreach loop at the bottom of this code. Remember, what has been returned into the $linkarray array is the entire HTML of the link including the <a> tags, so depending on which type of information you want you are going to have to strip out that information.

PHP has a lot of great functions that can help us grab what information we want. For instance, if we wanted to grab the anchor text of these links we could use the strip_tags() function in PHP to grab the text between the opening <a> and closing </a > tags (this can be used for anything that has an opening and closing tag).

Once we have the information we want then we can do a number of things such as save the information to a database, spawn a new spider to scrape the next page, or to record what the anchor text is for each link (for SEO purposes) as explained above.

In the next tutorial, we will continue to develop our spider class to include a number of functions that will allow us to make repetitive tasks very simple, and how to use these new functions to create a simple spider that can crawl an entire website.

Other Web Spider Tutorials:

Build a Web Spider – Part 1

Building a Web Spider – Part 2

Stumble it!

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

11 Responses

  1. I think the following needs to be fixed up ..

    $linkarray = $mySpider->parse_array(“<a”, ““);

    You’re missing the capital S and the / for the closing tag

  2. Your right about the closing tag missing the “/”, however I did not include a capital S in the $myspider variable.

    Good Catch :)

  3. Hey, spyderwebtech, I cant figure a way to contact you directly, So I’d figure Id try this approach. I really dig what you do here at your site, more so, I dig the idea of the school for ’scraping’. Like you I am at worlds end trying to make my milly ($1,000,000) on the net in a few weeks time. I recently have been developing this site which is essentially everything a creative person could ask for, but only the best of it — Tuts, Audio Files, Brushes, Vectors, Flash, etc… Problem is I want quality and lots of it, I really want to donate many hours of my time helping prepare your school and offer it via my site — people will sign up, I work at an ad agency and know a killer copyrighter whom I can ask a favor for as far as writing copy for the documents, I can what utilities to run the learning environment in, and Im even willing to invest in some of the ‘internet’ marketing to extend the brand of the app. Additionally, I would love to post a section on my site about your blog. Or make you an author on my site…..

    Who knows…..
    Just ideas

  4. rhickman,

    email is spyderwebtech.wp [at] gmail.com

  5. nice article mate…. hoping for the next one….please, dont make me to wait.

  6. [...] Scraping Web Pages with cURL Tutorial – Part 2 « Spyder Web Tech’s SEO Journey [...]

  7. Please tell me how to build a web crawler to search all internal link in the website..Please tell me, i’am waiting for your answer

  8. I’am waiting for your answer about my question last day..Please.

  9. how come this crap doesn’t work for me… I think is the same guy putting the replies on this page

  10. it would be more efficient and versatile to parse your DOM with XPath rather than preg

    • yes DOM parsing and using Xpath is a strategy that can be used to parse a web page. To say it is more efficient may or may not be true depending on the target website. There is a certain amount of overhead by creating the DOM in memory.

Leave a Reply