Building A Web Spider – Part 1

Whether you want to gather data for a database web site, or determine which links are broken in your site a web spider is a great tool for getting repetitive tasks done. Scraping web pages from the internet is a great way to gather content and ideas for web sites that you want to build. You can build a web spider to check the current prices of your favorite stocks, or even to get the balance of your bank account.

I use web spiders all the time to build databases for my database sites… I have even created them to get my email for me and respond if certain conditions are met. The secret to understanding how to build a great web spider is to first understand the environment that it must work in.

What do I mean by this? Well first you must understand that a web spider is simply a computer program. And computer programs are based upon rules (wait didn’t I hear this in the MATRIX) and these programs will blindly follow these rules until we tell them to stop. So for us to have our internet spider complete the task that we ask of it, we first must create the rules that it will follow. And of course for us to create the rules we must first understand the environment.

To illustrate this point, let’s say that we want to create a web spider to download our favorite stock quote from Yahoo (substitute MSN, or Etrade if you like). Let’s write some pseudo-code for the way we would do this manually:

  1. Goto the main page of Etrade.com
  2. Fill-in form on front page with our stock symbol – MSFT
  3. Hit Enter
  4. Look to see the stock price which is in large font and red

Now let’s see what kind of question may arise when we start creating a web spider from this simple process.

  1. Can I bypass the form by entering a dynamic url (say finance.yahoo.com/stockquote.asp?stocksym=MSFT)?
  2. If I have to use the form, what are the form variables that I need to send and how (GET vs POST)?
  3. Do I need to store cookies or session variables?
  4. How do I get the stock price from the web page after downloading?
  5. the list goes on and on.

You notice that this simple example of getting a stock quote has some questions that need to be answered before creating the rules of your web spider. And the best way to get these answers is by going directly to the source… the web page you are trying to scrape. So let’s get started!

To really figure out how a web site works we are going to need to look at the HTML code as well as the urls that appear in your navigation bar of your browser. So that is where we are going to start.

Navigate to Yahoo finance by going HERE

You will notice the base url is finance.yahoo.com, you can find this at the top of the browser. Next fill in the form for the stock that you want to look up (MSFT) and hit the enter button. You will notice now that the url has change to:

http://finance.yahoo.com/q?s=msft

Apparently the portion of the url that changes the stock to viewed is contained in the “q?m=” portion of the url. Therefore, I can look up any stock that I choose by putting a different value behind the “q?m=” portion. In this case it is very simple to create a spider to look up a lot of stocks by feeding different stock symbols into this url and the downloading the page.

So how do you download a page from the internet? Well PHP has a lot of really cool ways of doing this. I am only going to show you one quick way now. That is to use PHP’s file_get_contents() function.

Let’s say we want to get Microsoft’s stock page. Use the following code:

<?

$url = “http://finance.yahoo.com/q?m=MSFT”; // This is the url for Microsoft

///// The next line will download the HTML and put it into a variable called $page

$page = file_get_contents($url);

echo $page; /// will print the html onto your page

?>

The file_get_contents function might be disabled on certain servers. Hostgato.com is where I have my hosting accounts and I know that it works there. I used to use Godaddy.com and I know that their cheaper shared hosting won’t allow for using the file_get_contents() function.

In my next post, I am going to show you how to extract the values that you want from the page that you just downloaded. So until then play around with the above code by changing the value of the stock symbol.
Stumble it!

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

Advertisement

15 Responses

  1. nice post, thanks!

  2. Thanks a lot man. I have been looking for something like this for a hell of a long time.

    THANKS!

  3. No worries.. Happy New Year!

  4. Great Post. I have been looking all over for a tutorial like this. I have been trying to learn how to scrape pages and I have had a rough time till now.

  5. u the man dont stop

  6. For someone that doesn’t know PHP from a hole in the wall, how does this actually work? I tried using this within a WordPress template and got rebuked. :(

  7. You won’t be able to use this script in wordpress. WordPress doesn’t like PHP code in the posts, comments, or anywhere else. Templates usually can take PHP but it depends on the web hosting that you use.

    A pre-req for this program to work is that you must have the permissions to use the file_get_contents function which is not always permitted on some cheaper web hosting packages. So do a quick test to figure out if this will work for you. Example:

    If you are looking at the MSN page then you have permissions to use this function, if not then upgrade to a better package.

  8. Nice post. I am interested in learning more but don’t see a Building a web spider Part II.

  9. Thx for the tutorial !

  10. If you cannot use file get contents you have a couple options.

    1. If you have access to php.conf enable it. Either through a web manager or ssh modify the file. Set allow_url_fopen to on.

    2.Use curl.

  11. Nice. I will try it.

  12. but i can’t get the php to work, because of the two forward quotes following the http: it gives me an error because it thinks that it is a php comment

  13. Just replace the “ with ” or ‘ and it’ll work fine.

  14. I have the same experience when I take a few days off from training. I get it back quickly, and the awkwardness does go away too. I hear some folks can come back stronger even after taking 2 weeks off. I guess it depends on the person.

  15. Hi, thank you for your tutorials. This is my first time with spiders. I am looking to building a search engine. So, I created a database, created tables for keywords, url and another containing keywords and url.
    My questions are:
    1.are spiders supposed to harvest links and save it in my database?
    2. When I write the codes for scrapping how do I execute them?
    3.Do I need to host my site for me to run spiders.

    Thank you so much for impacting the world with your Knowledge.
    God bless you abundantly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.