Scraping Websites With cURL

Web Page Scraping is a hot topic of discussion around the Internet as more and more people are looking to create applications that pull data in from many different data sources and websites.

In my other tutorials, I talked about using PHP’s file_get_contents function to pull a web page and download the information into a variable string for later manipulation. This method of pulling data off of the web is very good when you are dealing with only text and html.

But what if you wanted to download pictures, graphics, or video off a number of websites and store them on your server? This is were PHP’s file_get_contents can not help us.

Introducing cURL!

cURL is command line tool for transferring files with URL syntax, which means that we can transfer most any type of file using this tool. Most, not all, web servers have the cURL library module installed already so you won’t have to do anything to begin using this powerful library.

cURL has the ability to transfer files using an extensive list of protocols, including:

  • FTP
  • FTPS
  • HTTP
  • HTTPS
  • TFTP
  • SCP
  • SFTP
  • Telnet
  • DICT
  • FILE
  • LDAP

As you can see cURL can not only use the HTTP protocol (which is what PHP’s file_get_contents function uses), but also the FTP protocol which can prove very useful if you want to create a web spider to upload files to server automatically or FTP videos to video sharing sites.

The good news is that cURL is so powerful that it can do most everything that you will ever need to do when it comes to web page scraping. The down-side is that cURL can be very tricky to deal with because there are a tremendous number of options to set and pit-falls to side step.

What I hope to do in this series of tutorials is show you how to work with cURL and how to create you own web scraping class in PHP so you can reuse the code time and time again. So let’s begin…

cURL and Your Web Server

Like I had mentioned that most of the time cURL is already set-up on your web server if you are using a hosted plan. (Sometimes on the “cheaper” plans, cURL is disabled so contact your administrator to see if they will enable it for you)

I personally do most of my web page scraping using my local web server. That’s right, you don’t even need to pay for a hosted server to scrape web pages. All you need is a computer and a web server like Xampp!

If you are using Xampp, like I recommended in my tutorial Creating a Local Development Environment, you will need to enable the cURL module in PHP.

To do this goto the PHP.ini file in your Xampp/php folder and the Xamp/apache/bin folder and uncomment the “php_curl.dll” line by removing the semi-colon.

; Windows Extensions
; Note that ODBC support is built in, so no dll is needed for it.
; Note that many DLL files are located in the extensions/ (PHP 4) ext/ (PHP 5)
; extension folders as well as the separate PECL DLL download (PHP 5).
; Be sure to appropriately set the extension_dir directive.

;extension=php_apc.dll
;extension=php_apd.dll
;extension=php_bcompiler.dll
;extension=php_bitset.dll
;extension=php_blenc.dll
;extension=php_bz2.dll
;extension=php_bz2_filter.dll
;extension=php_classkit.dll
;extension=php_cpdf.dll
;extension=php_crack.dll
extension=php_curl.dll
;extension=php_cvsclient.dll
;extension=php_db.dll
;extension=php_dba.dll
;extension=php_dbase.dll
;extension=php_dbx.dll

Save the changes and restart your web server.

You are now ready to start scraping the web. In the next tutorial, I will show you how you can create your own web scraping class in PHP using cURL.

Next tutorial:

Scraping Web Pages Using cURL Tutorial – Part 1

**************************************************************************

* Looking for a comprehensive course on Web Page Scraping?

* Let me know your interest by commenting on the SpyderSchool Post

* ************************************************************************

About these ads

5 Responses

  1. i m waiting for part 2

  2. The school sounds like a great idea – but I personally dont have the programming skills or the time for the projects that I would like done. Interested in doing a few WS projects? walpolem at gmail.com

  3. Thanks for this tutorial about cURL

  4. [...] In order to fill the search engine’s database with search content, I am going to follow some tutorials on creating a PHP web page scraping class.  I am familiar with various PHP methods such as file_get_contents that allow you to retrieve all text and HTML from a file or web page.  However, PHP has better tools for scraping web content such as cURL.  The online resources that I am using are helping me learn how to scrape all types of data from a web page, including photos, using protocols such as FTP, FTPS, HTTP, HTTPS, et cetera.  For the next several project sessions, I plan to spend time learning how to scrape web content using various tutorials including this one. [...]

  5. [...] Websites With cURLhttp://spyderwebtech.wordpress.com/2008/08/07/scraping-websites-with-curl/Share this:visitPrintStumbleUponTwitterFacebookEmailRedditDiggLike this:LikeBe the first to like [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: