Screen Scraping Tutorials and Info
I’m about to improve / re-design some of my old code that scrapes my customer’s websites for them, so I’m using this post to keep a note of any interesting nuggets of info I find.
Interesting overview of the process and example usage of CURL
http://www.devnewz.com/devnewz-3-20041221UsingPHPCURLLibrarytoScrapetheInternet.html#sort
Covers usage examples of file() and preg_match_all() and of putting info into an RSS feed. [Note, there is a little alarm bell ringing in my head ‘cos i remember reading about a security risk of using perl regular expressions with external content - need to investigate.]
http://www.phpit.net/article/screenscrap-rss/2/
Nice example usage of a scraping and caching class that allows for a time-to-live property to ensure pages are not scraped too often. The class looks like it is not made for php5, though (uses class name for constructor). Although its probably forward-compatible - I don’t have time to play around if there are any php5 nuances that i have to deal with.
http://www.tgreer.com/class_http_php.html
This one covers using php5 and covers PHP’s DOM functions
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
a handy link suggested by merchantos.com http://devzone.zend.com/node/view/id/1081#Heading4
The merchantos link, sent me to this next one too. This is intersting because it covers the use of PHP’s mb_convert_encoding() function which the author uses to convert utf-8 into HTML-ENTITIES. I had never heard of this function before. He also mentions a quote from Jamie Zawinsky: “You have a problem and you think, I know, I’ll use Regular Expressions… now you have two problems.”-which I totally agree with - it highlights one of the underlying reasons why i want to re-design my scraping system.
http://www.russellbeattie.com/blog/using-php-to-scrape-web-sites-as-feeds
{–
UPDATE - 02 Jan 2008 - WARNING - a liberal/naive use of the mb_convert_encoding() function can cause lots of problems if you do not understand character encodings (I have written about these issues here).
–}
This lead me on to discover the world of http://www.dapper.net which is something i do not have time to study now, but will put onto my todo list.
Another way of scraping using Zend’s Zend_Http part of the “Understanding the Zend Framework ” series on IBM’s website
http://www-128.ibm.com/developerworks/library/os-php-zend4/index.html
XPath
The merchantos page, which has proven to be really informative, lead me onto the concept of xpaths.
I have managed to survive without knowing about these for quite sometime, but I have seen whole books on xpath sitting on bookstore shelves and never even been tempted to buy. looks like i’ll add them to my reading list!
Anyway here’s a nice page to get me started:
XPath in Five Paragraphs
http://www.rpbourret.com/xml/XPathIn5.htm
and the W3C info:
http://www.w3.org/TR/xpath
Good ol’ wikipedia
http://en.wikipedia.org/wiki/XPath
xpath examples cheatsheet
http://doc.ddart.net/xmlsdk/htm/xpath_syntax2_3prn.htm
another xpath tutorial
http://contentwithstyle.co.uk/Articles/8/
Tidy
Now that i’ve just found out about curl/xpath etc and spent a good few hours looking into it - along comes Ian Van Ness and tells me about tidy plugin for PHP! As seen as a comment on the merchantos.com site. However, Tidy its not installed in php by default.
Also, in my travels, I stumbled accross an HTML parser from WACT
http://wact.sourceforge.net/api/WACT/HTMLParser.html
Handy Tools
Also just a reminder to myself: Firefox has a nice DOM inspector that comes in handy when scraping sites.
Another one dug up:
the relative to absolute URL converter is of note - it takes an absolute URL and a relative URL and combines them to find a new absoulute URL
http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
UPDATE:20 Decmeber 2007…
Character set issues
after playing around with this screen scraping malarkey for a while I am putting some more links below to help solve an issue i’m getting where some of the characters that I’m scraping are coming out weird: e.g text shows up as café instead of cafe
here are some leads:
http://www.issociate.de/board/post/466818/Might_be_PHP_after_all.html
http://www.joelonsoftware.com/articles/Unicode.html
How to get a web page’s content type
http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

January 2nd, 2008 at 6:44 am
[…] This blog post follows on from the post that I wrote whilst I was digging around the net for good info on ’screen scraping’… […]