Parsing Flickr and Blogger Feeds with Perl and XML::Parser

Note:

Unfortunately, WordPress does not allow for clean formatting of code examples.  Therefore, formatting on this page is a bit messed up.   You can go to my website for a clean example of this.

Thanks!

You may have noticed a few places on this site where there is a table that looks something like this:

New on CJU.comLatest from the CJU.com blog:The Wynn Las Vegas: Snack Anti-theft Technologies, Luxurious Suites and Free Drugs During your Stay (added: 2008-MAR-08)-New York Giants Win; Hoboken Madness Ensues (Update: February 5, 2008) (added: 2008-FEB-05)-The St. Regis San Francisco: When Your Hotel Room Needs to be Rebooted (added: 2008-JAN-22)-London New Year (added: 2008-JAN-03)-2007 Hoboken Accolades and Videos (added: 2007-DEC-27)-Christmas Day in New York City (added: 2007-DEC-26)Latest pics from my Flickr Photo Gallery:Super Bowl Sunday (added: 2008-02-17)-iSight is Dead (added: 2008-02-17)-Lake Tahoe, N. California and Nevada (added: 2008-01-26)-Lake Tahoe, N. California and Nevada (added: 2008-01-26)-Lake Tahoe, N. California and Nevada (added: 2008-01-26)-Lake Tahoe, N. California and Nevada (added: 2008-01-26)CJU.com Mini-sites:The Definitive Guide to Purchasing a SnowboardThe Deep Thoughts Server – served over 20 million Jack Handey Deep Thoughts since 1996

The table above shows the last five entries posted to my blog, as well as the last six photos posted to my photostream on Flickr. If you’re reading this, you probably already know what Flickr and Blogger are, so I won’t waste our time going into that, but you may not know that the content you post on both of those sites can be syndicated via RSS and Atom-style feeds, both rather easily. If you’re not familiar with what RSS is, take a quick read of the Wikipedia article, which addresses both RSS and Atom technologies.There are many ways that you can read syndicated XML feeds: Web browsers, such as Safari and Firefox have built in support; Third party applications like NetNewsReader provide robust interfaces for reading feeds; Portals such as Google customized front-page allow you to have these at your fingertips every time you fire up your browser and go to Google’s homepage. Reading RSS feeds is no problem, but what happens if you want to use the contents of these feeds for something?The good news is that RSS and Atom feeds are both XML-based and both have standard, well-documented formats, although the technical documents can be a bit tricky when you just want to bang out a quick web app without getting into the nitty-gritty of the protocol spec. this article talks about how you can use Perl to retrieve, parse and utilize these types of feeds.For simplicity’s sake, I’ll be focusing on the Atom format, which is easier to parse, IMO and is the only format currently available via Blogger. Flickr, thankfully, provides flexibility to syndicate feeds in just about any XML/RSS/Atom format available today.Determining the URL of Your FeedFirst and foremost, you must figure out what the URL is for the feed that you are trying to grab.For Blogger feeds, it’s pretty simple. The format of your blogger feed is always:   http://yourblog.blogspot.com/atom.xmlFor example, my blog URL is http://chrisjur.blogspot.com, so my Blogger Atom feed is http://chrisjur.blogspot.com/atom.xml.For Flickr feeds, it’s a bit more complicated. Flickr allows you to specify feeds for many different attributes of the site, which can include feeds for specific users and tags. Flickr has a specification for how to configure your feeds here. To get you started, however, it’s easy to use your photostream feed, which represents the last ten photos that you’ve uploaded to Flickr. The feed’s URL uses the following format for Atom feeds:   http://www.flickr.com/services/feeds/photos_public.gne?id=your_flickr_nsid&format=atom_03where the red colored “your_flickr_nsid” should be substituted with your Flickr NSID, which is a unique number that identifies you on the Flickr site. Note that this is NOT your Flickr username. You can have Flickr automatically generate this URL for you by going to your photostream and clicking on the “Feeds for (your username’s) photostream Available as RSS 2.0 and Atom” at the bottom of the page. Copy/Paste this URL and keep it in a safe place. Note that your Flickr NSID will be in the URL after the “id=” token of the query string. My feed URL, for example, looks like this:   http://www.flickr.com/services/feeds/photos_public.gne?id=64426228&format=atom_03Required Perl ModulesThere are a few methods to attack the parsing of the feed with Perl. Many people still parse XML “manually”, by writing their own parsers that use a lot of regular expressions and pattern matching. This can be a lot of work and you’re always at risk of slight changes in the feed format, which might throw your parsing routines off. There are several Perl modules written for RSS and Atom feeds, such as XML::Atom, but I have found that interfaces to these modules only allow you to pull out certain attributes about feed, which limits what you can do with it. All of the RSS/Atom modules, however, are built on top of Perl’s XML::Parser modules, which is a generic, event-based parser based on the Expat C XML parser. For flexibility’s sake, I’ll be using XML::Parser to do all of my feed parsing. No parsing libraries, however, provide methods to retrieve these feeds via HTTP, so you’ll have to do that yourself. The easiest way is to use the libwwwperl package, which provides you with the “LWP::” module namespace. Therefore, to get started, you’ll have to march down to your local CPAN and pick up XML::Parser and LWP. To validate if you have these installed, issue these two commands:

perl -MLWP::Simple -e "print;" perl -MXML::Parser -e "print;"

They should both return nothing. If you see something like this, however, you don’t have the module installed:

Can't locate Foobar.pm in @INC (@INC contains: /System/Library/Perl/5.8.6/darwin-thread-multi-2level /System/Library/Perl/5.8.6  /Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6  /Library/Perl /Network/Library/Perl/5.8.6/darwin-thread-multi-2level /Network/Library/Perl/5.8.6 /Network/Library/Perl /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level /System/Library/Perl/Extras/5.8.6 /Library/Perl/5.8.1 .). BEGIN failed--compilation aborted.

Parsing the FeedsWe’ll start with the Flickr Feed. You may just want to jump to the “Summary” part of this article if you’re not interested in how the code words, but just want to get the code and the details on the supporting files. The code we use to parse the Flickr feed is as follows:

#!/usr/bin/perl#(c) 2005 Christopher Uriarte# rssflckr.cgi - parses a flckr atom feed to show the most recent MAXENTRIES, links and date addeduse LWP::UserAgent;use XML::Parser;#####Static Values####$MAXENTRIES = 6;$url = 'http://www.flickr.com/services/feeds/photos_public.gne?id=64426228@N00&format=atom_03';#Tag Trackermy $thistag;#Track if we're in an <entry> blockmy $entryflag;#count the number of entriesmy $count = 0;#####Retrieve XML Date####my $data;my $ua = LWP::UserAgent->new;$ua->timeout(45);$ua->env_proxy;my $response = $ua->get($url);if ($response->is_success){$data = $response->content;}else{exit;}my $parser = new XML::Parser(ErrorContext => 2);$parser->setHandlers(Start => \&start_handler,End   => \&end_handler,Char  => \&char_handler);$parser->parse($data);#We start here when we encounter an HTML Tagsub start_handler {my $expat = $_[0];my $element = $_[1];#print "Encountered element $element\n";#If we enter an <entry> tag, we have a new element#Increase the county by 1 and set the entry flag to 1if ($element eq "entry"){$count++;$entryflag = 1;#print "Count is now $count and entryflag is set to $entryflag.\n";}if ($element eq "title" && $entryflag == 1){$thistag = "title";}#Grab the title and href element of the second "link" tag#exclude service.edit links, we want the "alternate" tag link#print "Encountering Element=$element,entryflag=$entryflag,dollar_1,2,3=$_[1],$_[2],$_[3] \n";if ( ($element eq "link") and ($entryflag == 1) and ($_[3] eq "alternate") ){$ENTRIES{$count}->{link} = $_[7];#print "Link: $_[7]\n";}if ($element eq "issued" && $entryflag == 1){$thistag = "issued";#print "Added: $_";}}#This is where we handle the values within the tagsub char_handler {my ($p, $data) = @_;#print the modified dateif ($thistag eq "issued" && $entryflag == 1){#Get the first 11 Chars of the date, that's all we care about$date = substr($data,0,10);#print "$date\n";$ENTRIES{$count}->{date} = $date;$thistag = "";}if ( ($thistag eq "title") and ($entryflag == 1) ){$ENTRIES{$count}->{title} = $data;$thistag = "";}1;}sub end_handler {my $expat = shift;my $element = shift;#If we're at the end of an <entry> block, clear the entry flagif ($element eq "entry"){$entryflag = 0;#print "\n\n";}1;}#Determine how many entries to display#If our set maximum amt of entries is less than what we encountered#we only display up to $MAXENTRIESif ($MAXENTRIES < $count){$MAX = $MAXENTRIES;}#Otherwise, we display what we encounteredelse{$MAX = $count;}#Map through the %ENTRIES hash from 1 to $MAX  to display the linksfor ($c=1; $c<=$MAX; $c++){#print "Loop is $c, count is $count\n";$title = $ENTRIES{$c}->{title};$link = $ENTRIES{$c}->{link};$date = $ENTRIES{$c}->{date};print "-<A href=\"$link\">$title</A> (added: $date)<BR>\n";}


Here’s a walkthrough of some of the code:

Lines 13-14: Configurable ValuesThese are the only two configurable values in the script. The $MAXENTRIES variable indicates the maximum number of entries you want to print out after parsing the feed. If your feed contains 100 entries, you may only wish to print out, say 5. The $url variable specifies the URL to your feed.Lines 24-47: Parser Setup and TimeoutsThis block initiates the XML::Parse object, retrieves your XML feed via HTTP and sets a timeout of 30 seconds on the HTTP retrieval. If the retrieval fails, the script exits.

Lines 50-85: The XML::Parser Start Handler This block is the tag start handler for XML::Parser, which is the sub-routine executed when a new XML tag is encountered. For the Atom feeds, we’re really only interested in the tags contained within and XML tags. If we find a new entry, we add a new element to the %ENTRIES hash array on line 75, which uses a global counter as the key. We make this hash multi-dimensional by setting $ENTRIES{$count}->{link} to the URL of the photo, which is the 7th element of a tag for that entry, e.g. If we’ve encountered the title of issued date tag, we set a variable ($this_tag) that indicates what we’ve encountered and wait until the next sub-routine for further processing of these tags.

Lines 88-110: The XML::Parser Char Handler This block is the tag char handler for XML::Parser, which is the sub-routine executed when we are examining the data contained between a start and end XML tag. As noted earlier, we’ve set flags for when we’ve encountered the issued date and and title tags. When we encounter the contents of each of these, we add them to the %ENTRIES hash array using the same key. The issued and title tags are added as $ENTRIES{$count}->{date} and $ENTRIES{$count}->{date}, respectively.

Lines 113-125: The XML::Parser End Handler This block is the tag end handler for XML::Parser, which is the sub-routine executed the close of an XML tag is encountered. In this sub-routine, we mainly just do some cleanup of state variables. If we’ve hit the end of an ENTRY tag, we set the state variable accordingly.

Lines 131-151: Printing the Contents of your FeedIn this block we determine the number of feed elements to print out, based o the number of elements encountered in the feed and what you’ve previous set $MAXENTRIES to in line 12. The format of the printing is done in line 150, where each entry is printed out in HTML format. This line can be modified accordingly to fit your requirements. All output is made to STDOUT.

The strategy for parsing the blogger feed is similar, which you can explore by examining the code itself (see SUMMARY section below).Using the Output of the FeedsNow that you’ve parsed the feeds and have the output, you need to incorporate them into your webpage, email or whatever your delivery mechanism is for this information. As I noted earlier, the script above outputs to STDOUT, so you can easily “catch” the output into a file by using simple re-direction, e.g.:

perl rssflickr.pl > flckrfeed.txt

You can then incorporate the contents of this text file into your website by simply reading the contents of the file. In order to keep the feed up-to-date, however, you will need to run this script automatically, which can be done through a standard UNIX cron job. I have separate cronjob entries for both my Flickr and Blogger feeds, which run every hour, e.g.:

0 * * * * perl ~chrisjur/www/cgi-bin/rssblogspot.pl > ~chrisjur/www/cgi-bin/blogfeed.txt  0 * * * * perl ~chrisjur/www/cgi-bin/rssflckr.pl > ~chrisjur/www/cgi-bin/flckrfeed.txt

Since I have several pages that call the display the same feed information, I like to keep the interface to these feed files consistent. I do this by using a simple ‘include’ perl route, which returns the contents of the feed based on a feed “keyword” that is passed to it. This script is the “roadmap” to all my feed files and how I access them. A simple example is this type of file follows:

1:      #!/usr/bin/perl 2: 3:      #dumpfeed.pl - returns HTML code pulled from CJU.com RSS/Atom feed grabbers. 4:      # use: 5:      # 6:      # require dumpfeed.pl 7:      # $blog = dumpfeed('blog); 8:      # print $blog; 9: 10:     sub dumpfeed { 11: 12:             # Configuration Hash uses key => value, where key is a keywork passed 13:             # to the routine, specifying what feed you want and value is the file 14:             # containing the feed contents 15:             my %feeds = ( 16:                     'blog' => 'blogfeed.txt', 17:                     'flckr' => 'flckrfeed.txt' 18:                     ); 19: 20:             my $requestfeed = shift(); 21:             my $feedfile = $feeds{$requestfeed}; 22: 23:             # Open the feed file 24:             my $text; 25:             my $failed; 26:             open (F, "$feedfile") || ($failed = 1); 27:                     while () 28:                     { 29:                             $text = $text . $_; 30:                     } 31:                     close F; 32: 33:             if ($failed == 1) 34:                     { 35:                     $text = "Error:  Could not open feed with token $requestfeed using source $feedfile."; 36:                     } 37: 38:             # Send it back. 39:             return $text; 40:     } 41: 42:     1;

You can call the subroutine as many times as you want from within your .cgi script or whatever is driving the display of your feed content. These 3 lines assign the contents of the blog feed to the $blog variable, which can be printed out at any point in your .cgi:

1:      require dumpfeed.pl 2:      $blog = dumpfeed('blog'); 3:      print $blog;

Summary Information and FilesRequired Perl Modules:

Files from Examples Above:

  • rssflickr.pl – Parses Flickr Atom feeds and prints summary output to STDOUT. Modify the $url variable to specify your Flickr feed URL. Modify the $MAXENTRIES variable to specify how many entries you want to print.
  • rssblogspot.pl – Parses Blogger Atom feeds and prints summary output to STDOUT.Modify the $url variable to specify your Blogger feed URL. Modify the $MAXENTRIES variable to specify how many entries you want to print.
  • dumpfeed.pl – include subroutine used to access various local feeds that you wish to incorporate into your output.

One response to “Parsing Flickr and Blogger Feeds with Perl and XML::Parser

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s