Aggregating RSS Feeds

Pull from several RSS feeds on a high traffic site for too long and you’ll wonder if there is a better way. Fortunately for you there is. Aggregating your RSS feeds solves several problems for both you and the source of the RSS. First it reduces the bandwidth required from both the source site and your site. Imagine a site that gets several requests per hour. Now imagine this site pulling from another site via RSS every time that a client loads the page. The result is the same data getting pulled over and over again. There is a better way!

Aggregated RSS software is available for a variety of operating systems and languages. The problem is that many of these have rather large footprints and cause for extra strain to be put on already busy servers. If you host with Linux you already have the tools required to do aggregation. Here are the things you will need:

  • Access to crontab
  • wget installed on your server
  • PHP

First lets look at the command that makes this all possible and go into a little detail about how it works. RSS feeds are XML based pages served for the most part by HTML browsers. A sample of RSS can be seen below:

Ice Rink Shiner https://www.cpierce.org/2009/01/ice-rink-shiner/ https://www.cpierce.org/2009/01/ice-rink-shiner/#comments Mon, 05 Jan 2009 21:20:11 +0000 admin

This is a excerpt from my RSS feed here at https://www.cpierce.org/feed. If we were to simply want to pull this one feed to our server we could use wget as follows:

#!/bin/bash
/usr/bin/wget --tries=2 --dns-timeout=5 --connect-timeout=5 --no-check-certificate "https://www.cpierce.org/feed/" -O /var/www/html/cpierce.org.xml

Now that the RSS feed is on our own server we don’t have to rely on the speed of the source during page loads. We can also still provide user content even if the source host is down. We could simply run the bash script above every time we wanted to pull a new copy of the feed, but we are looking for a more automated way of doing this. Lets start by upgrading our bash script to PHP so that we can easily pull multiple RSS feeds at once. Here is the example /var/www/html/rss/rss_feed.php code:

&1","r"))) return 126;
while (!feof($prun)) {
$buffer=fgets($prun,10000);
if ($output) print nl2br($buffer);
}
return pclose($prun);
}
// we need a place to store these files we are going to be pulling (this path must be writable from your httpd
$path = '/var/www/html/rss/';
// now we need an array that will hold our file name (the key) and our rss feed url (the value)
$feeds = array('cpierce.org' => 'https://www.cpierce.org/feed',
'jbcrawford.net' => 'http://www.jbcrawford.net/feed',
'jstownsley.com' => 'http://www.jstownsley.com/feed');
// now we need to loop through the array $feeds and pull each rss feed to our local $path.
foreach ($feeds as $name=>$url) {
syscmd('/usr/bin/wget --tries=2 --dns-timeout=5 --connect-timeout=5 --no-check-certificate "'.$url.'" -O '.$path.$name.'.xml', true);
}

We can test this by running it in our browser http://www.site.com/rss/rss_feed.php. Note this is also handy to do if you need to manually refresh an rss feed before the scheduled time. Once this is all working you’ll have xml files in your specified path. Just one thing left to do, schedule a time for them to start using ‘crontab -e’:

15 0-23/4 * * * /usr/bin/wget --delete-after http://www.site.com/rss/rss_feed.php >/dev/null 2>&1

This tells our system scheduled crontab to run every 4 hours when the minute hand is on the 15 (I do this so everything isn’t scheduled at the top of the hour). If you need to add other rss feeds you simply add them to your array and then access them via http://www.site.com/rss/cpierce.org.xml.

Leave a comment

Your email address will not be published. Required fields are marked *