In a previous entry, I mentioned the awesome ROME library for reading Atom or RSS feeds. One thing to be aware of when reading feeds programmatically on a regular basis is that you're potentially consuming a lot of bandwidth on the server providing the feed as you continually re-download the same large xml file. For that reason, it's a good idea not to aggregate too aggressively. I routinely look out for aggregators that are polling the Orablogs feed more than once every 5 minutes and block them because of excessive bandwidth usage.
An additional way to write a well behaved aggregator is to send some information to the server when you make the request to tell it about the last time you retrieved the feed. This gives the server the opportunity to decide whether the feed has actually changed since the last time you checked and only send you the full xml file if it has. This is known as HTTP conditional GET. Charles Miller's excellent blog entry describes the theory behind this in some detail. I'm going to look at one way of implementing this in Java.
The first thing to note is that, for this to work at all, you will need to store some persistent information locally about the blogs you aggregate. In this trivial implementation, I'm using Java's built in xml persistence support to store this information, but it could easily be adapted to use a database or a custom file format. First, let's define a basic javabean to store persistent information about a feed.
package org.dubh.samples.cget;import java.io.InputStream;
import java.io.IOException;public final class FeedSource
{
private String _url;
private String _serverLastModified;
private String _serverEtag;
public String getURL()
{
return _url;
}
public void setURL( String url )
{
_url = url;
}
public String getServerLastModified()
{
return _serverLastModified;
}
public void setServerLastModified( String serverLastModified )
{
_serverLastModified = serverLastModified;
}
public String getServerEtag()
{
return _serverEtag;
}
public void setServerEtag( String serverEtag )
{
_serverEtag = serverEtag;
}
/**
* Open an input stream on this feed, using conditional GET.
*
* @return an input stream on this feed.
* @throws java.io.IOException if an error occurs opening the stream
* @throws FeedNotChangedException if the feed content has not changed since
* the last time it was checked. No stream is open if this exception is
* thrown.
*/
public InputStream openStream() throws IOException,
FeedNotChangedException
{
// Todo: implement conditional get.
}
}
We're storing two pieces of information retrieved from the server we need to implement conditional get and the url of the feed. We don't care about the values of last modified or etag (e.g. in particular, we don't attempt to convert last modified into a real Java date), we're only ever going to pass these back to the server when we eventually implement openStream.
Next, an implementation of a persistent feed list. Again, the implementation of this doesn't matter a bit. You just have to ensure that the last modified and etag are persisted somehow.
package org.dubh.samples.cget;import java.beans.XMLDecoder;
import java.beans.XMLEncoder;import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;import java.util.HashSet;
import java.util.Set;/**
* A list of feeds
*/
public final class FeedList
{
private static final String FILENAME = "org_dubh_samples_cget_feeds.xml";
private Set<FeedSource> _sources = new HashSet<FeedSource>();public Set<FeedSource> getSources()
{
return _sources;
}
public void setSources( Set<FeedSource> sources )
{
_sources = sources;
}
public static FeedList load()
throws IOException
{
// You can replace this with some custom persistence implementation
XMLDecoder decoder = null;
try
{
File file = new File( System.getProperty( "user.home" ), FILENAME );
if ( !file.exists() )
{
return new FeedList();
}
decoder = new XMLDecoder(
new BufferedInputStream( new FileInputStream( file ) )
);
return (FeedList)decoder.readObject();
}
finally
{
if ( decoder != null )
{
decoder.close();
}
}
}
public static void save( FeedList list )
throws IOException
{
// You can replace this with some custom persistence implementation
XMLEncoder encoder = null;
try
{
File file = new File( System.getProperty( "user.home" ), FILENAME );
encoder = new XMLEncoder(
new BufferedOutputStream( new FileOutputStream( file ) )
);
encoder.writeObject( list );
}
finally
{
if ( encoder != null )
{
encoder.close();
}
}
}
}
Next, the most important part of the implementation, actually doing the conditional GET. This simply sets the If-Modified-Since and If-None-Match headers on the request, checks the HTTP status code and stores the Last-Modified and Etag headers in the response:
public InputStream openStream() throws IOException, FeedNotChangedException
{
URLConnection conn = new URL(getURL()).openConnection();
if ( conn instanceof HttpURLConnection )
{
HttpURLConnection connection = (HttpURLConnection) conn;
if ( getServerLastModified() != null )
{
connection.setRequestProperty(
"If-Modified-Since",
getServerLastModified()
);
}
if ( getServerEtag() != null )
{
connection.setRequestProperty(
"If-None-Match",
getServerEtag()
);
}
connection.connect();
int responseCode = connection.getResponseCode();
// The rss feed has not been modified.
if ( responseCode == HttpURLConnection.HTTP_NOT_MODIFIED )
{
connection.disconnect();
throw new FeedNotChangedException();
}
else if ( responseCode == HttpURLConnection.HTTP_OK )
{
String lastModified = connection.getHeaderField( "Last-Modified" );
String etag = connection.getHeaderField( "ETag" );
setServerLastModified( lastModified );
setServerEtag( etag );
return connection.getInputStream();
}
else
{
// Attempt to get an input stream anyway. This will possibly fail
// with an IOException depending on the response code.
return connection.getInputStream();
}
}
else
{
// The URL was not an http: URL. We only support conditional get
// for HTTP, so we just unconditionally get the stream for other URL
// schemes.
conn.connect();
return conn.getInputStream();
}
}
Finally, here's some test code that demonstrates that it all works. The first time you run this, it should open a stream on duffblog's RSS feed. Any subsequent runs will result in a message that the feed has not changed (unless it actually has :) )
package org.dubh.samples.cget;import java.io.InputStream;
import java.io.IOException;public class FeedTest
{
public static void main( String[] args )
throws Exception
{
FeedList fl = FeedList.load();
try
{
if ( fl.getSources().isEmpty() )
{
// OK, first time we've run... add duffblog to the set of feeds.
FeedSource duffblog = new FeedSource();
duffblog.setURL( "http://www.orablogs.com/duffblog/index.xml" );
fl.getSources().add( duffblog );
}
for ( FeedSource source : fl.getSources() )
{
InputStream is = null;
try
{
is = source.openStream();
System.out.println( "Opened stream on "+source.getURL() );
}
catch ( IOException ioe )
{
ioe.printStackTrace();
}
catch ( FeedNotChangedException fnce )
{
System.out.println( "The feed "+source.getURL()+" has not changed." );
}
finally
{
if ( is != null )
{
try { is.close(); }
catch ( IOException ioe ) { ioe.printStackTrace(); }
}
}
}
}
finally
{
FeedList.save( fl );
}
}
}
As JR Boyens pointed out on my previous entry, you can also take advantage of ROME/Fetcher, which does all the hard work of HTTP conditional GET for you...