Playing around with Apache HttpComponents

The other week while I was "stuck" at home watching our little baby boy, I spent a little time ASF's HttpComponents. I have several HTTP utilities but so far I have just been using the somewhat limited java.net.\* classes.

As a goal of the investigation, I decided to write a multi-threaded web crawler, to some level of completeness.

Here's an overview of the heart of the crawler.

After looking around at the HttpComponents examples, it's pretty clear that the main entry point is the org.apache.http.client.HttpClient interface. This interface represents an "HTTP Client". Analogous to say the heart of a Browser like Firefox. It mainly provides methods to allow you to execute HTTP Requests. Various subclasses exist, the main one of interest being DefaultHttpClient which has methods for setting up all the typical http goodies like Cookie stores, Authentication methods, and Connection managers.

The simplest instantiation to take the defaults is something like

HttpClient httpClient = new DefaultHttpClient();

But this isn't good enough for me because it creates a client that has only a single threaded Connection Manager. This will not work for my goal. A little bit more code will fix that and create an HttpClient with a multi-thread save Connection Manager.

HttpParams params = new BasicHttpParams();
HttpConnectionManagerParams.setMaxTotalConnections(params, 100);
HttpConnectionParams.setConnectionTimeout(params, 20 \* 1000);
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);

// Create and initialize scheme registry 
SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(new Scheme("http", PlainSocketFactory.getSocketFactory(), 80));
schemeRegistry.register(new Scheme("https", SSLSocketFactory.getSocketFactory(), 443));

// Create an HttpClient with the ThreadSafeClientConnManager.
// This connection manager must be used if more than one thread will
// be using the HttpClient.
ClientConnectionManager cm = new ThreadSafeClientConnManager(params, schemeRegistry);

HttpClient httpClient = new DefaultHttpClient(cm, params);

What's the Connection Manager about? Well it provides more advanced connection management features, such as connection pools for things like keep alive connections.

Ok, so now that the HttpClient is set up, I can execute HTTP requests in various ways, but one of the easiest is

HttpGet httpget = new HttpGet(url);
HttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();

With the HttpResponse and HttpEntity objects I can interrogate the status of the response, find the Content-Type, and get the content in the response. I put this all in a class I called Downloader the salient pieces are

public class Downloader implements Runnable {

    HttpClient httpClient;
    URL url;


    public Downloader(HttpClient client, URL url) {
        this.httpClient = client;
        this.url = url;
    }


    public void run() {

        HttpGet httpget = null;
        try {
            httpget = new HttpGet(url.toString());
        } catch (URISyntaxException ex) {
            // record error
            return;
        }

        try {
            HttpResponse response = context.getHttpClient().execute(httpget);

            // Get hold of the response entity
            HttpEntity entity = response.getEntity();

            // If the response does not enclose an entity, there is no need
            // to bother about connection release
            if (entity != null) {
                if (entity.getContentType().getValue().startsWith("text/html")) {
                    // This is an HTML file, so process it for links.
                    try {
                        BufferedReader reader = new BufferedReader(
                            new InputStreamReader(entity.getContent()));

                        // [ Read the content from the InputStream ]

                        // [ Code using tagsoup to extract links from the HTML ]

                        // [ Submit links for crawling ]

                    } catch (RuntimeException ex) {

                        // In case of an unexpected exception you may want to abort
                        // the HTTP request in order to shut down the underlying 
                        // connection and release it back to the connection manager.
                        httpget.abort();
                        throw ex;

                    } finally {

                        // Closing the input stream will trigger connection release
                        reader.close();
                    }

                } else {
                    // Reponse is not HTML, read it from the InputStream - entity.getContent() -
                    // And do something with it like save it.
                }

                }
            //
            }
        } catch (Exception ex) {
        }
    }
}

My actual code is structured a little differently because I created the notion of filters to constrain the URLs that are processed - constraints like same host, same domain, exclude paths, etc.

The main thing to note here is that you need to release the underlying connection when you are done with it. This is done by closing the InputStream in the entity - HttpEntity.getContent().close() - essentially. You must be careful that you do this for every condition or you will start leaking connections.

The other curious thing you may have noticed is that I had the Downloader class implement the Runnable interface. Why is this? Well this has to do with the the multi-threading I wanted to do. It implements Runnable so I can schedule them to execute in a pool of threads.

So how did I manage the thread pool? Why with the concurrent classes in java.util.concurrent - the ExecutorService specifically. The ExecutorService does all the hard work of managing the thread pool for you, all you really need to do is keep giving it work to do. To instantiate an ExecutorService I used

ExecutorService workerMgr = java.util.concurrent.Executors.newFixedThreadPool(threads);

And from there, giving more work to the service is as easy as

Downloader downloader = new Downloader(httpClient, url);
workerMgr.execute(downloader);

After I'm done adding items, I simply wait for all the work to finish:

workerMgr.shutdown();
Comments:

Hi Joe,

Thanks for posting your experience with HttpClient. I have been fooling around with it during some downtime and was wondering if you would mind posting the complete source code of what you worked on?

Thanks!

Jamie

Posted by Jamie Swain on April 25, 2008 at 04:40 AM PDT #

Funny you should ask. I am working on getting approvals to do just that. I have no idea how long the process will take though. Stay tuned.

Posted by Joseph Mocker on April 25, 2008 at 05:42 AM PDT #

I believe ThreadSafeClientConnManager is still in httpclient 4.0 beta. Any particular reason why MultiThreadedHttpConnectionManager available in 3.0 is not as good? specially the connection pool management?

Posted by Rose Smith on September 04, 2008 at 08:31 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

mock

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
Blogroll

No bookmarks in folder