Find Broken Links In Your Roller Blog

I've been wanting to track down and fix broken links in my blog posts for some time. I now have a solution for that. It's the anal-retentive part of me. Most people would just move on to other things. I think of these broken links as bugs that need to be fixed.

Recently on an internal mailing list, Calum gave an approach (thanks) which I've used. His suggestion was to use:

wget --spider --force-html -i bookmarks.html

Now as I've previously mentioned, you can use Dave Johnson's Grabber application to save a local copy of all of your Roller blog posts.

I modified Calum's one-liner to create a simple script (findlinks.sh) that iterates over each of those files:

#!/bin/bash
for i in 2\*.html;
do
    echo File: $i
    wget --spider --tries=2 --force-html -i $i
done

I added in the --tries=2 to make the script a little more responsive. It's definitely not the fastest thing in the world, but my philosophy for computer programming is:

  1. Get it working.
  2. Make it better (more features).
  3. Make it faster and smaller.

We are still on stage 1 here.

Now you can run the script (in the directory where you saved your blog posts) with:

% ./findlinks.sh > blog-links.txt

I left this running overnight (yes, it's that slow, or rather, it had a lot of files to process), and in the morning I had a 3.7Mb file of interesting information. This now needed to be processed.

Another small Python script to the rescue. This script takes the previously generated output from findlinks.sh as input, processes each line, extracting out the name of each blog post and reporting links that didn't generate a "200 OK" result. It also writes out some simple statistics at the end of the run. Results are written to standard output.

It actually does a bit more than that. If a link generated a "301 Moved" response, then it's ignored. The blogs.sun.com team adjusted all the blog URL's a little while ago. Links of the form:

http://blogs.sun.com/roller/resources/richb/blog-richb.jpg

are now of the form:

http://blogs.sun.com/richb/resource/blog-richb.jpg

It's a pity there isn't an easy way to automatically update all such links in my old blog posts.

It also seems that various Amazon links I have don't like it when the wget command touches them. They all generate a "405 MethodNotAllowed" response. I've ignored those too, as they seem to work just fine in a browser.

The new report generates output for each blog post file that looks something like:

File: 20040614-0724.html
Date: [June 14, 2004 07:24]
    
    Url:  http://www.sun.com/smrc/photos-sun/pphistory.html
    Url:  http://brand.sun.com/
    Response: 302 Moved Temporarily
    
    Url:  https://brand.sun.com/
    Url:  http://au.sun.com/news/onsun/2002-04/sun_20.html
    Url:  http://docs.sun.com/db/doc/806-2901/6jc3a4lqm?a=view
>>> Response: 404 Not found
    
    Url:  http://docs.sun.com/db/doc/806-2901/6jc3a4lqp?a=view
>>> Response: 404 Not found
    
    Url:  http://docs.sun.com/db/doc/806-2901/6jc3a4ltl?a=view
>>> Response: 404 Not found
    
    Url:  http://docs.sun.com/db/doc/806-2901/6jc3a4ltg?a=view
>>> Response: 404 Not found

    Url:  http://docs.sun.com/db/doc/806-2901/6jc3a4lu3?a=view
>>> Response: 404 Not found

    Url:  http://www.objectfarm.org/Activities/Publications/TheMerger/UserInterfaces/OPENSTEP-Desktop.jpg
    Url:  http://jsdt.dev.java.net/
    Url:  https://jsdt.dev.java.net/
    Url:  http://java.sun.com/products/jms/index.jsp
    Url:  http://www.wired.com/news/technology/0,1282,35526,00.html
    Url:  http://www.sun.com/access
    Response: 302 Moved Temporarily

    Url:  http://www.sun.com/access/
    Url:  http://wwws.sun.com/software/solaris/freeware/download.html
    Url:  http://www.sun.com/software/solaris/freeware/download.html
    Url:  http://www.sun.com/software/solaris/freeware/download.xml
    Url:  http://www.java.blogger.com.br/sd1.jpg
    Url:  http://www.solaris-x86.org/
    Url:  http://www.theregister.co.uk/2004/06/02/sun_shows_metropolis/
    Url:  http://calctool.sourceforge.net/Screenshots/gcalctool.png
    Url:  http://www.xwinman.org/screenshots/gnome-anakin.jpg
    Url:  http://web.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub043.html
    Url:  http://www.technorati.com/tag/Personal

Where a link generated a response that began with a "4", I've prefixed the report line with ">>>" to make them easier to find.

Here's my link statistics:

917 files processed.
13462 links processed.
2372 links moved.
578 'method not allowed' links.
766 broken links found.

Now I just need to go back and edit all of those broken links and fix them up if possible.

There is also no doubt in my mind that this can be improved. Suggestions on how to go to stage 2 are most welcome.

[]

[]

[]

Comments:

Post a Comment:
Comments are closed for this entry.
About

user12607856

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today