Find Broken Links In Your Roller Blog
By user12607856 on Jan 22, 2007
Recently on an internal mailing list, Calum gave an approach (thanks) which I've used. His suggestion was to use:
wget --spider --force-html -i bookmarks.html
I modified Calum's one-liner to create a simple script (findlinks.sh) that iterates over each of those files:
#!/bin/bash for i in 2\*.html; do echo File: $i wget --spider --tries=2 --force-html -i $i done
I added in the
--tries=2 to make the script a little more
responsive. It's definitely not the fastest thing in the world, but my
philosophy for computer programming is:
- Get it working.
- Make it better (more features).
- Make it faster and smaller.
We are still on stage 1 here.
Now you can run the script (in the directory where you saved your blog posts) with:
% ./findlinks.sh > blog-links.txt
I left this running overnight (yes, it's that slow, or rather, it had a lot of files to process), and in the morning I had a 3.7Mb file of interesting information. This now needed to be processed.
Another small Python script to the rescue. This script takes the previously generated output from findlinks.sh as input, processes each line, extracting out the name of each blog post and reporting links that didn't generate a "200 OK" result. It also writes out some simple statistics at the end of the run. Results are written to standard output.
It actually does a bit more than that. If a link generated a "301 Moved" response, then it's ignored. The blogs.sun.com team adjusted all the blog URL's a little while ago. Links of the form:
are now of the form:
It's a pity there isn't an easy way to automatically update all such links in my old blog posts.
It also seems that various Amazon links I have don't like it when the
command touches them. They all generate a "405 MethodNotAllowed" response.
I've ignored those too, as they seem to work just fine in a browser.
The new report generates output for each blog post file that looks something like:
File: 20040614-0724.html Date: [June 14, 2004 07:24] Url: http://www.sun.com/smrc/photos-sun/pphistory.html Url: http://brand.sun.com/ Response: 302 Moved Temporarily Url: https://brand.sun.com/ Url: http://au.sun.com/news/onsun/2002-04/sun_20.html Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lqm?a=view >>> Response: 404 Not found Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lqp?a=view >>> Response: 404 Not found Url: http://docs.sun.com/db/doc/806-2901/6jc3a4ltl?a=view >>> Response: 404 Not found Url: http://docs.sun.com/db/doc/806-2901/6jc3a4ltg?a=view >>> Response: 404 Not found Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lu3?a=view >>> Response: 404 Not found Url: http://www.objectfarm.org/Activities/Publications/TheMerger/UserInterfaces/OPENSTEP-Desktop.jpg Url: http://jsdt.dev.java.net/ Url: https://jsdt.dev.java.net/ Url: http://java.sun.com/products/jms/index.jsp Url: http://www.wired.com/news/technology/0,1282,35526,00.html Url: http://www.sun.com/access Response: 302 Moved Temporarily Url: http://www.sun.com/access/ Url: http://wwws.sun.com/software/solaris/freeware/download.html Url: http://www.sun.com/software/solaris/freeware/download.html Url: http://www.sun.com/software/solaris/freeware/download.xml Url: http://www.java.blogger.com.br/sd1.jpg Url: http://www.solaris-x86.org/ Url: http://www.theregister.co.uk/2004/06/02/sun_shows_metropolis/ Url: http://calctool.sourceforge.net/Screenshots/gcalctool.png Url: http://www.xwinman.org/screenshots/gnome-anakin.jpg Url: http://web.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub043.html Url: http://www.technorati.com/tag/Personal
Where a link generated a response that began with a "4", I've prefixed the report line with ">>>" to make them easier to find.
Here's my link statistics:
917 files processed. 13462 links processed. 2372 links moved. 578 'method not allowed' links. 766 broken links found.
Now I just need to go back and edit all of those broken links and fix them up if possible.
There is also no doubt in my mind that this can be improved. Suggestions on how to go to stage 2 are most welcome.