X

Geertjan's Blog

  • July 6, 2007

Parsing HTML for Links and Writing to the Output Window (Part 1)

Geertjan Wielenga
Product Manager
The nice thing about yesterday's blog entry is that now I can check links... in TOC files. And that's also not nice, because there's a lot more to a helpset than a TOC file. And the broken links are more likely to be found in all the other HTML files, not in the TOC file. So... how to do a link check, programmatically, for links in HTML files, within the IDE? I haven't yet got all of the answer, but I'm certainly a long way there. The first step, I concluded, was to be able to, somehow, identify the links in the first place. That means parsing the HTML file for A elements that have HREF attributes. And then obtaining their value. A NetBeans Platform appendage to that process is to attempt to write the results to the Output window, as hyperlinks, which can then be clicked to jump back into the same HTML file, into the line where the link is defined. That, in short, would be the first phase. Encapsulated in a single picture, the result would be this:

The fact that I can show the above screenshot is an indication that this blog entry, after following the obligatory winding road, will come to a happy conclusion. First, some mandatory reading, or, at least, the stepping stones which I skipped across in order to waylay the yawning chasm of despair:

First, we need a CookieAction, specifically for text/html files, which we can create with the New Action wizard. Once we have it, fill it out like this:

public final class LinkCheckAction extends CookieAction {
private HTMLDocument htmlDoc = new HTMLDocument();
private DataObject dObj;
private EditorCookie ec;
//When the menu item, which is within the open HTML document, is selected,
//we obtain the data object and the document itself, which we send to the parse() method:

protected void performAction(Node[] activatedNodes) {
try {
dObj = activatedNodes[0].getLookup().lookup(org.openide.loaders.DataObject.class);
ec = activatedNodes[0].getLookup().lookup(org.openide.cookies.EditorCookie.class);
if (ec != null) {
javax.swing.JEditorPane[] editorPanes = ec.getOpenedPanes();
if ((editorPanes != null) && (editorPanes.length > 0)) {
parse(ec.getDocument());
}
}
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
} catch (SAXException ex) {
Exceptions.printStackTrace(ex);
}
}//We use standard HTML parsing code (as explained very well here):
private void parse(StyledDocument ec) throws IOException, SAXException {
java.io.File f = FileUtil.toFile(dObj.getPrimaryFile());
java.io.FileReader r = new java.io.FileReader(f);
HTMLEditorKit.Parser parser = new HTMLParse().getParser();
htmlDoc.setParser(parser);
parser.parse(r, new HTMLParseLister(ec), true);
}//Our parse lister class does the work of creating a tab in the Output window
//and then overrides the handleStartTag method to identify A elements:

class HTMLParseLister extends HTMLEditorKit.ParserCallback {
OutputWriter writer;
StyledDocument ec;
public HTMLParseLister(StyledDocument ec) {
this.ec = ec;
try {
org.openide.windows.InputOutput io = IOProvider.getDefault().getIO("HTML Parsing", false);
io.select();
writer = io.getOut();
writer.reset();
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
}//A lot happens here, we use a fantastic utility method that is the key
//to everything: NbDocument.findLineNumber(ec, pos), which gets our line number,
//which, for some reason, we need to subtract 1 from (don't know why):

@Override
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t.toString().equals("a")) {
try {
java.lang.String value = (java.lang.String) a.getAttribute(HTML.Attribute.HREF);
int lineNo = NbDocument.findLineNumber(ec, pos);
int realLineNo = lineNo - 1;writer.println("Line " + realLineNo + ": " + value, new HTMLOutputListener(dObj, realLineNo, pos));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
}
}
}//Utility class for getting the parser:
public class HTMLParse extends HTMLEditorKit {
@Override
public HTMLEditorKit.Parser getParser() {
return super.getParser();
}
}
//The remainder is all standard CookieAction code,
//generated by the New Action wizard:

protected int mode() {
return CookieAction.MODE_EXACTLY_ONE;
}
public String getName() {
return NbBundle.getMessage(LinkCheckAction.class, "CTL_LinkCheckAction");
}
protected Class[] cookieClasses() {
return new Class[]{DataObject.class, EditorCookie.class};
}
protected String iconResource() {
return "org/netbeans/modules/javahelpeditor/BR16.png";
}
public HelpCtx getHelpCtx() {
return HelpCtx.DEFAULT_HELP;
}
protected boolean asynchronous() {
return false;
}
}

The line in bold above indicates that we also have a class called HTMLOutputListener. This is it:

class HTMLOutputListener implements OutputListener {
DataObject dObj;
int pos;
int realLineNo;
HTMLOutputListener(DataObject dObj, int realLineNo, int pos) {
this.dObj = dObj;
this.pos = pos;
this.realLineNo = realLineNo;
}
public void outputLineSelected(OutputEvent evt) {
}
public void outputLineAction(OutputEvent evt) {
LineCookie lc = dObj.getCookie(org.openide.cookies.LineCookie.class);
Line line = lc.getLineSet().getOriginal(realLineNo);
line.show(Line.SHOW_GOTO, pos);
}
public void outputLineCleared(OutputEvent evt) {
}
}

The above is a standard implementation of the NetBeans OutputListener class, which is the class that creates the hyperlinks in the Output window. The crucial piece of information is the line number, which is hard to get to, unless you know about the NbDocument.findLineNumber method. It is very worth one's while to explore everything that NbDocument provides, and I intend to do that very soon, because if you don't know what this class offers you, you are likely to experience a lot of unnecessary headaches.

And, now I am able to, after selecting a menu item, output all the links in an HTML document to the Output window. They appear there as hyperlinks, which can be clicked. When you click a link, the document opens (if closed) and the cursor lands on the line in which the link is defined. A definite first step to a link checker, at least I hope so.

Postscript: For clarity, my ec should really be called doc, because it is the Document object obtained from the EditorCookie.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.