For the Fun of It: Writing Your Own Text Editor, Part 1

Using a layered design and iterative development, a line editor evolves into a text editor in this new article series.

August 20, 2019

Download a PDF of this article

After reading the title of this article, you might be thinking, “A text editor in Java? Why would you want to do that? Isn’t Java just for enterprise web apps?” My first answer is heck no! Java is still alive and well on the desktop. My second answer is that getting away from the web environment—some Java developers’ main “comfort zone”—moves the focus towards design and interesting implementation issues.

In this first article, I focus on building a simple line-mode text editor. I do this through iterative implementation. That is, I start with a simple implementation. Then, as I add new features, I update the design and reimplement it.

In an upcoming article, I will turn it into a graphical desktop editor. In all its variations, it will remain a pure text editor; that is, a program for changing a plain-text file. Plain text—no font or color changes—still makes up the bulk of the world’s computer-facing information: batch/script files, program source files, configuration files, log files, and much more. There’s a big jump from a plain-text editor to a full word processor that provides the ability to select typefaces and type size and color; embed images and spreadsheets; align text left, right, or center; and so much more. It would be difficult to focus on all that in a series of short articles, so for the time being, I’ll focus on plain text.

When you design an editor, the two main things to think about are the user interface (UI) or command layer and buffer management. The UI or command layer defines what the user can do to work on the in-memory buffer. These tasks may be simple commands for a line editor, mouse gestures on a screen editor, voice commands on an interactive system, and so on. Buffer management is how the user works on the text in a buffer. Text editors—along with their larger kin, word processors—generally keep an in-memory copy of the file on which changes are made. This copy is stored in memory in a buffer. The editor will write this modified text in place of the original file on disk when the user exits the editor or when some kind of “save” command is issued. Buffer management, also known as the model code, is concerned with looking after the in-memory contents of the file being edited. Both the UI and buffer management, as well as the all-important interface between them, need to be well designed for the end product to be both useful and maintainable.

The Line Between Layers of Code

Defining a Java interface is a powerful (and common) way of separating logical layers of code. Interfaces should be used between layers of an application and on any class that is likely to have (now or later) multiple implementations. The classes can then depend on the interface, not the particular implementations.

In this case, the BufferPrims interface meets both needs: It is a layer boundary, and it has multiple implementations. This is similar to a database-backed application, where you’d likely have an interface between the middle (business logic) layer and the database code. This would allow switching among JDBC, Java Persistence API (JPA)/Hibernate, and maybe NoSQL databases. Some have argued that the JPA EntityManager or Hibernate’s Session are general-enough interfaces for this purpose; others suggest using an application-specific interface. There’s no one right answer for all applications, but these are the design considerations.

The editor needs an interface between the command layer and the buffer management layer. A very trivial interface between these layers might look like this:


public interface BufferPrims {
    addLineAfter(int afterLine, String newLine);
    deleteLine(int lineNumber);
    replaceText(oldText, newText);
}

This interface tells the command code nothing much about how the buffer management works, and at the same time it tells the buffer management nothing about how the UI works. That’s important, because it’s desirable to be able to replace either part without having any effect on the other. And that is true of layered software in general: Layers should know nothing at all about the layer above them (so they can be invoked from different UIs, say) and know only how to invoke the layer directly below them, nothing more.

To make an actual editor, the interface needs to be a bit more comprehensive. I wound up with something like this version:


public interface BufferPrims {
    final int NO_NUM = 0,
        INF = Integer.MAX_VALUE;
    void addLines(List<String> newLines);
    void addLines(int start, List<String> newLines);
    void deleteLines(int start, int end);
    void clearBuffer();
    void readBuffer(String fileName);
    void writeBuffer(String fileName);
    int getCurrentLineNumber();
    String getCurrentLine();
    int goToLine(int n);
    int size();     // Number of lines, as per old Collections
    /** Retrieve one or more lines */
    String getLine(int ln);
    List<String> getLines(int i, int j);

    /** Replace first/all occurrences of 'old' regex 
     *  with 'new' text, in current line only */
    void replace(String oldRE, String newStr, boolean all)
    
    /** Replace first/all occurrences in each line */
    void replace(String oldRE, String newStr, 
        boolean all,
        int startLine, int endLine);
    boolean isUndoSupported();
    /** Undo the most recent operation; 
     *  optional method */
    default void undo() {
        throw new UnsupportedOperationException();
    }
}

Most of the operations probably seem straightforward. The source code for this article provides three implementations of the BufferPrims interface. The optional undo method is in one of these implementations but not the others; I discussed that implementation previously in this article on the Command design pattern.

Some might argue that read() and write() don’t belong here and that the main program should read the file a line at a time and ingest the lines using one of the add() methods. The read() and write() methods are in this interface for efficiency; there may be versions where you read the entire file with a single read operation.

Buffer Management

Having multiple implementations that differ significantly provides some evidence that the interface is based on a sensible design. But it doesn’t say anything about efficiency—speaking of which, if you didn’t care about efficiency, you could just keep everything in a single String object. Because strings are immutable, this approach would require a lot of reallocation, so a StringBuilder or StringBuffer is a better basis for a naive implementation. In fact, the first implementation of the buffer primitives (BufferPrims) will use BufferPrimsStringBuffer.

Although a StringBuffer has some advantages—such as built-in methods for modifying the contents of the buffer—it is not a natural organization for what is essentially a list of lists of words or, more precisely, a list of lines. Accordingly, the StringBuffer implementation, which I talked myself into writing to show that it could be done, has to do some work just to find where the lines begin and end.

To keep things consistent, I made the assumption that each line would end with a single newline character ('\n'); carriage returns ('\r') would be banned altogether. With this approach, I can find where one line ends and the next begins just by looking for those newline characters. Most of the operations in this class end by calling StringBuffer methods to get or set characters at particular positions in the StringBuffer.

For example, here’s the code to get the current line’s contents:


@Override
public String getCurrentLine() {
    int startOffset = findLineOffset(current);
    int len = findLineLengthAt(startOffset);
    return buffer.substring(startOffset, 
                            startOffset + len);
}

The findLineOffset method uses a regular expression to find the lines, so it’s a bit more than just a for loop looking for newline characters, although that would be a workable naive implementation.

Similarly, the StringBuffer-based line deletion makes use of the same two methods:


@Override
public void deleteLines(int startLine, int endLine) {
    if (startLine > endLine) {
        throw new IllegalArgumentException();
    }
    int startOffset = findLineOffset(startLine),
            endOffset = findLineOffset(endLine);
    buffer.delete(startOffset,
        endOffset + findLineLengthAt(endOffset) + 1);
}

The substitute (s) command is probably the most complicated command; I describe the variations of it in the “UI/Command Structure” section. When all the command options have been parsed, the UI layer has to call one of two methods:


void replace(String oldRE, 
             String newStr, boolean all);

void replace(String oldRE, 
             String newStr, boolean all,
             int startLine, int endLine);

The variable all controls whether all occurrences, or just the first, are to be replaced. Here’s the implementation of the one-line version of replace:


@Override
public void replace(String oldRE, 
                    String newStr, boolean all) {
    int startOffset = findLineOffset(current);
    int length = findLineLengthAt(startOffset);
    String tmp = 
        buffer.substring(startOffset, length);
    tmp = all ? tmp.replaceAll(oldRE, newStr) : 
                tmp.replace(oldRE newStr);
    buffer.replace(startOffset, length, tmp);
}

A second implementation. The second implementation of the buffer code, BufferPrimsNoUndo, stores the buffer data in a List<String>, which is a data structure that’s easier to work with on a line-by-line basis. For example, getting the current line consists of just the following code:


@Override
public String getCurrentLine() {
    return buffer.get(
        lineNumToIndex(current));
}

(Because the line numbers start at 1 but List indices start at 0, there’s a lineNumToIndex() method to convert from a line number to an array index. Doing that looks neater than subtracting 1 every time.)

The “delete lines” operation is also simpler.


@Override
public void deleteLines(int startLnum, 
                        int end) {
    int startIx = lineNumToIndex(startLnum);
    for (int i = startIx; i < end; i++) {
        if (buffer.isEmpty()) {
            System.out.println(
                "?Deleted all lines!");
            break;
        }
        buffer.remove(startIx); // not i!
    }
    current = startLnum;
}

Note that some of this class’s methods are stored in the class AbstractBufferPrims, which both List-based implementations use.

A third version. The final version, BufferPrimsWithUndo, goes one step further and, as its name implies, implements the undo operation. Briefly, each operation that changes the buffer also creates a lambda expression containing code to reverse that operation. For example, the “delete lines” operation now works something like this:


@Override
public void deleteLines(int startLnum, int end) {
    
    List<String> undoLines = new ArrayList<>();
    for (int i = startLnum; i < end; i++) {
        if (buffer.isEmpty()) {
            System.out.println("?Deleted all lines!");
            break;
        }
        undoLines.add(buffer.remove(i - 1));
    }
    current = startLnum;
    if (!undoLines.isEmpty()) {
        pushUndo("delete lines " + startLnum + " to " + end,
            () -> addLines(startLnum, undoLines));
    }
}

In fact, some of the operations are so similar to the BufferPrimsNoUndo that the latter is reused via super() from the AbstractBufferPrims class:


@Override
public void addLine(String newLine) {
    super.addLine(newLine);
    pushUndo("add 1 line", () -> 
        deleteLines(current, current));
}

The method pushUndo() creates a wrapper object around the descriptive string (for use in visual editors) and the lambda expression to be performed to undo the current command. The deleteLines() undo operation has to be able to add back all the deleted lines, so they are collected in a List and attached in the call to the inverse operation, addLines(). See the article cited earlier for more discussion of the undo mechanism and the Command design pattern it implements.

UI/Command Structure

To avoid having to reinvent everything from scratch (usually a bad idea), the UI of the line editor is shamelessly stolen from that of the UNIX ed command, which has been fairly standard for many decades and was the basis for both ex and vi and largely the basis for the stream editor sed. The basic format of every command is


[n,m]C[operands]

where:

  • Square brackets mean the bracketed text is optional.

  • n and m are line numbers (starting at 1; the default is the current line, which is the latest line you did anything to).

  • C is a single-letter command (d for delete, a for append, s for substitute, and so on); a reasonably current list is in the README file for the source project.

  • Operands depend on the command being used.

It’s worth noting that the design of this single lowercase letter command interface was influenced by the very slow keyboard interface between users and the computer when this design was first created. But it was also a conscious choice that was intended to keep the design from growing without limits; there were supposed to be only about 26 possible commands, ever. For better or for worse, later developers have added some uppercase commands; none of these is implemented in my version, but any modern UNIX or Linux ed likely implements them.

Every Java application needs a public static void main(String[] args) method to start it running. This method typically handles command-line options (if you have any), checks and opens files, and then delegates control to some processing method, typically a nonstatic instance method. The main program is in LineEditor.java.

There’s a loop in this method that reads one line from the console, parses it to make sure it looks like an editor command and, if it does, hands it to a letter-command-specific routine. The main code loop looks something like this:


    // The main loop of the editor:
    while ((line = in.readLine())  != null) {
        ParsedCommand pl = 
            LineParser.parse(line, buffPrims);
        if (pl == null) {
            System.out.println("?");
            continue;
        }
        EditCommand c = commands[pl.cmdLetter];
        if (c == null) {
            System.out.println(
                "? Unknown command in " + line);
        } else {
            c.execute(pl);
        }
}

The code that processes the single-letter commands is set up in the static initializer around line 65 of LineEditor.java, where an array of EditCommand object maps from the single-letter characters to lambda expressions for the commands that are implemented. The array’s length is 255 to allow for any ASCII character, but in reality only about 20 are implemented. As a simple example of the command processing, the p command (for “print”) is implemented as follows:


    // p - print lines
    commands['p'] = pl -> {
        buffPrims.getLines(pl.startNum, pl.endNum)
                 .forEach(System.out::println);
};

In this code, the variable pl stands for ParsedCommand (formerly ParsedLine), a class representing the command, the line number range, and the operands. Each time the user types a command, a ParsedCommand representing it is created by the parse() method in the LineParser class.

Some of the command handlers, such as p above, are short and sweet. Some are quite a bit longer. If you’re curious, look at the s command (that is, substitute) implementation, which must not only iterate over a range of lines but also allow any delimiter character, check that it’s present two times or three times, and see if the final delimiter is followed by a g or a p in either order. The g means “global” or “go across the line”; without that, edj replaces only the first occurrence of the old text. A trailing p causes the line to be printed. Here are some examples of valid commands:


s/him/them/     # change first "him" to "them" in current line
s=him=them=     # same as above; just a different delimiter
2,3/him/them/   # change first "him" to "them" in lines 2 through 3
5,s/him/them/   # change first "him" to "them" in lines 5 through end
,s/him/them/    # change first "him" to "them" in every line
s/him/them/g    # change every "him" to "them" in every line
s/him/them/p    # change first "him" to "them" in current line 
                #   and print line after
s/him/them/gp   # combines the two previous examples

It takes time to learn this syntax, and it requires a few lines of code to interpret all the variations correctly (see parseSubstitute()). But the conciseness and power of this command explains why ed (and its offspring, including the vi editor) remain popular with sysadmins and developers around the open source world.

Parameterized Testing

“Test early and often” is a mantra for reliable code: Have many unit tests, and run them when anything changes. The buffer primitives are easy to test, and there are tests for most of that code. To avoid copy-and-pasting tests for the three different implementations, I use a JUnit 4 mechanism called the parameterized test. A method annotated with @Parameters returns a list of constructor arguments of any type, which are used to construct a distinct instance of the test class with fields set by the constructor. Because the arguments can be of any type, I use the class descriptor, for example, the Reflection API class. Here is the relevant part of BufferPrimsTest:


@RunWith(Parameterized.class)
public class BufferPrimsTest {
    protected BufferPrims target;
    public BufferPrimsTest(Class<BufferPrims> clazz) throws Exception {
        target = clazz.newInstance();
    }
    @Parameters(name = "{0}")
    public static Class<BufferPrims>[] params() {
        return (Class<BufferPrims>[]) new Class<?>[] {
            BufferPrimsStringBuffer.class,
            BufferPrimsNoUndo.class,
            BufferPrimsWithUndo.class };
    }

    // @Test-annotated methods...
}

There are also tests for the command and substitute-operands parsing code.

Further Work

As it stands, edj is not something you’d use if you had the full version available or, preferably, a screen-based derivative such as vi or its various derivatives. Some of the commands from ed are not implemented, and those that are tend to be incomplete. The line-range parsing, for example, lacks “searching”: In actual UNIX/Linux implementations of ed, either the starting or ending line number (or both) can be replaced with a search string. The search string can be /pattern/ to search in the forward (downward) direction from the current line, or ?pattern? to search upwards. The overall design of the code will allow anyone to extend it into a full implementation, if there’s a need for such a thing.

Code Reuse

To somewhat validate the overall design, I built a very simple proof-of-concept implementation of the UNIX stream editor, sed; the UNIX sed command works by reading the named files (or standard input, if no files are specified) a line at a time, and applying any and all edit commands that were placed on the command line for each line. You might say the following:


sed -e 's/old/new/g' -e 's=more=less=' file1

This would, as you might imagine, replace all occurrences of “old” with “new” and the first occurrence of “more” on each line with “less” in file1. This is a stream editor rather than a batch editor. It doesn’t change file1; instead, it writes the changed version to the standard output. Making a batch editor—keeping the file safe if there are command errors or if the disk fills up—is out of scope for this proof of concept.

My StreamEditor, unlike the LineEditor, has only one command implemented, but it’s the most complex one: the s (substitute) command. StreamEditor reuses the line parsing code and the substitute-command parsing code. My StreamEditor prototype works for the one command that’s implemented; there is a shell script called stests (stream tests) that demonstrates this proof of concept. I plan to reuse the buffer-handling code in an upcoming installment.

Getting the Code

This version of the editor is known as edj (editor in Java, pronounced like “edge”). A reminder that the code can be downloaded from GitHub if you’d like to see the whole thing in one place.

Conclusion

This article looked at some issues for designing the structure and some of the code required to implement a relatively simple text editor completely in Java. Some of the important challenges are

  • To design a UI that exposes the power of regular expressions with a line-oriented paradigm

  • To design the interface between the UI and the underlying buffer management code

  • To build an efficient but comprehensive implementation for buffer management

The UI design is based on that of UNIX ed. And the BufferPrims interface is based very vaguely on my C-language reimplementation of this editor many years ago, which was based on an even earlier version written in a pedagogical language called RatFor, in the book Software Tools. The version of the interface presented here did not spring fully formed from my mind, but was revised a few times during the evolution of the editor. My next article will show how this interface can be reused to make a usable visual text editor.

Also in This Issue

Know for Sure with Property-Based Testing
Arquillian: Easy Jakarta EE Testing
Unit Test Your Architecture with ArchUnit
The New Java Magazine
Quiz Yourself: Using Collectors (Advanced)
Quiz Yourself: Comparing Loop Constructs (Intermediate)
Quiz Yourself: Threads and Executors (Advanced)
Quiz Yourself: Wrapper Classes (Intermediate)
Book Review: Core Java, 11th Ed. Volumes 1 and 2

Ian Darwin

Ian Darwin (@Ian_Darwin) is a Java Champion who has done all kinds of development, from mainframe applications and desktop publishing applications for UNIX and Windows, to a desktop database application in Java, to healthcare apps in Java for Android. He’s the author of Java Cookbook and Android Cookbook (both from O’Reilly). He has also written a few courses and taught many at Learning Tree International.

Share this Page