The joy of writing command-line utilities, Part 1: Finding duplicate files

Step-by-step creation of a small project that has practical utility

August 14, 2020

Download a PDF of this article

Writing command-line utilities for your own use is an agreeable and productive way to explore Java and create side projects that you will actually finish.

Take duplicate files. Like many people, I have lots of music on my laptop, as well as many photos. Managing the folders containing these items can be tedious at times, especially when I suspect—but don’t know for sure—that a given photo or track is a duplicate. To address this problem, it’s handy to be able to run a utility that scans one or more directories for duplicate files.

This article presents a project to create such a utility. It is a good project for intermediate programmers who have a good command of the Java language and want to expand their skills. I’ll explore some lesser used Java APIs and examine some unexpected problems in design and testing.

Throughout this article, I lay particular emphasis on the effects of small, seemingly innocuous decisions and look at alternatives. Registering these small decisions as you code is an underappreciated discipline, which I’ve discussed before. As you’ll see, looking at the effects of small decisions paves the way for finding new ways to make the software more useful.

One of the benefits of working on a small utility is that the problem domain is narrowly focused, but that does not prevent design issues from creeping in. Let’s start with those issues.

Design

My goal is to build a simple command-line utility where the user can specify one or more directories and have the utility generate a list of all the duplicate files, grouping them together.

Here’s the command line, so far:


java –jar jar-name directory1 [directory2...directoryn] 

A very useful additional feature would be to tell the program whether it should search subdirectories. Let’s decide here that the default behavior is that it should scan subdirectories. That way, I can point it at my top-level music directory and let it run through all the tracks below it. So, the command-line switch should be for an option that disables this feature, rather than enables it. I’ve chosen to go with –nosubdirs.

A final option is –h or –help, which explains the usage in case I forget it in the future. Even in this small specification, I’ve made some implicit choices. For example, I’ve avoided using the GNU-style command options that would have used a double hyphen on the help argument: --help. (This looks like a small thing to note, but if I were building this for wide consumption, I would likely include the double-hyphen spelling as well.)

Next, let’s consider the output. I want a simple presentation of the information, so I’ve settled on writing the data to stdout, which I’ll need to capture into a text file. This choice also eliminates another command-line option, which a more polished version would have, namely, the ability to specify the output file. That capability is frequently expressed with a –o command-line option.

Currently, the sample output when a duplicate is found looks like this:


These files are the same:
	D:\GoogleDrive\misc\NotesOnReed.txt
	D:\GoogleDrive\misc\NotesOnReed2.txt

These files are the same:
	D:\GoogleDrive\misc\Chad-smiling.jpg
	D:\GoogleDrive\misc\Chad-1988-atCamp.jpg

If there are no duplicates, the program should show a one-line statement to that effect, rather than produce no output.

I have sketched out the basic design as a user would experience it. These simple decisions allow me to avoid needing to make feature decisions during coding, and they help me determine my tolerance for YAGNI (“You ain’t gonna need it”—a shorthand for saying don’t build features you might need at some future date but don’t need right away).

Before I begin implementation, I should define the level of robustness I’m aiming for. My two basic criteria for robustness are that the utility should never throw an uncaught exception and it should never produce no output. While these decisions are apparently minimal, they put a series of requirements on my list:

  • If a nonexistent directory is specified, an error message should be generated.
  • If a specified directory is empty, a warning message should be generated (however, no warning should appear for empty subdirectories).
  • All I/O exceptions should be trapped (and reported to the user).

These look fairly straightforward, but that’s because I’ve avoided addressing some knotty problems. For example, how should I handle a directory in which files might be being added or deleted while the utility is running? That is, should I check that a file still exists before I include it on the duplicate listing? What about hidden files? Or symlinks to other files? For the nonce, I’m choosing to accommodate only stable directories and to not follow symlinks.

By making these decisions up front, I’m free from distractions while I write the code.

Implementation

There are five principal actions in the proposed design: command-line processing, creation of file lists for the specified directories, fingerprinting the files, comparing the fingerprints for duplicates, and generating the output. Let’s consider these individually.

The code that I’ll be showing here is all based on Java 8. This choice has no larger agenda to it—I simply don’t need features that are in later Java releases for this project. The code is available on GitHub as release 1.0 of the FileDedupe project.

Command-line processing. The command line I have discussed is so simple that I don’t need to use one of the innumerable libraries that handle command-line options. For my needs, I go through the passed arguments, which are all treated as directories until I hit an option that begins with a hyphen or I reach the end of the list. The code for this is unremarkable.

However, this approach highlights an unaddressed problem in lots of command-line processing related to files and directories. For example, should I make sure that . and .. are valid directories? If so, this means that I can specify directories without giving their full path. That’s fine as long as a user does not have a directory whose name begins with a hyphen, which I would treat as an option, rather than a directory. I choose to ignore this problem because I know I don’t have any such directories (at least, I think I don’t). But then, should I make note in the help information that directories that begin with a hyphen should be specified using the entire path? Edge cases are the nemesis of developers who rush into software development because the task appears conceptually simple. Handling edge cases creates confidence in the utility.

File lists for directories. Walking a directory structure to get a list of the files it contains (and, optionally, of its subdirectories) is such a common operation that the Java code for it is almost a fixed recipe. The Files API in the java.nio.file package has a walk() method for crawling a directory. It returns a stream of Path entries. In the following code, I use this API and I convert the stream of filenames into an ArrayList:


    Files.walk( dir, skipSubDirs? 1 : Integer.MAX_VALUE )
        .filter( p -> p.toFile().isFile() )
        .peek( System.out::println )
        .collect( Collectors.toCollection( ArrayList::new ));

There are several variants for Files.walk; the one I use allows me to specify as an integer what depth of subdirectories to search. I use .filter to make sure I’m getting only files back, and in this code I’m also printing the filenames to the output. In an enhanced version of the utility, this last capability would be controlled by a –verbose option on the command line.

I catch all exceptions in this walk and if one occurs, I return an empty ArrayList, rather than null. Returning an empty collection rather than null is a good practice to avoid null pointer exceptions, but it has some implications, namely that if an exception occurs in the middle of a large directory walk, the program will incorrectly think that there were no files in the directory. I could have gone in various directions here, but I like the idea of notifying the user of the error, specifying which top-level directory is involved, and not providing a list of any duplicates. I don’t want to later be deleting files in a directory that has I/O problems.

A key limitation of this choice is that, per my previous requirement, I notify the user if a directory contains no files. This message will be triggered by this error due to the empty ArrayList, meaning the user is first told that there is an error and that the directory will be skipped and then is told that there were no files in the directory. A good solution if I later want to improve this is to return an Optional, which allows the calling method to distinguish between an error and a valid empty list.

Now that I have the list of files, let’s see how to detect duplicates.

Fingerprinting the content of the files

The goal is to come up with a way that uniquely identifies the contents of each file. Cryptographic hashes generate unique values for files, but they are too heavy a solution for my needs, because they include extensive computation intended to fulfill the requirements of secrecy. For example, a requirement of cryptographic hashes is that two files that differ by only a byte or two generate substantially different hashes. I don’t care if the files are similar, as long as they are unique. Thus, for this utility, none of the cryptographic computation adds any benefit.

All I want is a checksum generator that guarantees that any given file content will have a unique value, such that if two files generate the same value, I know for certain they have the same contents. One of the most famous of such checksums is CRC-32. Fortunately, the standard Java library has a CRC-32 checksum routine in the java.util.zip package. It uses that checksum routine in the following code:


CheckedInputStream check = new CheckedInputStream( file, new CRC32() );
BufferedInputStream in = new BufferedInputStream( check );

try {
    while( in.read() != -1 ) {
        // Read file in completely
    }
    in.close();
} catch( IOException e ) {
    System.err.println( "Error reading file: " + filename );
    throw( new IOException( e.toString() ));
}

return( check.getChecksum().getValue() );

The code for this API is unusual. You first must create a CheckedInputStream for each file and point it to the Java library’s CRC-32 routine, as in the first line in the code above. Then you read through the entire file in an empty loop, and finally you go back to the CheckedInputStream to get the checksum value.

CRC-32 is not the only choice that Java offers. You can also use the Adler-32 algorithm, which is faster but less reliable when computing checksums on small files. The Java API enables you to add other checksum algorithms, but CRC-32 is entirely satisfactory for this project.

At this point, I have checksums for all the files, and now I need only to figure out how to find duplicates among them.

Finding duplicates

There are several ways to check for duplicates. One way would be to create a large ArrayList of entries consisting of a filename and the corresponding checksum. Then sort on checksum, go down the sorted list looking for checksums that appear more than once, and print them out. Because of the sort, the duplicates will all appear sequentially.

I chose a different approach that skips the sorting: I create a HashMap with a key consisting of the checksum (which is a long), and the corresponding value is an ArrayList of filenames. I add entries to the HashMap by checking whether the checksum is already in the table. If it is, I add the filename to the checksum’s corresponding ArrayList. If it is not, I create a new entry in the table for the checksum and its ArrayList, the latter of which presently contains just the one filename. Here is the code, which is straightforward:


ArrayList tableEntry = dupesTable.get( checksum );
if( tableEntry == null ) {    // not found yet in table
    ArrayList<String> entry = new ArrayList<>();
    entry.add( filename );
    dupesTable.put( checksum, entry );
}
else {
    tableEntry.add( filename ); //checksum already in the table
}

The first line checks the HashMap (called DupesTable) for the entry based on the checksum. If it returns null, this means there is no entry for that checksum. In that case, I create a new ArrayList, add the one filename to it, and then add it and the checksum to the HashMap. If the result is not null, the checksum is already in the table with one or more filenames in its ArrayList. The original get() function returns that ArrayList, to which I then add the new filename.

Once I’ve loaded all the file entries, I go through the HashMap sequentially. I look for checksums whose corresponding ArrayList of filenames has more than one entry and I write all filenames to stdout. When I’ve gone through the entire HashMap, I’m done.

This design and the earlier one of sorting the entries in a big array have one advantage that might be useful: In a future version, I could add an option on the command line that prints only unique files. That capability is handy when you have a backup directory and you’re not sure if the original and the backup are in sync. Any file that is not a duplicate then represents one that was not properly handled in the backup. For this feature, of course, I would print out only the checksums whose corresponding ArrayList contains exactly one filename.

From the design step, this feature is not presently a concern; nonetheless, it’s agreeable to know that my implementation has not precluded it.

Testing

The FileDedupe utility as described here works very well and at reasonable scale; I’ve run it without any difficulty on an entire disk consisting of more than 600,000 files. The problem with such a large run is that you can’t know for sure whether some duplicates might have been overlooked due to a defect, which brings up the question of testing.

Like many developers today, I write unit tests as I code. However, unit tests will not to tell you everything you need to know here. You’re going to need to run user-acceptance tests (UATs), in which you create large directories with many files that you seed with known duplicates. And then you’ll need to make sure every duplicate shows up in the output list.

To do this properly, you need to write some more code. I’ll explore this task in the second part of this project’s article, which will appear soon. I expect that by the time I’m done with that, you’ll feel comfortable that FileDedupe can be very well tested.

Conclusion

Performance of FileDedupe is acceptable and certainly fine for my needs. It’s not a utility I expect to run frequently, so I’m not too worried. But I’m not free from concern either. I want to be able to share FileDedupe with my friends and colleagues, and I’m concerned that if they have very large collections of photos or tracks, they’ll be mildly thankful but not wowed with my work if the utility runs slower than they expect.

Because I do have some ego attachment to the software I write, I’d like it to run faster, possibly much faster. I can optimize the software now that it’s known to work and once I have all the tests I just referred to in place.

The slow part in the processing is reading through every byte of every file to compute the CRC-32 checksums. I could use faster algorithms, but in the next article, I’ll take you through an interesting optimization that means on average I’ll need to compute checksums on only a subset of the files. This is important because FileDedupe is I/O-bound, rather than CPU-bound. (However, on older machines or in VMs, it might become CPU-bound.) Therefore, anything I can do to reduce the I/O will translate into immediate improvements in performance. Meanwhile, put on your thinking caps and see if you can figure out how to identify unique file contents without computing checksums on every file.

The code, which consists of a few well-commented files, has no external dependencies. It is available on GitHub, along with a Maven file for easy building. If this kind of practical, intensely hands-on article that focuses on small projects is to your liking, please let me know.

Andrew Binstock

Andrew Binstock (@platypusguy) was formerly the editor in chief of Java Magazine. Previously, he was the editor of Dr. Dobb's Journal. He co-founded the company behind the open-source iText PDF library, which was acquired in 2015. His book on algorithm implementation in C went through 16 printings before joining the long tail. Previously, he was the editor in chief of UNIX Review and, earlier, the founding editor of the C Gazette. He lives in Silicon Valley with his wife. When not coding or editing, he studies piano.

Share this Page