Thursday Sep 09, 2010

Another detour: short-circuiting cat(1)

What do you think happens when you do this:

        # cat vmcore.4 > /dev/null

If you've used Unix systems before, you might expect this to read vmcore.4 into memory and do nothing with it, since cat(1) reads a file, and "> /dev/null" sends it to the null driver, which accepts data and does nothing. This appears pointless, but can actually be useful to bring a file into memory, for example, or to evict other files from memory (if this file is larger than total cache size).

But here's a result I found surprising:

        # ls -l vmcore.1
        -rw-r--r--   1 root     root     5083361280 Oct 30  2009 vmcore.1

        # time cat vmcore.1 > /dev/null
        real    0m0.007s
        user    0m0.001s
        sys     0m0.007s

That works out to 726GB/s. That's way too fast, even reading from main memory. The obvious question is how does cat(1) know that I'm sending to /dev/null and not bother to read the file at all?

Of course, you can answer this by examining the cat source in the ON gate. There's no special case for /dev/null (though that does exist elsewhere), but rather this behavior is a consequence of an optimization in which cat(1) maps the input file and writes the mapped buffer instead of using read(2) to fill a buffer and write that. With truss(1) it's clear exactly what's going on:

        # truss cat vmcore.1 > /dev/null
        execve("/usr/bin/cat", 0x08046DC4, 0x08046DD0)  argc = 2
        [ ... ]
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) = 0xFE600000
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x01000000) = 0xFE600000
        [ ... ]
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E000000) = 0xFE600000
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8253440, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E800000) = 0xFE600000
        write(1, ..., 8253440) = 8253440
        llseek(3, 0x000000012EFDF000, SEEK_SET)         = 0x12EFDF000
        munmap(0xFE600000, 8388608)                     = 0
        llseek(3, 0, SEEK_CUR)                          = 0x12EFDF000
        close(3)                                        = 0
        close(1)                                        = 0
        _exit(0)

cat(1) really is issuing tons of writes from the mapped file, but the /dev/null device just returns immediately without doing anything. The file mapping is never even read. If you actually wanted to read the file (for the side effects mentioned above, for example), you can defeat this with an extra pipe:

        # time cat vmcore.1 | cat > /dev/null
        real    0m32.661s
        user    0m0.865s
        sys     0m32.127s

That's more like it: about 155MB/s streaming from a single disk. In this case the second cat invocation can't use this optimization since stdin is actually a pipe, not the input file.

There's another surprising result of the initial example: the file's access time actually gets updated even though it was never read:

        # ls -lu vmcore.2
        -rw-r--r--   1 root     root     6338052096 Nov  3  2009 vmcore.2

        # time cat vmcore.2 > /dev/null
        real    0m0.040s
        user    0m0.001s
        sys     0m0.008s

        # ls -lu vmcore.2
        -rw-r--r--   1 root     root     6338052096 Aug  6 15:55 vmcore.2

This wasn't always the case, but it was fixed back in 1995 under bug 1193522, which is where this comment and code probably came from:

    363         /\*
    364          \* NFS V2 will let root open a file it does not have permission
    365          \* to read. This read() is here to make sure that the access
    366          \* time on the input file will be updated. The VSC tests for
    367          \* cat do this:
    368          \*      cat file > /dev/null
    369          \* In this case the write()/mmap() pair will not read the file
    370          \* and the access time will not be updated.
    371          \*/
    372
    373         if (read(fi_desc, &x, 1) == -1)
    374                 read_error = 1;

I found this all rather surprising because I think of cat(1) as one of the basic primitives that's dead-simple by design.  If you really want something simple to read a file into memory, you might be better off with dd(1M).

About

On Fishworks, Sun, and software engineering

Search

Categories
Archives
« September 2010
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
24
25
26
27
28
29
30
  
       
Today