Speeding up SSH data transfer on Niagara machines
By janp on Nov 25, 2007
Defining the problem
We have been hearing complains that SSH is slow on Niagara boxes. I can't say anything else but to confirm it. However, there is a background story and a way to speed it up significantly. As you know, the strength of UltraSparc-T1 and UltraSPARC-T2 processors is in parallelism. If you run multithreaded applications these boxes are just for you. If you run one single threaded application, only one virtual CPU is doing the job, all the others are idle and you end up in situation like this, showing the example of SSH transfer:
What can we do about it? On UltraSparc-T2 processor, we could make use of its hardware crypto acceleration for symetric and hash algorithms (T1 could use RSA for initial pubkey authentication but it wouldn't help us in the actual transfer). There are several CR's filed that could help that happen:
- 6445288 ssh needs to be OpenSSL engine aware
- 6619049 env variable should force OpenSSL to use pkcs#11 engine by default
- 6591009 OpenSSL pkcs#11 engine should be configurable
And we also think about replacing OpenSSL API in SunSSH completely and use PKCS#11 API. Some speed-up might be done in software only - we estimate that these two together could decrease the transfer time up to 5-10% (both are new features found in OpenSSH 4.7):
I must admit we don't work on any of those RFEs right now since there is other stuff with higher priority but there is also another approach to the problem -- horse more virtual CPUs into the SSH transfer. Ideally, the aim here would be to achieve the situation where all virtuall CPUs are participating and taking care of transfering one part of the file:
Nice idea and one would wonder how much we could gain here. To test it, I wrote a shell wrapper around SSH, calling it pscp -- parallel data transfer based on SSH. You can download the script here. I have just one T1000 box here in Prague for testing so I did some transfers over loopback only. The first impression is that you must not expect miracles -- it's not that the idea is bad but on a 1Gb network or above you quickly hit disk I/O bottleneck. I was able to trasfer up to 23MB of data per second over localhost on our T1000 with 12 parallel connections. That's 4x speed up against single threaded scp(1) on the same box. However, if the data was not in memory the disk reading slowed it down and I could see only 2x speed up. The fact that I copied the data over the loopback meaning that I was also writing to the same disk didn't help the speed, of course.
If you want to test the transfer speed only, use -t option that uses fake data so no disk I/O is performed. The script has also some limitations -- you can't transfer recursively, for example. Transfer of one file only is supported for now. You also need public key or GSS-API authentication configured. In other words, you need passwordless authentication. Patches are welcome, of course. The interfaces is like this:
pscp - parallel data transfer based on SSH Version 1.0 Usage: pscp user@host:file1 user@host:[dir | file2] Options: -b block size in bytes (default: 65536) -c number of parallel connections (default: 6) -d debug mode -h this help -q quiet mode; print only errors -s SHA1 verification of the transfered file (use for debug only) -t test mode; no reads, no writes, transfer only Notes: - doesn't support local-to-local and remote-to-remote transfers - you might consider to increase sshd's MaxStartups on server side
I got 11 seconds when transfering 256MB of random data on my box while normal SCP transfer took 42 seconds:
$ time ~/pscp -c 12 256.data ogma:/export/tmp2 init done firing up parallel transfer with 12 connection(s) NOW waiting for all ssh clients to finish all finished real 0m11.296s user 1m32.171s sys 0m12.963s $ time scp 256.data ogma:/export/tmp2 256.data 100% |\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*| 250 MB 00:40 real 0m41.833s user 0m37.453s sys 0m4.095s
To use or not to use?
Before you use it, I strongly suggest you to use pscp-test script that tests various download and upload transfers over localhost. You run it like this:
~/pscp-test /export/pscp/ ~/pscp
The 1st parameter is a temporary directory with at least 0.5GB of free
space, the 2nd argument is a full path to pscp script. You should see
ALL TESTS PASSED as the last line of output and no obvious
error output. If you don't then do NOT use the pscp
script and please report the problem to me. Note that you might need to increase
MaxStartups in server side
sshd_config to 20. Alternatively, change
conns=10 in pscp-test if you experience a problem.
Notes on implementation and portability
Suprisingly, I spent 90% of time on last 10% of functionality. After I got it for Nevada I had to rewrite some code so that it worked on S10 - no stat(1) in S10, no Python 2.4 (but 2.3), etc. etc. I think now that it should work on other systems that have python but I tested Nevada and S10 only (and download on FreeBSD since no python was there). Please use pscp-test script above before you decide to use pscp.
Should this be part of scp(1)? I don't think so. I would rather like to see it in something like rsync.
One interesting problem was how to transfer all those parts of the file. The obvious solution:
dd if=file bs=xx skip=yy count=zz | \\ ssh host "dd of=file bs=xx seek=yy count=zz"
...doesn't work since there is no way how to make sure that dd can read the whole block from SSH. Remember, if dd reads a partial block it just updates the counter and reports it later but it never reads the rest of the block. One could try to use block size of 1024 bytes and hope that no fragmentation ever occurs but that's not good enough and it would be very slow anyway. The only idea I got was to use Perl, Python or something similar to read from ssh process since I don't think we have a command in the base system that would fit here.
The final solution was even better - just lseek(2) in Python and then fork cat(1) process. Thanks to Martin Horcicka for the Python code since I'm not familiar with this language.
I believe the speed of parallel transfer can be much better than 4x depending on I/O speed and number of virtual CPUs. If you get better results please share your experience in the comments section. Thanks!