Speeding up SSH data transfer on Niagara machines

Defining the problem

We have been hearing complains that SSH is slow on Niagara boxes. I can't say anything else but to confirm it. However, there is a background story and a way to speed it up significantly. As you know, the strength of UltraSparc-T1 and UltraSPARC-T2 processors is in parallelism. If you run multithreaded applications these boxes are just for you. If you run one single threaded application, only one virtual CPU is doing the job, all the others are idle and you end up in situation like this, showing the example of SSH transfer:

Possible solutions

What can we do about it? On UltraSparc-T2 processor, we could make use of its hardware crypto acceleration for symetric and hash algorithms (T1 could use RSA for initial pubkey authentication but it wouldn't help us in the actual transfer). There are several CR's filed that could help that happen:

  • 6445288 ssh needs to be OpenSSL engine aware
  • 6619049 env variable should force OpenSSL to use pkcs#11 engine by default
  • 6591009 OpenSSL pkcs#11 engine should be configurable

And we also think about replacing OpenSSL API in SunSSH completely and use PKCS#11 API. Some speed-up might be done in software only - we estimate that these two together could decrease the transfer time up to 5-10% (both are new features found in OpenSSH 4.7):

  • 6625805 teach SSH about UMAC-64
  • 6616927 preserve MAC contexts between packets

I must admit we don't work on any of those RFEs right now since there is other stuff with higher priority but there is also another approach to the problem -- horse more virtual CPUs into the SSH transfer. Ideally, the aim here would be to achieve the situation where all virtuall CPUs are participating and taking care of transfering one part of the file:

Testing

Nice idea and one would wonder how much we could gain here. To test it, I wrote a shell wrapper around SSH, calling it pscp -- parallel data transfer based on SSH. You can download the script here. I have just one T1000 box here in Prague for testing so I did some transfers over loopback only. The first impression is that you must not expect miracles -- it's not that the idea is bad but on a 1Gb network or above you quickly hit disk I/O bottleneck. I was able to trasfer up to 23MB of data per second over localhost on our T1000 with 12 parallel connections. That's 4x speed up against single threaded scp(1) on the same box. However, if the data was not in memory the disk reading slowed it down and I could see only 2x speed up. The fact that I copied the data over the loopback meaning that I was also writing to the same disk didn't help the speed, of course.

If you want to test the transfer speed only, use -t option that uses fake data so no disk I/O is performed. The script has also some limitations -- you can't transfer recursively, for example. Transfer of one file only is supported for now. You also need public key or GSS-API authentication configured. In other words, you need passwordless authentication. Patches are welcome, of course. The interfaces is like this:

pscp - parallel data transfer based on SSH
Version 1.0
Usage: pscp user@host:file1 user@host:[dir | file2]
Options:
        -b      block size in bytes (default: 65536)
        -c      number of parallel connections (default: 6)
        -d      debug mode
        -h      this help
        -q      quiet mode; print only errors
        -s      SHA1 verification of the transfered file (use for debug only)
        -t      test mode; no reads, no writes, transfer only

Notes:
        - doesn't support local-to-local and remote-to-remote transfers
        - you might consider to increase sshd's MaxStartups on server side

I got 11 seconds when transfering 256MB of random data on my box while normal SCP transfer took 42 seconds:

$ time ~/pscp -c 12 256.data ogma:/export/tmp2
init done
firing up parallel transfer with 12 connection(s) NOW
waiting for all ssh clients to finish
all finished


real    0m11.296s
user    1m32.171s
sys     0m12.963s


$ time scp 256.data ogma:/export/tmp2
256.data        100% |\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*|   250 MB    00:40    

real    0m41.833s
user    0m37.453s
sys     0m4.095s

To use or not to use?

Before you use it, I strongly suggest you to use pscp-test script that tests various download and upload transfers over localhost. You run it like this:

~/pscp-test /export/pscp/ ~/pscp

The 1st parameter is a temporary directory with at least 0.5GB of free space, the 2nd argument is a full path to pscp script. You should see ALL TESTS PASSED as the last line of output and no obvious error output. If you don't then do NOT use the pscp script and please report the problem to me. Note that you might need to increase MaxStartups in server side sshd_config to 20. Alternatively, change conns=20 to conns=10 in pscp-test if you experience a problem.

Notes on implementation and portability

Suprisingly, I spent 90% of time on last 10% of functionality. After I got it for Nevada I had to rewrite some code so that it worked on S10 - no stat(1) in S10, no Python 2.4 (but 2.3), etc. etc. I think now that it should work on other systems that have python but I tested Nevada and S10 only (and download on FreeBSD since no python was there). Please use pscp-test script above before you decide to use pscp.

Should this be part of scp(1)? I don't think so. I would rather like to see it in something like rsync.

One interesting problem was how to transfer all those parts of the file. The obvious solution:

dd if=file bs=xx skip=yy count=zz | \\
    ssh host "dd of=file bs=xx seek=yy count=zz"

...doesn't work since there is no way how to make sure that dd can read the whole block from SSH. Remember, if dd reads a partial block it just updates the counter and reports it later but it never reads the rest of the block. One could try to use block size of 1024 bytes and hope that no fragmentation ever occurs but that's not good enough and it would be very slow anyway. The only idea I got was to use Perl, Python or something similar to read from ssh process since I don't think we have a command in the base system that would fit here.

The final solution was even better - just lseek(2) in Python and then fork cat(1) process. Thanks to Martin Horcicka for the Python code since I'm not familiar with this language.

Conclusion

I believe the speed of parallel transfer can be much better than 4x depending on I/O speed and number of virtual CPUs. If you get better results please share your experience in the comments section. Thanks!

Comments:

Did you try also using the blowfish cipher? Using nothing else but specifying blowfish with a -c option to scp gave me a pretty decent boost in transfer rate. Perhaps using -c blowfish along with your parallel routine would be even better (even if it isn't hardware accelerated).

Posted by William Hathaway on November 26, 2007 at 05:21 AM CET #

hi William, I did now. So, for the example I showed in the entry, I see time drop to 82% of previous scp transfer time. However, the difference when running in parallel is not that significant. If we have enough CPUs we could use more connections, unfortunately the problem I saw looked like disk I/O was unable to keep up the pace.

using blowfish with -t option is similar - I get 23 seconds to transfer 700MB of data now while it was 26 with the default 3DES (16 parallel connections in both cases over localhost on T1000). I guess I should just add -o option to the pscp script for users to be able to add any SSH options. Of course, the bigger the file the better for pscp since the script has some overhead before starting up SSH connections.

Posted by Jan on November 26, 2007 at 08:57 AM CET #

HPN SSH seems to help a bit - http://www.psc.edu/networking/projects/hpn-ssh/

Posted by Mads on December 13, 2007 at 01:31 PM CET #

trying the script pscp-test, but it fails here (adding set +x to debug):
+ grep \^pscp - parallel data transfer based on SSH$ /etc/scripts/pscp-test
+ 1> /dev/null
+ [ 1 -ne 0 ]
+ echo /etc/scripts/pscp-test doesn't look like pscp script

what is the meaning of this line?

thanks,

gerard

Posted by henry on December 29, 2007 at 11:14 PM CET #

hi Gerard, the meaning is to check that you actually use the right pscp for testing. And that's not the case here:

+ echo /etc/scripts/pscp-test doesn't look like pscp script

which is exactly right - you need to give it pscp script path as an argument, not pscp-test script.

/etc/scripts/pscp-test /tmp /etc/scripts/pscp

Posted by Jan on December 30, 2007 at 07:57 AM CET #

hi, does anybody know if there are improvements regarding scp using hardware crypto acceleration on T2?
We need to achive a scp throughput of 80-90 MByte/sec.

thx,Martin

Posted by Martin Steiner on February 01, 2008 at 03:36 PM CET #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Jan Pechanec

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today