File transfer technologies for HPC systems
Posted: 28 May 2013 | 14:32
While a surprisingly high proportion of HPC users are happy to keep their data on a single HPC service, or at most to move it within the hosting institution, sometimes is becomes necessary to move large volumes of data between different sites and institutions. As anyone who has ever tried to support users in this endeavour knows, it can be much harder to get good performance than it should be. This post is an attempt to document the available tools and technologies as well as common problems and bottlenecks.
Storage system performance
Most users assume that network speed is going to be the limiting factor on performance, but in many cases the performance of the local storage system is just as important. A typical performance seen by a user for data copies on a typical HPC file-system (such as lustre or GPFS) is 200-400 MB/s. While this is faster than the ubiquitous Gigabit Ethernet (1Gbit/s = 125 MB/s), it is significantly less than the readily available 10Gbit/s networks. It is therefore important to benchmark local file-system performance to determine if this is going to be the limiting factor for file-transfers. Many sophisticated benchmarks (eg Iozone: http://www.iozone.org) are available to perform an in-depth analysis of file-system performance. However these often concentrate on aggregate performance under high levels of load. A simple view of the typical performance seen by a single user trying to transfer a large file can easily be generated just by using dd to copy a couple of Gigabytes of data. Modern versions of dd automatically calculate the achieved bandwidth and provide flags to force output to be flushed to disk. Pure read or write performance can be checked by using /dev/zero or /dev/null as source/sync of the copy.
SSH and associated tools
The most popular file-transfer tools (and to a first approximation the only ones users have ever heard of) all piggy-back on ssh connections. These include:
These tools are universally available, well documented with well understood security considerations. From the user perspective they also pass through any firewall that allows ssh logins. Unfortunately they are not very fast. The fundamental problem with all these mechanisms is that authentication and data pass through the same encrypted connection. I usually see default ssh-based data transfers run at roughly 30-40 MB/s on current hardware. By selecting a very lightweight cipher such as arcfour, data transfer rates can be increased to ~130MB/s. However this may compromise the security of the authentication. In addition to the encryption overhead, the commonly used OpenSSH implementation may be performance-limited by the size of internal flow control buffers (see http://www.psc.edu/index.php/hpn-ssh). In comparison between the same two hosts a simple C program that opens an unencrypted socket runs at 300 MB/s. For this reason most high throughput data transfer tools separate authentication/control and data onto different sockets in a similar way to the classic FTP protocol even though many of them can still use ssh to provide the control connection. On the other hand, it does seem to be possible to get good throughput by using multiple ssh sessions in parallel. Each connection is encrypted separately so by running multiple parallel transfers it is possible to make use of multiple cores on the source and destination hosts, provided they have plenty of free cores available. This also works around the flow control buffer problem.
Over long distance, high latency connections the performance of a socket degrades because of the TCP acknowledgement window. This is the amount of unacknowledged data that a socket is prepared to have outstanding before stopping to wait for acknowledgements. For good performance this window needs to be increased as the network speed or end-to-end network latency increase.
Linux hosts now do a good job of auto-tuning this parameter so it is not necessary to set it manually in programs but the default maximum size can still be too small, especially for 10G networks and/or long distance transfers. These maximum values can be easily increased by system administrators but not be normal users. See http://www.psc.edu/index.php/networking/641-tcp-tune.
In addition users can work around the problem by using multiple data sockets in parallel: each individual socket has limited performance but in combination performance is increased up to the physical limits on the network. This strategy of using multiple sockets also helps with many other network performance problems so it is a common option in many high performance data transfer tools.
Ephemeral port ranges and firewalls
The introduction of separate data connections complicates firewall setup to some extent. On the server side a range of ports need to be open in the firewall to allow incoming data connection. On the client side outgoing socket connections need to be permitted to the open port range on the corresponding servers. The security impact of opening a port range like this on a server is fairly minimal. Obviously the range should not overlap with any socket used by system daemons, but this can be easily checked by running netstat. The location of the open port range is a configuration choice by the server administrator. However, all other things being equal, the default globus port range of 50000-51000 is as good a choice as any if only to make it easier to write firewall rules that apply to multiple remote sites. Different file-transfer tools should be able to all share the same open port range so there is no need to allow multiple ranges. The file-transfer tools only listen on these ports briefly as part of a previously authenticated transaction, and even then only as a process with the permission of the user doing the file-transfer, so in themselves they are not a big security risk. The biggest risk is that users may utilise the open ports to run additional unauthorised services such as bit-torrent (these run as the user obviously). Some some additional monitoring of user activity might therefore be in order.
Unfortunately as cross-site transfers may involve up to 2 host-firewalls and 2 site-firewalls all operated by different people, it usually takes several attempts to get these open port ranges set up correctly.
In general firewall configuration is the biggest problem when it comes to network performance. Firewall administrators are usually primarily concerned with security and give insufficient thought to how their firewall rules will impact network performance or the ability of users to debug network problems.
The most reliable way of debugging these firewall issues is to send a series of TCP SYN packets to the target port with variable TTL values and examine the returned ICMP time-exceeded packets. This is similar to a normal traceroute but, rather than using an ICMP probe packet that might be subject to different firewall rules, it is using exactly the same packets that are used to intiate the TCP socket connection we are trying to debug. Modern versions of traceroute and tools like nmap all have special flags to provide this functionality, eg:
traceroute -T -p 50050 server-hostname.example.com
Unlike a normal traceroute, this technique requires a raw socket so needs to be performed by someone with root privilege on the client system. Even this mechanism is only of limited use if the traffic traverses a firewall that blocks the ICMP time-exceeded packets.
FTPS is the minimal extension to the FTP protocol that encrypts the control channel to protect authentication data. It is defined in RFC-4217. At first glance this seems a good choice for simple password authenticated transfers. However there does not seem to be good support in terms of client side tools.
These are two different but fairly similar tools differing mainly in user interface. BBFTP provides an FTP-style interface where BBCP looks more like scp. Both tools use ssh to provide the control connection but transfer data over additional unencrypted sockets.
Grid-FTP is the Swiss army knife of file-transfer technologies. It can be configured to operate as a FTPS server and can be initiated via ssh in the same way as BBFTP/BBCP, but in its default configuration it runs as a server listening on port 2811 and using certificate-based authentication. Authentication in this mode is bi-directional, meaning that both the server and the users need certificates. Actually the transfers are authenticated using short-lived proxy certificates to minimise the impact of any credential loss.
Grid-FTP is part of the Globus suite of tools but can be installed independently of the rest of the tools. Globus software is notoriously difficult to build from source code but the RPM support on most versions of linux is quite good (see http://www.globus.org/toolkit/docs/5.2/5.2.4/admin/install/#install-bininst). You need the globus-gridft bundle to get both client and server tools. Make sure you set the correct ephemeral port range in /etc/gridftp.conf once the packages are installed. Unfortunately you usually need an ephemeral port range enabled at both ends of the transfer, as data connections can be made in either direction depending on the direction of data movement. Client tools usually get the available port range from the GLOBUS_TCP_PORT_RANGE environment variable.
Grid-FTP servers can perform “third-party” transfers. This means a client can contact 2 independent servers and get them to directly transfer data between them.
Grid-FTP servers can also be “striped” across several head nodes, however this option is less useful than it was now that high-performance network interfaces may easily be faster than the file systems.
Because Grid-FTP has a well-defined command protocol (based on FTP) many different client tools can interoperate using Grid-FTP and systems like iRODs can expose their data via the Grid-FTP protocol.
Grid-FTP over ssh
When run from the command line and initiated via ssh, Grid-FTP provides essentially the same functionality as BBFT/BBCP. To enable this option you need to run the globus-gridftp-server-enable-sshftp program on the server to create the sshftp script that is started by ssh. You can either run this as root to create /etc/grid-security/sshftp or individual users can run it with the –non-root flag to create $HOME/.globus/sshftp.
One common problem with sshftp is on HPC systems that frequently have internal and external network connections. The server has to work out which IP address it should tell the client to contact it on. Normally it uses the interface the request came in on but when started via ssh the server can't work this out and has to guess. Depending on how your hostname and /etc/hosts are configured it can guess wrong and open the data connections on the wrong interface. If this happens you can either hardwire the data connections to a particular interface in /etc/gridftp.conf or add a GLOBUS_HOSTNAME environment variable to the ssftp script to set it for ssftp connections only. A similar problem occurs if the server is behind a firewall performing Network Address Translation.
Grid-FTP as a server
Running a full Grid-FTP server is more effort. You need a host certificate for your server and to set up the /etc/grid-security/grid-mapfile with an entry for each user permitted to use Grid-FTP, specifying their certificate name. You also need to install copies of all the CA certificates you trust to sign user certificates. Luckily there are bundles of generally trusted CA certificates you can install.
In its default server configuration, Grid-FTP (like the rest of the Globus toolkit) uses proxy certificates for authentication. The Globus toolkit also provides a special server (called myproxy) for managing these. At its simplest myproxy acts as a dropbox for proxy certificates. Users can upload proxies to the server setting a download password. Later on they (or some script) can contact the server and download the proxy by providing the download password. If a user trusts a particular myproxy server they can keep their full certificate in the myproxy server and generate their proxies there.
Finally, myproxy can be used as a way of generating certificates for users. In this case it needs to be linked to some other form of authentication like a local LDAP. Users use their LDAP credentials to access the myproxy server and the myproxy server automatically generates a certificate corresponding to this credential signed by an internal Certificate Authority. As this internal authority is not one of the generally-recognised CAs, this can really only be used as a way of authenticating services run by the same organisation that runs the LDAP/myproxy service.
While Grid-FTP provides an efficient and feature-rich mechanism for data transfer, the certificate-based authentication is a lot of trouble for everyone. Users need to go through a complicated procedure to obtain a certificate in the first place and system administrators need to maintain a mapping between these certificates and user accounts.
Globus Online is an attempt to provide a much more user-friendly interface built on the same underlying technology.
The Globus Online website provides a web interface for users to setup, monitor and control file-transfers. Large numbers of file-transfers can be set up in advance and the Globus Online server will schedule and monitor the transfers including re-trying failed transfers. It does this by performing third-party Grid-FTP transfers between the source and destination end-points.
In order to do this, Globus Online needs proxy credentials for the user on each site involved in the transfer; it normally obtains these from myproxy services using download passwords provided by the users when the transfers are set up.
Users with their own certificates are able to upload a proxy via any myproxy server and use Globus Online to access any grid-FTP server where their certificate is valid. However Globus Online can also be used by users without personal certificates.
To use Globus Online without a personal certificate, a Globus Online end-point has to be installed on any machine they wish to access.
Users can install “Globus Connect” on their laptop or personal workstation. This connects to the Globus Online server using the users globus-online account credentials. Once connected it then acts as a Grid-FTP server allowing globus-online to read/write files on the local machine.
System administrators can install “Globus Connect Multi User” to make globus-online available to all their users. This is actually a combination of a Grid-FTP server and a myproxy server tied to the local unix login credentials. When provided with a users username and password the myproxy server will automatically generate a proxy certificate for the user. This certificate is only valid on the machine that issued it but this is sufficient as the globus online server can obtain separate proxies for each end of the data transfer. There is no need to explicitly manage a grid-mapfile or CA-s as the GCMU handles this internally.
This does provide a much more user-friendly authentication experience for users and requires little management from system administrators but does require that the Globus Online servers are trusted to some extent. Proxy download credentials flow through the server when creating a proxy and if GCMU is used these are the same as the unix login credentials.
Also whatever mechanism is used to generate the proxy the server will have read/write access to the user's file-space during the lifetime of the proxy, which could be abused if the globus-online server is compromised. However this can be mitigated by using some of the advanced Grid-FTP configuration features to restrict which parts of the file-space the server is allowed to access and to lock down which users are allowed to use the GCMU.
UDT( http://udt.sourceforge.net) is an alternative protocol to TCP that is designed for high-speed data transfers. However, though logically a protocol that could have been layered directly on top of the IP network layer in the same way as TCP or UDP, the designers chose to implement it using UDP packets. This means that UDT can be implemented in a user-space library without needing to install a kernel module, but it makes it harder to recognised in firewall rules. Grid-FTP does provide a mode to use UDT rather than TCP sockets but this would require UDP packets to be allowed through firewalls. As the normal TCP mechanism is capable of saturating most realistic network/file-system configurations, this is is probably more of a research interest than a useful option.