You can use higher network bandwidths, of 100 Gbps or more, to improve the
performance of distributed workloads running on your GPU VMs.
Higher network bandwidths are available for VMs with attached GPUs on
Compute Engine as follows:
- For N1 general-purpose VMs that have T4 and V100 GPUs attached, you can get a
maximum network bandwidth of up to 100 Gbps, based on the combination of
GPU and vCPU count.
- For A2 and G2 accelerator-optimized VMs, you can get a
maximum network bandwidth of up to 100 Gbps, based on the machine type.
- For A3 accelerator-optimized VMs, you can get a maximum network
bandwidth of up 1,800 Gbps.
To review the configurations or machine types that support these higher network
bandwidths rates, see
Network bandwidths and GPUs
.
For general network bandwidth information on Compute Engine, see
Network bandwidth
.
Overview
To use the higher network bandwidths available to each GPU VM, complete the
following recommended steps:
- Create your GPU VM
by using an OS image that supports Google Virtual NIC (gVNIC).
For A3 VMs,
it is recommended that you use a Container-Optimized OS image.
- Optional:
Install Fast Socket
.
Fast Socket improves NCCL performance on
100 Gbps or higher networks by reducing the contention between multiple TCP
connections. Some Deep Learning VM Images (DLVM) have Fast Socket
preinstalled.
Use Deep Learning VM Images
You can create your VMs using any GPU supported image from the
Deep Learning VM Images project. All GPU supported DLVM images
have the GPU driver, ML software, and gVNIC preinstalled. For a list of DLVM
images, see
Choosing an image
.
If you want to use Fast Socket, you can choose a DLVM image such as:
tf-latest-gpu-debian-10
or
tf-latest-gpu-ubuntu-1804
.
Create VMs that use higher network bandwidths
For higher network bandwidths, it is recommended that you enable
Google Virtual NIC (gVNIC). For more information,
see
Using Google Virtual NIC
.
For A3 VMs, gVNIC version 1.4.0rc3 or later is required. This driver
version is available on the Container-Optimized OS. For all other
operating systems, you need to install
gVNIC version 1.4.0rc3 or later
.
To create a VM that has attached GPUs and a higher network bandwidth complete
the following:
- Review the
maximum network bandwidth available
for each machine type that has attached GPUs.
Create your GPU VM. The following examples show how to create A3, A2,
and N1 with attached V100 VMs.
In these examples, VMs are created by using the Google Cloud CLI. However,
you can also use either the
Google Cloud console
or the
Compute Engine API
to create these VMs.
For more information about creating GPU VMs, see
Create a VM with attached GPUs
.
A3 (H100)
For detailed instructions on how to set up A3 VMs to maximise network
performance, review the following:
A2 (A100)
For example, to create a VM that has a maximum bandwidth of 100 Gbps, has
eight A100 GPUs attached, and uses the
tf-latest-gpu
DLVM image, run the
following command:
gcloud compute instances create
VM_NAME
\
--project=
PROJECT_ID
\
--zone=
ZONE
\
--machine-type=a2-highgpu-8g \
--maintenance-policy=TERMINATE --restart-on-failure \
--image-family=tf-latest-gpu \
--image-project=deeplearning-platform-release \
--boot-disk-size=200GB \
--network-interface=nic-type=GVNIC \
--metadata="install-nvidia-driver=True,proxy-mode=project_editors" \
--scopes=https://www.googleapis.com/auth/cloud-platform
Replace the following:
VM_NAME
: the name of your VM
PROJECT_ID
: your project ID
ZONE
: the zone for the VM. This zone must support
the specified GPU type. For more information about zones,
see
GPU regions and zones availability
.
N1 (V100)
For example, to create a VM that has a maximum bandwidth of 100 Gbps,
has eight V100 GPUs attached, and uses the
tf-latest-gpu
DLVM image, run the
following command:
gcloud compute instances create
VM_NAME
\
--project
PROJECT_ID
\
--custom-cpu 96 \
--custom-memory 624 \
--image-project=deeplearning-platform-release \
--image-family=tf-latest-gpu \
--accelerator type=nvidia-tesla-v100,count=8 \
--maintenance-policy TERMINATE \
--metadata="install-nvidia-driver=True" \
--boot-disk-size 200GB \
--network-interface=nic-type=GVNIC \
--zone=
ZONE
If you are not using GPU supported Deep Learning VM Images or
Container-Optimized OS, install GPU drivers. For more
information, see
Installing GPU drivers
.
Optional: On the VM,
Install Fast Socket
.
After you setup the VM, you can
verify the network bandwidth
.
Install Fast Socket
NVIDIA Collective Communications Library (NCCL) is used by deep learning
frameworks such as TensorFlow, PyTorch, Horovod for multi-GPU
and multi-node training.
Fast Socket is a Google proprietary network transport for NCCL. On
Compute Engine, Fast Socket improves NCCL performance on 100 Gbps
networks by reducing the contention between multiple TCP connections.
For more information about working with NCCL, see the
NCCL user guide
.
Current evaluation shows that Fast Socket improves all-reduce throughput
by 30%–60%, depending on the message size.
To setup a Fast Socket environment, you can use either a
Deep Learning VM Images that has Fast Socket preinstalled, or you can
manually install Fast Socket on a Linux VM. To check if Fast Socket is
preinstalled, see
Verifying that Fast Socket is enabled
.
Before you install Fast Socket on a Linux VM, you need to install NCCL.
For detailed instructions, see
NVIDIA NCCL documentation
.
CentOS/RHEL
To download and install Fast Socket on a CentOS or RHEL VM, complete the following
steps:
Add the package repository and import public keys.
sudo tee /etc/yum.repos.d/google-fast-socket.repo
< eom="" [google-fast-socket]="" name="Fast" socket="" transport="" for="" nccl="" baseurl="https://packages.cloud.google.com/yum/repos/google-fast-socket" enabled="1" gpgcheck="0" repo_gpgcheck="0" gpgkey="https://packages.cloud.google.com/yum/doc/yum-key.gpg" https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg="" eom="">
Install Fast Socket.
sudo yum install google-fast-socket
Verify that Fast Socket is
enabled
.
SLES
To download and install Fast Socket on an SLES VM, complete the following
steps:
Add the package repository.
sudo zypper addrepo https://packages.cloud.google.com/yum/repos/google-fast-socket google-fast-socket
Add repository keys.
sudo rpm --import https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
Install Fast Socket.
sudo zypper install google-fast-socket
Verify that Fast Socket is
enabled
.
Debian/Ubuntu
To download and install Fast Socket on a Debian or Ubuntu VM, complete the following
steps:
Add the package repository.
echo "deb https://packages.cloud.google.com/apt google-fast-socket main" | sudo tee /etc/apt/sources.list.d/google-fast-socket.list
Add repository keys.
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
Install Fast Socket.
sudo apt update && sudo apt install google-fast-socket
Verify that Fast Socket is
enabled
.
Verifying that Fast Socket is enabled
On your VM, complete the following steps:
Locate the NCCL home directory.
sudo ldconfig -p | grep nccl
For example, on a DLVM image, you get the following output:
libnccl.so.2 (libc6,x86-64) => /usr/local/nccl2/lib/libnccl.so.2
libnccl.so (libc6,x86-64) => /usr/local/nccl2/lib/libnccl.so
libnccl-net.so (libc6,x86-64) => /usr/local/nccl2/lib/libnccl-net.so
This shows that the NCCL home directory is
/usr/local/nccl2
.
Check that NCCL loads the Fast Socket plugin. To check, you need to
download the NCCL test package. To download the test package, run the following
command:
git clone https://github.com/NVIDIA/nccl-tests.git && \
cd nccl-tests && make NCCL_HOME=
NCCL_HOME_DIRECTORY
Replace
NCCL_HOME_DIRECTORY
with the NCCL home directory.
From the
nccl-tests
directory, run the
all_reduce_perf
process:
NCCL_DEBUG=INFO build/all_reduce_perf
If Fast Socket is enabled, the
FastSocket plugin initialized
message
displays in the output log.
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 63324 on fast-socket-gpu device 0 [0x00] Tesla V100-SXM2-16GB
.....
fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : queue skip: 0
fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket : Using [0]ens12:10.240.0.24
fast-socket-gpu:63324:63324 [0] NCCL INFO NET/FastSocket plugin initialized
......
Check network bandwidth
When working with high bandwidth GPUs, you can use a network traffic tool, such
as iperf2, to measure the networking bandwidth.
To check bandwidth speeds, you need at least two VMs that have attached
GPUs and can both support the bandwidth speed that you are testing.
Use iPerf to perform the benchmark on Debian-based systems.
Create two VMs that can support the required bandwidth speeds.
Once both VMs are running, use SSH to connect to one of the VMs.
gcloud compute ssh
VM_NAME
\
--project=
PROJECT_ID
Replace the following:
VM_NAME
: the name of the first VM
PROJECT_ID
: your project ID
On the first VM, complete the following steps:
Install
iperf
.
sudo apt-get update && sudo apt-get install iperf
Get the internal IP address for this VM. Keep track of it by writing it down.
ip a
Start up the iPerf server.
iperf -s
This starts up a server listening for connections in order to perform the
benchmark. Leave this running for the duration of the test.
From a new client terminal, connect to the second VM using SSH.
gcloud compute ssh
VM_NAME
\
--project=
PROJECT_ID
Replace the following:
VM_NAME
: the name of the second VM
PROJECT_ID
: your project ID
On the second VM, complete the following steps:
Install iPerf.
sudo apt-get update && sudo apt-get install iperf
Run the iperf test and specify the first VM's IP address as the target.
iperf -t 30 -c internal_ip_of_instance_1 -P 16
This executes a 30-second test and produces a result that resembles
the following output. If iPerf is not able to reach the other VM you,
might need to adjust the network or
firewall settings
on the VMs or perhaps
in the Google Cloud console.
When you use the maximum available bandwidth of 100 Gbps or 1000 Gbps (A3),
keep the following considerations in mind:
Due to header overheads for protocols such as Ethernet, IP, and TCP on the
virtualization stack, the throughput, as measured by
netperf
, saturates at
around 90 Gbps or 800 Gbps (A3). Generally known as
goodput
.
TCP is able to achieve the 100 or 1000 Gbps network speed. Other protocols, such
as UDP, are slower.
Due to factors such as protocol overhead and network congestion, end-to-end
performance of data streams might be slightly lower.
You need to use multiple TCP streams to achieve maximum bandwidth
between VM instances. Google recommends 4 to 16 streams. At 16 flows you'll
frequently maximize the throughput. Depending on your application and
software stack, you might need to adjust settings for your application or your
code to set up multiple streams.
What's next?