One strategy for uploading large files is called
parallel composite uploads
.
In such an upload, a file is divided into up to 32 chunks,
the chunks are uploaded in parallel to temporary objects, the final object is
recreated using the temporary objects
, and the temporary objects are
deleted.
Parallel composite uploads can be significantly faster if network and disk
speed are not limiting factors; however, the final object stored in your bucket
is a
composite object
, which only has a crc32c hash and not an
MD5 hash
.
As a result, you must use crcmod to perform integrity checks when downloading
the object with Python applications. You should only perform parallel composite
uploads if the following apply:
Any Python user, including
gsutil
users, who needs to download your
objects has either google-crc32c or crcmod installed.
For example, if you use Python to upload video assets that are only served
by a Java application, parallel composite uploads are a good choice because
there are efficient CRC32C implementations available in Java.
You do not need the uploaded objects to have an
MD5 hash
.
How tools and APIs use parallel composite uploads
Depending on how you interact with Cloud Storage, parallel composite
uploads might be managed automatically on your behalf. This section describes
parallel composite upload behavior for different tools and provides information
for how you can modify the behavior.
Console
The Google Cloud console does not perform parallel composite uploads.
Command line
You can configure how and when
gcloud storage cp
performs parallel
composite uploads by modifying the following properties:
storage/parallel_composite_upload_enabled
: Property for enabling
parallel composite uploads. If
False
, disable parallel composite
uploads. If
True
or
None
, perform parallel composite uploads for
objects that meet the criteria defined in the other properties. The
default setting is
None
.
storage/parallel_composite_upload_compatibility_check
: Property for
toggling safety checks. If
True
,
gcloud storage
only performs parallel
composite uploads when all of the following conditions are met:
Note that in order to check these conditions, the gcloud CLI
retrieves the metadata for the destination bucket as part of the upload
command.
If
False
,
gcloud storage
does not perform any checks. The default
setting is
True
.
storage/parallel_composite_upload_threshold
: The minimum total file
size for performing a parallel composite upload. The default setting is
150 MiB.
storage/parallel_composite_upload_component_size
: The maximum size
for each temporary object. The property is ignored if the total file size
is so large that it would require more than 32 chunks
at this size.
storage/parallel_composite_upload_component_prefix
: The prefix used
when naming temporary objects. This property can be set either as an
absolute path or as a path relative to the final object. See the
property description
for more information. The default prefix is the
absolute path
/gcloud/tmp/parallel_composite_uploads/see_gcloud_storage_cp_help_for_details
.
You can modify these properties by creating a
named configuration
and
applying the configuration either on a per-command basis by using the
--configuration
project-wide flag
or for all gcloud CLI
commands by using the
gcloud config set
command
.
No additional local disk space is required when using gcloud CLI
to perform parallel composite uploads. If a parallel composite upload fails
prior to composition, run the gcloud CLI command again to take
advantage of resumable uploads for the temporary objects that failed. Any
temporary objects that uploaded successfully before the failure do not get
re-uploaded when you resume the upload.
Temporary objects are named in the following fashion:
TEMPORARY_PREFIX
/
RANDOM_VALUE
_
HEX_DIGEST
_
COMPONENT_ID
Where:
TEMPORARY_PREFIX
is controlled by the
storage/parallel_composite_upload_component_prefix
property.
RANDOM_VALUE
is a random numerical value.
HEX_DIGEST
is a hash derived from the name of the
source resource.
COMPONENT_ID
is the sequential number of the
component.
Generally, temporary objects are deleted at the end of a parallel composite
upload, but to avoid leaving temporary objects around, you should check the
exit status from the gcloud CLI command, and you should manually
delete any temporary objects that were uploaded as part of any aborted
upload.
REST APIs
Both the
JSON API
and
XML API
support uploading object chunks
in parallel and recombining them into a single object using the
compose
operation
.
Keep the following in mind when designing code for parallel composite
uploads:
When using the
compose
operation, the source objects are unaffected
by the composition process.
This means that if they are meant to be temporary, you must explicitly
delete them
once you've successfully completed the composition, or
else the source objects remain in your bucket and are billed
accordingly.
In order to protect against changes to source objects between the upload
and compose requests, you should provide an expected
generation number
for each source.