This document describes how to use manifest files to query data stored in open
table formats such as
Apache Hudi
and
Delta Lake
.
Some open table formats such as Hudi and Delta Lake
export their current state as one or more manifest files. A manifest file contains
a list of data files that make tables. With the manifest support in
BigQuery, you can query and load data stored in open table formats.
Before you begin
Required roles
To query BigLake tables based on Hudi and
Delta Lake data, ensure you have the following roles:
- BigQuery Connection User (
roles/bigquery.connectionUser
)
- BigQuery Data Viewer (
roles/bigquery.dataViewer
)
- BigQuery User (
roles/bigquery.user
)
You can also query Hudi external tables. However, we
recommend you to
upgrade the external table to BigLake
.
To query Hudi external tables, ensure you have the
following roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer
)
- BigQuery User (
roles/bigquery.user
)
- Storage Object Viewer (
roles/storage.objectViewer
)
Depending on your permissions, you can
grant these roles to yourself or ask your administrator
to grant them to you. For more information about granting roles, see
Viewing the grantable roles on resources
.
To see the exact permissions that are required to query
BigLake tables, expand the
Required permissions
section:
You might also be able to get these permissions with
custom roles
or other
predefined roles
.
Query Hudi workloads
To
query Hudi data
,
follow these steps:
- Create an external table
based on
Hudi data.
- Upgrade the external table to BigLake
.
Create Hudi external tables
When you sync tables by using the sync tool for Hudi and
BigQuery, enable the
use-bq-manifest-file
flag to transition
to the manifest file approach. This flag also exports a manifest file in a
format supported by BigQuery and uses it to create an external
table with the name specified in the
--table
parameter.
To create a Hudi external table, follow these steps:
To create a Hudi external table,
submit a job
to an existing
Dataproc cluster. When you build the
Hudi-BigQuery connector, enable the
use-bq-manifest-file
flag
to transition to the manifest file approach. This flag exports a
manifest file in a format supported by BigQuery and uses it to
create an external table with the name specified in the
--table
parameter.
spark-submit \
--master yarn \
--packages com.google.cloud:google-cloud-bigquery:2.10.4 \
--class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \
JAR
\
--project-id
PROJECT_ID
\
--dataset-name
DATASET
\
--dataset-location
LOCATION
\
--table
TABLE
\
--source-uri
URI
\
--source-uri-prefix
URI_PREFIX
\
--base-path
BASE_PATH
\
--partitioned-by
PARTITION_BY
\
--use-bq-manifest-file
Replace the following:
JAR
: If you are using the Hudi-BigQuery connector, specify
hudi-gcp-bundle-0.14.0.jar
. If you are using
the Hudi component in Dataproc 2.1, specify
/usr/lib/hudi/tools/bq-sync-tool/hudi-gcp-bundle-0.12.3.1.jar
PROJECT_ID
: the project ID in which you want to
create the Hudi BigLake table
DATASET
: the dataset in which you want to create
the Hudi BigLake table
LOCATION
: the location in which you want to create
the Hudi BigLake table
TABLE
: the name of the table that you want to create
If you are transitioning from the earlier version of the
Hudi-BigQuery
connector (0.13.0 and earlier) that created views on
the manifest files, ensure that you use the same table name as it lets you
keep the existing downstream pipeline code.
URI
: the Cloud Storage URI that you created
to store the Hudi manifest file
This URI points to the first level partition; make sure to include the
partition key. For example,
gs://mybucket/hudi/mydataset/EventDate=*
URI_PREFIX
: the prefix for the Cloud Storage
URI path, usually it's the path to Hudi tables
BASE_PATH
: the base path for Hudi
tables
For example,
gs://mybucket/hudi/mydataset/
PARTITION_BY
: the partition value
For example,
EventDate
For more information about the connector's configuration, see
Hudi-BigQuery connector
.
To set appropriate fine-grained controls or to accelerate the performance
by enabling metadata caching, see
Upgrade BigLake tables
.
Query Delta workloads
Delta tables are now
natively supported
. We recommend creating Delta BigLake tables for Delta workloads. Delta Lake BigLake tables support more advanced
Delta Lake tables
, including tables with column remapping and deletion vectors. Additionally, Delta BigLake tables directly read the latest snapshot, so updates are instantly available.
To
query Delta workloads
,
follow these steps:
- Generate a manifest file
.
- Create a BigLake table
based on the manifest file.
- Set appropriate fine-grained controls or accelerate the performance
by enabling metadata caching. To do this, see
Upgrade BigLake tables
.
Generate a manifest file
BigQuery supports the manifest file in a
SymLinkTextInputFormat
format, which is a
newline-delimited list of URIs. For more information about generating a manifest
file, see
Set up Presto to Delta Lake integration and query Delta tables
.
To generate a manifest file,
submit a job
to an existing Dataproc cluster:
SQL
Using Spark, run the following command on a Delta table at location
path-to-delta-table
:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
Scala
Using Spark, run the following command on a Delta table at location
path-to-delta-table
:
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Java
Using Spark, run the following command on a Delta table at location
path-to-delta-table
:
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");
Python
Using Spark, run the following command on a Delta table at location
path-to-delta-table
:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Create Delta BigLake tables
To create a Delta BigLake table, use the
CREATE EXTERNAL TABLE
statement
with the
file_set_spec_type
field set to
NEW_LINE_DELIMITED_MANIFEST
:
Go to the
BigQuery
page.
Go to BigQuery
In the query editor, run the
CREATE EXTERNAL TABLE
statement:
CREATE
EXTERNAL
TABLE
PROJECT_ID
.
DATASET_NAME
.
TABLE_NAME
WITH
PARTITION COLUMNS
(
`
PARTITION_COLUMN PARTITION_COLUMN_TYPE
`,)
WITH
CONNECTION
`
PROJECT_ID
REGION
.
CONNECTION_NAME
`
OPTIONS
(
format = "
DATA_FORMAT
",
uris = ["
URI
"],
file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',
hive_partition_uri_prefix = "
PATH_TO_DELTA_TABLE
"
max_staleness =
STALENESS_INTERVAL
,
metadata_cache_mode = '
CACHE_MODE
');
Replace the following:
DATASET_NAME
: the name of the dataset you created
TABLE_NAME
: the name you want to give to this table
REGION
: the location where the connection
is located (for example,
us-east1
)
CONNECTION_NAME
: the name of the connection you
created
DATA_FORMAT
: any of the supported
formats
(such as
PARQUET
)
URI
: the path to the manifest file (for
example,
gs://mybucket/path
)
PATH_TO_DELTA_TABLE
: a common prefix for all
source URIs before the partition key encoding begins
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the BigLake
table, and how fresh the cached metadata must be in order for the
operation to use it. For more information about metadata caching
considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Delta Lake instead.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. For more information
about metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
Example:
CREATE
EXTERNAL
TABLE mydataset.mytable
WITH
CONNECTION
`us-east1.myconnection`
OPTIONS
(
format="PARQUET",
uris=["gs://mybucket/path/partitionpath=*"],
file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST'
hive_partition_uri_prefix = "gs://mybucket/path/"
max_staleness = INTERVAL 1 DAY,
metadata_cache_mode = 'AUTOMATIC'
);
Upgrade BigLake tables
You can also
accelerate the performance
of your workloads by taking advantage of
metadata caching
and
materialized views
. If you want to use
metadata caching, you can specify settings for this at
the same time. To get table details such as source format and source URI, see
Get table information
.
To update an external table to a BigLake table or update an
existing BigLake, select one of the following options:
SQL
Use the
CREATE OR REPLACE EXTERNAL TABLE
DDL statement
to update a table:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE OR REPLACE EXTERNAL TABLE
`
PROJECT_ID
.
DATASET
.
EXTERNAL_TABLE_NAME
`
WITH CONNECTION `
REGION
.
CONNECTION_ID
`
OPTIONS(
format ="
TABLE_FORMAT
",
uris = ['
BUCKET_PATH
'],
max_staleness =
STALENESS_INTERVAL
,
metadata_cache_mode = '
CACHE_MODE
'
);
Replace the following:
PROJECT_ID
: the name of the project that
contains the table
DATASET
: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME
: the name of the
table
REGION
: the region that contains the
connection
CONNECTION_ID
: the name of the connection
to use
TABLE_FORMAT
: the format used by the table
You can't change this when updating the table.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
['gs://bucket_name/[folder_name/]file_name']
.
You can select multiple files from the bucket by specifying one asterisk (
*
)
wildcard character in the path. For example,
['gs://mybucket/file_name*']
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the table, and
how fresh the cached metadata must be in order for the operation to
use it
For more information about metadata caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually
For more information
on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh
the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Use the
bq mkdef
and
bq update
commands
to update a table:
Generate an
external table definition
,
that describes the aspects of the table to change:
bq mkdef --connection_id=
PROJECT_ID
.
REGION
.
CONNECTION_ID
\
--source_format=
TABLE_FORMAT
\
--metadata_cache_mode=
CACHE_MODE
\
"
BUCKET_PATH
" > /tmp/
DEFINITION_FILE
Replace the following:
PROJECT_ID
: the name of the project that
contains the connection
REGION
: the region that contains the
connection
CONNECTION_ID
: the name of the connection
to use
TABLE_FORMAT
: the format used by the
table. You can't change this when updating the table.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. For more information
on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be refreshed at a
system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh the metadata cache on a
schedule you determine. In this case, you can call the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh
the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value
greater than 0.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
gs://bucket_name/[folder_name/]file_name
.
You can limit the files selected from the bucket by specifying one asterisk (
*
)
wildcard character in the path. For example,
gs://mybucket/file_name*
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.csv
gs://bucket/path1/*,gs://bucket/path2/file00*
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
DEFINITION_FILE
: the name of the table
definition file that you are creating.
Update the table using the new external table definition:
bq update --max_staleness=
STALENESS_INTERVAL
\
--external_table_definition=/tmp/
DEFINITION_FILE
\
PROJECT_ID
:
DATASET
.
EXTERNAL_TABLE_NAME
Replace the following:
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the
table, and how fresh the cached metadata must be in order for
the operation to use it. For more information about metadata
caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S
format described in the
INTERVAL
data type
documentation. For example, specify
0-0 0 4:0:0
for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DEFINITION_FILE
: the name of the table
definition file that you created or updated.
PROJECT_ID
: the name of the project
that contains the table
DATASET
: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME
: the name of the table
Query BigLake and external tables
After creating a BigLake table, you can
query it using GoogleSQL syntax
, the
same as if it were a standard BigQuery table.
For example,
SELECT field1, field2 FROM mydataset.my_cloud_storage_table;
.
Limitations
BigQuery only supports querying
Delta Lake reader
v1
tables.
Hudi and BigQuery integration only
works for hive-style partitioned
copy-on-write
tables.
Using manifests to query data stored in third-party storage is not supported.
What's next