Create Cloud Storage BigLake tables
This document describes how to create a Cloud Storage BigLake table.
A
BigLake table
lets you use
access delegation to query structured data in Cloud Storage. Access
delegation decouples access to the BigLake table from access
to the underlying data store.
Before you begin
-
In the Google Cloud console, on the project selector page,
select or
create a Google Cloud project
.
Go to project selector
-
Make sure that billing is enabled for your Google Cloud project
.
-
Enable the BigQuery Connection API.
Enable the API
If you want to read BigLake tables from open source engines
such as Apache Spark, then you need to enable the
BigQuery Storage Read API
.
-
In the Google Cloud console, activate Cloud Shell.
Activate Cloud Shell
Ensure that you have a BigQuery
dataset
.
Ensure that your version of the Google Cloud SDK is 366.0.0 or later:
gcloud version
If needed,
update the Google Cloud SDK
.
- Optional: For Terraform,
terraform-provider-google
version 4.25.0 or
later is required.
terraform-provider-google
releases are listed on
GitHub
.
You can download the latest version of Terraform from
HashiCorp Terraform downloads
.
Create a
Cloud resource connection
based on your external data source, and grant that connection
access to Cloud Storage
.
If you don't have the appropriate permissions to create a connection, ask
your BigQuery administrator to create a connection
and share it with you.
Required roles
To create a BigLake table, you need the following
BigQuery Identity and Access Management (IAM) permissions:
bigquery.tables.create
bigquery.connections.delegate
The BigQuery Admin (
roles/bigquery.admin
) predefined
Identity and Access Management role includes these permissions.
If you are not a principal in this role, ask your administrator
to grant you access or to create the BigLake table for you.
For more information on Identity and Access Management roles and permissions in
BigQuery, see
Predefined roles and
permissions
.
Location consideration
When you use Cloud Storage to store data files, you can improve
performance by using
Cloud Storage
single-region
or
dual-region
buckets instead of
multi-region buckets.
Create BigLake tables on unpartitioned data
If you're familiar with creating tables in BigQuery, the process
of creating a BigLake table is similar.
Your table can use any file format that BigLake supports. For
more information, see
Limitations
.
Before you create a BigLake table, you need to have a
dataset
and a
Cloud resource
connection
that can
access Cloud Storage
.
To create a BigLake table, select one of the following options:
Console
Go to the
BigQuery
page.
Go to BigQuery
In the
Explorer
pane, expand your project and select a dataset.
Expand the
more_vert
Actions
option and click
Create table
.
In the
Source
section, specify the following details:
For
Create table from
, select
Google Cloud Storage
For
Select file from GCS bucket or use a URI pattern
, browse to
select a bucket and file to use, or type the path in the format
gs://bucket_name/[folder_name/]file_name
.
You can't specify multiple URIs in the Google Cloud console, but
you can select multiple files by specifying one asterisk (
*
)
wildcard character. For example,
gs://mybucket/file_name*
. For more
information, see
Wildcard support for Cloud Storage URIs
.
The Cloud Storage bucket
must be in the same location as the dataset that contains the table
you're creating.
For
File format
, select the format that matches your file.
In the
Destination
section, specify the following details:
For
Project
, choose the project in which to create the table.
For
Dataset
, choose the dataset in which to create the table.
For
Table
, enter the name of the table you are creating.
For
Table type
, select
External table
.
Select
Create a BigLake table using a Cloud Resource connection
.
For
Connection ID
, select the connection that you created earlier.
In the
Schema
section, you can either enable
schema auto-detection
or manually specify
a schema if you have a source file. If you don't have a source file, you
must manually specify a schema.
To enable schema auto-detection, select the
Auto-detect
option.
To manually specify a schema, leave the
Auto-detect
option
unchecked. Enable
Edit as text
and enter the table schema as a
JSON array
.
To ignore rows with extra column values that do not match the schema,
expand the
Advanced options
section and select
Unknown values
.
Click
Create table
.
After the permanent table is created, you can run a query against the table
as if it were a native BigQuery table. After your query
completes, you can
export the results
as CSV or JSON files, save the results
as a table, or save the results to Google Sheets.
SQL
Use the
CREATE EXTERNAL TABLE
DDL statement
.
You can specify the schema explicitly, or use
schema auto-detection
to infer the schema
from the external data.
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE
EXTERNAL
TABLE `
PROJECT_ID
.
DATASET
.
EXTERNAL_TABLE_NAME
`
WITH
CONNECTION
`
PROJECT_ID
.
REGION
.
CONNECTION_ID
`
OPTIONS
(
format ="
TABLE_FORMAT
",
uris = ['
BUCKET_PATH
'[,...]],
max_staleness =
STALENESS_INTERVAL
,
metadata_cache_mode = '
CACHE_MODE
'
);
Replace the following:
PROJECT_ID
: the name of your
project in which you want to create the table?for example,
myproject
DATASET
: the name of the
BigQuery dataset that you want to create the table
in?for example,
mydataset
EXTERNAL_TABLE_NAME
: the name of the table
that you want to create?for example,
mytable
REGION
: the region that contains the
connection?for example,
us
CONNECTION_ID
: the connection ID?for
example,
myconnection
When you
view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in
Connection ID
?for example
projects/myproject/locations/connection_location/connections/
myconnection
.
TABLE_FORMAT
: the format of the table that
you want to create?for example,
PARQUET
For more information about supported formats, see
Limitations
.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
['gs://bucket_name/[folder_name/]file_name']
.
You can select multiple files from the bucket by specifying one asterisk (
*
)
wildcard character in the path. For example,
['gs://mybucket/file_name*']
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the BigLake
table, and
how fresh the cached metadata must be in order for the operation to
use it. For more information about metadata caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. For more information
on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Option 1: Table definition file
Use the
bq mkdef
command
to create a table definition file, and then pass the file path to
the
bq mk
command
as follows:
bq mkdef \
--connection_id=
CONNECTION_ID
\
--source_format=
SOURCE_FORMAT
\
BUCKET_PATH
>
DEFINITION_FILE
bq mk --table \
--external_table_definition=
DEFINITION_FILE
\
--max_staleness=
STALENESS_INTERVAL
\
PROJECT_ID
:
DATASET
.
EXTERNAL_TABLE_NAME
\
SCHEMA
Replace the following:
CONNECTION_ID
: the connection ID?for
example,
myconnection
When you
view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in
Connection ID
?for example
projects/myproject/locations/connection_location/connections/
myconnection
.
SOURCE_FORMAT
: the format of the external data source.
For example,
PARQUET
.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
table, in the format
gs://bucket_name/[folder_name/]file_pattern
.
You can select multiple files from the bucket by specifying one asterisk (
*
)
wildcard character in the
file_pattern
. For example,
gs://mybucket/file00*.parquet
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.parquet
gs://bucket/path1/file1*
,
gs://bucket1/path1/*
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
DEFINITION_FILE
: the path to the
table definition file
on your local machine.
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the
BigLake table, and
how fresh the cached metadata must be in order for the operation to
use it. For more information about metadata caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S
format described in the
INTERVAL
data type
documentation. For example, specify
0-0 0 4:0:0
for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DATASET
: the name of the
BigQuery dataset that you want to create a table
in?for example,
mydataset
EXTERNAL_TABLE_NAME
: the name of the table
that you want to create?for example,
mytable
SCHEMA
: the schema for the
BigLake table
Example:
bq mkdef
--connection_id=myconnection
--metadata_cache_mode=CACHE_MODE
--source_format=CSV 'gs://mybucket/*.csv' > mytable_def
bq mk
--table
--external_table_definition=mytable_def='gs://mybucket/*.csv'
--max_staleness=0-0 0 4:0:0
myproject:mydataset.mybiglaketable
Region:STRING,Quarter:STRING,Total_sales:INTEGER
To use schema auto-detection, set the
--autodetect=true
flag in the
mkdef
command and omit the schema:
bq mkdef \
--connection_id=myconnection \
--metadata_cache_mode=CACHE_MODE \
--source_format=CSV --autodetect=true \
gs://mybucket/*.csv > mytable_def
bq mk \
--table \
--external_table_definition=mytable_def=gs://mybucket/*.csv \
--max_staleness=0-0 0 4:0:0 \
myproject:mydataset.myexternaltable
Option 2: Inline table definition
Instead of creating a table definition file, you can pass the table
definition directly to the
bq mk
command
.
Use the
@connection
decorator to specify the connection to use at the end
of the
--external_table_definition
flag.
bq mk --table \
--external_table_definition=@
SOURCE_FORMAT
=
BUCKET_PATH
@projects/
PROJECT_ID
/locations/
REGION
/connections/
CONNECTION_ID
\
DATASET_NAME
.
TABLE_NAME
\
SCHEMA
Replace the following:
SOURCE_FORMAT
: the format of the external data source
For example,
CSV
.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
table, in the format
gs://bucket_name/[folder_name/]file_pattern
.
You can select multiple files from the bucket by specifying one asterisk (
*
)
wildcard character in the
file_pattern
. For example,
gs://mybucket/file00*.parquet
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.parquet
gs://bucket/path1/file1*
,
gs://bucket1/path1/*
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
PROJECT_ID
: the name of your
project in which you want to create the table?for example,
myproject
REGION
: the region that contains the
connection,
us
CONNECTION_ID
: the connection ID?for
example,
myconnection
When you
view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in
Connection ID
?for example
projects/myproject/locations/connection_location/connections/
myconnection
.
DATASET_NAME
: the name of the dataset where you
want to create the BigLake table
TABLE_NAME
: the name of the BigLake table
SCHEMA
: the schema for the
BigLake table
Example:
bq mk --table \
--external_table_definition=@CSV=gs://mybucket/*.parquet@projects/myproject/locations/us/connections/myconnection \
--max_staleness=0-0 0 4:0:0 \
myproject:mydataset.myexternaltable \
Region:STRING,Quarter:STRING,Total_sales:INTEGER
API
Call the
tables.insert
method
API method, and create an
ExternalDataConfiguration
in the
Table
resource
that you pass in.
Specify the
schema
property or set the
autodetect
property to
true
to enable schema auto detection for
supported data sources.
Specify the
connectionId
property to identify the connection to use
for connecting to Cloud Storage.
This example creates a BigLake Table on unpartitioned data.
To authenticate to BigQuery, set up Application Default
Credentials. For more information, see
Set up authentication for client libraries
.
To apply your Terraform configuration in a Google Cloud project, complete the steps in the
following sections.
Prepare Cloud Shell
- Launch
Cloud Shell
.
-
Set the default Google Cloud project
where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=
PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform
configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also
called a
root module
).
-
In
Cloud Shell
, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for example
main.tf
. In this
tutorial, the file is referred to as
main.tf
.
mkdir
DIRECTORY
&& cd
DIRECTORY
&& touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.
Optionally, copy the code from GitHub. This is recommended
when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:
terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project
to view
the results. In the Google Cloud console, navigate to your resources in the UI to make sure
that Terraform has created or updated them.
BigLake supports schema autodetection. However, if you did not
provide a schema and the service account was not granted access in the previous
steps, these steps fail with an access denied message if you try to autodetect
the schema.
Create BigLake tables on Hive partitioned data
You can create a BigLake table for Hive partitioned data in
Cloud Storage. After you create an externally partitioned table, you
can't change the partition key. You need to recreate the table to change the
partition key.
To create a BigLake table based on Hive partitioned data in
Cloud Storage, select one of the following options:
Console
Go to the
BigQuery
page.
Go to BigQuery
In the
Explorer
pane, expand your project and select a dataset.
Click
more_vert
View actions
,
and then click
Create table
. This opens the
Create table
pane.
In the
Source
section, specify the following details:
For
Create table from
, select
Google Cloud Storage
.
Provide the path to the folder, using
wildcards
.
For example,
my_bucket/my_files*
.
The folder
must be in the same location as the dataset that contains the
table you want to create, append, or overwrite.
From the
File format
list, select the file type.
Select the
Source data partitioning
checkbox, and then specify
the following details:
- For
Select Source URI Prefix
, enter the
URI prefix. For example,
gs://my_bucket/my_files
.
- Optional: To require a partition filter on all queries for this
table, select the
Require partition filter
checkbox.
Requiring a partition filter can reduce cost and improve
performance. For more information, see
Requiring predicate filters on partition keys in queries
.
In the
Partition inference mode
section, select one of the
following options:
- Automatically infer types
: set the partition schema
detection mode to
AUTO
.
- All columns are strings
: set the partition schema
detection mode to
STRINGS
.
- Provide my own
: set the partition schema detection mode to
CUSTOM
and manually enter the schema
information for the partition keys. For more information, see
Provide a custom partition key schema
.
In the
Destination
section, specify the following details:
- For
Project
, select the project in which you want to create
the table.
- For
Dataset
, select the dataset in which you want to create
the table.
- For
Table
, enter the name of the table that you want
to create.
- For
Table type
, select
External table
.
- Select the
Create a BigLake table using a Cloud Resource
connection
checkbox.
- For
Connection ID
, select the connection that you created
earlier.
In the
Schema
section, you can either enable
schema auto-detection
or manually specify
a schema if you have a source file. If you don't have a source file, you
must manually specify a schema.
To enable schema auto-detection, select the
Auto-detect
option.
To manually specify a schema, leave the
Auto-detect
option
unchecked. Enable
Edit as text
and enter the table schema as a
JSON array
.
To ignore rows with extra column values that do not match the schema,
expand the
Advanced options
section and select
Unknown values
.
Click
Create table
.
SQL
Use the
CREATE EXTERNAL TABLE
DDL statement
:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE
EXTERNAL
TABLE `
PROJECT_ID
.
DATASET
.
EXTERNAL_TABLE_NAME
`
WITH PARTITION COLUMNS
(
PARTITION_COLUMN
PARTITION_COLUMN_TYPE
,
)
WITH
CONNECTION
`
PROJECT_ID
.
REGION
.
CONNECTION_ID
`
OPTIONS
(
hive_partition_uri_prefix = "
HIVE_PARTITION_URI_PREFIX
",
uris=['
FILE_PATH
'],
max_staleness =
STALENESS_INTERVAL
,
metadata_cache_mode = '
CACHE_MODE
',
format ="
TABLE_FORMAT
"
);
Replace the following:
PROJECT_ID
: the name of your
project in which you want to create the table?for example,
myproject
DATASET
: the name of the
BigQuery dataset that you want to create the table
in?for example,
mydataset
EXTERNAL_TABLE_NAME
: the name of the table
that you want to create?for example,
mytable
PARTITION_COLUMN
: the name of the
partitioning column
PARTITION_COLUMN_TYPE
: the type of the
partitioning column
REGION
: the region that contains the
connection?for example,
us
CONNECTION_ID
: the connection ID?for
example,
myconnection
When you
view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in
Connection ID
?for example
projects/myproject/locations/connection_location/connections/
myconnection
.
HIVE_PARTITION_URI_PREFIX
: hive partitioning
uri prefix?for example,
gs://mybucket/
FILE_PATH
: path to the data source for the
external table that you want to create?for example,
gs://mybucket/*.parquet
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the BigLake
table, and
how fresh the cached metadata must be in order for the operation to
use it. For more information about metadata caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. For more information
about metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
TABLE_FORMAT
: the format of the table that
you want to create?for example,
PARQUET
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
Examples
The following example creates a BigLake table over
partitioned data where:
- The schema is autodetected.
- The metadata cache staleness interval for the table is 1 day.
- The metadata cache refreshes automatically.
CREATE EXTERNAL TABLE `my_dataset.my_table`
WITH PARTITION COLUMNS
(
sku STRING,
)
WITH CONNECTION `us.my-connection`
OPTIONS(
hive_partition_uri_prefix = "gs://mybucket/products",
uris = ['gs://mybucket/products/*'],
max_staleness = INTERVAL 1 DAY,
metadata_cache_mode = 'AUTOMATIC'
);
The following example creates a BigLake table over
partitioned data where:
- The schema is specified.
- The metadata cache staleness interval for the table is 8 hours.
- The metadata cache must be manually refreshed.
CREATE EXTERNAL TABLE `my_dataset.my_table`
(
ProductId INTEGER,
ProductName STRING,
ProductType STRING
)
WITH PARTITION COLUMNS
(
sku STRING,
)
WITH CONNECTION `us.my-connection`
OPTIONS(
hive_partition_uri_prefix = "gs://mybucket/products",
uris = ['gs://mybucket/products/*'],
max_staleness = INTERVAL 8 HOUR,
metadata_cache_mode = 'MANUAL'
);
bq
First, use the
bq mkdef
command to
create a table definition file:
bq mkdef \
--source_format=
SOURCE_FORMAT
\
--connection_id=
REGION
.
CONNECTION_ID
\
--hive_partitioning_mode=
PARTITIONING_MODE
\
--hive_partitioning_source_uri_prefix=
GCS_URI_SHARED_PREFIX
\
--require_hive_partition_filter=
BOOLEAN
\
--metadata_cache_mode=
CACHE_MODE
\
GCS_URIS
>
DEFINITION_FILE
Replace the following:
SOURCE_FORMAT
: the format of the external data
source. For example,
CSV
.
REGION
: the region that contains the
connection?for example,
us
.
CONNECTION_ID
: the connection ID?for
example,
myconnection
.
When you
view the connection details
in the Google Cloud console, the connection ID is the value in the last section of the
fully qualified connection ID that is shown in
Connection ID
?for example
projects/myproject/locations/connection_location/connections/
myconnection
.
PARTITIONING_MODE
: the Hive partitioning mode. Use one of the
following values:
AUTO
: Automatically detect the key names and types.
STRINGS
: Automatically convert the key names to strings.
CUSTOM
: Encode the key schema in the source URI prefix.
GCS_URI_SHARED_PREFIX
: the source URI prefix.
BOOLEAN
: specifies whether to require a predicate filter at query
time. This flag is optional. The default value is
false
.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. You only need to include this
flag if you also plan to use the
--max_staleness
flag in the
subsequent
bq mk
command to enable metadata caching. For more
information on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
GCS_URIS
: the path to the Cloud Storage folder, using
wildcard format.
DEFINITION_FILE
: the path to the
table definition file
on your local
machine.
If
PARTITIONING_MODE
is
CUSTOM
, include the partition key schema
in the source URI prefix, using the following format:
--hive_partitioning_source_uri_prefix=
GCS_URI_SHARED_PREFIX
/{
KEY1
:
TYPE1
}/{
KEY2
:
TYPE2
}/...
After you create the table definition file, use the
bq mk
command to
create the BigLake table:
bq mk --external_table_definition=
DEFINITION_FILE
\
--max_staleness=
STALENESS_INTERVAL
\
DATASET_NAME
.
TABLE_NAME
\
SCHEMA
Replace the following:
DEFINITION_FILE
: the path to the table definition file.
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the
BigLake table, and
how fresh the cached metadata must be in order for the operation to
use it. If you include this flag, you must have also specified a value
for the
--metadata_cache_mode
flag in the preceding
bq mkdef
command. For more information on metadata
caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S
format described in the
INTERVAL
data type
documentation. For example, specify
0-0 0 4:0:0
for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DATASET_NAME
: the name of the dataset that contains the
table.
TABLE_NAME
: the name of the table you're creating.
SCHEMA
: specifies a path to a
JSON schema file
,
or specifies the schema in the form
field:data_type,field:data_type,...
. To use schema
auto-detection, omit this argument.
Examples
The following example uses
AUTO
Hive partitioning mode, and also
sets the metadata cache to have a 12 hour staleness interval and to get
refreshed automatically:
bq mkdef --source_format=CSV \
--connection_id=us.my-connection \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://myBucket/myTable \
--metadata_cache_mode=AUTOMATIC \
gs://myBucket/myTable/* > mytable_def
bq mk --external_table_definition=mytable_def \
--max_staleness=0-0 0 12:0:0 \
mydataset.mytable \
Region:STRING,Quarter:STRING,Total_sales:INTEGER
The following example uses
STRING
Hive partitioning mode:
bq mkdef --source_format=CSV \
--connection_id=us.my-connection \
--hive_partitioning_mode=STRING \
--hive_partitioning_source_uri_prefix=gs://myBucket/myTable \
gs://myBucket/myTable/* > mytable_def
bq mk --external_table_definition=mytable_def \
mydataset.mytable \
Region:STRING,Quarter:STRING,Total_sales:INTEGER
The following example uses
CUSTOM
Hive partitioning mode:
bq mkdef --source_format=CSV \
--connection_id=us.my-connection \
--hive_partitioning_mode=CUSTOM \
--hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING} \
gs://myBucket/myTable/* > mytable_def
bq mk --external_table_definition=mytable_def \
mydataset.mytable \
Region:STRING,Quarter:STRING,Total_sales:INTEGER
API
To set Hive partitioning using the BigQuery API, include the
hivePartitioningOptions
object in the
ExternalDataConfiguration
object when you create the
table definition file
.
To create a BigLake table, you must also specify a value
for the
connectionId
field.
If you set the
hivePartitioningOptions.mode
field to
CUSTOM
, you must
encode the partition key schema in the
hivePartitioningOptions.sourceUriPrefix
field as follows:
gs://
BUCKET
/
PATH_TO_TABLE
/{
KEY1
:
TYPE1
}/{
KEY2
:
TYPE2
}/...
To enforce the use of a predicate filter at query time, set the
hivePartitioningOptions.requirePartitionFilter
field to
true
.
This example creates a BigLake Table on partitioned data.
To authenticate to BigQuery, set up Application Default
Credentials. For more information, see
Set up authentication for client libraries
.
To apply your Terraform configuration in a Google Cloud project, complete the steps in the
following sections.
Prepare Cloud Shell
- Launch
Cloud Shell
.
-
Set the default Google Cloud project
where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=
PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform
configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also
called a
root module
).
-
In
Cloud Shell
, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for example
main.tf
. In this
tutorial, the file is referred to as
main.tf
.
mkdir
DIRECTORY
&& cd
DIRECTORY
&& touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.
Optionally, copy the code from GitHub. This is recommended
when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:
terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project
to view
the results. In the Google Cloud console, navigate to your resources in the UI to make sure
that Terraform has created or updated them.
Set up access control policies
You can use several methods to control access to BigLake tables:
For example, let's say you want to limit row access for the table
mytable
in the dataset
mydataset
:
+---------+---------+-------+
| country | product | price |
+---------+---------+-------+
| US | phone | 100 |
| JP | tablet | 300 |
| UK | laptop | 200 |
+---------+---------+-------+
You can create a row-level filter for Kim (
kim@example.com
) that restricts
their access to rows where
country
is equal to
US
.
CREATE
ROW ACCESS POLICY
only_us_filter
ON mydataset.mytable
GRANT TO ('user:kim@example.com')
FILTER USING
(country = 'US');
Then, Kim runs the following query:
SELECT * FROM projectid.mydataset.mytable;
The output shows only the rows where
country
is equal to
US
:
+---------+---------+-------+
| country | product | price |
+---------+---------+-------+
| US | phone | 100 |
+---------+---------+-------+
Query BigLake tables
For more information, see
Query Cloud Storage data in BigLake tables
.
Update BigLake tables
You can update BigLake tables if necessary, for example to
change their
metadata caching
.
To get table details such as source format and source URI, see
Get table information
.
You can also use this same procedure to upgrade Cloud Storage-based
external tables to BigLake tables by associating the external
table to a connection. For more information, see
Upgrade external tables to BigLake tables
.
To update a BigLake table, select one of the
following options:
SQL
Use the
CREATE OR REPLACE EXTERNAL TABLE
DDL statement
to update a table:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE OR REPLACE EXTERNAL TABLE
`
PROJECT_ID
.
DATASET
.
EXTERNAL_TABLE_NAME
`
WITH CONNECTION `
REGION
.
CONNECTION_ID
`
OPTIONS(
format ="
TABLE_FORMAT
",
uris = ['
BUCKET_PATH
'],
max_staleness =
STALENESS_INTERVAL
,
metadata_cache_mode = '
CACHE_MODE
'
);
Replace the following:
PROJECT_ID
: the name of the project that
contains the table
DATASET
: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME
: the name of the
table
REGION
: the region that contains the
connection
CONNECTION_ID
: the name of the connection
to use
TABLE_FORMAT
: the format used by the table
You can't change this when updating the table.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
['gs://bucket_name/[folder_name/]file_name']
.
You can select multiple files from the bucket by specifying one asterisk (
*
)
wildcard character in the path. For example,
['gs://mybucket/file_name*']
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the table, and
how fresh the cached metadata must be in order for the operation to
use it
For more information about metadata caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an
interval literal
value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually
For more information
on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh
the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value greater
than 0.
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Use the
bq mkdef
and
bq update
commands
to update a table:
Generate an
external table definition
,
that describes the aspects of the table to change:
bq mkdef --connection_id=
PROJECT_ID
.
REGION
.
CONNECTION_ID
\
--source_format=
TABLE_FORMAT
\
--metadata_cache_mode=
CACHE_MODE
\
"
BUCKET_PATH
" > /tmp/
DEFINITION_FILE
Replace the following:
PROJECT_ID
: the name of the project that
contains the connection
REGION
: the region that contains the
connection
CONNECTION_ID
: the name of the connection
to use
TABLE_FORMAT
: the format used by the
table. You can't change this when updating the table.
CACHE_MODE
: specifies whether the metadata
cache is refreshed automatically or manually. For more information
on metadata caching considerations, see
Metadata caching for performance
.
Set to
AUTOMATIC
for the metadata cache to be refreshed at a
system-defined interval, usually somewhere between 30 and
60 minutes.
Set to
MANUAL
if you want to refresh the metadata cache on a
schedule you determine. In this case, you can call the
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure
to refresh
the cache.
You must set
CACHE_MODE
if
STALENESS_INTERVAL
is set to a value
greater than 0.
BUCKET_PATH
: the path to the
Cloud Storage bucket that contains the data for the
external table, in the format
gs://bucket_name/[folder_name/]file_name
.
You can limit the files selected from the bucket by specifying one asterisk (
*
)
wildcard character in the path. For example,
gs://mybucket/file_name*
. For more
information, see
Wildcard support for Cloud Storage URIs
.
You can specify multiple buckets for the
uris
option by providing multiple
paths.
The following examples show valid
uris
values:
gs://bucket/path1/myfile.csv
gs://bucket/path1/*.csv
gs://bucket/path1/*,gs://bucket/path2/file00*
When you specify
uris
values that target multiple files, all of those
files must share a compatible schema.
For more information about using Cloud Storage URIs in
BigQuery, see
Cloud Storage resource path
.
DEFINITION_FILE
: the name of the table
definition file that you are creating.
Update the table using the new external table definition:
bq update --max_staleness=
STALENESS_INTERVAL
\
--external_table_definition=/tmp/
DEFINITION_FILE
\
PROJECT_ID
:
DATASET
.
EXTERNAL_TABLE_NAME
Replace the following:
STALENESS_INTERVAL
: specifies whether
cached metadata is used by operations against the
table, and how fresh the cached metadata must be in order for
the operation to use it. For more information about metadata
caching considerations, see
Metadata caching for performance
.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30
minutes and 7 days, using the
Y-M D H:M:S
format described in the
INTERVAL
data type
documentation. For example, specify
0-0 0 4:0:0
for a 4
hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Cloud Storage instead.
DEFINITION_FILE
: the name of the table
definition file that you created or updated.
PROJECT_ID
: the name of the project
that contains the table
DATASET
: the name of the dataset that
contains the table
EXTERNAL_TABLE_NAME
: the name of the table
Example
The following example updates
mytable
to use cached metadata as long as
it has been refreshed in the last 4.5 hours, and also to refresh cached
metadata automatically:
bq update --project_id=myproject --max_staleness='0-0 0 4:30:0' \
--external_table_definition=enable_metadata.json mydataset.mytable
Where
enable_metadata.json
has the following contents:
{
"metadataCacheMode": "AUTOMATIC"
}
Audit logging
For information about logging in BigQuery, see
Introduction to
BigQuery monitoring
. To learn more
about logging in Google Cloud, see
Cloud Logging
.
What's next