Create and use clustered tables
This document describes how to create and use clustered tables in
BigQuery. For an overview of clustered table support in
BigQuery, see
Introduction to clustered tables
.
Create clustered tables
You can create a clustered table by using the following methods:
Table naming
When you create a table in BigQuery, the table name must
be unique per dataset. The table name can:
- Contain characters with a total of up to 1,024 UTF-8 bytes.
- Contain Unicode characters in category L (letter), M (mark), N (number),
Pc (connector, including underscore), Pd (dash), Zs (space). For more
information, see
General Category
.
The following are all examples of valid table names:
table 01
,
??????
,
00_お客?
,
etudiant-01
.
Caveats:
- Table names are case-sensitive by default.
mytable
and
MyTable
can
coexist in the same dataset, unless they are part of a
dataset with
case-sensitivity turned off
.
- Some table names and table name prefixes are reserved. If
you receive an error saying that your table name or prefix is
reserved, then select a different name and try again.
If you include multiple dot operators (
.
) in a sequence, the duplicate
operators are implicitly stripped.
For example, this:
project_name....dataset_name..table_name
Becomes this:
project_name.dataset_name.table_name
Required permissions
To create a table, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
Additionally, you might require the
bigquery.tables.getData
permission to
access the data that you write to the table.
Each of the following predefined IAM roles includes the
permissions that you need in order to create a table:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
(includes the
bigquery.jobs.create
permission)
roles/bigquery.user
(includes the
bigquery.jobs.create
permission)
roles/bigquery.jobUser
(includes the
bigquery.jobs.create
permission)
Additionally, if you have the
bigquery.datasets.create
permission, you can
create and update tables in the datasets that you create.
For more information on IAM roles and permissions in
BigQuery, see
Predefined roles and permissions
.
Create an empty clustered table with a schema definition
You specify clustering columns when you create a table in
BigQuery. After the table is created, you can modify the
clustering columns; see
Modifying clustering specification
for details.
Clustering columns must be top-level, non-repeated columns, and they must be one
of the following simple data types:
DATE
BOOLEAN
GEOGRAPHY
INTEGER
NUMERIC
BIGNUMERIC
STRING
TIMESTAMP
RANGE
(
preview
)
You can specify up to four clustering columns. When you specify multiple
columns, the order of the columns determines how the data is sorted. For
example, if the table is clustered by columns a, b and c, the data is sorted in
the same order: first by column a, then by column b, and then by column
c. As a best practice, place the most frequently filtered or aggregated column
first.
The order of your clustering columns also affects query performance and pricing.
For more information about query best practices for clustered tables, see
Querying clustered tables
.
To create an empty clustered table with a schema definition:
Console
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
- In the
Explorer
pane, expand your project, and then select a dataset.
- In the
Dataset info
section, click
add_box
Create table
.
- In the
Create table
panel, specify the following details:
- In the
Source
section, select
Empty table
in the
Create table from
list.
- In the
Destination
section, specify the following details:
- For
Dataset
, select the dataset in which you want to create the
table.
- In the
Table
field, enter the name of the table that you want to create.
- Verify that the
Table type
field is set to
Native table
.
- In the
Schema
section, enter the
schema
definition.
You can enter schema information manually by using one of
the following methods:
- For
Clustering order
, enter between one and four
comma-separated column names.
- Optional: In the
Advanced options
section, if you want to use a
customer-managed encryption key, then select the
Use a customer-managed
encryption key (CMEK)
option. By default, BigQuery
encrypts customer content stored at rest
by using a Google-owned and Google-managed key.
- Click
Create table
.
SQL
Use the
CREATE TABLE
DDL statement
command with the
CLUSTER BY
option. The following example creates a
clustered table named
myclusteredtable
in
mydataset
:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE TABLE mydataset.myclusteredtable
(
customer_id STRING,
transaction_amount NUMERIC
)
CLUSTER
BY
customer_id
OPTIONS
(
description = 'a table clustered by customer_id');
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Use the
bq mk
command
with the following flags:
--table
(or the
-t
shortcut).
--schema
. You can supply the table's schema definition inline or
use a JSON schema file.
--clustering_fields
. You can specify up to four clustering columns.
Optional parameters include
--expiration
,
--description
,
--time_partitioning_type
,
--time_partitioning_field
,
--time_partitioning_expiration
,
--destination_kms_key
, and
--label
.
If you are creating a table in a project other than your default project,
add the project ID to the dataset in the following format:
project_id:dataset
.
--destination_kms_key
is not demonstrated here. For information about
using
--destination_kms_key
, see
customer-managed encryption keys
.
Enter the following command to create an empty clustered table with a
schema definition:
bq mk \
--table \
--expiration
INTEGER1
\
--schema
SCHEMA
\
--clustering_fields
CLUSTER_COLUMNS
\
--description "
DESCRIPTION
" \
--label
KEY:VALUE,KEY:VALUE
\
PROJECT_ID
:
DATASET
.
TABLE
Replace the following:
INTEGER1
: the default lifetime, in seconds, for
the table.
The minimum value is 3,600 seconds (one hour). The expiration
time evaluates to the current UTC time plus the integer value. If you set
the table's expiration time when you create a table, the dataset's default
table expiration setting is ignored. Setting this value deletes the table
after the specified time.
SCHEMA
: an inline schema definition in the format
COLUMN:DATA_TYPE,COLUMN:DATA_TYPE
or the path to
the JSON schema file on your local machine.
CLUSTER_COLUMNS
: a comma-separated list of up to
four clustering columns. The list cannot contain any spaces.
DESCRIPTION
: a description of the table, in quotes.
KEY:VALUE
: the key-value pair that represents a
label
. You can enter multiple labels using
a comma-separated list.
PROJECT_ID
: your project ID.
DATASET
: a dataset in your project.
TABLE
: the name of the table you're creating.
When you specify the schema on the command line, you cannot include a
RECORD
(
STRUCT
)
type, you cannot include a column description, and you
cannot specify the column's mode. All modes default to
NULLABLE
. To
include descriptions, modes, and
RECORD
types,
supply a JSON schema
file
instead.
Examples:
Enter the following command to create a clustered table
named
myclusteredtable
in
mydataset
in your default project. The table's
expiration is set to 2,592,000 (1 30-day month), the description is set to
This is my clustered table
, and the label is set to
organization:development
. The command uses the
-t
shortcut instead of
--table
.
The schema is specified inline as:
timestamp:timestamp,customer_id:string,transaction_amount:float
. The
specified clustering field
customer_id
is used to cluster the table.
bq mk \
-t \
--expiration 2592000 \
--schema 'timestamp:timestamp,customer_id:string,transaction_amount:float' \
--clustering_fields customer_id \
--description "This is my clustered table" \
--label org:dev \
mydataset.myclusteredtable
Enter the following command to create a clustered table named
myclusteredtable
in
myotherproject
, not your default project. The
description is set to
This is my clustered table
, and the label is set
to
organization:development
. The command uses the
-t
shortcut instead of
--table
. This command does not specify a table expiration. If the dataset
has a default table expiration, it is applied. If the dataset has no default
table expiration, the table never expires.
The schema is specified in a local JSON file:
/tmp/myschema.json
. The
customer_id
field is used to cluster the table.
bq mk \
-t \
--expiration 2592000 \
--schema /tmp/myschema.json \
--clustering_fields=customer_id \
--description "This is my clustered table" \
--label org:dev \
myotherproject:mydataset.myclusteredtable
After the table is created, you can update the table's
description
and
labels
.
Use the
google_bigquery_table
resource.
To authenticate to BigQuery, set up Application Default
Credentials. For more information, see
Set up authentication for client libraries
.
The following example creates a table named
mytable
that is clustered
on the
ID
and
Created
columns:
To apply your Terraform configuration in a Google Cloud project, complete the steps in the
following sections.
Prepare Cloud Shell
- Launch
Cloud Shell
.
-
Set the default Google Cloud project
where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=
PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform
configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also
called a
root module
).
-
In
Cloud Shell
, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for example
main.tf
. In this
tutorial, the file is referred to as
main.tf
.
mkdir
DIRECTORY
&& cd
DIRECTORY
&& touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.
Optionally, copy the code from GitHub. This is recommended
when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:
terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project
to view
the results. In the Google Cloud console, navigate to your resources in the UI to make sure
that Terraform has created or updated them.
Create a clustered table from a query result
There are two ways to create a clustered table from a query result:
You can create a clustered table by querying either a partitioned table or a
non-partitioned table. You cannot change an existing table to a clustered table
by using query results.
When you create a clustered table from a query result, you must use standard
SQL. Currently, legacy SQL is not supported for querying clustered tables or
for writing query results to clustered tables.
SQL
To create a clustered table from a query result, use the
CREATE TABLE
DDL statement
with the
CLUSTER BY
option. The following example creates a new
table clustered by
customer_id
by querying an existing unclustered table:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE TABLE mydataset.clustered_table
(
customer_id STRING,
transaction_amount NUMERIC
)
CLUSTER
BY
customer_id
AS (
SELECT * FROM mydataset.unclustered_table
);
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Enter the following command to create a new, clustered destination
table from a query result:
bq --location=
LOCATION
query \
--use_legacy_sql=false '
QUERY
'
Replace the following:
LOCATION
: the name of your location. The
--location
flag is
optional. For example, if you are using BigQuery in the
Tokyo region, you can set the flag's value to
asia-northeast1
. You
can set a default value for the location using the
.bigqueryrc file
.
QUERY
: a query in GoogleSQL syntax. Currently, you cannot
use legacy SQL to query clustered tables or to write query results to
clustered tables. The query can contain a
CREATE TABLE
DDL
statement that specifies the options for creating your clustered table.
You can use DDL rather than specifying the individual command-line
flags.
Examples:
Enter the following command to write query results to a clustered
destination table named
myclusteredtable
in
mydataset
.
mydataset
is in
your default project. The query retrieves data from a non-partitioned table:
mytable. The table's
customer_id
column is used to cluster the
table. The table's
timestamp
column is used to create a partitioned table.
bq query --use_legacy_sql=false \
'CREATE TABLE
mydataset.myclusteredtable
PARTITION BY
DATE(timestamp)
CLUSTER BY
customer_id
AS (
SELECT
*
FROM
`mydataset.mytable`
);'
API
To save query results to a clustered table,
call the
jobs.insert
method
,
configure a
query
job
,
and include a
CREATE TABLE
DDL
statement that creates your clustered table.
Specify your location in the
location
property in the
jobReference
section of the
job resource
.
Create a clustered table when you load data
You can create a clustered table by specifying clustering columns when you load
data into a new table. You do not need to create an empty table before loading
data into it. You can create the clustered table and load your data at the same
time.
For more information about loading data, see
Introduction to loading data into BigQuery
.
To define clustering when defining a load job:
SQL
Use the
LOAD DATA
statement
.
The following example loads AVRO data to create a table that is partitioned
by the
transaction_date
field and clustered by the
customer_id
field.
It also configures the partitions to expire after three days.
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
LOAD
DATA
INTO mydataset.mytable
PARTITION
BY transaction_date
CLUSTER
BY customer_id
OPTIONS
(
partition_expiration_days = 3)
FROM FILES(
format = 'AVRO',
uris = ['gs://bucket/path/file.avro']);
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
API
To define a clustering configuration when creating a table through a
load job, you can populate the
Clustering
properties for the table.
Control access to clustered tables
To configure access to tables and views, you can grant an
IAM role to an entity at the following levels, listed in
order of range of resources allowed (largest to smallest):
You can also restrict data access within tables, by using the following
methods:
Access with any resource protected by IAM is additive. For
example, if an entity does not have access at the high level such as a project,
you could grant the entity access at the dataset level, and then the entity will
have access to the tables and views in the dataset. Similarly, if the entity
does not have access at the high level or the dataset level, you could grant the
entity access at the table or view level.
Granting IAM roles at a higher level in the
Google Cloud
resource hierarchy
such as the project, folder, or organization level gives the entity access to a
broad set of resources. For example, granting a role to an entity at the project
level gives that entity permissions that apply to all datasets throughout the
project.
Granting a role at the dataset level specifies the operations an entity is
allowed to perform on tables and views in that specific dataset, even if the
entity does not have access at a higher level. For information on configuring
dataset-level access controls, see
Controlling access to datasets
.
Granting a role at the table or view level specifies the operations an entity is
allowed to perform on specific tables and views, even if the entity does not
have access at a higher level. For information on configuring table-level access
controls, see
Controlling access to tables and views
.
You can also create
IAM custom roles
.
If you create a custom role, the permissions you grant depend on the specific
operations you want the entity to be able to perform.
You can't set a "deny" permission on any resource protected by
IAM.
For more information about roles and permissions, see
Understanding roles
in the IAM documentation and the BigQuery
IAM roles and permissions
.
Use clustered tables
You can get information about tables in the following ways:
- Using the Google Cloud console.
- Using the bq command-line tool's
bq show
command.
- Calling the
tables.get
API method.
- Querying
INFORMATION_SCHEMA
views.
Required permissions
At a minimum, to get information about tables, you must be granted
bigquery.tables.get
permissions. The following predefined IAM
roles include
bigquery.tables.get
permissions:
bigquery.metadataViewer
bigquery.dataViewer
bigquery.dataOwner
bigquery.dataEditor
bigquery.admin
In addition, if a user has
bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted
bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to get information about
tables in a dataset.
For more information about IAM roles and permissions in
BigQuery, see
Predefined roles and permissions
.
To view information about a clustered table:
Console
In the Google Cloud console, go to the
Resources
pane.
Click your dataset name to expand it, and then click the table name
you want to view.
Click
Details
. This page displays the table's
details including the clustering columns.
SQL
For clustered tables, you can query the
CLUSTERING_ORDINAL_POSITION
column
in the
INFORMATION_SCHEMA.COLUMNS
view
to find the 1-indexed offset of the column within the table's clustering
columns:
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE TABLE mydataset.data (column1 INT64, column2 INT64)
CLUSTER
BY column1, column2;
SELECT
column_name, clustering_ordinal_position
FROM
mydataset.INFORMATION_SCHEMA.COLUMNS;
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
The clustering ordinal position is 1 for
column1
and 2 for
column2
.
More table metadata is available through the
TABLES
,
TABLE_OPTIONS
,
COLUMNS
, and
COLUMN_FIELD_PATH
views in
INFORMATION_SCHEMA
.
bq
Issue the
bq show
command to display all table information. Use the
--schema
flag to display only table schema information. The
--format
flag can be used to control the output.
If you are getting information about a table in a project other than
your default project, add the project ID to the dataset in the following
format:
project_id:dataset
.
bq show \
--schema \
--format=prettyjson \
PROJECT_ID
:
DATASET
.
TABLE
Replace the following:
PROJECT_ID
: your project ID
DATASET
: the name of the dataset
TABLE
: the name of the table
Examples:
Enter the following command to display all information about
myclusteredtable
in
mydataset
.
mydataset
in your default project.
bq show --format=prettyjson mydataset.myclusteredtable
The output should look like the following:
{
"clustering": {
"fields": [
"customer_id"
]
},
...
}
List clustered tables in a dataset
You can list clustered tables in datasets in the following ways:
- Using the Google Cloud console.
- Using the bq command-line tool's
bq ls
command.
- Calling the
tables.list
API
method.
- Using the client libraries.
- Querying the
CLUSTERING_ORDINAL_POSITION
column in the
INFORMATION_SCHEMA.COLUMNS
view.
The permissions required to list clustered tables and the steps to list them
are the same as for standard tables. For more information about listing
tables, see
Listing tables in a dataset
.
Modify clustering specification
You can change or remove a table's clustering specifications, or change the set
of clustered columns in a clustered table. This method of updating the
clustering column set is useful for tables that use continuous streaming inserts
because those tables cannot be easily swapped by other methods.
Follow these steps to apply a new clustering specification to unpartitioned or
partitioned tables.
In the bq tool, update the clustering specification of your
table to match the new clustering:
bq update --clustering_fields=
CLUSTER_COLUMN
DATASET
.
ORIGINAL_TABLE
Replace the following:
CLUSTER_COLUMN
: the column you are clustering on—for example,
mycolumn
DATASET
: the name of the dataset containing the table—for example,
mydataset
ORIGINAL_TABLE
: the name of your original table—for example,
mytable
You can also call the
tables.update
or
tables.patch
API method to
modify the clustering specification
.
To cluster all rows according to the new clustering specification,
run the following
UPDATE
statement:
UPDATE
DATASET
.
ORIGINAL_TABLE
SET
CLUSTER_COLUMN
=
CLUSTER_COLUMN
WHERE true
Table security
To control access to tables in BigQuery, see
Introduction to table access controls
.
What's next