Create datasets
This document describes how to create datasets in BigQuery.
You can create datasets in the following ways:
- Using the Google Cloud console.
- Using a SQL query.
- Using the
bq mk
command in the bq command-line tool.
- Calling the
datasets.insert
API method.
- Using the client libraries.
- Copying an existing dataset.
To see steps for copying a dataset, including across regions, see
Copying datasets
.
To learn to query tables in a public dataset, see
Query a public dataset with the Google Cloud console
.
Dataset limitations
BigQuery datasets are subject to the following limitations:
- The
dataset location
can only be set at creation
time. After a dataset is created, its location cannot be changed.
All tables that are referenced in a query must be stored in datasets in the
same location.
When
you copy a table
, the
datasets that contain the source table and destination table must reside in
the same location.
Dataset names must be unique for each project.
If you change a dataset's
storage billing model
, you must wait 14
days before you can change the storage billing model again.
You can't enroll a dataset in physical storage billing if you have any
existing legacy
flat-rate slot commitments
located in the same region as the dataset.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.
Required permissions
To create a dataset, you need the
bigquery.datasets.create
IAM permission.
Each of the following predefined IAM roles includes the
permissions that you need in order to create a dataset:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.user
roles/bigquery.admin
For more information about IAM roles in BigQuery,
see
Predefined roles and
permissions
.
Name datasets
When you create a dataset in BigQuery, the dataset name must
be unique for each project. The dataset name can contain the following:
- Up to 1,024 characters.
- Letters (uppercase or lowercase), numbers, and underscores.
Dataset names are case-sensitive by default.
mydataset
and
MyDataset
can
coexist in the same project, unless one of them has
case-sensitivity
turned off
.
Dataset names cannot contain spaces or special characters such as
-
,
&
,
@
,
or
%
.
Hidden datasets
A hidden dataset is a dataset whose name begins with an underscore. You can
query tables and views in hidden datasets the same way you would in any other
dataset. Hidden datasets have the following restrictions:
- They are hidden from the
Explorer
panel in the Google Cloud console.
- They don't appear in any
INFORMATION_SCHEMA
views.
- They can't be used with
linked datasets
.
- They don't appear in Data Catalog.
Create datasets
To create a dataset:
Console
Open the BigQuery page in the Google Cloud console.
Go to the BigQuery page
In the
Explorer
panel, select the project where you want to create
the dataset.
Expand the
more_vert
Actions
option and click
Create dataset
.
On the
Create dataset
page:
- For
Dataset ID
, enter a unique dataset
name
.
For
Location type
, choose a geographic
location
for the dataset. After a dataset is created, the
location can't be changed.
Optional: If you want tables in this dataset to expire, select
Enable table expiration
, then specify the
Default maximum
table age
in days.
Optional: If you want to use a customer-managed encryption key (CMEK),
expand
Advanced options
, then select
Customer-managed encryption
key (CMEK)
.
Optional: If you want to use case-insensitive table names,
expand
Advanced options
, then select
Enable case insensitive
table names
.
Optional: If you want to use a default collation,
expand
Advanced options
, select
Enable default collation
,
then select the
Default Collation
to use.
Optional: If you want to use a default rounding mode,
expand
Advanced options
, then select the
Default Rounding Mode
to use.
Optional: If you want to enable the
physical storage billing model
,
expand
Advanced options
, then select
Enable physical storage
billing model
.
When you change a dataset's billing model, it takes 24 hours for
the change to take effect.
Once you change a dataset's storage billing model, you must wait
14 days before you can change the storage billing model again.
Optional: If you want to set the dataset's
time travel window
, expand
Advanced options
, then select the
Time travel window
to use.
Click
Create dataset
.
SQL
Use the
CREATE SCHEMA
statement
.
To create a dataset in a project other than your default project, add the
project ID to the dataset ID in the following format:
PROJECT_ID
.
DATASET_ID
.
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE SCHEMA
PROJECT_ID
.
DATASET_ID
OPTIONS (
default_kms_key_name = '
KMS_KEY_NAME
',
default_partition_expiration_days =
PARTITION_EXPIRATION
,
default_table_expiration_days =
TABLE_EXPIRATION
,
description = '
DESCRIPTION
',
labels = [('
LABEL_1
','
VALUE_1
'),('
LABEL_2
','
VALUE_2
')],
location = '
LOCATION
',
max_time_travel_hours =
HOURS
,
storage_billing_model =
BILLING_MODEL
);
Replace the following:
PROJECT_ID
: your project ID
DATASET_ID
: the ID of the dataset that you're
creating
KMS_KEY_NAME
: the name of the default Cloud Key Management Service
key used to protect newly created tables in this dataset unless a
different key is supplied at the time of creation. You cannot create a
Google-encrypted table in a dataset with this parameter set.
PARTITION_EXPIRATION
: the default lifetime
(in days) for partitions in newly created partitioned tables.
The default partition expiration has no minimum value. The expiration
time evaluates to the partition's date plus the integer value. Any
partition created in a partitioned table in the dataset is deleted
PARTITION_EXPIRATION
days after the partition's
date. If you supply the
time_partitioning_expiration
option when you
create or update a partitioned table, the table-level partition
expiration takes precedence over the dataset-level default partition
expiration.
TABLE_EXPIRATION
: the default lifetime
(in days) for newly created tables. The minimum value is
0.042 days (one hour). The expiration time evaluates to the current time
plus the integer value. Any table created in the dataset is deleted
TABLE_EXPIRATION
days after its creation time.
This value is applied if you do not set a table expiration when you
create the table
.
DESCRIPTION
: a description of the dataset
LABEL_1
:
VALUE_1
: the key-value
pair that you want to set as the first label on this dataset
LABEL_2
:
VALUE_2
: the key-value
pair that you want to set as the second label
LOCATION
: the dataset's
location
. After a dataset is created, the
location can't be changed.
HOURS
: the duration in hours of the time travel
window for the new dataset.
The
HOURS
value must
be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168)
between 48 (2 days) and 168 (7 days). 168 hours is the default
if this option isn't specified.
BILLING_MODEL
: sets the
storage billing model
for the dataset. Set the
BILLING_MODEL
value to
PHYSICAL
to use physical bytes when calculating storage
charges, or to
LOGICAL
to use logical bytes.
LOGICAL
is the default.
When you change a dataset's billing model, it takes 24 hours for the
change to take effect.
Once you change a dataset's storage billing model, you must wait 14 days
before you can change the storage billing model again.
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
To create a new dataset, use the
bq mk
command
with the
--location
flag.
To create a dataset in a project other than your default project, add the
project ID to the dataset name in the following format:
PROJECT_ID
:
DATASET_ID
.
bq --location=
LOCATION
mk \
--dataset \
--default_kms_key=
KMS_KEY_NAME
\
--default_partition_expiration=
PARTITION_EXPIRATION
\
--default_table_expiration=
TABLE_EXPIRATION
\
--description="
DESCRIPTION
" \
--label=
LABEL_1
:
VALUE_1
\
--label=
LABEL_2
:
VALUE_2
\
--max_time_travel_hours=
HOURS
\
--storage_billing_model=
BILLING_MODEL
\
PROJECT_ID
:
DATASET_ID
Replace the following:
LOCATION
: the dataset's
location
.
After a dataset is created, the location can't be changed. You can set a
default value for the location by using the
.bigqueryrc
file
.
KMS_KEY_NAME
: the name of the default Cloud Key Management Service
key used to protect newly created tables in this dataset unless a
different key is supplied at the time of creation. You cannot create a
Google-encrypted table in a dataset with this parameter set.
PARTITION_EXPIRATION
: the default lifetime (in seconds) for
partitions in newly created partitioned tables. The default partition
expiration has no minimum value. The expiration time evaluates to the
partition's date plus the integer value. Any partition created in a
partitioned table in the dataset is deleted
PARTITION_EXPIRATION
seconds after the partition's date. If
you supply the
--time_partitioning_expiration
flag when you create or
update a partitioned table, the table-level partition expiration takes
precedence over the dataset-level default partition expiration.
TABLE_EXPIRATION
: the default lifetime (in seconds) for
newly created tables. The minimum value is 3600 seconds (one hour). The
expiration time evaluates to the current time plus the integer value. Any
table created in the dataset is deleted
TABLE_EXPIRATION
seconds after its creation time. This
value is applied if you don't set a table expiration when you
create the table
.
DESCRIPTION
: a description of the dataset
LABEL_1
:
VALUE_1
: the key-value
pair that you want to set as the first label on this dataset, and
LABEL_2
:
VALUE_2
is the key-value
pair that you want to set as the second label.
HOURS
: the duration in hours of the time travel
window for the new dataset.
The
HOURS
value must
be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168)
between 48 (2 days) and 168 (7 days). 168 hours is the default
if this option isn't specified.
BILLING_MODEL
: sets the
storage billing model
for the dataset. Set the
BILLING_MODEL
value to
PHYSICAL
to use physical bytes when calculating storage
charges, or to
LOGICAL
to use logical bytes.
LOGICAL
is the default.
When you change a dataset's billing model, it takes 24 hours for the
change to take effect.
Once you change a dataset's storage billing model, you must wait 14 days
before you can change the storage billing model again.
PROJECT_ID
: your project ID.
DATASET_ID
is the ID of the dataset that you're
creating.
For example, the following command creates a dataset named
mydataset
with data
location set to
US
, a default table expiration of 3600 seconds (1 hour), and a
description of
This is my dataset
. Instead of using the
--dataset
flag, the
command uses the
-d
shortcut. If you omit
-d
and
--dataset
, the command
defaults to creating a dataset.
bq --location=US mk -d \
--default_table_expiration 3600 \
--description "This is my dataset." \
mydataset
To confirm that the dataset was created, enter the
bq ls
command. Also,
you can create a table when you create a new dataset using the
following format:
bq mk -t
dataset
.
table
.
For more information about creating tables, see
Creating a table
.
Use the
google_bigquery_dataset
resource.
To authenticate to BigQuery, set up Application Default
Credentials. For more information, see
Set up authentication for client libraries
.
Create a dataset
The following example creates a dataset named
mydataset
:
When you create a dataset using the
google_bigquery_dataset
resource,
it automatically grants access to the dataset to all accounts that are
members of project-level
basic roles
.
If you run the
terraform show
command
after creating the dataset, the
access
block
for the dataset looks similar to the
following:
To grant access to the dataset, we recommend that you use one of the
google_bigquery_iam
resources
, as shown in the following example, unless you plan
to create authorized objects, such as
authorized views
, within the dataset.
In that case, use the
google_bigquery_dataset_access
resource
Refer to that documentation for examples.
Create a dataset and grant access to it
The following example creates a dataset named
mydataset
, then uses the
google_bigquery_dataset_iam_policy
resource to grant
access to it.
Create a dataset with a customer-managed encryption key
The following example creates a dataset named
mydataset
, and also uses the
google_kms_crypto_key
and
google_kms_key_ring
resources to specify a Cloud Key Management Service key for the dataset. You must
enable the Cloud Key Management Service API
before running this example.
To apply your Terraform configuration in a Google Cloud project, complete the steps in the
following sections.
Prepare Cloud Shell
- Launch
Cloud Shell
.
-
Set the default Google Cloud project
where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=
PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform
configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also
called a
root module
).
-
In
Cloud Shell
, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for example
main.tf
. In this
tutorial, the file is referred to as
main.tf
.
mkdir
DIRECTORY
&& cd
DIRECTORY
&& touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.
Optionally, copy the code from GitHub. This is recommended
when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:
terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project
to view
the results. In the Google Cloud console, navigate to your resources in the UI to make sure
that Terraform has created or updated them.
Dataset security
To control access to datasets in BigQuery, see
Controlling access to datasets
.
For information about data encryption, see
Encryption at rest
.
What's next
Try it for yourself
If you're new to Google Cloud, create an account to evaluate how
BigQuery performs in real-world
scenarios. New customers also get $300 in free credits to run, test, and
deploy workloads.
Try BigQuery free