This document describes how to stream data into BigQuery by using the
legacy
tabledata.insertAll
method.
For new projects, we recommend using the
BigQuery Storage Write API
instead of the
tabledata.insertAll
method. The Storage Write API has lower
pricing and more robust features, including exactly-once delivery semantics. If
you are migrating an existing project from the
tabledata.insertAll
method
to the Storage Write API, we recommend selecting the
default stream
. The
tabledata.insertAll
method is still fully supported.
Before you begin
Ensure that you have write access to the dataset that contains your
destination table. The table must exist before you begin writing data to it
unless you are using template tables. For more information on template tables,
see
Creating tables automatically using template tables
.
Check the
quota policy for streaming data
.
-
Make sure that billing is enabled for your Google Cloud project
.
Streaming is not available through the
free tier
. If you attempt to use streaming without enabling billing, you receive the following error:
BigQuery: Streaming insert is not allowed in the free tier.
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.
Required permissions
To stream data into BigQuery, you need the following IAM permissions:
bigquery.tables.updateData
(lets you insert data into the table)
bigquery.tables.get
(lets you obtain table metadata)
bigquery.datasets.get
(lets you obtain dataset metadata)
bigquery.tables.create
(required if you use a
template table
to create the table automatically)
Each of the following predefined IAM roles includes the permissions that you need in order to stream data into BigQuery:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
For more information about IAM roles and permissions in
BigQuery, see
Predefined roles and permissions
.
Stream data into BigQuery
You don't need to populate the
insertID
field when you insert rows.
The following example shows how to avoid sending an
insertID
for each row
when streaming.
Send date and time data
For date and time fields, format the data in the
tabledata.insertAll
method as
follows:
Type
|
Format
|
DATE
|
A string in the form
"YYYY-MM-DD"
|
DATETIME
|
A string in the form
"YYYY-MM-DD [HH:MM:SS]"
|
TIME
|
A string in the form
"HH:MM:SS"
|
TIMESTAMP
|
The number of seconds since 1970-01-01 (the Unix epoch), or a string
in the form
"YYYY-MM-DD HH:MM[:SS]"
|
Send range data
For fields with type
RANGE<T>
, format the data in the
tabledata.insertAll
method as a JSON object with two fields,
start
and
end
.
Null values for the
start
and
end
fields represent unbounded boundaries.
These fields must have the same supported JSON format of type
T
, where
T
can be one of
DATE
,
DATETIME
, and
TIMESTAMP
.
In the following example, the
f_range_date
field represents a
RANGE<DATE>
column in a table. A row is inserted into this column using the
tabledata.insertAll
API.
{
"f_range_date": {
"start": "1970-01-02",
"end": null
}
}
Stream data availability
Data is available for real-time analysis using
GoogleSQL
queries immediately after BigQuery successfully acknowledges a
tabledata.insertAll
request.
Recently streamed rows to an ingestion time partitioned table temporarily have a
NULL value for the
_PARTITIONTIME
pseudo column. For such rows, BigQuery assigns the final non-NULL
value of the
PARTITIONTIME
column in the background, typically within a few
minutes. In rare cases, this can take up to 90 minutes.
Some recently streamed rows might not be available for table copy typically for a
few minutes. In rare cases, this can take up to 90 minutes. To see whether data
is available for table copy, check the
tables.get
response for a section named
streamingBuffer
.
If the
streamingBuffer
section is absent, your data is available for copy.
You can also use the
streamingBuffer.oldestEntryTime
field to identify the
age of records in the streaming buffer.
Best effort de-duplication
When you supply
insertId
for an inserted row, BigQuery uses this
ID to support best effort de-duplication for up to one minute. That is, if
you stream the same row with the same
insertId
more than once within
that time period into the same table, BigQuery
might
de-duplicate
the multiple occurrences of that row, retaining only one of those occurrences.
The system expects that rows provided with identical
insertId
s are also
identical. If two rows have identical
insertId
s, it is nondeterministic
which row BigQuery preserves.
De-duplication is generally meant for retry scenarios in a distributed system where there's
no way to determine the state of a streaming insert under certain error
conditions, such as network errors between your system and BigQuery
or internal errors within BigQuery.
If you retry an insert, use the same
insertId
for the same set of rows so that
BigQuery can attempt to de-duplicate your data. For more
information, see
troubleshooting streaming inserts
.
De-duplication offered by BigQuery is best effort, and it should
not be relied upon as a mechanism to guarantee the absence of duplicates in your
data. Additionally, BigQuery might degrade the quality of best
effort de-duplication at any time in order to guarantee higher
reliability and availability for your data.
If you have strict de-duplication requirements for your data,
Google Cloud Datastore
is an alternative service that supports
transactions
.
Disabling best effort de-duplication
You can disable best effort de-duplication by not populating the
insertId
field for each row inserted. This is the recommended way to insert data.
Apache Beam and Dataflow
To disable best effort de-duplication when you use Apache Beam's
BigQuery I/O connector
for Java, use the
ignoreInsertIds()
method
.
Manually removing duplicates
To ensure that no duplicate rows exist after you are done streaming, use the following manual process:
- Add the
insertId
as a column in your table schema and include the
insertId
value in the data for each row.
- After streaming has stopped, perform the following query to check for
duplicates:
#standardSQL
SELECT
MAX(count) FROM(
SELECT
ID_COLUMN,
count(*) as count
FROM
`TABLE_NAME`
GROUP BY
ID_COLUMN)
If the result is greater than 1, duplicates exist.
- To remove duplicates, run the following query. Specify
a destination table, allow large results, and disable result flattening.
#standardSQL
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY ID_COLUMN) row_number
FROM
`TABLE_NAME`)
WHERE
row_number = 1
Notes about the duplicate removal query:
- The safer strategy for the duplicate removal query is to target a new table.
Alternatively, you can target the source table with write disposition
WRITE_TRUNCATE
.
- The duplicate removal query adds a
row_number
column with the value
1
to
the end of the table schema. The query uses a
SELECT * EXCEPT
statement from
GoogleSQL
to
exclude the
row_number
column from the destination table. The
#standardSQL
prefix
enables
GoogleSQL for this query. Alternatively, you can select by specific
column names to omit this column.
- For querying live data with duplicates removed, you can also create a view
over your table using the duplicate removal query. Be aware that query costs
against the view are calculated based on the columns selected in your
view, which can result in large bytes scanned sizes.
Stream into time-partitioned tables
When you stream data to a time-partitioned table, each partition has a streaming
buffer. The streaming buffer is retained when you perform a load, query, or copy
job that overwrites a partition by setting the
writeDisposition
property to
WRITE_TRUNCATE
. If you want to remove the streaming buffer, verify that the
streaming buffer is empty by calling
tables.get
on the partition.
Ingestion-time partitioning
When you stream to an ingestion-time partitioned table, BigQuery
infers the destination partition from the current UTC time.
Newly arriving data is temporarily placed in the
__UNPARTITIONED__
partition
while in the streaming buffer. When there's enough unpartitioned data,
BigQuery partitions the data into the correct partition. However,
there is no SLA for how long it takes for data to move out of the
__UNPARTITIONED__
partition. A query
can exclude data in the streaming buffer from a query by filtering out the
NULL
values from the
__UNPARTITIONED__
partition by using one of the
pseudocolumns (
_PARTITIONTIME
or
_PARTITIONDATE
depending on your preferred data type).
If you are streaming data into a daily partitioned table, then you can override
the date inference by supplying a partition decorator as part of the
insertAll
request. Include the decorator in the
tableId
parameter. For example, you can
stream to the partition corresponding to 2021-03-01 for table
table1
using the
partition decorator:
table1$20210301
When streaming using a partition decorator, you can stream to partitions within
the last 31 days in the past and 16 days in the future relative to the current
date, based on current UTC time. To write to partitions for dates outside these
allowed bounds, use a load or query job instead, as described in
Appending to and overwriting partitioned table data
.
Streaming using a partition decorator is only supported for daily partitioned
tables. It is not supported for hourly, monthly, or yearly partitioned tables.
For testing, you can use the bq command-line tool
bq insert
CLI command.
For example, the following command streams a single row to a partition for the
date January 1, 2017 (
$20170101
) into a partitioned table named
mydataset.mytable
:
echo '{"a":1, "b":2}' | bq insert 'mydataset.mytable$20170101'
Time-unit column partitioning
You can stream data into a table partitioned on a
DATE
,
DATETIME
, or
TIMESTAMP
column that is between 5 years in the past and 1 year in the future.
Data outside this range is rejected.
When the data is streamed, it is initially placed in the
__UNPARTITIONED__
partition. When there's enough unpartitioned data, BigQuery
automatically repartitions the data, placing it into the appropriate partition.
However, there is no SLA for how long it takes for data to move out of the
__UNPARTITIONED__
partition.
- Note: Daily partitions are processed differently than hourly, monthly and
yearly partitions. Only data outside
of the date range (last 7 days to future 3 days) is extracted to the UNPARTITIONED partition,
waiting to be repartitioned.
On the other hand, for hourly partitioned table, data is always extracted to
the UNPARTITIONED partition, and later repartitioned.
Create tables automatically using template tables
Template tables
provide a mechanism to split a logical table into many smaller
tables to create smaller sets of data (for example, by user ID). Template tables
have a number of limitations described below. Instead,
partitioned tables
and
clustered tables
are the recommended ways to
achieve this behavior.
To use a template table through the BigQuery API, add a
templateSuffix
parameter
to your
insertAll
request. For the bq command-line tool, add the
template_suffix
flag
to your
insert
command. If BigQuery detects a
templateSuffix
parameter or the
template_suffix
flag, it treats the targeted table as a base
template. It creates a new table that shares the same schema as the targeted
table and has a name that includes the specified suffix:
<targeted_table_name> + <templateSuffix>
By using a template table, you avoid the overhead of creating each table
individually and specifying the schema for each table. You need only create
a single template, and supply different suffixes so that BigQuery can create
the new tables for you. BigQuery places the tables in the same project
and dataset.
Tables created by using template tables are usually available within a few seconds.
On rare occasions, they may take longer to become available.
Change the template table schema
If you change a template table schema, all tables that are generated subsequently
use the updated schema. Previously generated tables are not affected,
unless the existing table still has a streaming buffer.
For existing tables that still have a streaming buffer, if you modify the
template table schema in a backward compatible way, the schema of those
actively streamed generated tables is also updated. However,
if you modify the template table schema in a non-backward compatible way,
any buffered data that uses the old schema is lost. Also, you
cannot stream new data to existing generated tables that use
the old, but now incompatible, schema.
After you change a template table schema, wait until the changes have propagated
before you try to insert new data or query the generated tables. Requests to insert
new fields should succeed within a few minutes. Attempts to query the new
fields might require a longer wait of up to 90 minutes.
If you want to change a generated table's schema, do not change the
schema until streaming through the template table has ceased and the generated
table's streaming statistics section is absent from the
tables.get()
response,
which indicates that no data is buffered on the table.
Partitioned tables
and
clustered tables
do not suffer from the
preceding limitations and are the recommended mechanism.
Template table details
- Template suffix value
- The
templateSuffix
(or
--template_suffix
) value must contain only letters
(a-z, A-Z), numbers (0-9), or underscores (_). The maximum combined length
of the table name and the table suffix is 1024 characters.
- Quota
Template tables are subject to
streaming quota
limitations. Your project can make up to 10 tables per second with template tables, similar
to the
tables.insert
API. This quota only applies
to tables being created, not to tables being modified.
If your application needs to create more than 10 tables per second, we recommend using
clustered tables
.
For example, you can put the high cardinality table ID into the key column of a single clustering table.
- Time to live
The generated table inherits its expiration time from the dataset. As with
normal streaming data, generated tables cannot be copied immediately.
- Deduplication
Deduplication only happens between uniform references to a destination table.
For example, if you simultaneously stream to a generated table using both
template tables and a regular
insertAll
command, no deduplication occurs
between rows inserted by template tables and a regular
insertAll
command.
- Views
The template table and the generated tables should not be views.
Troubleshoot streaming inserts
The following sections discuss how to troubleshoot errors that occur
when you
stream data into
BigQuery using the legacy streaming API
. For more
information on how to resolve quota errors for streaming inserts, see
Streaming insert quota
errors
.
Failure HTTP response codes
If you receive a failure HTTP response code such as a network error, there's
no way to tell whether the streaming insert succeeded. If you try to re-send
the request, you might end up with duplicated rows in your table. To help
protect your table against duplication, set the
insertId
property when
sending your request. BigQuery uses the
insertId
property
for de-duplication.
If you receive a permission error, an invalid table name error, or an exceeded
quota error, no rows are inserted and the entire request fails.
Success HTTP response codes
Even if you receive a
success HTTP response code
, you'll need to check the
insertErrors
property of the response to determine whether the row insertions
were successful because it's possible that BigQuery was only partially
successful at inserting the rows. You might encounter one of the following scenarios:
- All rows inserted successfully.
If the
insertErrors
property is an
empty list, all of the rows were inserted successfully.
- Some rows inserted successfully.
Except in cases where there is a
schema mismatch in any of the rows, rows indicated in the
insertErrors
property are
not inserted, and all other rows are inserted successfully. The
errors
property contains detailed information about why each unsuccessful row failed.
The
index
property indicates the 0-based row index of the request
that the error applies to.
- None of the rows inserted successfully.
If BigQuery encounters a
schema mismatch on individual rows in the request, none of the rows are inserted and an
insertErrors
entry is returned for each row, even the rows that did not have a schema
mismatch. Rows that did not have a schema mismatch have an error with the
reason
property set to
stopped
, and can be re-sent as-is. Rows that failed
include detailed information about the schema mismatch. To learn about the supported protocol buffer
types for each BigQuery data type, see
Data type conversions
.
Because BigQuery's streaming API is designed for high insertion rates,
modifications to the underlying table metadata exhibit are eventually consistent when interacting
with the streaming system. Most of the time, metadata changes are propagated within minutes, but
during this period API responses might reflect the inconsistent state of the table.
Some scenarios include:
- Schema Changes
. Modifying the schema of a table that has recently received streaming
inserts can cause responses with schema mismatch errors because the streaming system might not
immediately pick up the schema change.
- Table Creation/Deletion
. Streaming to a nonexistent table returns a variation of a
notFound
response. A table created in response might not immediately be recognized
by subsequent streaming inserts. Similarly, deleting or recreating a table can create a period
of time where streaming inserts are effectively delivered to the old table. The streaming inserts
might not be present in the new table.
- Table Truncation
. Truncating a table's data (by using a query job that uses
writeDisposition of WRITE_TRUNCATE) can similarly cause subsequent inserts during the consistency
period to be dropped.
Missing/Unavailable data
Streaming inserts reside temporarily in the write-optimized storage, which has different availability
characteristics than managed storage. Certain operations in BigQuery do not interact
with the write-optimized storage, such as table copy jobs and API methods like
tabledata.list
.
Recent streaming data won't be present in the destination table or output.