Introduction to loading data
This document provides an overview of loading data into BigQuery.
Overview
There are several ways to ingest data into BigQuery:
- Batch load a set of data records.
- Stream individual records or batches of records.
- Use queries to generate new data and append or overwrite the results to a
table.
- Use a third-party application or service.
Batch loading
With batch loading, you load the source data into a BigQuery
table in a single batch operation. For example, the data source could be a CSV
file, an external database, or a set of log files. Traditional extract,
transform, and load (ETL) jobs fall into this category.
Options for batch loading in BigQuery include the following:
- Load jobs.
Load data from Cloud Storage or from a local file by
creating a
load job
. The records can be
in Avro, CSV, JSON, ORC, or Parquet format.
- SQL
. The
LOAD DATA
SQL statement loads data from one or more files into a new or existing table.
You can use the
LOAD DATA
statement to load Avro, CSV, JSON, ORC, or Parquet
files.
- BigQuery Data Transfer Service.
Use
BigQuery Data Transfer Service
to automate loading data from Google Software as a Service (SaaS) apps or from
third-party applications and services.
- BigQuery Storage Write API.
The
Storage Write API
lets you
batch-process an arbitrarily large number of records and commit them in a
single atomic operation. If the commit operation fails, you can safely retry
the operation. Unlike BigQuery load jobs, the
Storage Write API does not require staging the data to
intermediate storage such as Cloud Storage.
- Other managed services.
Use other managed services to export data from an
external data store and import it into BigQuery. For example,
you can load data from
Firestore exports
.
When choosing a batch load method, most file-based patterns should use either
load job
or
LOAD DATA
SQL statement
to batch load data. Other services should
generally use BigQuery Data Transfer Service or
export data from Google services
.
Batch loading can be done as a one-time operation or on a recurring schedule.
For example, you can do the following:
- You can run BigQuery Data Transfer Service transfers on a schedule.
- You can use an orchestration service such as Cloud Composer to schedule
load jobs.
- You can use a cron job to load data on a schedule.
Streaming
With streaming, you continually send smaller batches of data in real time, so
the data is available for querying as it arrives. Options for streaming in
BigQuery include the following:
- Storage Write API.
The
Storage Write API
supports high-throughput streaming ingestion with exactly-once delivery
semantics.
- Dataflow.
Use
Dataflow
with the
Apache Beam SDK to set up a streaming pipeline that writes to
BigQuery. For more information, see
BigQuery I/O connector
in the Apache Beam documentation and the
Stream from Pub/Sub to BigQuery tutorial
.
- Datastream.
Datastream
uses BigQuery change data capture functionality and the
Storage Write API
to replicate data and
schema updates from operational databases directly into BigQuery.
Follow this
quickstart
for an example of replicating from a Cloud SQL for PostgreSQL database into
BigQuery.
- BigQuery Connector for SAP.
The BigQuery
Connector for SAP enables near real time replication of SAP data directly into
BigQuery. For more information, see the
BigQuery Connector for SAP planning guide
.
- Pub/Sub.
Pub/Sub
is a messaging
service you can use to coordinate streaming analytics and data integration pipelines.
You can use
BigQuery subscriptions
to
write messages directly to an existing BigQuery table.
Generated data
You can use SQL to generate data and store the results in
BigQuery. Options for generating data include:
Use
data manipulation language
(DML) statements to perform bulk inserts into an existing table or store query
results in a new table.
Use a
CREATE TABLE ... AS
statement to create a new table from a query result.
Run a query and save the results to a table. You can append the results to an
existing table or write to a new table. For more information, see
Writing query results
.
Third-party applications
Some third-party applications and services provide
connectors that can ingest data into BigQuery. The details of how
to configure and manage the ingestion pipeline depend on the application.
For example, to load data from external sources to BigQuery's
storage, you can use Informatica Data Loader or Fivetran Data Pipelines. For more information, see
Load data using a third-party application
.
Choosing a data ingestion method
Here are some considerations to think about when you choose a data ingestion
method.
Data source.
The source of the data or the data format can determine whether
batch loading or streaming is simpler to implement and maintain. Consider the
following points:
If the BigQuery Data Transfer Service supports the data source, transferring the data
directly into BigQuery might be the simplest solution to
implement.
For data from third-party sources that aren't supported by the
BigQuery Data Transfer Service, transform the data into a format supported
by
batch loading
and use that method
instead.
If your data comes from Spark or Hadoop, consider using
BigQuery connectors
to simplify data ingestion.
For local files, consider batch load jobs, especially if
BigQuery supports the file format without requiring a
transformation or data cleansing step.
For application data such as application events or a log stream, it might be
easier to stream the data in real time, rather than implement batch
loading.
Slow-changing versus fast-changing data.
If you need to ingest and analyze
data in near real time, consider streaming the data. With streaming, the data is
available for querying as soon as each record arrives. Avoid using DML
statements to submit large numbers of individual row updates or insertions. For
frequently updated data, it's often better to stream a change log and use a view
to obtain the latest results. Another option is to use Cloud SQL as your
online transaction processing (OLTP) database and use federated queries to join
the data in BigQuery.
If your source data changes slowly or you don't need continuously updated
results, consider using a load job. For example, if you use the data to run a
daily or hourly report, load jobs can be less expensive and can use fewer system
resources.
Another scenario is data that arrives infrequently or in response to an event.
In that case, consider using Dataflow to stream the data or use
Cloud Functions to call the streaming API in response to a trigger.
Reliability of the solution
. BigQuery has a
Service Level
Agreement
(SLA). However, you also need to consider the
reliability of the particular solution that you implement. Consider the
following points:
- With loosely typed formats such as JSON or CSV, bad data can make an entire
load job fail. Consider whether you need a data cleansing step before
loading, and consider how to respond to errors. Also consider using a
strongly typed format such as Avro, ORC, or Parquet.
- Periodic load jobs require scheduling, using Cloud Composer, cron, or
another tool. The scheduling component could be a failure point in the
solution.
- With streaming, you can check the success of each record and quickly report
an error. Consider writing failed messages to an unprocessed messages queue
for later analysis and processing. For more information about
BigQuery streaming errors, see
Troubleshooting streaming inserts
.
- Streaming and load jobs are subject to
quotas
. For
information about how to handle quota errors, see
Troubleshooting BigQuery quota errors
.
- Third-party solutions might differ in configurability, reliability, ordering
guarantees, and other factors, so consider these before adopting a solution.
Latency.
Consider how much data you load and how soon you need the data to
be available. Streaming offers the lowest latency of data being available for
analysis. Periodic load jobs have a higher latency, because new data is only
available after each load job finishes.
Load jobs use a shared pool of
slots
by default. A load
job might wait in a pending state until slots are available, especially if you
load a very large amount of data. If that creates unacceptable wait times, you
can purchase dedicated slots, instead of using the shared slot pool. For more
information, see
Introduction to Reservations
.
Query performance for external data sources might not be as high as query
performance for data stored in BigQuery. If minimizing query
latency is important, then we recommend loading the data into
BigQuery.
Data ingestion format
. Choose a data ingestion format based on the
following factors:
Schema support.
Avro, ORC, Parquet, and Firestore exports are
self-describing formats. BigQuery creates the table schema
automatically based on the source data. For JSON and CSV data, you can
provide an explicit schema, or you can use
schema auto-detection
.
Flat data or nested and repeated fields.
Avro, CSV, JSON, ORC, and
Parquet all support flat data. Avro, JSON, ORC, Parquet, and
Firestore exports also support data with nested and repeated
fields. Nested and repeated data is useful for expressing hierarchical data.
Nested and repeated fields also reduce data duplication when
loading the data
.
Embedded newlines.
When you are loading data from JSON files, the rows
must be newline delimited. BigQuery expects newline-delimited
JSON files to contain a single record per line.
Encoding.
BigQuery supports UTF-8 encoding for both
nested or repeated and flat data. BigQuery supports
ISO-8859-1 encoding for flat data only for CSV files.
Load nested and repeated data
You can load data into nested and repeated fields in the following data formats:
- Avro
- JSON (newline delimited)
- ORC
- Parquet
- Datastore exports
- Firestore exports
For information about specifying nested and repeated fields in your schema
when you're loading data, see
Specifying nested and repeated fields
.
Load data from other Google services
Some Google services export data to BigQuery using
scheduled queries, exports, or transfers. For more information about services
that support exports to BigQuery, see
Load data from Google services
.
Other Google services support data exports initiated from BigQuery Data Transfer Service.
For more information about services that support exports initiated by
BigQuery Data Transfer Service, see
BigQuery Data Transfer Service
.
Quota
For information about quotas, see the following sections:
Alternatives to loading data
You don't need to load data before running queries in the following situations:
- Public datasets
- Public datasets are datasets stored in BigQuery and shared with
the public. For more information, see
BigQuery public datasets
.
- Shared datasets
- You can share datasets stored in BigQuery. If someone has
shared a dataset with you, you can run queries on that dataset without loading
the data.
- External data sources
- BigQuery can run queries on certain forms of external data,
without loading the data into BigQuery storage. This approach
lets you take advantage of the analytic capabilities of
BigQuery without moving data that is stored elsewhere. For
information about the benefits and limitations of this approach, see
external data sources
.
- Logging files
- Cloud Logging provides an option to export log files into
BigQuery. See
Configure and manage sinks
for more information.
Monitor usage of load jobs
You can monitor usage of load jobs using the following two ways:
Use Cloud Monitoring.
For more information, see
BigQuery metrics
.
Specifically, you can monitor the amount of data and number of rows uploaded to
a specific table. If your load jobs upload data to a specific table, this can
be a proxy for monitoring load job upload data usage.
Use
INFORMATION_SCHEMA.JOBS_BY_PROJECT
.
You can use the
INFORMATION_SCHEMA.JOBS_BY_PROJECT
view to
get the number of load jobs per table per day
.
Example use case
The following examples explain the methods to use based on your use case and how
to use them with other data analytics solutions.
Stream data using Storage Write API
Suppose that there is a pipeline processing event data from endpoint logs.
Events are generated continuously and need to be available for querying in
BigQuery as soon as possible. As data freshness is paramount for
this use case, the
Storage Write API
is the best choice to ingest data into BigQuery. A
recommended architecture
to keep these endpoints lean is sending events to Pub/Sub, from
where they are consumed by a streaming Dataflow pipeline which
directly streams to BigQuery.
A primary reliability concern for this architecture is how to deal with failing
to insert a record into BigQuery. If each record is important and
cannot be lost, data needs to be buffered before attempting to insert. In the
recommended architecture above, Pub/Sub can play the role of a
buffer with its message retention capabilities. The Dataflow
pipeline should be configured to retry BigQuery streaming inserts
with
truncated exponential backoff
.
After the capacity of Pub/Sub
as a buffer is exhausted, for example in the case of prolonged unavailability of
BigQuery or a network failure, data needs to be persisted on the
client and the client needs a mechanism to resume inserting persisted records
once availability is restored. For more information about how to handle this
situation, see the
Google Pub/Sub Reliability Guide
blog post.
Another failure case to handle is that of a
poison record
. A poison record is
either a record rejected by BigQuery because the record fails to
insert with a non-retryable error or a record that has not been successfully
inserted after the maximum number of retries. Both types of records should be
stored in a
"
dead letter queue
"
by the Dataflow pipeline for further investigation.
If exactly-once semantics are required, create a write stream in
committed type
,
with record offsets provided by the client. This avoids duplicates, as the write
operation is only performed if the offset value matches the next append offset.
Not providing an offset means records are appended to the current end of the
stream and retrying a failed append could result in the record appearing more
than once in the stream.
If exactly-once guarantees are not required,
writing to the default stream
allows for a higher throughput and also does not count against the
quota limit
on creating write streams.
Estimate the throughput of your network
and ensure in advance that you have an adequate quota to serve the throughput.
If your workload is generating or processing data at a very uneven rate, then
try to smooth out any load spikes on the client and stream into
BigQuery with a constant throughput. This can simplify your
capacity planning. If that is not possible, ensure you are prepared to handle
429
(resource exhausted) errors if and when your throughput goes over quota
during short spikes.
Batch data processing
Suppose there is a nightly batch processing pipeline that needs to be
completed by a fixed deadline. Data needs to be available by this deadline for
further processing by another batch process to generate reports to be sent to a
regulator. This use case is common in regulated industries such as finance.
Batch loading of data with load jobs
is the right approach for this use case because latency is not a concern
provided the deadline can be met. Ensure your Cloud Storage buckets
meet the location requirements
for loading data into the BigQuery dataset.
The result of a BigQuery load job is atomic; either all records
get inserted or none do. As a best practice, when inserting all data in a single
load job, create a new table by using the
WRITE_TRUNCATE
disposition of
the
JobConfigurationLoad
resource.
This is important when retrying a failed load job, as the client might
not be able to distinguish between jobs that have failed and the failure
caused by for example in communicating the success state back to the client.
Assuming data to be ingested has been successfully copied to
Cloud Storage already, retrying with exponential backoff is sufficient
to address ingestion failures.
It's recommended that a nightly batch job doesn't hit the
default quota
of 1,500 loads per table per day even with retries. When loading data
incrementally, the default quota is sufficient for running a load job every 5
minutes and have unconsumed quota for at least 1 retry per job on average.
What's next