Export data as Protobuf columns
This document describes how you can export BigQuery data as Protocol
Buffers (Protobuf) columns by using BigQuery user-defined
functions (UDFs).
When to use Protobuf columns
BigQuery offers a number of built-in functions to format selected
data. One option is to merge multiple column values into a single Protobuf value,
which has the following benefits:
- Object type safety.
- Improved compression, data transfer time, and cost as compared with JSON.
- Flexibility, as most programming languages have libraries to handle Protobuf.
- Less overhead when reading from multiple columns and building a single object.
While other column types can also provide type safety, using Protobuf columns
provides a fully typed object, which can reduce the amount of work that needs to
be done on the application layer or on another part of the pipeline.
However, there are limitations to exporting BigQuery data as
Protobuf columns:
- Protobuf columns are not well indexed or filtered. Searching by the content of
the Protobuf columns can be less effective.
- Sorting data in Protobuf format can be difficult.
If these limitations apply to your export workflow, you might consider other
methods of exporting BigQuery data:
- Use
Dataflow
to export
BigQuery data in either Avro or CSV file formats.
- Use
scheduled queries
to sort your
exported BigQuery data by date or time, and to schedule exports
on a recurring basis.
Required roles
To get the permissions that you need to export BigQuery data as Protobuf columns,
ask your administrator to grant you the
following IAM roles on your project:
For more information about granting roles, see
Manage access
.
You might also be able to get
the required permissions through
custom
roles
or other
predefined
roles
.
Create a UDF
The following procedure shows how you can create a user-defined function that
converts a BigQuery
STRUCT
data type into a Protobuf column:
In a command line, clone the
bigquery-utils.git
repository:
$ git clone https://github.com/GoogleCloudPlatform/bigquery-utils.git
Navigate into the Protobuf export folder
$ cd bigquery-utils/tools/protobuf_export
Add your proto files to the
./protos
folder.
Install the necessary packages from the GitHub repository:
$ npm install
Bundle the package by using webpack:
$ npx webpack --config webpack.config.js --stats-error-details
Locate the
pbwrapper.js
file in your
dist/
folder and copy the file to a
Cloud Storage bucket.
Create a UDF that builds a Protobuf column from your
existing BigQuery columns:
CREATE FUNCTION
DATASET_ID
.toMyProtoMessage(input STRUCT<field_1 TYPE1, field_2 TYPE2>)
RETURNS BYTES
LANGUAGE js OPTIONS ( library=["gs://
BUCKET_NAME
/pbwrapper.js"]
) AS r"""
let message = pbwrapper.setup("
PROTO_PACKAGE
.
PROTO_MESSAGE
")
return pbwrapper.parse(message, input)
""";
Replace the following:
DATASET_ID
: the ID of the dataset where you store your function
BUCKET_NAME
: the name of your Cloud Storage bucket
PROTO_PACKAGE
: the package name for your proto file
PROTO_MESSAGE
: the message type for your proto file
For more information about using packages in the proto language,
see
Packages
.
After you create the UDF, run the function:
SELECT
DATASET_ID
.toMyProtoMessage(STRUCT(
COLUMN_TYPE1
,
COLUMN_TYPE2
)) AS protoResult
FROM
`
DATASET_NAME
`
LIMIT
100;
Replace the following:
DATASET_ID
: the ID of the dataset where you store your function
DATASET_NAME
: the name your dataset—for example,
dataset_name.table_name
COLUMN_TYPE1
: the name of a column. Columns can include any
supported
scalar value type
or non-scalar type, including
ARRAY
and
STRUCT
COLUMN_TYPE2
: the name of a column. Columns can include any
supported
scalar value type
or non-scalar type, including
ARRAY
and
STRUCT
Work with Protobuf values
With your BigQuery data exported in the Protobuf format, you can
now work with your data as a fully typed object or struct.
The following code samples provide several examples of ways that you can process
or work with your exported data: