Manage vector indexes
To provide feedback or request support for this feature, send email to
bq-vector-search@google.com
.
This document describes how to create and manage vector indexes.
A vector index is a data structure designed to let the
VECTOR_SEARCH
function
perform a more efficient
vector search
of embeddings. When
VECTOR_SEARCH
is able to use a vector index, the function
uses the
Approximate Nearest Neighbor
search technique to help improve search performance, with the trade-off of
reducing
recall
and thus returning more approximate results.
Roles and permissions
To create a vector index, you need the
bigquery.tables.createIndex
IAM permission
on the table where you're creating the index. To drop a vector index, you need
the
bigquery.tables.deleteIndex
permission. Each of the following predefined
IAM roles includes the permissions that you need to work with
vector indexes:
- BigQuery Data Owner (
roles/bigquery.dataOwner
)
- BigQuery Data Editor (
roles/bigquery.dataEditor
)
Create a vector index
To create a vector index, use the
CREATE VECTOR INDEX
data definition language (DDL) statement:
Go to the
BigQuery
page.
Go to BigQuery
In the query editor, run the following SQL statement:
CREATE [ OR REPLACE ]
VECTOR
INDEX [ IF NOT EXISTS ]
INDEX_NAME
ON
DATASET_NAME
.
TABLE_NAME
(
COLUMN_NAME
)
STORING(
STORED_COLUMN_NAME
[, ...])
OPTIONS
(index_type =
INDEX_TYPE
,
distance_type =
DISTANCE_TYPE
,
ivf_options = '{"num_lists":
NUM_LISTS
}')
Replace the following:
INDEX_NAME
: the name of the vector index you're
creating. Since the index is always created in the same project and
dataset as the base table, there is no need to specify these in the name.
DATASET_NAME
: the name of the dataset that
contains the table.
TABLE_NAME
: the name of the table that contains
the column with embeddings data.
COLUMN_NAME
: the name of a column that contains
the embeddings data. The column must have a type of
ARRAY<FLOAT64>
. The
column can't have any child fields. All elements in the array must be
non-
NULL
, and all values in the column must have the same array
dimensions.
STORED_COLUMN_NAME
: the name of a top-level column
in the table to store in the vector index. The column type can't be
RANGE
. Stored columns are not used if the table has a row-level access
policy or the column has a policy tag. For information about how to
enable stored columns, see
Store columns and pre-filter
.
INDEX_TYPE
: the algorithm to use to build the
vector index.
IVF
is the only supported value.
Specifying
IVF
builds the vector index as inverted file index (IVF). An
IVF uses a k-means algorithm to cluster the vector data, and then
partitions the vector data based on those clusters. When you use the
VECTOR_SEARCH
function
to search the vector data, it can use these partitions to reduce the
amount of data it needs to read in order to determine a result.
DISTANCE_TYPE
: specifies the default distance type
to use when performing a vector search using this index. The supported
values are
EUCLIDEAN
and
COSINE
.
EUCLIDEAN
is the default.
The index creation itself always uses
EUCLIDEAN
distance for training
but the distance used in the
VECTOR_SEARCH
function can be different.
If you specify a value for the
distance_type
argument of the
VECTOR_SEARCH
function, that value is used instead of the
DISTANCE_TYPE
value.
NUM_LISTS
: an
INT64
value less than or
equal to 5,000 that determines how many lists the IVF algorithm creates.
The IVF algorithm divides the whole data space into a number of lists
equal to
NUM_LISTS
, with data points that
are closer to each other being more likely to be put on the same list. If
NUM_LISTS
is small, you have fewer lists
with more data points, while a larger value creates more lists with fewer
data points.
You can use
NUM_LISTS
in
combination with the
fraction_lists_to_search
argument in the
VECTOR_SEARCH
function to create an efficient vector search. If you have data that is
distributed in many small groups in the embedding space, then specify a
high
NUM_LISTS
to create an index with more
lists and specify a lower
fraction_lists_to_search
value to scan fewer
lists in vector search. Use a lower
NUM_LISTS
and a higher
fraction_lists_to_search
value when your data is distributed in fewer,
larger groups. Using a high
num_lists
value might make the
vector index take longer to build.
If you don't specify
NUM_LISTS
,
BigQuery calculates an appropriate value.
The following example creates a vector index on the
embedding
column
of
my_table
:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>);
CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS(index_type = 'IVF');
The following example creates a vector index on the
embedding
column
of
my_table
, and specifies the distance type to use and the IVF options:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>);
CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS(index_type = 'IVF', distance_type = 'COSINE',
ivf_options = '{"num_lists": 2500}')
Store columns and pre-filter
To further improve the efficiency of your vector index, you can specify columns
from your base table to store in your vector index. Using stored columns can
optimize queries that call the
VECTOR_SEARCH
function in the following ways:
The
VECTOR_SEARCH
function outputs a struct called
base
that contains
all columns from the base table. Without stored columns, a potentially
expensive join is needed to retrieve the columns stored in
base
. If
your query only selects stored columns from
base
, then
BigQuery optimizes your query to eliminate that join.
Instead of searching an entire table, you can call the
VECTOR_SEARCH
function on a query statement that
pre-filters
the base table with a
WHERE
clause. If your table has an index and you filter on only stored
columns, then BigQuery optimizes the query by filtering the
data before searching and then using the index to search the smaller result
set. If you filter on columns that aren't stored, then
BigQuery applies the filter after the table is searched, or
post-filters
.
Post-filtering is less efficient and can cause
fewer than
top_k
matches in the result set. In some cases,
pre-filtering can also reduce the size of the result set. If this happens,
try increasing the value of
fraction_lists_to_search
in your call to
VECTOR_SEARCH
.
To store columns, list them in the
STORING
clause of the
CREATE VECTOR INDEX
DDL statement
.
Storing columns increases the size of the vector index, so it's best
to store only the most frequently used or filtered columns.
The following example creates a vector index with stored columns and then
explains the behavior of different types of vector searches:
-- Create a table that contains an embedding.
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>, type STRING, creation_time DATETIME, id INT64);
-- Create a query table that contains an embedding.
CREATE TABLE my_dataset.my_testdata(embedding ARRAY<FLOAT64>, test_id INT64);
-- Create a vector index with stored columns.
CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
STORING (type, creation_time)
OPTIONS (index_type = 'IVF');
-- Select only stored columns from a vector search to avoid an expensive join.
SELECT query, base.type, distance
FROM
VECTOR_SEARCH(
TABLE my_dataset.my_table,
'embedding'
TABLE my_dataset.my_testdata);
-- Pre-filter on a stored column. The index speeds up the query.
SELECT *
FROM
VECTOR_SEARCH(
(SELECT * FROM my_dataset.my_table WHERE type = 'animal'),
'embedding',
TABLE my_dataset.my_testdata);
-- Filter on a column that isn't stored. The index is used to search the
-- entire table, and then the results are post-filtered. You might see fewer
-- than 5 matches returned for some embeddings.
SELECT query.test_id, base.type, distance
FROM
VECTOR_SEARCH(
(SELECT * FROM my_dataset.my_table WHERE id = 123),
'embedding',
TABLE my_dataset.my_testdata,
top_k => 5);
-- Use post-filters. The index is used, but the entire table is searched and
-- the post-filtering might reduce the number of results.
SELECT query.test_id, base.type, distance
FROM
VECTOR_SEARCH(
TABLE my_dataset.my_table,
'embedding',
TABLE my_dataset.my_testdata,
top_k => 5)
WHERE base.type = 'animal';
-- Use pre-filters with brute force. The data is filtered and then searched
-- with brute force for exact results.
SELECT query.test_id, base.type, distance
FROM
VECTOR_SEARCH(
(SELECT * FROM my_dataset.my_table WHERE id = 123),
'embedding',
TABLE my_dataset.my_testdata,
options => '{"use_brute_force":true}');
Limitations
- You can't use
logical views
in your pre-filter.
- If your pre-filter contains a
subquery
, it might
interfere with index usage.
- If you select a column of type
STRUCT
from the
query
output of a
VECTOR_SEARCH
query on a table that has an index with stored columns, then
the whole query might fail.
Understand index refresh
Vector indexes are fully managed by BigQuery and automatically
refreshed when the indexed table changes. If you delete the indexed column in
a table or rename the table itself, the vector index is deleted automatically.
If you create a vector index on a table that is smaller than 10 MB, then the
vector index isn't populated. Similarly, if you delete data from an indexed
table and the table size falls below 10 MB, then the vector index is
temporarily disabled. In this case, vector search queries don't use the index
and the
indexUnusedReasons
code in the
vectorSearchStatistics
section of the
Job
resource is
BASE_TABLE_TOO_SMALL
. Without the index,
VECTOR_SEARCH
automatically falls back to using brute force to find the
nearest neighbors of embeddings.
Queries that use the
VECTOR_SEARCH
function
always return correct results, even if some portion of data is not yet indexed.
You can verify the existence and the readiness of a vector index by querying
INFORMATION_SCHEMA
. The following views contain metadata on vector indexes:
The
INFORMATION_SCHEMA.VECTOR_INDEXES
view
has information about the vector indexes in a dataset.
After the
CREATE VECTOR INDEX
statement completes, the index must still
be populated before you can use it. You can use the
last_refresh_time
and
coverage_percentage
columns to verify the readiness of a vector
index. If the vector index isn't ready, you can still use the
VECTOR_SEARCH
function on a table, it just might run more slowly without
the index.
The
INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS
view
has information about the vector-indexed columns for all tables in a dataset.
The
INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS
view
has information about the options used by the vector indexes in a dataset.
Vector index examples
The following example shows all active vector indexes on tables in the dataset
my_dataset
, located in the project
my_project
. It includes their names, the
DDL statements used to create them, and their coverage percentage. If an
indexed base table is less than 10 MB, then its index is not populated, in
which case the
coverage_percentage
value is 0.
SELECT table_name, index_name, ddl, coverage_percentage
FROM my_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES
WHERE index_status = 'ACTIVE';
The result is similar to the following:
+-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+
| table_name | index_name | ddl | coverage_percentage |
+-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+
| small_table | myindex1 | CREATE VECTOR INDEX `myindex1` ON `my_project.my_dataset.small_table`(embeddings) | 100 |
| | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"numLists": 3}') | |
+-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+
| large_table | myindex2 | CREATE VECTOR INDEX `myindex2` ON `my_project.my_dataset.large_table`(vectors) | 42 |
| | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"numLists": 12}') | |
+-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+
Vector index columns examples
The following query extracts information on columns that have vector indexes:
SELECT table_name, index_name, index_column_name, index_field_path
FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;
The result is similar to the following:
+------------+------------+-------------------+------------------+
| table_name | index_name | index_column_name | index_field_path |
+------------+------------+-------------------+------------------+
| table1 | indexa | a | a |
| table2 | indexb | b | b |
| table3 | indexc | c | c |
+------------+------------+-------------------+------------------+
Vector index options examples
The following query extracts information on vector index options:
SELECT table_name, index_name, option_name, option_type, option_value
FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;
The result is similar to the following:
+------------+------------+------------------+------------------+--------------------+
| table_name | index_name | option_name | option_type | option_value |
+------------+------------+------------------+------------------+--------------------+
| table1 | indexa | distance_type | STRING | EUCLIDEAN |
| table1 | indexa | index_type | STRING | IVF |
| table2 | indexb | ivf_options | STRING | {"num_lists": 100} |
| table2 | indexb | index_type | STRING | IVF |
+------------+------------+------------------+------------------+--------------------+
Vector index usage
Information on vector index usage is available in the job metadata of the
job that ran the vector search query. You can
view job metadata
by using
the Google Cloud console, the bq command-line tool, the BigQuery API, or the
client libraries.
When you use the Google Cloud console, you can find vector index usage
information in the
Vector Index Usage Mode
and
Vector Index Unused Reasons
fields.
When you use the bq tool or the BigQuery API, you can
find vector index usage information in the
VectorSearchStatistics
section of the
Job
resource.
The index usage mode indicates whether a vector index was used by providing
one of the following values:
UNUSED
: No vector index was used.
PARTIALLY_USED
: Some
VECTOR_SEARCH
functions in the query used vector
indexes and some didn't.
FULLY_USED
: Every
VECTOR_SEARCH
function in the query used a vector index.
When the index usage mode value is
UNUSED
or
PARTIALLY_USED
,
the index unused reasons indicate why vector indexes weren't used in the query.
For example, the following results returned by
bq show --format=prettyjson -j my_job_id
shows that the index was not used
because the
use_brute_force
option was specified in the
VECTOR_SEARCH
function:
"vectorSearchStatistics": {
"indexUnusedReasons": [
{
"baseTable": {
"datasetId": "my_dataset",
"projectId": "my_project",
"tableId": "my_table"
},
"code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION",
"message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified."
}
],
"indexUsageMode": "UNUSED"
}
Index management options
To create indexes and have BigQuery maintain them,
you have two options:
- Use the default shared slot pool
: When the data you plan
to index is
below your per-organization
limit
, you
can use the free shared slot pool for index management.
- Use your own reservation
:
To achieve more predictable and consistent indexing
progress on your larger production workloads, you can use your own
reservations for index management.
Use shared slots
If you have not configured your project to use a
dedicated reservation
for indexing,
index management is handled in the free, shared slot pool, subject to the
following constraints.
If you add data to a table which causes the total size of indexed
tables to exceed your organization's
limit
,
BigQuery pauses index management
for all indexed tables. When this happens, the
index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view
displays
PENDING DISABLEMENT
and the index is queued for deletion. While
the index is pending disablement, it is
still used in queries and you are charged for the index storage.
After an index is deleted, the
index_status
field shows
the index as
TEMPORARILY DISABLED
. In this state, queries don't use the index,
and you are not charged for index storage. In this case, the
IndexUnusedReason
code
is
BASE_TABLE_TOO_LARGE
.
If you delete data from the table and the total size of indexed tables
falls below the per-organization limit, then index management is resumed for
all indexed tables. The
index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view is
ACTIVE
, queries can use the index, and you are charged for the
index storage.
BigQuery does not make guarantees about the available
capacity of the shared pool or the throughput of indexing you see.
For production applications, you might want to use
dedicated slots for your index processing.
Use your own reservation
Instead of using the default shared slot pool, you can optionally designate your
own reservation to index your tables. Using your own reservation ensures
predictable and consistent performance of index-management jobs, such as
creation, refresh, and background optimizations.
- There are no table size limits when an indexing job runs in your
reservation.
- Using your own reservation gives you flexibility in your index management.
If you need to create a very large index or
make a major update to an indexed table, you can temporarily add more
slots to the assignment.
To index the tables in a project with a designated reservation,
create a reservation
in the region where your tables are located. Then, assign the project to the
reservation with the
job_type
set to
BACKGROUND
:
SQL
Use the
CREATE ASSIGNMENT
DDL statement
.
In the Google Cloud console, go to the
BigQuery
page.
Go to BigQuery
In the query editor, enter the following statement:
CREATE
ASSIGNMENT
`
ADMIN_PROJECT_ID
.region-
LOCATION
.
RESERVATION_NAME
.
ASSIGNMENT_ID
`
OPTIONS (
assignee = 'projects/
PROJECT_ID
',
job_type = 'BACKGROUND');
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the
administration project
that owns the reservation resource
LOCATION
: the
location
of the reservation
RESERVATION_NAME
: the name of the
reservation
ASSIGNMENT_ID
: the ID of the assignment
The ID must be unique to the project and location,
start and end with a lowercase letter or a number,
and contain only lowercase letters, numbers, and dashes.
PROJECT_ID
: the ID of the project containing
the tables to index. This project is assigned to the reservation.
Click
play_circle
Run
.
For more information about how to run queries, see
Run an interactive query
.
bq
Use the
bq mk
command:
bq mk \
--project_id=
ADMIN_PROJECT_ID
\
--location=
LOCATION
\
--reservation_assignment \
--reservation_id=
RESERVATION_NAME
\
--assignee_id=
PROJECT_ID
\
--job_type=BACKGROUND \
--assignee_type=PROJECT
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the
administration project
that owns the reservation resource
LOCATION
: the
location
of the reservation
RESERVATION_NAME
: the name of the
reservation
PROJECT_ID
: the ID of the project to assign
to this reservation
View your indexing jobs
A new indexing job is created every time an index is created or updated on
a single table. To view information about the job, query the
INFORMATION_SCHEMA.JOBS*
views
. You
can filter for indexing jobs by
setting
job_type IS NULL AND SEARCH(job_id, '`search_index`')
in the
WHERE
clause of your query. The following example lists the five most recent indexing
jobs in the project
my_project
:
SELECT *
FROM
region-us.INFORMATION_SCHEMA.JOBS
WHERE
project_id = 'my_project'
AND job_type IS NULL
AND SEARCH(job_id, '`search_index`')
ORDER BY
creation_time DESC
LIMIT 5;
Choose your reservation size
To choose the right number of slots for your reservation, you should consider
when index-management jobs are run, how many slots they use, and what your usage
looks like over time. BigQuery triggers an index-management job
in the following situations:
- You create an index on a table.
- Data is modified in an indexed table.
- The schema of a table changes and this affects which columns are indexed.
- Index data and metadata are periodically optimized or updated.
The number of slots you need for an index-management job on a table depends on
the following factors:
- The size of the table
- The rate of data ingestion to the table
- The rate of DML statements applied to the table
- The acceptable delay for building and maintaining the index
- The complexity of the index, typically determined by attributes of the data,
such as the number of duplicate terms
Monitor Usage and Progress
The best way to assess the number of slots you need to efficiently run your
index-management jobs is to monitor your slot utilization and adjust the
reservation size accordingly. The following query produces the daily slot usage
for index-management jobs. Only the past 30 days are included in the
region
us-west1
:
SELECT
TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date,
-- Aggregate total_slots_ms used for index-management jobs in a day and divide
-- by the number of milliseconds in a day. This value is most accurate for
-- days with consistent slot usage.
SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage
FROM
`region-us-west1`.INFORMATION_SCHEMA.JOBS job
WHERE
project_id = 'my_project'
AND job_type IS NULL
AND SEARCH(job_id, '`search_index`')
GROUP BY
usage_date
ORDER BY
usage_date DESC
limit 30;
When there are insufficient slots to run index-management jobs, an index can
become out of sync with its table and indexing jobs might fail.
In this case, BigQuery rebuilds the index from scratch. To
avoid having an out-of-sync index, ensure you have enough slots to support index
updates from data ingestion and optimization. For more information on
monitoring slot usage, see
admin resource charts
.
Delete a vector index
When you no longer need a vector index or want to change which column is
indexed on a table, you can delete the index on that table by using the
DROP VECTOR INDEX
DDL statement
.
For example:
DROP
VECTOR
INDEX my_index ON my_dataset.indexed_table;
If an indexed table is deleted, its index is deleted automatically.
What's next