GoogleSQL for BigQuery supports the following search functions.
Function list
Name
|
Summary
|
SEARCH
|
Checks to see whether a table or other
search data contains a set of search terms.
|
VECTOR_SEARCH
|
Performs a vector search on embeddings to find semantically similar
entities.
|
SEARCH
SEARCH(
data_to_search, search_query
[, json_scope=>{ 'JSON_VALUES' | 'JSON_KEYS' | 'JSON_KEYS_AND_VALUES' }]
[, analyzer=>{ 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER'}]
[, analyzer_options=>analyzer_options_values]
)
Description
The
SEARCH
function checks to see whether a BigQuery table or other
search data contains a set of search terms (tokens). It returns
TRUE
if all
search terms appear in the data, based on the
rules for search_query
and text analysis described in the
text analyzer
. Otherwise,
this function returns
FALSE
.
Definitions
data_to_search
: The data to search over. The value can be:
- Any GoogleSQL data type literal
- A list of columns
- A table reference
- A column of any type
A table reference is evaluated as a
STRUCT
whose fields are the columns of
the table.
data_to_search
can be any type, but
SEARCH
will return
FALSE
for all types except those listed here:
ARRAY<STRING>
ARRAY<STRUCT>
JSON
STRING
STRUCT
You can search for string literals in columns of the preceding types.
For additional rules, see
Search data rules
.
search_query
: A
STRING
literal, or a
STRING
constant expression that
represents the terms of the search query. If
search_query
is
NULL
, an
error is returned. If
search_query
produces no search tokens,
and the text analyzer is
LOG_ANALYZER
or
PATTERN_ANALYZER
, an error is
produced.
json_scope
: Optional mandatory-named argument that
takes one of the following values to indicate the scope of JSON data to be
searched. It has no effect if
data_to_search
isn't a JSON value or
doesn't contain a JSON field.
'JSON_VALUES'
(default): Only the JSON values are searched. If
json_scope
isn't provided, this is used by default.
'JSON_KEYS'
: Only the JSON keys are searched.
'JSON_KEYS_AND_VALUES'
: The JSON keys and values are searched.
analyzer
: Optional mandatory-named argument that takes
one of the following values to indicate the text analyzer to use:
'LOG_ANALYZER'
(default): Breaks the input into tokens when delimiters
are encountered and then normalizes the tokens.
For more information, see
LOG_ANALYZER
.
'NO_OP_ANALYZER'
: Extracts the text as a single token, but
doesn't apply normalization. For more information about this analyzer,
see
NO_OP_ANALYZER
.
'PATTERN_ANALYZER'
: Breaks the input into tokens that match a
regular expression. For more information, see
PATTERN_ANALYZER
text analyzer
.
analyzer_options
: Optional mandatory-named argument that takes a list of
text analysis rules as a JSON-formatted
STRING
. For more information,
see
Text analyzer options
.
Details
The
SEARCH
function is designed to work with
search indexes
to
optimize point lookups. Although the
SEARCH
function works for
tables that aren't indexed, its performance will be greatly improved with a
search index. If both the analyzer and analyzer options match the one used
to create the index, the search index will be used.
Rules for
search_query
A search query is initially broken down into terms (subqueries) using the
white spaces in the search query. Each term is then further broken down into
zero or more searchable tokens based on the text analyzer. This section contains
the rules for how different types of terms are analyzed and evaluated.
Rules for backticks in
search_query
:
If the
LOG_ANALYZER
text analyzer is used, text enclosed in backticks
forces an exact match.
For example,
`Hello World` happy days
becomes
Hello World
,
happy
,
and
days
.
Search terms enclosed in backticks must match exactly in
data_to_search
,
subject to the following conditions:
For example,
SEARCH('foo.bar', '`foo.`')
returns
FALSE
because the
text enclosed in the backticks
foo.
is immediately followed by the
character
b
in the search data
foo.bar
, rather than by a delimiter or
the end of the string. However,
SEARCH('foo..bar', '`foo.`')
returns
TRUE
because
foo.
is immediately followed by the delimiter
.
in the
search data.
The backtick itself can be escaped using a backslash,
as in
\`foobar\`
.
The following are reserved words and must be enclosed
in backticks:
AND
,
NOT
,
OR
,
IN
, and
NEAR
Rules for reserved characters in
search_query
:
Rules for phrases in
search_query
:
- A phrase is a type of term. If text is enclosed in double quotes and the
analyzer
is
LOG_ANALYZER
,
PATTERN_ANALYZER
, or not set
(
LOG_ANALYZER
by default), the term represents a phrase.
- When a phrase is analyzed, a subset of tokens is created for that phrase.
For example, from the phrase
"foo baz.bar"
, the analyzer called
LOG_ANALYZER
generates the phrase-specific tokens
foo
,
baz
, and
bar
.
The order of terms in a phrase matters. A match is only returned if
the tokens that were produced for the phrase are next to each other and in
the same order as the tokens for
data_to_search
.
For example:
-- FALSE because 'foo' and 'bar' are not next to each other in
-- 'foo baz.bar'.
SEARCH('foo baz.bar', '"foo bar"')
-- TRUE because 'foo' and 'baz' are next to each other in
-- 'foo baz.bar'.
SEARCH('foo baz.bar', '"foo baz"')
A single quote inside of the phrase is analyzed as a special character.
An escaped double quote (double quote after a backslash) is analyzed
as a double quote character.
How
search_query
is broken into searchable tokens
The following table shows how
search_query
is broken into
searchable tokens by the
LOG_ANALYZER
text analyzer. All entries are strings.
search_query
|
searchable tokens
|
127.0.0.1
|
127
0
1
127.0.0.1
.
127.0.0
127.0
0.0
0.0.1
0.1
|
foobar@example.com
|
foobar
example
com
foobar@example
example.com
foobar@example.com
|
The fox.
|
the
fox
The
The fox
The fox.
fox
fox.
|
The following table shows how
search_query
is broken into searchable tokens
by the
LOG_ANALYZER
text analyzer. All entries are strings.
search_query
|
searchable tokens
|
127.0.0.1
|
127
0
1
|
`127.0.0.1`
|
127.0.0.1
|
foobar@example.com
|
foobar
example
com
|
`foobar@example.com`
|
foobar@example.com
|
Rules for
data_to_search
General rules for
data_to_search
:
Return type
BOOL
Examples
The following queries show how tokens in
search_query
are analyzed
by a
SEARCH
function call using the default analyzer,
LOG_ANALYZER
:
SELECT
-- ERROR: `search_query` is NULL.
SEARCH('foobarexample', NULL) AS a,
-- ERROR: `search_query` contains no tokens.
SEARCH('foobarexample', '') AS b,
SELECT
-- TRUE: '-' and ' ' are delimiters.
SEARCH('foobar-example', 'foobar example') AS a,
-- TRUE: The search query is a constant expression evaluated to 'foobar'.
SEARCH('foobar-example', CONCAT('foo', 'bar')) AS b,
-- FALSE: The search_query is not split.
SEARCH('foobar-example', 'foobarexample') AS c,
-- TRUE: The double backslash escapes the ampersand which is a delimiter.
SEARCH('foobar-example', 'foobar\\&example') AS d,
-- TRUE: The single backslash escapes the ampersand in a raw string.
SEARCH('foobar-example', R'foobar\&example')AS e,
-- FALSE: The backticks indicate that there must be an exact match for
-- foobar&example.
SEARCH('foobar-example', '`foobar&example`') AS f,
-- TRUE: An exact match is found.
SEARCH('foobar&example', '`foobar&example`') AS g
/*-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g |
+-------+-------+-------+-------+-------+-------+-------+
| true | true | false | true | true | false | true |
*-------+-------+-------+-------+-------+-------+-------*/
SELECT
-- TRUE: The order of terms doesn't matter.
SEARCH('foobar-example', 'example foobar') AS a,
-- TRUE: Tokens are made lower-case.
SEARCH('foobar-example', 'Foobar Example') AS b,
-- TRUE: An exact match is found.
SEARCH('foobar-example', '`foobar-example`') AS c,
-- FALSE: Backticks preserve capitalization.
SEARCH('foobar-example', '`Foobar`') AS d,
-- FALSE: Backticks don't have special meaning for search_data and are
-- not delimiters in the default LOG_ANALYZER.
SEARCH('`foobar-example`', '`foobar-example`') AS e,
-- TRUE: An exact match is found after the delimiter in search_data.
SEARCH('foobar@example.com', '`example.com`') AS f,
-- TRUE: An exact match is found between the space delimiters.
SEARCH('a foobar-example b', '`foobar-example`') AS g;
/*-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g |
+-------+-------+-------+-------+-------+-------+-------+
| true | true | true | false | false | true | true |
*-------+-------+-------+-------+-------+-------+-------*/
SELECT
-- FALSE: No single array entry matches all search terms.
SEARCH(['foobar', 'example'], 'foobar example') AS a,
-- FALSE: The search query is equivalent to foobar\\=.
SEARCH('foobar=', '`foobar\\=`') AS b,
-- FALSE: This is equivalent to the previous example.
SEARCH('foobar=', R'`\foobar=`') AS c,
-- TRUE: The equals sign is a delimiter in the data and query.
SEARCH('foobar=', 'foobar\\=') AS d,
-- TRUE: This is equivalent to the previous example.
SEARCH('foobar=', R'foobar\=') AS e,
-- TRUE: An exact match is found.
SEARCH('foobar.example', '`foobar`') AS f,
-- FALSE: `foobar.\` is not analyzed because of backticks; it is not
-- followed by a delimiter in search_data 'foobar.example'.
SEARCH('foobar.example', '`foobar.\`') AS g,
-- TRUE: `foobar.` is not analyzed because of backticks; it is
-- followed by the delimiter '.' in search_data 'foobar..example'.
SEARCH('foobar..example', '`foobar.`') AS h;
/*-------+-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g | h |
+-------+-------+-------+-------+-------+-------+-------+-------+
| false | false | false | true | true | true | false | true |
*-------+-------+-------+-------+-------+-------+-------+-------*/
The following queries show how phrases in
search_query
are analyzed
by a
SEARCH
function call:
SELECT
-- TRUE: The phrase `foo bar` is in `foo bar baz`.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are `foo` and `bar`
-- and because they appear in that exact order in `data_to_search`,
-- the function returns TRUE.
SEARCH(R'foo bar baz', R'"foo bar"') AS a,
-- TRUE: Case is ignored.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are `foo` and `bar`
-- and because they appear in that exact order in `data_to_search`,
-- the function return TRUE.
SEARCH(R'Foo bar baz', R'"foo Bar"') AS b,
-- TRUE: Both `-` and `&` are delimiters used during tokenization.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are `foo` and `bar`
-- and because they appear in that exact order in `data_to_search`,
-- the function returns TRUE.
SEARCH(R'foo-bar baz', R'"foo&bar"') AS c,
-- FALSE: Backticks in a phrase are treated as normal characters.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are:
-- `foo
-- bar`
-- Because these searchable tokens don't appear in `data_to_search`,
-- the function returns FALSE.
SEARCH(R'foo bar baz', R'"`foo bar`"') AS d,
-- FALSE: `foo bar` is not in `foo else bar`.
-- The tokens in `data_to_search` are `foo`, `else`, and `bar`.
-- The searchable tokens in `query_string` are `foo` and `bar`.
-- Even though they appear in `data_to_search`, but because they
-- do not appear in that exact order (`foo` before `bar`),
-- the function returns FALSE.
SEARCH(R'foo else bar', R'"foo bar"') AS e,
-- FALSE: `foo baz` is not in `foo bar baz`.
-- The `search_query` produces two terms. The first term is `bar`, which
-- matches with the similar token in `data_to_search`. However, the second
-- term is the phrase "foo&baz" with two tokens, `foo` and `baz`. Because
-- `foo` and `baz` do not appear next to each other in `data_to_search`
-- (`bar` is in between), the function returns FALSE.
SEARCH(R'foo-bar-baz', R'bar "foo&baz"') AS f;
/*-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f |
+-------+-------+-------+-------+-------+-------+
| true | true | false | false | false | false |
*-------+-------+-------+-------+-------+-------*/
SELECT
-- FALSE: Only double quotes need to be escaped in a phrase.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are `foo\` and `bar` and they
-- must appear in that exact order in `data_to_search`, but don't.
SEARCH(
R'foo bar baz',
R'"foo\ bar"',
analyzer_options=>'{"delimiters": [" "]}') AS a,
-- TRUE: `foo bar` is in `foo bar baz` after tokenization with the given
-- delimiters.
-- The tokens in `data_to_search` are `foo`, `bar`, and `baz`.
-- The searchable tokens in `query_string` are `foo` and `bar` and they
-- must appear in that exact order in `data_to_search`.
SEARCH(
R'foo bar baz',
R'"foo? bar"',
analyzer_options=>'{"delimiters": [" ", "?"]}') AS b,
-- TRUE: `read book` is in `read book now` after `the` is ignored.
-- The tokens in `data_to_search` are `read`, `book`, and `now`.
-- The searchable tokens in `query_string` are `read` and `book` and they
-- must appear in that exact order in `data_to_search`.
SEARCH(
'read the book now',
R'"read the book"',
analyzer_options => '{ "token_filters": [{"stop_words": ["the"]}] }') AS c,
-- FALSE: `c d` is not in `a`, `b`, `cd`, `e` or `f` after tokenization with
-- the given pattern.
-- The tokens in `data_to_search` are `a`, `b`, `cd`, `e` and `f`.
-- The searchable tokens in `query_string` are `c` and `d` and they
-- must appear in that exact order in `data_to_search`. `data_to_search`
-- contains a `cd` token, but not a `c` or `d` token.
SEARCH(
R'abcdef',
R'"c d"',
analyzer=>'PATTERN_ANALYZER',
analyzer_options=>'{"patterns": ["(?:cd)|[a-z]"]}') AS d,
-- TRUE: `ant apple` is in `ant apple avocado` after tokenization with
-- the given pattern.
-- The tokens in `data_to_search` are `ant`, `apple`, and `avocado`.
-- The searchable tokens in `query_string` are `ant` and `apple` and they
-- must appear in that exact order in `data_to_search`.
SEARCH(
R'ant orange apple avocado',
R'"ant apple"',
analyzer=>'PATTERN_ANALYZER',
analyzer_options=>'{"patterns": ["a[a-z]"]}') AS e;
/*-------+-------+-------+-------+-------*
| a | b | c | d | e |
+-------+-------+-------+-------+-------+
| false | true | true | false | true |
*-------+-------+-------+-------+-------*/
The following query shows examples of calls to the
SEARCH
function using the
NO_OP_ANALYZER
text analyzer and reasons for various return values:
SELECT
-- TRUE: exact match
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') AS a,
-- FALSE: Backticks are not special characters for `NO_OP_ANALYZER`.
SEARCH('foobar', '\`foobar\`', analyzer=>'NO_OP_ANALYZER') AS b,
-- FALSE: The capitalization does not match.
SEARCH('foobar', 'Foobar', analyzer=>'NO_OP_ANALYZER') AS c,
-- FALSE: There are no delimiters for `NO_OP_ANALYZER`.
SEARCH('foobar example', 'foobar', analyzer=>'NO_OP_ANALYZER') AS d,
-- TRUE: An exact match is found.
SEARCH('', '', analyzer=>'NO_OP_ANALYZER') AS e,
-- FALSE: 'foo bar' and "foo bar" are not considered an exact match.
SEARCH( R'foo bar baz', R'"foo bar"', analyzer=>'NO_OP_ANALYZER') AS f,
-- TRUE: "foo bar" and "foo Bar" are considered an exact match because the
-- analysis is case-insensitive.
SEARCH( R'"foo bar"', R'"foo Bar"', analyzer=>'NO_OP_ANALYZER') AS g;
/*-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g |
+-------+-------+-------+-------+-------+-------+-------+
| true | false | false | false | true | false | true |
*-------+-------+-------+-------+-------+-------+-------*/
Consider the following table called
meals
with columns
breakfast
,
lunch
,
and
dinner
:
/*-------------------+-------------------------+------------------*
| breakfast | lunch | dinner |
+-------------------+-------------------------+------------------+
| Potato pancakes | Toasted cheese sandwich | Beef soup |
| Avocado toast | Tomato soup | Chicken soup |
*-------------------+-------------------------+------------------*/
The following query shows how to search single columns, multiple columns, and
whole tables, using the default
LOG_ANALYZER
text analyzer
with the default analyzer options:
WITH
meals AS (
SELECT
'Potato pancakes' AS breakfast,
'Toasted cheese sandwich' AS lunch,
'Beef soup' AS dinner
UNION ALL
SELECT
'Avocado toast' AS breakfast,
'Tomato soup' AS lunch,
'Chicken soup' AS dinner
)
SELECT
SEARCH(lunch, 'soup') AS lunch_soup,
SEARCH((breakfast, dinner), 'soup') AS breakfast_or_dinner_soup,
SEARCH(meals, 'soup') AS anytime_soup
FROM meals;
/*------------+--------------------------+--------------*
| lunch_soup | breakfast_or_dinner_soup | anytime_soup |
+------------+--------------------------+--------------+
| false | true | true |
| true | true | true |
*------------+--------------------------+--------------*/
The following query shows additional ways to search, using the
default
LOG_ANALYZER
text analyzer with
default analyzer options:
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'exam') AS a,
SEARCH(email, 'foobar') AS b,
SEARCH(email, 'example.com') AS c,
SEARCH(email, R'"please use"') AS d,
SEARCH(email, R'"as email"') AS e
FROM data;
/*-------+-------+-------+-------+-------*
| a | b | c | d | e |
+-------+-------+-------+-------+-------+
| false | true | true | true | false |
*-------+-------+-------+-------+-------*/
The following query shows additional ways to search, using the
default
LOG_ANALYZER
text analyzer with custom
analyzer options. Terms are only split when a space or
@
symbol is
encountered.
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'foobar', analyzer_options=>'{"delimiters": [" ", "@"]}') AS a,
SEARCH(email, 'example', analyzer_options=>'{"delimiters": [" ", "@"]}') AS b,
SEARCH(email, 'example.com', analyzer_options=>'{"delimiters": [" ", "@"]}') AS c,
SEARCH(email, 'foobar@example.com', analyzer_options=>'{"delimiters": [" ", "@"]}') AS d,
SEARCH(email, R'use "foobar example.com" "as your"', analyzer_options=>'{"delimiters": [" ", "@"]}') AS e
FROM data;
/*-------+-------+-------+-------+-------*
| a | b | c | d | e |
+-------+-------+-------+-------+-------+
| true | false | true | true | true |
*-------+-------+-------+-------+-------*/
The following query shows how to search, using the
NO_OP_ANALYZER
text analyzer:
WITH meals AS ( SELECT 'Tomato soup' AS lunch )
SELECT
SEARCH(lunch, 'Tomato soup', analyzer=>'NO_OP_ANALYZER') AS a,
SEARCH(lunch, 'soup', analyzer=>'NO_OP_ANALYZER') AS b,
SEARCH(lunch, 'tomato soup', analyzer=>'NO_OP_ANALYZER') AS c,
SEARCH(lunch, R'"Tomato soup"', analyzer=>'NO_OP_ANALYZER') AS d
FROM meals;
/*-------+-------+-------+-------*
| a | b | c | d |
+-------+-------+-------+-------+
| true | false | false | false |
*-------+-------+-------+-------*/
The following query shows how to use the
PATTERN_ANALYZER
text analyzer with default analyzer options:
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'exam', analyzer=>'PATTERN_ANALYZER') AS a,
SEARCH(email, 'foobar', analyzer=>'PATTERN_ANALYZER') AS b,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER') AS c,
SEARCH(email, R'foobar "EXAMPLE.com as" email', analyzer=>'PATTERN_ANALYZER') AS d
FROM data;
/*-------+-------+-------+-------*
| a | b | c | d |
+-------+-------+-------+-------+
| false | true | true | true |
*-------+-------+-------+-------*/
The following query shows additional ways to search, using the
PATTERN_ANALYZER
text analyzer with
custom analyzer options:
WITH data AS ( SELECT 'Please use foobar@EXAMPLE.com as your email.' AS email )
SELECT
SEARCH(email, 'EXAMPLE', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[A-Z]*"]}') AS a,
SEARCH(email, 'example', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-z]*"]}') AS b,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-z]*"]}') AS c,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-zA-Z.]*"]}') AS d
FROM data;
/*-------+-------+-------+-------+-------*
| a | b | c | d | e |
+-------+-------+-------+-------+-------+
| true | false | false | true | false |
*-------+-------+-------+-------+-------*/
For additional examples that include analyzer options,
see the
Text analysis
reference guide.
For helpful analyzer recipes that you can use to enhance
analyzer-supported queries, see the
Search with text analyzers
user guide.
VECTOR_SEARCH
To provide feedback or request support for this feature, send email to
bq-vector-search@google.com
.
VECTOR_SEARCH(
{TABLE base_table | base_table_query_statement},
column_to_search,
TABLE query_table
[, query_column_to_search => query_column_to_search_value]
[, top_k => top_k_value ]
[, distance_type => distance_type_value ]
[, options => options_value ]
)
VECTOR_SEARCH(
{TABLE base_table | base_table_query_statement},
column_to_search,
(query_statement)
[, query_column_to_search => query_column_to_search_value]
[, top_k => top_k_value ]
[, distance_type => distance_type_value ]
[, options => options_value ]
)
Description
The
VECTOR_SEARCH
function lets you search embeddings to find semantically
similar entities.
Embeddings are high-dimensional numerical vectors that represent a given entity,
like a piece of text or an audio file. Machine learning (ML) models use
embeddings to encode semantics about such entities to make it easier to
reason about and compare them. For example, a common operation in clustering,
classification, and recommendation models is to measure the distance between
vectors in an
embedding space
to
find items that are most semantically similar.
Definitions
base_table
: The table to search for nearest neighbor embeddings.
base_table_query_statement
: A query that you can use to pre-filter the base
table. Only
SELECT
,
FROM
, and
WHERE
clauses are allowed in this query.
Don't apply any filters to the embedding column.
You can't use
logical views
in this query.
Using a
subquery
might
interfere with index usage or cause your query to fail.
If the base table is indexed and the
WHERE
clause contains columns that are
not stored in the index, then
VECTOR_SEARCH
post-filters on those columns
instead. To learn more and enable pre-filtering, see
Store columns and pre-filter
.
column_to_search
: The name of the base table column
to search for nearest neighbor embeddings. The column must have
a type of
ARRAY<FLOAT64>
. All elements in the array must be non-
NULL
, and
all values in the column must have the same array dimensions.
If the column has a vector index, BigQuery attempts to use it.
To determine if an index was used in the vector search, see
Vector index usage
.
query_table
: The table that provides the
embeddings for which to find nearest neighbors. All columns are passed
through as output columns.
query_statement
: A query that provides the
embeddings for which to find nearest neighbors. All columns are passed
through as output columns.
query_column_to_search
: An optional
STRING
positional-named argument.
query_column_to_search_value
specifies the name of the column in the query
table or statement that contains the embeddings for which to find nearest
neighbors. The column must have a type of
ARRAY<FLOAT64>
. All elements in
the array must be non-
NULL
and all values in the column must have the same
array dimensions as the values in the
column_to_search
column. If you don't
specify
query_column_to_search_value
, the function uses the
column_to_search
value.
top_k
: An optional
INT64
mandatory-named argument.
top_k_value
specifies the number of nearest neighbors to
return. The default is
10
. A negative value is treated as infinity, meaning
that all values are counted as neighbors and returned.
distance_type
: An optional
STRING
mandatory-named argument.
distance_type_value
specifies the type of metric to use to
compute the distance between two vectors. Supported distance types are
EUCLIDEAN
and
COSINE
. The default is
EUCLIDEAN
.
If you don't specify
distance_type_value
and the
column_to_search
column has a vector index that is used,
VECTOR_SEARCH
uses the distance
type specified in the
distance_type
option
of the
CREATE VECTOR INDEX
statement.
options
: An optional JSON-formatted
STRING
mandatory-named argument.
options_value
is a literal that specifies the following vector search
options:
fraction_lists_to_search
: A JSON number that specifies the
percentage of lists to search. For example,
options => '{"fraction_lists_to_search":0.15}'
. The
fraction_lists_to_search
value must be in the range
0.0
to
1.0
,
exclusive.
Specifying a higher percentage leads to higher recall and slower
performance, and the converse is true when specifying a lower percentage.
fraction_lists_to_search
is only used when a vector index is also used.
If you don't specify a
fraction_lists_to_search
value but an index is
matched, the default number of lists to scan is calculated as
min(0.002 * number_of_lists, 10)
.
The number of available lists to search is determined by the
num_lists
option
in the
ivf_options
option of the
CREATE VECTOR INDEX
statement if that is specified. Otherwise,
BigQuery calculates an appropriate number.
You can't specify
fraction_lists_to_search
when
use_brute_force
is
set to
true
.
use_brute_force
: A JSON boolean that determines whether to use brute
force search by skipping the vector index if one is available. For
example,
options => '{"use_brute_force":true}'
. The
default is
false
. If you specify
use_brute_force=false
and there is
no useable vector index available, brute force is used anyway.
options
defaults to
'{}'
to denote that all underlying options use their
corresponding default values.
Details
You can optionally use
VECTOR_SEARCH
with a
vector index
. When
a vector index is used,
VECTOR_SEARCH
uses the
Approximate Nearest
Neighbor
search technique to help improve vector search performance, with
the trade-off of reducing
recall
and so returning more approximate
results. Brute force is used to return exact results when a vector index isn't
available, and you can choose to use brute force to get exact results even when
a vector index is available.
Output
For each row in the query data, the output contains multiple rows from the
base table that satisfy the search criteria. The number of results rows per
query table row is either 10 or the
top_k
value if it is specified. The
order of the output isn't guaranteed.
The output includes the following columns:
query
: A
STRUCT
value that contains all selected columns from the query
data.
base
: A
STRUCT
value that contains all columns from
base_table
or a
subset of the columns from
base_table
that you selected in the
base_table_query_statement
query.
distance
: A
FLOAT64
value that represents the distance between the base
data and the query data.
Limitations
BigQuery data security and governance rules apply to the use of
VECTOR_SEARCH
, which results in the following behavior:
- If the base table has
row-level security policies
,
VECTOR_SEARCH
applies the row-level
access policies to the query results.
- If the indexed column from the base table has
data masking policies
,
VECTOR_SEARCH
succeeds only if the user
running the query has the
Fine-Grained Reader
role on the policy tags
that are used. Otherwise,
VECTOR_SEARCH
fails with an invalid query error.
If any base table column or any column in the query table or statement has
column-level security policies
and you don't have appropriate
permissions to access the column,
VECTOR_SEARCH
fails with a permission
denied error.
The project that runs the query containing
VECTOR_SEARCH
must match the
project that contains the base table.
Examples
The following queries create test tables
table1
and
table2
to use in
subsequent query examples :
CREATE OR REPLACE TABLE mydataset.table1
(
id INT64,
my_embedding ARRAY<FLOAT64>
);
INSERT mydataset.table1 (id, my_embedding)
VALUES(1, [1.0, 2.0]),
(2, [2.0, 4.0]),
(3, [1.5, 7.0]),
(4, [1.0, 3.2]),
(5, [5.0, 5.4]),
(6, [3.7, 1.8]),
(7, [4.4, 2.9]);
CREATE OR REPLACE TABLE mydataset.table2
(
query_id STRING,
embedding ARRAY<FLOAT64>
);
INSERT mydataset.table2 (query_id, embedding)
VALUES('dog', [1.0, 2.0]),
('cat', [3.0, 5.2]);
The following example searches the
my_embedding
column of
table1
for the top
two embeddings that match each row of data in the
embedding
column of
table2
:
SELECT *
FROM
VECTOR_SEARCH(
TABLE mydataset.table1,
'my_embedding',
(SELECT query_id, embedding FROM mydataset.table2),
'embedding',
top_k => 2);
/*----------------+-----------------+---------+----------------------------------------*
| query.query_id | query.embedding | base.id | base.my_embedding | distance |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 1 | 1.0 | 0 |
| | 2.0 | | 2.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 4 | 1.0 | 1.2000000000000002 |
| | 2.0 | | 3.2 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 2 | 2.0 | 1.5620499351813311 |
| | 5.2 | | 4.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 5 | 5.0 | 2.0099751242241779 |
| | 5.2 | | 5.4 | |
*----------------+-----------------+---------+-------------------+--------------------*/
The following example pre-filters
table1
to rows where
id
is not equal to
4 and then searches the
my_embedding
column of
table1
for the top
two embeddings that match each row of data in the
embedding
column of
table2
. To enable pre-filtering, fill out the
enrollment form
.
SELECT *
FROM
VECTOR_SEARCH(
(SELECT * FROM mydataset.table1 WHERE id != 4),
'my_embedding',
(SELECT query_id, embedding FROM mydataset.table2),
'embedding',
top_k => 2,
options => '{"use_brute_force":true}');
/*----------------+-----------------+---------+----------------------------------------*
| query.query_id | query.embedding | base.id | base.my_embedding | distance |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 1 | 1.0 | 0 |
| | 2.0 | | 2.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 2 | 2.0 | 2.23606797749979 |
| | 2.0 | | 4.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 2 | 2.0 | 1.5620499351813311 |
| | 5.2 | | 4.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 5 | 5.0 | 2.0099751242241779 |
| | 5.2 | | 5.4 | |
*----------------+-----------------+---------+-------------------+--------------------*/
The following example searches the
my_embedding
column of
table1
for the top
two embeddings that match each row of data in the
embedding
column of
table2
, and uses the
COSINE
distance type to measure the distance between
the embeddings:
SELECT *
FROM
VECTOR_SEARCH(
TABLE mydataset.table1,
'my_embedding',
TABLE mydataset.table2,
'embedding',
top_k => 2,
distance_type => 'COSINE');
/*----------------+-----------------+---------+-------------------------------------------+
| query.query_id | query.embedding | base.id | base.my_embedding | distance |
+----------------+-----------------+---------+-------------------+-----------------------+
| dog | 1.0 | 2 | 2.0 | 0 |
| | 2.0 | | 4.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| dog | 1.0 | 1 | 1.0 | 0 |
| | 2.0 | | 2.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| cat | 3.0 | 2 | 2.0 | 0.0017773842088002478 |
| | 5.2 | | 4.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| cat | 3.0 | 1 | 1.0 | 0.0017773842088002478 |
| | 5.2 | | 2.0 | |
*----------------+-----------------+---------+-------------------+-----------------------*/