Functions for Searching in Strings
All functions in this section search case-sensitively by default. Case-insensitive search is usually provided by separate function variants.
Case-insensitive search follows the lowercase-uppercase rules of the English language. E.g. Uppercased
i
in the English language is
I
whereas in the Turkish language it is
İ
- results for languages other than English may be unexpected.
Functions in this section also assume that the searched string (referred to in this section as
haystack
) and the search string (referred to in this section as
needle
) are single-byte encoded text. If this assumption is
violated, no exception is thrown and results are undefined. Search with UTF-8 encoded strings is usually provided by separate function
variants. Likewise, if a UTF-8 function variant is used and the input strings are not UTF-8 encoded text, no exception is thrown and the
results are undefined. Note that no automatic Unicode normalization is performed, however you can use the
normalizeUTF8*()
functions for that.
General strings functions and functions for replacing in strings are described separately.
position
Returns the position (in bytes, starting at 1) of a substring
needle
in a string
haystack
.
Syntax
position(haystack, needle[, start_pos])
Alias:
-
position(needle IN haystack)
Arguments
-
haystack
— String in which the search is performed. String . -
needle
— Substring to be searched. String . -
start_pos
– Position (1-based) inhaystack
at which the search starts. UInt . Optional.
Returned values
- Starting position in bytes and counting from 1, if the substring was found.
- 0, if the substring was not found.
If substring
needle
is empty, these rules apply:
-
if no
start_pos
was specified: return1
-
if
start_pos = 0
: return1
-
if
start_pos >= 1
andstart_pos <= length(haystack) + 1
: returnstart_pos
-
otherwise: return
0
The same rules also apply to functions
locate
,
positionCaseInsensitive
,
positionUTF8
and
positionCaseInsensitiveUTF8
.
Type:
Integer
.
Examples
Query:
SELECT position('Hello, world!', '!');
Result:
┌─position('Hello, world!', '!')─┐
│ 13 │
└────────────────────────────────┘
Example with
start_pos
argument:
Query:
SELECT
position('Hello, world!', 'o', 1),
position('Hello, world!', 'o', 7)
Result:
┌─position('Hello, world!', 'o', 1)─┬─position('Hello, world!', 'o', 7)─┐
│ 5 │ 9 │
└───────────────────────────────────┴───────────────────────────────────┘
Example for
needle IN haystack
syntax:
Query:
SELECT 6 = position('/' IN s) FROM (SELECT 'Hello/World' AS s);
Result:
┌─equals(6, position(s, '/'))─┐
│ 1 │
└─────────────────────────────┘
Examples with empty
needle
substring:
Query:
SELECT
position('abc', ''),
position('abc', '', 0),
position('abc', '', 1),
position('abc', '', 2),
position('abc', '', 3),
position('abc', '', 4),
position('abc', '', 5)
Result:
┌─position('abc', '')─┬─position('abc', '', 0)─┬─position('abc', '', 1)─┬─position('abc', '', 2)─┬─position('abc', '', 3)─┬─position('abc', '', 4)─┬─position('abc', '', 5)─┐
│ 1 │ 1 │ 1 │ 2 │ 3 │ 4 │ 0 │
└─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘
locate
Like
position
but with arguments
haystack
and
locate
switched.
The behavior of this function depends on the ClickHouse version:
-
in versions < v24.3,
locate
was an alias of functionposition
and accepted arguments(haystack, needle[, start_pos])
. -
in versions >= 24.3,,
locate
is an individual function (for better compatibility with MySQL) and accepts arguments(needle, haystack[, start_pos])
. The previous behavior can be restored using setting function_locate_has_mysql_compatible_argument_order = false ;
Syntax
locate(needle, haystack[, start_pos])
positionCaseInsensitive
A case insensitive invariant of position .
Example
Query:
SELECT position('Hello, world!', 'hello');
Result:
┌─position('Hello, world!', 'hello')─┐
│ 0 │
└────────────────────────────────────┘
positionUTF8
Like
position
but assumes
haystack
and
needle
are UTF-8 encoded strings.
Examples
Function
positionUTF8
correctly counts character
ö
(represented by two points) as a single Unicode codepoint:
Query:
SELECT positionUTF8('Motörhead', 'r');
Result:
┌─position('Motörhead', 'r')─┐
│ 5 │
└────────────────────────────┘
positionCaseInsensitiveUTF8
Like positionUTF8 but searches case-insensitively.
multiSearchAllPositions
Like
position
but returns an array of positions (in bytes, starting at 1) for multiple
needle
substrings in a
haystack
string.
All
multiSearch*()
functions only support up to 2
8
needles.
Syntax
multiSearchAllPositions(haystack, [needle1, needle2, ..., needleN])
Arguments
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned values
- Array of the starting position in bytes and counting from 1, if the substring was found.
- 0, if the substring was not found.
Example
Query:
SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world']);
Result:
┌─multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])─┐
│ [0,13,0] │
└───────────────────────────────────────────────────────────────────┘
multiSearchAllPositionsCaseInsensitive
Like multiSearchAllPositions but ignores case.
Syntax
multiSearchAllPositionsCaseInsensitive(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned value
- Array of the starting position in bytes and counting from 1 (if the substring was found).
- 0 if the substring was not found.
Example
Query:
SELECT multiSearchAllPositionsCaseInsensitive('ClickHouse',['c','h']);
Result:
["1","6"]
multiSearchAllPositionsUTF8
Like
multiSearchAllPositions
but assumes
haystack
and the
needle
substrings are UTF-8 encoded strings.
Syntax
multiSearchAllPositionsUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 encoded string in which the search is performed. String . -
needle
— UTF-8 encoded substrings to be searched. Array .
Returned value
- Array of the starting position in bytes and counting from 1 (if the substring was found).
- 0 if the substring was not found.
Example
Given
ClickHouse
as a UTF-8 string, find the positions of
C
(
\x43
) and
H
(
\x48
).
Query:
SELECT multiSearchAllPositionsUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x43','\x48']);
Result:
["1","6"]
multiSearchAllPositionsCaseInsensitiveUTF8
Like multiSearchAllPositionsUTF8 but ignores case.
Syntax
multiSearchAllPositionsCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 encoded string in which the search is performed. String . -
needle
— UTF-8 encoded substrings to be searched. Array .
Returned value
- Array of the starting position in bytes and counting from 1 (if the substring was found).
- 0 if the substring was not found.
Example
Given
ClickHouse
as a UTF-8 string, find the positions of
c
(
\x63
) and
h
(
\x68
).
Query:
SELECT multiSearchAllPositionsCaseInsensitiveUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x63','\x68']);
Result:
["1","6"]
multiSearchFirstPosition
Like
position
but returns the leftmost offset in a
haystack
string which matches any of multiple
needle
strings.
Functions
multiSearchFirstPositionCaseInsensitive
,
multiSearchFirstPositionUTF8
and
multiSearchFirstPositionCaseInsensitiveUTF8
provide case-insensitive and/or UTF-8 variants of this function.
Syntax
multiSearchFirstPosition(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned value
-
Leftmost offset in a
haystack
string which matches any of multipleneedle
strings. - 0, if there was no match.
Example
Query:
SELECT multiSearchFirstPosition('Hello World',['llo', 'Wor', 'ld']);
Result:
3
multiSearchFirstPositionCaseInsensitive
Like
multiSearchFirstPosition
but ignores case.
Syntax
multiSearchFirstPositionCaseInsensitive(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Array of substrings to be searched. Array .
Returned value
-
Leftmost offset in a
haystack
string which matches any of multipleneedle
strings. - 0, if there was no match.
Example
Query:
SELECT multiSearchFirstPositionCaseInsensitive('HELLO WORLD',['wor', 'ld', 'ello']);
Result:
2
multiSearchFirstPositionUTF8
Like
multiSearchFirstPosition
but assumes
haystack
and
needle
to be UTF-8 strings.
Syntax
multiSearchFirstPositionUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— Array of UTF-8 substrings to be searched. Array .
Returned value
-
Leftmost offset in a
haystack
string which matches any of multipleneedle
strings. - 0, if there was no match.
Example
Find the leftmost offset in UTF-8 string
hello world
which matches any of the given needles.
Query:
SELECT multiSearchFirstPositionUTF8('\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64',['wor', 'ld', 'ello']);
Result:
2
multiSearchFirstPositionCaseInsensitiveUTF8
Like
multiSearchFirstPosition
but assumes
haystack
and
needle
to be UTF-8 strings and ignores case.
Syntax
multiSearchFirstPositionCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— Array of UTF-8 substrings to be searched. Array
Returned value
-
Leftmost offset in a
haystack
string which matches any of multipleneedle
strings, ignoring case. - 0, if there was no match.
Example
Find the leftmost offset in UTF-8 string
HELLO WORLD
which matches any of the given needles.
Query:
SELECT multiSearchFirstPositionCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['wor', 'ld', 'ello']);
Result:
2
multiSearchFirstIndex
Returns the index
i
(starting from 1) of the leftmost found needle
i
in the string
haystack
and 0 otherwise.
Functions
multiSearchFirstIndexCaseInsensitive
,
multiSearchFirstIndexUTF8
and
multiSearchFirstIndexCaseInsensitiveUTF8
provide case-insensitive and/or UTF-8 variants of this function.
Syntax
multiSearchFirstIndex(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned value
- index (starting from 1) of the leftmost found needle.
- 0, if there was no match.
Example
Query:
SELECT multiSearchFirstIndex('Hello World',['World','Hello']);
Result:
1
multiSearchFirstIndexCaseInsensitive
Returns the index
i
(starting from 1) of the leftmost found needle
i
in the string
haystack
and 0 otherwise. Ignores case.
Syntax
multiSearchFirstIndexCaseInsensitive(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned value
- index (starting from 1) of the leftmost found needle.
- 0, if there was no match.
Example
Query:
SELECT multiSearchFirstIndexCaseInsensitive('hElLo WoRlD',['World','Hello']);
Result:
1
multiSearchFirstIndexUTF8
Returns the index
i
(starting from 1) of the leftmost found needle
i
in the string
haystack
and 0 otherwise. Assumes
haystack
and
needle
are UTF-8 encoded strings.
Syntax
multiSearchFirstIndexUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— Array of UTF-8 substrings to be searched. Array
Returned value
- index (starting from 1) of the leftmost found needle.
- 0, if there was no match.
Example
Given
Hello World
as a UTF-8 string, find the first index of UTF-8 strings
Hello
and
World
.
Query:
SELECT multiSearchFirstIndexUTF8('\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64',['\x57\x6f\x72\x6c\x64','\x48\x65\x6c\x6c\x6f']);
Result:
1
multiSearchFirstIndexCaseInsensitiveUTF8
Returns the index
i
(starting from 1) of the leftmost found needle
i
in the string
haystack
and 0 otherwise. Assumes
haystack
and
needle
are UTF-8 encoded strings. Ignores case.
Syntax
multiSearchFirstIndexCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— Array of UTF-8 substrings to be searched. Array .
Returned value
- index (starting from 1) of the leftmost found needle.
- 0, if there was no match.
Example
Given
HELLO WORLD
as a UTF-8 string, find the first index of UTF-8 strings
hello
and
world
.
Query:
SELECT multiSearchFirstIndexCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['\x68\x65\x6c\x6c\x6f','\x77\x6f\x72\x6c\x64']);
Result:
1
multiSearchAny
Returns 1, if at least one string needle
i
matches the string
haystack
and 0 otherwise.
Functions
multiSearchAnyCaseInsensitive
,
multiSearchAnyUTF8
and []
multiSearchAnyCaseInsensitiveUTF8
](#multiSearchAnyCaseInsensitiveUTF8) provide case-insensitive and/or UTF-8 variants of this function.
Syntax
multiSearchAny(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array .
Returned value
- 1, if there was at least one match.
- 0, if there was not at least one match.
Example
Query:
SELECT multiSearchAny('ClickHouse',['C','H']);
Result:
1
multiSearchAnyCaseInsensitive
Like multiSearchAny but ignores case.
Syntax
multiSearchAnyCaseInsensitive(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— String in which the search is performed. String . -
needle
— Substrings to be searched. Array
Returned value
- 1, if there was at least one case-insensitive match.
- 0, if there was not at least one case-insensitive match.
Example
Query:
SELECT multiSearchAnyCaseInsensitive('ClickHouse',['c','h']);
Result:
1
multiSearchAnyUTF8
Like
multiSearchAny
but assumes
haystack
and the
needle
substrings are UTF-8 encoded strings.
*Syntax**
multiSearchAnyUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— UTF-8 substrings to be searched. Array .
Returned value
- 1, if there was at least one match.
- 0, if there was not at least one match.
Example
Given
ClickHouse
as a UTF-8 string, check if there are any
C
('\x43') or
H
('\x48') letters in the word.
Query:
SELECT multiSearchAnyUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x43','\x48']);
Result:
1
multiSearchAnyCaseInsensitiveUTF8
Like multiSearchAnyUTF8 but ignores case.
*Syntax**
multiSearchAnyCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])
Parameters
-
haystack
— UTF-8 string in which the search is performed. String . -
needle
— UTF-8 substrings to be searched. Array
Returned value
- 1, if there was at least one case-insensitive match.
- 0, if there was not at least one case-insensitive match.
Example
Given
ClickHouse
as a UTF-8 string, check if there is any letter
h
(
\x68
) in the word, ignoring case.
Query:
SELECT multiSearchAnyCaseInsensitiveUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x68']);
Result:
1
match
Returns whether string
haystack
matches the regular expression
pattern
in
re2 regular syntax
.
Matching is based on UTF-8, e.g.
.
matches the Unicode code point
¥
which is represented in UTF-8 using two bytes. The regular
expression must not contain null bytes. If the haystack or the pattern are not valid UTF-8, then the behavior is undefined.
Unlike re2's default behavior,
.
matches line breaks. To disable this, prepend the pattern with
(?-s)
.
If you only want to search substrings in a string, you can use functions like or position instead - they work much faster than this function.
Syntax
match(haystack, pattern)
Alias:
haystack REGEXP pattern operator
multiMatchAny
Like
match
but returns 1 if at least one of the patterns match and 0 otherwise.
Functions in the
multi[Fuzzy]Match*()
family use the the (Vectorscan)
[https://github.com/VectorCamp/vectorscan]
library. As such, they are only enabled if ClickHouse is compiled with support for vectorscan.
To turn off all functions that use hyperscan, use setting
SET allow_hyperscan = 0;
.
Due to restrictions of vectorscan, the length of the
haystack
string must be less than 2
32
bytes.
Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks (e.g. see (here) [https://www.usenix.org/conference/usenixsecurity22/presentation/turonova] , (here) [https://doi.org/10.1007/s10664-021-10033-1] and (here) [https://doi.org/10.1145/3236024.3236027] . Users are adviced to check the provided patterns carefully.
If you only want to search multiple substrings in a string, you can use function multiSearchAny instead - it works much faster than this function.
Syntax
multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
multiMatchAnyIndex
Like
multiMatchAny
but returns any index that matches the haystack.
Syntax
multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
multiMatchAllIndices
Like
multiMatchAny
but returns the array of all indices that match the haystack in any order.
Syntax
multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
multiFuzzyMatchAny
Like
multiMatchAny
but returns 1 if any pattern matches the haystack within a constant
edit distance
. This function relies on the experimental feature of
hyperscan
library, and can be slow for some corner cases. The performance depends on the edit distance value and patterns used, but it's always more expensive compared to a non-fuzzy variants.
multiFuzzyMatch*()
function family do not support UTF-8 regular expressions (it threats them as a sequence of bytes) due to restrictions of hyperscan.
Syntax
multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
multiFuzzyMatchAnyIndex
Like
multiFuzzyMatchAny
but returns any index that matches the haystack within a constant edit distance.
Syntax
multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
multiFuzzyMatchAllIndices
Like
multiFuzzyMatchAny
but returns the array of all indices in any order that match the haystack within a constant edit distance.
Syntax
multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
extract
Extracts a fragment of a string using a regular expression. If
haystack
does not match the
pattern
regex, an empty string is returned.
For regex without subpatterns, the function uses the fragment that matches the entire regex. Otherwise, it uses the fragment that matches the first subpattern.
Syntax
extract(haystack, pattern)
extractAll
Extracts all fragments of a string using a regular expression. If
haystack
does not match the
pattern
regex, an empty string is returned.
Returns an array of strings consisting of all matches of the regex.
The behavior with respect to subpatterns is the same as in function
extract
.
Syntax
extractAll(haystack, pattern)
extractAllGroupsHorizontal
Matches all groups of the
haystack
string using the
pattern
regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc.
This function is slower than extractAllGroupsVertical .
Syntax
extractAllGroupsHorizontal(haystack, pattern)
Arguments
-
haystack
— Input string. Type: String . -
pattern
— Regular expression with re2 syntax . Must contain groups, each group enclosed in parentheses. Ifpattern
contains no groups, an exception is thrown. Type: String .
Returned value
- Type: Array .
If
haystack
does not match the
pattern
regex, an array of empty arrays is returned.
Example
SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');
Result:
┌─extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','def','ghi'],['111','222','333']] │
└──────────────────────────────────────────────────────────────────────────────────────────┘
extractAllGroupsVertical
Matches all groups of the
haystack
string using the
pattern
regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the
haystack
.
Syntax
extractAllGroupsVertical(haystack, pattern)
Arguments
-
haystack
— Input string. Type: String . -
pattern
— Regular expression with re2 syntax . Must contain groups, each group enclosed in parentheses. Ifpattern
contains no groups, an exception is thrown. Type: String .
Returned value
- Type: Array .
If
haystack
does not match the
pattern
regex, an empty array is returned.
Example
SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');
Result:
┌─extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','111'],['def','222'],['ghi','333']] │
└────────────────────────────────────────────────────────────────────────────────────────┘
like
Returns whether string
haystack
matches the LIKE expression
pattern
.
A LIKE expression can contain normal characters and the following metasymbols:
-
%
indicates an arbitrary number of arbitrary characters (including zero characters). -
_
indicates a single arbitrary character. -
\
is for escaping literals%
,_
and\
.
Matching is based on UTF-8, e.g.
_
matches the Unicode code point
¥
which is represented in UTF-8 using two bytes.
If the haystack or the LIKE expression are not valid UTF-8, the behavior is undefined.
No automatic Unicode normalization is performed, you can use the normalizeUTF8*() functions for that.
To match against literal
%
,
_
and
\
(which are LIKE metacharacters), prepend them with a backslash:
\%
,
\_
and
\\
.
The backslash loses its special meaning (i.e. is interpreted literally) if it prepends a character different than
%
,
_
or
\
.
Note that ClickHouse requires backslashes in strings
to be quoted as well
, so you would actually need to write
\\%
,
\\_
and
\\\\
.
For LIKE expressions of the form
%needle%
, the function is as fast as the
position
function.