Functions for Searching in Strings

All functions in this section search case-sensitively by default. Case-insensitive search is usually provided by separate function variants.

note

Case-insensitive search follows the lowercase-uppercase rules of the English language. E.g. Uppercased i in the English language is I whereas in the Turkish language it is İ - results for languages other than English may be unexpected.

Functions in this section also assume that the searched string (referred to in this section as haystack ) and the search string (referred to in this section as needle ) are single-byte encoded text. If this assumption is violated, no exception is thrown and results are undefined. Search with UTF-8 encoded strings is usually provided by separate function variants. Likewise, if a UTF-8 function variant is used and the input strings are not UTF-8 encoded text, no exception is thrown and the results are undefined. Note that no automatic Unicode normalization is performed, however you can use the normalizeUTF8*() functions for that.

General strings functions and functions for replacing in strings are described separately.

position

Returns the position (in bytes, starting at 1) of a substring needle in a string haystack .

Syntax

position(haystack, needle[, start_pos])

Alias:

position(needle IN haystack)

Arguments

haystack — String in which the search is performed. String .
needle — Substring to be searched. String .
start_pos – Position (1-based) in haystack at which the search starts. UInt . Optional.

Returned values

Starting position in bytes and counting from 1, if the substring was found.
0, if the substring was not found.

If substring needle is empty, these rules apply:

if no start_pos was specified: return 1
if start_pos = 0 : return 1
if start_pos >= 1 and start_pos <= length(haystack) + 1 : return start_pos
otherwise: return 0

The same rules also apply to functions locate , positionCaseInsensitive , positionUTF8 and positionCaseInsensitiveUTF8 .

Type: Integer .

Examples

Query:

SELECT position('Hello, world!', '!');

Result:

┌─position('Hello, world!', '!')─┐
│                             13 │
└────────────────────────────────┘

Example with start_pos argument:

Query:

SELECT
    position('Hello, world!', 'o', 1),
    position('Hello, world!', 'o', 7)

Result:

┌─position('Hello, world!', 'o', 1)─┬─position('Hello, world!', 'o', 7)─┐
│                                 5 │                                 9 │
└───────────────────────────────────┴───────────────────────────────────┘

Example for needle IN haystack syntax:

Query:

SELECT 6 = position('/' IN s) FROM (SELECT 'Hello/World' AS s);

Result:

┌─equals(6, position(s, '/'))─┐
│                           1 │
└─────────────────────────────┘

Examples with empty needle substring:

Query:

SELECT
    position('abc', ''),
    position('abc', '', 0),
    position('abc', '', 1),
    position('abc', '', 2),
    position('abc', '', 3),
    position('abc', '', 4),
    position('abc', '', 5)

            
           

Result:

┌─position('abc', '')─┬─position('abc', '', 0)─┬─position('abc', '', 1)─┬─position('abc', '', 2)─┬─position('abc', '', 3)─┬─position('abc', '', 4)─┬─position('abc', '', 5)─┐
│                   1 │                      1 │                      1 │                      2 │                      3 │                      4 │                      0 │
└─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘

locate

Like position but with arguments haystack and locate switched.

The behavior of this function depends on the ClickHouse version:

in versions < v24.3, locate was an alias of function position and accepted arguments (haystack, needle[, start_pos]) .
in versions >= 24.3,, locate is an individual function (for better compatibility with MySQL) and accepts arguments (needle, haystack[, start_pos]) . The previous behavior can be restored using setting function_locate_has_mysql_compatible_argument_order = false ;

Syntax

locate(needle, haystack[, start_pos])

positionCaseInsensitive

A case insensitive invariant of position .

Example

Query:

SELECT position('Hello, world!', 'hello');

Result:

┌─position('Hello, world!', 'hello')─┐
│                                  0 │
└────────────────────────────────────┘

positionUTF8

Like position but assumes haystack and needle are UTF-8 encoded strings.

Examples

Function positionUTF8 correctly counts character ö (represented by two points) as a single Unicode codepoint:

Query:

SELECT positionUTF8('Motörhead', 'r');

Result:

┌─position('Motörhead', 'r')─┐
│                          5 │
└────────────────────────────┘

positionCaseInsensitiveUTF8

Like positionUTF8 but searches case-insensitively.

multiSearchAllPositions

Like position but returns an array of positions (in bytes, starting at 1) for multiple needle substrings in a haystack string.

note

All multiSearch*() functions only support up to 2 ⁸ needles.

Syntax

multiSearchAllPositions(haystack, [needle1, needle2, ..., needleN])

Arguments

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned values

Array of the starting position in bytes and counting from 1, if the substring was found.
0, if the substring was not found.

Example

Query:

SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world']);

Result:

┌─multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])─┐
│ [0,13,0]                                                          │
└───────────────────────────────────────────────────────────────────┘

multiSearchAllPositionsCaseInsensitive

Like multiSearchAllPositions but ignores case.

Syntax

multiSearchAllPositionsCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned value

Array of the starting position in bytes and counting from 1 (if the substring was found).
0 if the substring was not found.

Example

Query:

SELECT multiSearchAllPositionsCaseInsensitive('ClickHouse',['c','h']);

Result:

["1","6"]

multiSearchAllPositionsUTF8

Like multiSearchAllPositions but assumes haystack and the needle substrings are UTF-8 encoded strings.

Syntax

multiSearchAllPositionsUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 encoded string in which the search is performed. String .
needle — UTF-8 encoded substrings to be searched. Array .

Returned value

Array of the starting position in bytes and counting from 1 (if the substring was found).
0 if the substring was not found.

Example

Given ClickHouse as a UTF-8 string, find the positions of C ( \x43 ) and H ( \x48 ).

Query:

SELECT multiSearchAllPositionsUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x43','\x48']);

Result:

["1","6"]

multiSearchAllPositionsCaseInsensitiveUTF8

Like multiSearchAllPositionsUTF8 but ignores case.

Syntax

multiSearchAllPositionsCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 encoded string in which the search is performed. String .
needle — UTF-8 encoded substrings to be searched. Array .

Returned value

Array of the starting position in bytes and counting from 1 (if the substring was found).
0 if the substring was not found.

Example

Given ClickHouse as a UTF-8 string, find the positions of c ( \x63 ) and h ( \x68 ).

Query:

SELECT multiSearchAllPositionsCaseInsensitiveUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x63','\x68']);

Result:

["1","6"]

multiSearchFirstPosition

Like position but returns the leftmost offset in a haystack string which matches any of multiple needle strings.

Functions multiSearchFirstPositionCaseInsensitive , multiSearchFirstPositionUTF8 and multiSearchFirstPositionCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchFirstPosition(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned value

Leftmost offset in a haystack string which matches any of multiple needle strings.
0, if there was no match.

Example

Query:

SELECT multiSearchFirstPosition('Hello World',['llo', 'Wor', 'ld']);

Result:

multiSearchFirstPositionCaseInsensitive

Like multiSearchFirstPosition but ignores case.

Syntax

multiSearchFirstPositionCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Array of substrings to be searched. Array .

Returned value

Leftmost offset in a haystack string which matches any of multiple needle strings.
0, if there was no match.

Example

Query:

SELECT multiSearchFirstPositionCaseInsensitive('HELLO WORLD',['wor', 'ld', 'ello']);

Result:

multiSearchFirstPositionUTF8

Like multiSearchFirstPosition but assumes haystack and needle to be UTF-8 strings.

Syntax

multiSearchFirstPositionUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — Array of UTF-8 substrings to be searched. Array .

Returned value

Leftmost offset in a haystack string which matches any of multiple needle strings.
0, if there was no match.

Example

Find the leftmost offset in UTF-8 string hello world which matches any of the given needles.

Query:

SELECT multiSearchFirstPositionUTF8('\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64',['wor', 'ld', 'ello']);

Result:

multiSearchFirstPositionCaseInsensitiveUTF8

Like multiSearchFirstPosition but assumes haystack and needle to be UTF-8 strings and ignores case.

Syntax

multiSearchFirstPositionCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — Array of UTF-8 substrings to be searched. Array

Returned value

Leftmost offset in a haystack string which matches any of multiple needle strings, ignoring case.
0, if there was no match.

Example

Find the leftmost offset in UTF-8 string HELLO WORLD which matches any of the given needles.

Query:

SELECT multiSearchFirstPositionCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['wor', 'ld', 'ello']);

Result:

multiSearchFirstIndex

Returns the index i (starting from 1) of the leftmost found needle _i in the string haystack and 0 otherwise.

Functions multiSearchFirstIndexCaseInsensitive , multiSearchFirstIndexUTF8 and multiSearchFirstIndexCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchFirstIndex(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned value

index (starting from 1) of the leftmost found needle.
0, if there was no match.

Example

Query:

SELECT multiSearchFirstIndex('Hello World',['World','Hello']);

Result:

multiSearchFirstIndexCaseInsensitive

Returns the index i (starting from 1) of the leftmost found needle _i in the string haystack and 0 otherwise. Ignores case.

Syntax

multiSearchFirstIndexCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned value

index (starting from 1) of the leftmost found needle.
0, if there was no match.

Example

Query:

SELECT multiSearchFirstIndexCaseInsensitive('hElLo WoRlD',['World','Hello']);

Result:

multiSearchFirstIndexUTF8

Returns the index i (starting from 1) of the leftmost found needle _i in the string haystack and 0 otherwise. Assumes haystack and needle are UTF-8 encoded strings.

Syntax

multiSearchFirstIndexUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — Array of UTF-8 substrings to be searched. Array

Returned value

index (starting from 1) of the leftmost found needle.
0, if there was no match.

Example

Given Hello World as a UTF-8 string, find the first index of UTF-8 strings Hello and World .

Query:

SELECT multiSearchFirstIndexUTF8('\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64',['\x57\x6f\x72\x6c\x64','\x48\x65\x6c\x6c\x6f']);

Result:

multiSearchFirstIndexCaseInsensitiveUTF8

Returns the index i (starting from 1) of the leftmost found needle _i in the string haystack and 0 otherwise. Assumes haystack and needle are UTF-8 encoded strings. Ignores case.

Syntax

multiSearchFirstIndexCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — Array of UTF-8 substrings to be searched. Array .

Returned value

index (starting from 1) of the leftmost found needle.
0, if there was no match.

Example

Given HELLO WORLD as a UTF-8 string, find the first index of UTF-8 strings hello and world .

Query:

SELECT multiSearchFirstIndexCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['\x68\x65\x6c\x6c\x6f','\x77\x6f\x72\x6c\x64']);

Result:

multiSearchAny

Returns 1, if at least one string needle _i matches the string haystack and 0 otherwise.

Functions multiSearchAnyCaseInsensitive , multiSearchAnyUTF8 and [] multiSearchAnyCaseInsensitiveUTF8 ](#multiSearchAnyCaseInsensitiveUTF8) provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchAny(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array .

Returned value

1, if there was at least one match.
0, if there was not at least one match.

Example

Query:

SELECT multiSearchAny('ClickHouse',['C','H']);

Result:

multiSearchAnyCaseInsensitive

Like multiSearchAny but ignores case.

Syntax

multiSearchAnyCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — String in which the search is performed. String .
needle — Substrings to be searched. Array

Returned value

1, if there was at least one case-insensitive match.
0, if there was not at least one case-insensitive match.

Example

Query:

SELECT multiSearchAnyCaseInsensitive('ClickHouse',['c','h']);

Result:

multiSearchAnyUTF8

Like multiSearchAny but assumes haystack and the needle substrings are UTF-8 encoded strings.

*Syntax**

multiSearchAnyUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — UTF-8 substrings to be searched. Array .

Returned value

1, if there was at least one match.
0, if there was not at least one match.

Example

Given ClickHouse as a UTF-8 string, check if there are any C ('\x43') or H ('\x48') letters in the word.

Query:

SELECT multiSearchAnyUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x43','\x48']);

Result:

multiSearchAnyCaseInsensitiveUTF8

Like multiSearchAnyUTF8 but ignores case.

*Syntax**

multiSearchAnyCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Parameters

haystack — UTF-8 string in which the search is performed. String .
needle — UTF-8 substrings to be searched. Array

Returned value

1, if there was at least one case-insensitive match.
0, if there was not at least one case-insensitive match.

Example

Given ClickHouse as a UTF-8 string, check if there is any letter h ( \x68 ) in the word, ignoring case.

Query:

SELECT multiSearchAnyCaseInsensitiveUTF8('\x43\x6c\x69\x63\x6b\x48\x6f\x75\x73\x65',['\x68']);

Result:

match

Returns whether string haystack matches the regular expression pattern in re2 regular syntax .

Matching is based on UTF-8, e.g. . matches the Unicode code point ¥ which is represented in UTF-8 using two bytes. The regular expression must not contain null bytes. If the haystack or the pattern are not valid UTF-8, then the behavior is undefined.

Unlike re2's default behavior, . matches line breaks. To disable this, prepend the pattern with (?-s) .

If you only want to search substrings in a string, you can use functions like or position instead - they work much faster than this function.

Syntax

match(haystack, pattern)

Alias: haystack REGEXP pattern operator

multiMatchAny

Like match but returns 1 if at least one of the patterns match and 0 otherwise.

note

Functions in the multi[Fuzzy]Match*() family use the the (Vectorscan) [https://github.com/VectorCamp/vectorscan] library. As such, they are only enabled if ClickHouse is compiled with support for vectorscan.

To turn off all functions that use hyperscan, use setting SET allow_hyperscan = 0; .

Due to restrictions of vectorscan, the length of the haystack string must be less than 2 ³² bytes.

Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks (e.g. see (here) [https://www.usenix.org/conference/usenixsecurity22/presentation/turonova] , (here) [https://doi.org/10.1007/s10664-021-10033-1] and (here) [https://doi.org/10.1145/3236024.3236027] . Users are adviced to check the provided patterns carefully.

If you only want to search multiple substrings in a string, you can use function multiSearchAny instead - it works much faster than this function.

Syntax

multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

multiMatchAnyIndex

Like multiMatchAny but returns any index that matches the haystack.

Syntax

multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

multiMatchAllIndices

Like multiMatchAny but returns the array of all indices that match the haystack in any order.

Syntax

multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

multiFuzzyMatchAny

Like multiMatchAny but returns 1 if any pattern matches the haystack within a constant edit distance . This function relies on the experimental feature of hyperscan library, and can be slow for some corner cases. The performance depends on the edit distance value and patterns used, but it's always more expensive compared to a non-fuzzy variants.

note

multiFuzzyMatch*() function family do not support UTF-8 regular expressions (it threats them as a sequence of bytes) due to restrictions of hyperscan.

Syntax

multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

multiFuzzyMatchAnyIndex

Like multiFuzzyMatchAny but returns any index that matches the haystack within a constant edit distance.

Syntax

multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

multiFuzzyMatchAllIndices

Like multiFuzzyMatchAny but returns the array of all indices in any order that match the haystack within a constant edit distance.

Syntax

multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])

extract

Extracts a fragment of a string using a regular expression. If haystack does not match the pattern regex, an empty string is returned.

For regex without subpatterns, the function uses the fragment that matches the entire regex. Otherwise, it uses the fragment that matches the first subpattern.

Syntax

extract(haystack, pattern)

extractAll

Extracts all fragments of a string using a regular expression. If haystack does not match the pattern regex, an empty string is returned.

Returns an array of strings consisting of all matches of the regex.

The behavior with respect to subpatterns is the same as in function extract .

Syntax

extractAll(haystack, pattern)

extractAllGroupsHorizontal

Matches all groups of the haystack string using the pattern regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc.

This function is slower than extractAllGroupsVertical .

Syntax

extractAllGroupsHorizontal(haystack, pattern)

Arguments

haystack — Input string. Type: String .
pattern — Regular expression with re2 syntax . Must contain groups, each group enclosed in parentheses. If pattern contains no groups, an exception is thrown. Type: String .

Returned value

Type: Array .

If haystack does not match the pattern regex, an array of empty arrays is returned.

Example

SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');

Result:

┌─extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','def','ghi'],['111','222','333']]                                                │
└──────────────────────────────────────────────────────────────────────────────────────────┘

extractAllGroupsVertical

Matches all groups of the haystack string using the pattern regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the haystack .

Syntax

extractAllGroupsVertical(haystack, pattern)

Arguments

haystack — Input string. Type: String .
pattern — Regular expression with re2 syntax . Must contain groups, each group enclosed in parentheses. If pattern contains no groups, an exception is thrown. Type: String .

Returned value

Type: Array .

If haystack does not match the pattern regex, an empty array is returned.

Example

SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');

Result:

┌─extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','111'],['def','222'],['ghi','333']]                                            │
└────────────────────────────────────────────────────────────────────────────────────────┘

like

Returns whether string haystack matches the LIKE expression pattern .

A LIKE expression can contain normal characters and the following metasymbols:

% indicates an arbitrary number of arbitrary characters (including zero characters).
_ indicates a single arbitrary character.
\ is for escaping literals % , _ and \ .

Matching is based on UTF-8, e.g. _ matches the Unicode code point ¥ which is represented in UTF-8 using two bytes.

If the haystack or the LIKE expression are not valid UTF-8, the behavior is undefined.

No automatic Unicode normalization is performed, you can use the normalizeUTF8*() functions for that.

To match against literal % , _ and \ (which are LIKE metacharacters), prepend them with a backslash: \% , \_ and \\ . The backslash loses its special meaning (i.e. is interpreted literally) if it prepends a character different than % , _ or \ . Note that ClickHouse requires backslashes in strings to be quoted as well , so you would actually need to write \\% , \\_ and \\\\ .

For LIKE expressions of the form %needle% , the function is as fast as the position function.

Functions for Searching in Strings

position ​

locate ​

positionCaseInsensitive ​

positionUTF8 ​

positionCaseInsensitiveUTF8 ​

multiSearchAllPositions ​

multiSearchAllPositionsCaseInsensitive ​

multiSearchAllPositionsUTF8 ​

multiSearchAllPositionsCaseInsensitiveUTF8 ​

multiSearchFirstPosition ​

multiSearchFirstPositionCaseInsensitive ​

multiSearchFirstPositionUTF8 ​

multiSearchFirstPositionCaseInsensitiveUTF8 ​

multiSearchFirstIndex ​

multiSearchFirstIndexCaseInsensitive ​

multiSearchFirstIndexUTF8 ​

multiSearchFirstIndexCaseInsensitiveUTF8 ​

multiSearchAny ​

multiSearchAnyCaseInsensitive ​

multiSearchAnyUTF8 ​

multiSearchAnyCaseInsensitiveUTF8 ​

match ​

multiMatchAny ​

multiMatchAnyIndex ​

multiMatchAllIndices ​

multiFuzzyMatchAny ​

multiFuzzyMatchAnyIndex ​

multiFuzzyMatchAllIndices ​

extract ​

extractAll ​

extractAllGroupsHorizontal ​

extractAllGroupsVertical ​

like ​

position

locate

positionCaseInsensitive

positionUTF8

positionCaseInsensitiveUTF8

multiSearchAllPositions

multiSearchAllPositionsCaseInsensitive

multiSearchAllPositionsUTF8

multiSearchAllPositionsCaseInsensitiveUTF8

multiSearchFirstPosition

multiSearchFirstPositionCaseInsensitive

multiSearchFirstPositionUTF8

multiSearchFirstPositionCaseInsensitiveUTF8

multiSearchFirstIndex

multiSearchFirstIndexCaseInsensitive

multiSearchFirstIndexUTF8

multiSearchFirstIndexCaseInsensitiveUTF8

multiSearchAny

multiSearchAnyCaseInsensitive

multiSearchAnyUTF8

multiSearchAnyCaseInsensitiveUTF8

match

multiMatchAny

multiMatchAnyIndex

multiMatchAllIndices

multiFuzzyMatchAny

multiFuzzyMatchAnyIndex

multiFuzzyMatchAllIndices

extract

extractAll

extractAllGroupsHorizontal

extractAllGroupsVertical

like