Multi-value dimensions
Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an
array of values instead of a single value, such as the
tags
values in the following JSON array example:
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
It is important to be aware that multi-value dimensions are distinct from array types . While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the SQL data type documentation .
This document describes inserting, filtering, and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see segments documentation . Examples in this document are in the form of both SQL and native Druid queries . Refer to the Druid SQL documentation for details about the functions available for using multi-value string dimensions in SQL.
The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes a multi-value dimension,
tags
.
{"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]}
{"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []}
Ingestion
Native batch and streaming ingestion
When using native
batch
or streaming ingestion such as with
Apache Kafka
, the Druid web console data loader can detect multi-value dimensions and configure the
dimensionsSpec
accordingly.
For TSV or CSV data, you can specify the multi-value delimiters using the
listDelimiter
field in the
inputFormat
. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require
inputFormat
configuration.
The following shows an example
dimensionsSpec
for native ingestion of the data used in this document:
"dimensions": [
{
"type": "string",
"name": "label"
},
{
"type": "string",
"name": "tags",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
}
],
By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the
SORTED_ARRAY
value of the
multiValueHandling
field. Alternatively, you can specify multi-value handling as:
-
SORTED_SET
: results in the removal of duplicate values -
ARRAY
: retains the original order of the values
See Dimension Objects for information on configuring multi-value handling.
SQL-based ingestion
Multi-value dimensions can also be inserted with
SQL-based ingestion
. The functions
MV_TO_ARRAY
and
ARRAY_TO_MV
can assist in converting
VARCHAR
to
VARCHAR ARRAY
and
VARCHAR ARRAY
into
VARCHAR
respectively.
multiValueHandling
is not available when using the multi-stage query engine to insert data.
For example, to insert the data used in this document:
REPLACE INTO "mvd_example" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV("tags") AS "tags"
FROM "ext"
PARTITIONED BY DAY
SQL-based ingestion with rollup
These input arrays can also be grouped prior to converting into a multi-value dimension:
REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV("tags") AS "tags",
COUNT(*) AS "count"
FROM "ext"
GROUP BY 1, 2, "tags"
PARTITIONED BY DAY
Notice that
ARRAY_TO_MV
is not present in the
GROUP BY
clause since we only wish to coerce the type
after
grouping.
The
EXTERN
is also able to refer to the
tags
input type as
VARCHAR
, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the
tags
column. If this is the case you must use
MV_TO_ARRAY
since the multi-stage query engine only supports grouping on multi-value dimensions as arrays. So, they must be coerced first. These arrays must then be coerced back into
VARCHAR
in the
SELECT
part of the statement with
ARRAY_TO_MV
.
REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}'
)
) EXTEND ("timestamp" VARCHAR, "label" VARCHAR, "tags" VARCHAR)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV(MV_TO_ARRAY("tags")) AS "tags",
COUNT(*) AS "count"
FROM "ext"
GROUP BY 1, 2, MV_TO_ARRAY("tags")
PARTITIONED BY DAY
Querying multi-value dimensions
Filtering
All query types, as well as filtered aggregators , can filter on multi-value dimensions. Filters follow these rules on multi-value dimensions:
- Value filters (like "selector", "bound", and "in") match a row if any of the values of a multi-value dimension match the filter.
- The Column Comparison filter will match a row if the dimensions have any overlap.
-
Value filters that match
null
or""
(empty string) will match empty cells in a multi-value dimension. - Logical expression filters behave the same way they do on single-value dimensions: "and" matches a row if all underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row if the underlying filter does not match the row.
The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3:
SELECT *
FROM "mvd_example_rollup"
WHERE tags = 't1' OR tags = 't3'
returns
{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1}
{"__time":"2011-01-13T00:00:00.000Z","label":"row2","tags":"[\"t3\",\"t4\",\"t5\"]","count":1}
Native queries can also perform filtering that would be considered a "contradiction" in SQL, such as this "and" filter which would match only row1 of the dataset above:
{
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "tags",
"value": "t1"
},
{
"type": "selector",
"dimension": "tags",
"value": "t3"
}
]
}
which returns
{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1}
Multi-value dimensions also consider an empty row as
null
, consider:
SELECT *
FROM "mvd_example_rollup"
WHERE tags is null
which results in:
{"__time":"2011-01-14T00:00:00.000Z","label":"row4","tags":null,"count":1}
Grouping
When grouping on a multi-value dimension with SQL or a native
topN
or
groupBy
queries,
all
values
from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL
UNNEST
operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the
dimension
tags
with filter
"t1" AND "t3"
would match only row1, and generate a result with three groups:
t1
,
t2
, and
t3
.
If you only need to include values that match your filter, you can use the SQL functions
MV_FILTER_ONLY
/
MV_FILTER_NONE
,
filtered virtual column
, or
filtered dimensionSpec
. This can also improve performance.
Example: SQL grouping query with no filtering
SELECT label, tags
FROM "mvd_example_rollup"
GROUP BY 1,2
results in:
{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}
{"label":"row3","tags":"t5"}
{"label":"row3","tags":"t6"}
{"label":"row3","tags":"t7"}
{"label":"row4","tags":null}
Example: SQL grouping query with a filter
SELECT label, tags
FROM "mvd_example_rollup"
WHERE tags = 't3'
GROUP BY 1,2
results:
{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}
Example: native GroupBy query with no filtering
See GroupBy querying for details.
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "tags",
"outputName": "tags"
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
This query returns the following result:
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t1"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t2"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t4"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t5"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t6"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t7"
}
}
]
Notice that original rows are "exploded" into multiple rows and merged.
Example: native GroupBy query with a selector query filter
See query filters for details of selector query filter.
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "tags",
"outputName": "tags"
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
This query returns the following result:
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t1"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t2"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t4"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t5"
}
}
]
You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is applied on the row before explosion. For multi-value dimensions, a filter for value "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside the multiple values matches the query filter.
Example: native GroupBy query with selector query and dimension filters
To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below.
See filtered
dimensionSpecs
in
dimensionSpecs
for details.
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "listFiltered",
"delegate": {
"type": "default",
"dimension": "tags",
"outputName": "tags"
},
"values": ["t3"]
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
This query returns the following result:
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
}
]
Note that, for groupBy queries, you could get similar result with a
having spec
but using a filtered
dimensionSpec
is much more efficient because that gets applied at the lowest level in the query processing pipeline.
Having specs are applied at the outermost level of groupBy query processing.
Disable GroupBy on multi-value columns
You can disable the implicit unnesting behavior for groupBy by setting
groupByEnableMultiValueUnnesting: false
in your
query context
. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety