![]() |
欢快的开水瓶 · No controlling tty ...· 3 天前 · |
![]() |
不羁的勺子 · 市人大财经委、常委会预算工委调研制造业智能化 ...· 11 月前 · |
![]() |
热心的酱肘子 · SQLで分岐処理を行うには?~CASE式とI ...· 1 年前 · |
![]() |
销魂的开心果 · How to Fix the ...· 1 年前 · |
In a single cluster, you can define as many indexes as you want.
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Sharding is important for two primary reasons:
Replication is important for two primary reasons:
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of
LUCENE-5843
, the limit is
2,147,483,519
(= Integer.MAX_VALUE - 128) documents.
You can monitor shard sizes using the
_cat/shards
api.
With that out of the way, let’s get started with the fun part…
Elasticsearch requires at least Java 7. Specifically as of this writing, it is recommended that you use the Oracle JDK version 1.8.0_25. Java installation varies from platform to platform so we won’t go into those details here. Oracle’s recommended installation documentation can be found on Oracle’s website . Suffice to say, before you install Elasticsearch, please check your Java version first by running (and then install/upgrade accordingly if needed):
java -version echo $JAVA_HOME
Once we have Java set up, we can then download and run Elasticsearch. The binaries are available from
www.elastic.co/downloads
along with all the releases that have been made in the past. For each release, you have a choice among a
zip
or
tar
archive, or a
DEB
or
RPM
package. For simplicity, let’s use the tar file.
Let’s download the Elasticsearch 1.7.6 tar as follows (Windows users should download the zip package):
curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.6.tar.gz
Then extract it as follows (Windows users should unzip the zip package):
tar -xvf elasticsearch-1.7.6.tar.gz
It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:
cd elasticsearch-1.7.6/bin
And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):
./elasticsearch
If everything goes well, you should see a bunch of messages that look like below:
./elasticsearch [2014-03-13 13:42:17,218][INFO ][node ] [New Goblin] version[1.7.6], pid[2085], build[5c03844/2014-02-25T15:52:53Z] [2014-03-13 13:42:17,219][INFO ][node ] [New Goblin] initializing ... [2014-03-13 13:42:17,223][INFO ][plugins ] [New Goblin] loaded [], sites [] [2014-03-13 13:42:19,831][INFO ][node ] [New Goblin] initialized [2014-03-13 13:42:19,832][INFO ][node ] [New Goblin] starting ... [2014-03-13 13:42:19,958][INFO ][transport ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]} [2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master) [2014-03-13 13:42:23,100][INFO ][discovery ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g [2014-03-13 13:42:23,125][INFO ][http ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]} [2014-03-13 13:42:23,629][INFO ][gateway ] [New Goblin] recovered [1] indices into cluster_state [2014-03-13 13:42:23,630][INFO ][node ] [New Goblin] started
Without going too much into detail, we can see that our node named "New Goblin" (which will be a different Marvel character in your case) has started and elected itself as a master in a single cluster. Don’t worry yet at the moment what master means. The main thing that is important here is that we have started one node within one cluster.
As mentioned previously, we can override either the cluster or node name. This can be done from the command line when starting Elasticsearch as follows:
./elasticsearch --cluster.name my_cluster_name --node.name my_node_name
Also note the line marked http with information about the HTTP address (
192.168.8.112
) and port (
9200
) that our node is reachable from. By default, Elasticsearch uses port
9200
to provide access to its REST API. This port is configurable if necessary.
To check the cluster health, we will be using the
_cat
API
. Remember previously that our node HTTP endpoint is available at port
9200
:
curl 'localhost:9200/_cat/health?v'
And the response:
epoch timestamp cluster status node.total node.data shards pri relo init unassign 1394735289 14:28:09 elasticsearch green 1 1 0 0 0 0 0
We can see that our cluster named "elasticsearch" is up with a green status.
Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.
Also from the above response, we can see a total of 1 node and that we have 0 shards since we have no data in it yet. Note that since we are using the default cluster name (elasticsearch) and since Elasticsearch uses multicast network discovery by default to find other nodes, it is possible that you could accidentally start up more than one node in your network and have them all join a single cluster. In this scenario, you may see more than 1 node in the above response.
We can also get a list of nodes in our cluster as follows:
curl 'localhost:9200/_cat/nodes?v'
And the response:
curl 'localhost:9200/_cat/nodes?v' host ip heap.percent ram.percent load node.role master name mwubuntu1 127.0.1.1 8 4 0.00 d * New Goblin
Here, we can see our one node named "New Goblin", which is the single node that is currently in our cluster.
We’ve previously seen how we can index a single document. Let’s recall that command again:
The above indexes a new document with an ID of 2.
This example shows how to index a document without an explicit ID:
Updates can also be performed by using simple scripts. Note that dynamic scripts like the following are disabled by default as of
1.4.3
, have a look at the
scripting docs
for more details. This example uses a script to increment the age by 5:
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d ' "script" : "ctx._source.age += 5" }'
In the above example,
ctx._source
refers to the current source document that is about to be updated.
Note that as of this writing, updates can only be performed on a single document at a time. In the future, Elasticsearch might provide the ability to update multiple documents given a query condition (like an
SQL UPDATE-WHERE
statement).
In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the
_bulk
API
. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as little network roundtrips as possible.
As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d ' {"index":{"_id":"1"}} {"name": "John Doe" } {"index":{"_id":"2"}} {"name": "Jane Doe" } '
This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d ' {"update":{"_id":"1"}} {"doc": { "name": "John Doe becomes Jane Doe" } } {"delete":{"_id":"2"}} '
Note above that for the delete action, there is no corresponding source document after it since deletes only require the ID of the document to be deleted.
The bulk API executes all the actions sequentially and in order. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.
For the curious, I generated this data from
www.json-generator.com/
so please ignore the actual values and semantics of the data as these are all randomly generated.
You can download the sample dataset (accounts.json) from here . Extract it to our current directory and let’s load it into our cluster as follows:
curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary "@accounts.json" curl 'localhost:9200/_cat/indices?v'
And the response:
curl 'localhost:9200/_cat/indices?v' health index pri rep docs.count docs.deleted store.size pri.store.size yellow bank 5 1 1000 0 424.4kb 424.4kb
Which means that we just successfully bulk indexed 1000 documents into the bank index (under the account type).
Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body . The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We’ll try one example of the request URI method but for the remainder of this tutorial, we will exclusively be using the request body method.
The REST API for search is accessible from the
_search
endpoint. This example returns all documents in the bank index:
curl 'localhost:9200/bank/_search?q=*&pretty'
Let’s first dissect the search call. We are searching (
_search
endpoint) in the bank index, and the
q=*
parameter instructs Elasticsearch to match all documents in the index. The
pretty
parameter, again, just tells Elasticsearch to return pretty-printed JSON results.
And the response (partially shown):
curl 'localhost:9200/bank/_search?q=*&pretty' "took" : 63, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 "hits" : { "total" : 1000, "max_score" : 1.0, "hits" : [ { "_index" : "bank", "_type" : "account", "_id" : "1", "_score" : 1.0, "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"[email protected]","city":"Brogan","state":"IL"} "_index" : "bank", "_type" : "account", "_id" : "6", "_score" : 1.0, "_source" : {"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"[email protected]","city":"Dante","state":"TN"} "_index" : "bank", "_type" : "account",
As for the response, we see the following parts:
took
– time in milliseconds for Elasticsearch to execute the search
timed_out
– tells us if the search timed out or not
_shards
– tells us how many shards were searched, as well as a count of the successful/failed searched shards
hits
– search results
hits.total
– total number of documents matching our search criteria
hits.hits
– actual array of search results (defaults to first 10 documents)
_score
and
max_score
- ignore these fields for now
Here is the same exact search above using the alternative request body method:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_all": {} } }'
The difference here is that instead of passing
q=*
in the URI, we POST a JSON-style query request body to the
_search
API. We’ll discuss this JSON query in the next section.
And the response (partially shown):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_all": {} } "took" : 26, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 "hits" : { "total" : 1000, "max_score" : 1.0, "hits" : [ { "_index" : "bank", "_type" : "account", "_id" : "1", "_score" : 1.0, "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"[email protected]","city":"Brogan","state":"IL"} "_index" : "bank", "_type" : "account", "_id" : "6", "_score" : 1.0, "_source" : {"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"[email protected]","city":"Dante","state":"TN"} "_index" : "bank", "_type" : "account", "_id" : "13",
It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.
Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL . The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.
Going back to our last example, we executed this query:
{ "query": { "match_all": {} } }
Dissecting the above, the
query
part tells us what our query definition is and the
match_all
part is simply the type of query that we want to run. The
match_all
query is simply a search for all documents in the specified index.
In addition to the
query
parameter, we also can pass other parameters to influence the search results. For example, the following does a
match_all
and returns only the first document:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_all": {} }, "size": 1 }'
Note that if
size
is not specified, it defaults to 10.
This example does a
match_all
and returns documents 11 through 20:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_all": {} }, "from": 10, "size": 10 }'
The
from
parameter (0-based) specifies which document index to start from and the
size
parameter specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if
from
is not specified, it defaults to 0.
This example does a
match_all
and sorts the results by account balance in descending order and returns the top 10 (default size) documents.
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } } Executing Filters
Now let’s move on to the query part. Previously, we’ve seen how the
match_all
query is used to match all documents. Let’s now introduce a new query called the
match
query
, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).
This example returns the account numbered 20:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match": { "account_number": 20 } } }'
This example returns all accounts containing the term "mill" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match": { "address": "mill" } } }'
This example returns all accounts containing the term "mill" or "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match": { "address": "mill lane" } } }'
This example is a variant of
match
(
match_phrase
) that returns all accounts containing the phrase "mill lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "match_phrase": { "address": "mill lane" } } }'
Let’s now introduce the
bool
(ean) query
. The
bool
query allows us to compose smaller queries into bigger queries using boolean logic.
This example composes two
match
queries and returns all accounts containing "mill" and "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "bool": { "must": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } }'
In the above example, the
bool must
clause specifies all the queries that must be true for a document to be considered a match.
In contrast, this example composes two
match
queries and returns all accounts containing "mill" or "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "bool": { "should": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } }'
In the above example, the
bool should
clause specifies a list of queries either of which must be true for a document to be considered a match.
This example composes two
match
queries and returns all accounts that contain neither "mill" nor "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "bool": { "must_not": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } }'
In the above example, the
bool must_not
clause specifies a list of queries none of which must be true for a document to be considered a match.
We can combine
must
,
should
, and
must_not
clauses simultaneously inside a
bool
query. Furthermore, we can compose
bool
queries inside any of these
bool
clauses to mimic any complex multi-level boolean logic.
This example returns all accounts of anybody who is 40 years old but don’t live in ID(aho):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "bool": { "must": [ { "match": { "age": "40" } } "must_not": [ { "match": { "state": "ID" } } Executing Aggregations
All queries in Elasticsearch trigger computation of the relevance scores. In cases where we do not need the relevance scores, Elasticsearch provides another query capability in the form of filters . Filters are similar in concept to queries except that they are optimized for much faster execution speeds for two primary reasons:
To understand filters, let’s first introduce the
filtered
query
, which allows you to combine a query (like
match_all
,
match
,
bool
, etc.) together with a filter. As an example, let’s introduce the
range
filter
, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.
This example uses a filtered query to return all accounts with balances between 20000 and 30000, inclusive. In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.
curl -XPOST 'localhost:9200/bank/_search?pretty' -d ' "query": { "filtered": { "query": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 }'
Dissecting the above, the filtered query contains a
match_all
query (the query part) and a
range
filter (the filter part). We can substitute any other query into the query part as well as any other filter into the filter part. In the above case, the range filter makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.
In general, the easiest way to decide whether you want a filter or a query is to ask yourself if you care about the relevance score or not. If relevance is not important, use filters, otherwise, use queries. If you come from a SQL background, queries and filters are similar in concept to the
SELECT WHERE
clause, although more so for filters than queries.
In addition to the
match_all
,
match
,
bool
,
filtered
, and
range
queries, there are a lot of other query/filter types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query/filter types.
In SQL, the above aggregation is similar in concept to:
And the response (partially shown):
Building on the previous aggregation, let’s now sort on the average balance in descending order:
There are a many other aggregations capabilities that we won’t go into detail here. The aggregations reference guide is a great starting point if you want to do further experimentation.
This section includes information on how to setup elasticsearch and get it running. If you haven’t already, download it, and then check the installation docs.
Elasticsearch can also be installed from our repositories using
apt
or
yum
.
See
Repositories
.
After downloading the latest release and extracting it, elasticsearch can be started using:
$ bin/elasticsearch
Under *nix system, the command will start the process in the foreground.
To run it in the background, add the
-d
switch to it:
$ bin/elasticsearch -d
*NIX
There are added features when using the
elasticsearch
shell script.
The first, which was explained earlier, is the ability to easily run the
process either in the foreground or the background.
Another feature is the ability to pass
-X
and
-D
or getopt long style
configuration parameters directly to the script. When set, all override
anything set using either
JAVA_OPTS
or
ES_JAVA_OPTS
. For example:
$ bin/elasticsearch -Xmx2g -Xms2g -Des.index.store.type=memory --node.name=my-node
Elasticsearch is built using Java, and requires at least Java 7 in order to run. Only Oracle’s Java and the OpenJDK are supported. The same JVM version should be used on all Elasticsearch nodes and clients.
We recommend installing the Java 8 update 20 or later , or Java 7 update 55 or later . Previous versions of Java 7 are known to have bugs that can cause index corruption and data loss. Elasticsearch will refuse to start if a known-bad version of Java is used.
The version of Java to use can be configured by setting the
JAVA_HOME
environment variable.
It is recommended to set the min and max memory to the same value, and
enable
mlockall
.
Alternatively, you can retrieve the
max_file_descriptors
for each node
using the
Nodes Info
API, with:
curl localhost:9200/_nodes/process?pretty
Elasticsearch uses a
hybrid mmapfs / niofs
directory by default to store its indices. The default
operating system limits on mmap counts is likely to be too low, which may
result in out of memory exceptions. On Linux, you can increase the limits by
running the following command as
root
:
sysctl -w vm.max_map_count=262144
To set this value permanently, update the
vm.max_map_count
setting in
/etc/sysctl.conf
.
If you installed Elasticsearch using a package (.deb, .rpm) this setting will be changed automatically. To verify, run
sysctl vm.max_map_count
.
The third option is to use
mlockall
on Linux/Unix systems, or
VirtualLock
on Windows, to
try to lock the process address space into RAM, preventing any Elasticsearch
memory from being swapped out. This can be done, by adding this line
to the
config/elasticsearch.yml
file:
bootstrap.mlockall: true
After starting Elasticsearch, you can see whether this setting was applied
successfully by checking the value of
mlockall
in the output from this
request:
curl http://localhost:9200/_nodes/process?pretty
If you see that
mlockall
is
false
, then it means that the the
mlockall
request has failed. The most probable reason, on Linux/Unix systems, is that
the user running Elasticsearch doesn’t have permission to lock memory. This can
be granted by running
ulimit -l unlimited
as
root
before starting Elasticsearch.
Another possible reason why
mlockall
can fail is that the temporary directory
(usually
/tmp
) is mounted with the
noexec
option. This can be solved by
specifying a new temp directory, by starting Elasticsearch with:
./bin/elasticsearch -Djna.tmpdir=/path/to/new/dir
mlockall
might cause the JVM or shell session to exit if it tries
to allocate more memory than is available!
elasticsearch
configuration files can be found under
ES_HOME/config
folder. The folder comes with two files, the
elasticsearch.yml
for
configuring Elasticsearch different
modules
, and
logging.yml
for
configuring the Elasticsearch logging.
The configuration format is YAML . Here is an example of changing the address all network based modules will use to bind and publish to:
network : host : 10.0.0.4
In production use, you will almost certainly want to change paths for data and log files:
Internally, all settings are collapsed into "namespaced" settings. For
example, the above gets collapsed into
node.name
. This means that
its easy to support other configuration formats, for example,
JSON
. If JSON is a preferred configuration format,
simply rename the
elasticsearch.yml
file to
elasticsearch.json
and
add:
On execution of the
elasticsearch
command, you will be prompted to enter
the actual value like so:
The location of the configuration file can be set externally using a system property:
All of the index level configuration can be found within each index module .
Elasticsearch uses an internal logging abstraction and comes, out of the
box, with
log4j
. It tries to simplify
log4j configuration by using
YAML
to configure it,
and the logging configuration file is
config/logging.yml
. The
JSON
and
properties
formats are also
supported. Multiple configuration files can be loaded, in which case they will
get merged, as long as they start with the
logging.
prefix and end with one
of the supported suffixes (either
.yml
,
.yaml
,
.json
or
.properties
)
The logger section contains the java packages and their corresponding log
level, where it is possible to omit the
org.elasticsearch
prefix. The
appender section contains the destinations for the logs. Extensive information
on how to customize logging and all the supported appenders can be found on
the
log4j documentation
.
Additional Appenders and other logging classes provided by log4j-extras are also available, out of the box.
Each package features a configuration file, which allows you to set the following parameters
ES_USER
The user to run as, defaults to
elasticsearch
ES_GROUP
The group to run as, defaults to
elasticsearch
ES_HEAP_SIZE
The heap size to start with
ES_HEAP_NEWSIZE
The size of the new generation heap
ES_DIRECT_SIZE
The maximum size of the direct memory
MAX_OPEN_FILES
Maximum number of open files, defaults to
65535
MAX_LOCKED_MEMORY
Maximum locked memory size. Set to "unlimited" if you use the bootstrap.mlockall option in elasticsearch.yml. You must also set ES_HEAP_SIZE.
MAX_MAP_COUNT
Maximum number of memory map areas a process may have. If you use
mmapfs
as index store type, make sure this is set to a high value. For more information, check the
linux kernel documentation
about
max_map_count
. This is set via
sysctl
before starting elasticsearch. Defaults to
65535
LOG_DIR
Log directory, defaults to
/var/log/elasticsearch
DATA_DIR
Data directory, defaults to
/var/lib/elasticsearch
WORK_DIR
Work directory, defaults to
/tmp/elasticsearch
CONF_DIR
Configuration file directory (which needs to include
elasticsearch.yml
and
logging.yml
files), defaults to
/etc/elasticsearch
CONF_FILE
Path to configuration file, defaults to
/etc/elasticsearch/elasticsearch.yml
ES_JAVA_OPTS
Any additional java options you may want to apply. This may be useful, if you need to set the
node.name
property, but do not want to change the
elasticsearch.yml
configuration file, because it is distributed via a provisioning system like puppet or chef. Example:
ES_JAVA_OPTS="-Des.node.name=search-01"
RESTART_ON_UPGRADE
Configure restart on package upgrade, defaults to
false
. This means you will have to restart your elasticsearch instance after installing a package manually. The reason for this is to ensure, that upgrades in a cluster do not result in a continuous shard reallocation resulting in high network traffic and reducing the response times of your cluster.
|
c:\elasticsearch-1.7.6\bin>service Usage: service.bat install|remove|start|stop|manager [SERVICE_ID]
install
Install Elasticsearch as a service
remove
Remove the installed Elasticsearch service (and stop the service if started)
start
Start the Elasticsearch service (if installed)
Stop the Elasticsearch service (if started)
manager
Start a GUI for managing the installed service
|
There are two ways to customize the service settings:
manager
command, the GUI offers insight into the installed service including its status, startup type,
JVM, start and stop settings among other things. Simply invoking
service.bat
from the command-line with the aforementioned option
will open up the manager window:
Customizing
service.bat
at its core,
service.bat
relies on
Apache Commons Daemon
project
to install the services. For full flexibility such as customizing the user under which the service runs, one can modify the installation
parameters to tweak all the parameters accordingly. Do note that this requires reinstalling the service for the new settings to be applied.
There is also a community supported customizable MSI installer available: https://github.com/salyh/elasticsearch-msi-installer (by Hendrik Saly).
The directory layout of an installation is as follows:
Type | Description | Default Location | Setting |
---|---|---|---|
home |
Home of elasticsearch installation. |
|
|
bin |
Binary scripts including
|
|
|
conf |
Configuration files including
|
|
|
data |
The location of the data files of each index / shard allocated on the node. Can hold multiple locations. |
|
|
logs |
Log files location. |
|
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory. |
|
|
repo |
Shared file system repository locations. Can hold multiple locations. A file system repository can be placed in to any subdirectory of any directory specified here. |
|
|
path.data: /mnt/first,/mnt/second
path.data: ["/mnt/first", "/mnt/second"]
Below are the default paths that elasticsearch will use, if not explictly changed.
Type | Description | Location Debian/Ubuntu | Location RHEL/CentOS |
---|---|---|---|
home |
Home of elasticsearch installation. |
|
|
bin |
Binary scripts including
|
|
|
conf |
Configuration files
|
|
|
conf |
Environment variables including heap size, file descriptors. |
|
|
data |
The location of the data files of each index / shard allocated on the node. |
|
|
logs |
Log files location |
|
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory. |
|
|
Type | Description | Location |
---|---|---|
home |
Home of elasticsearch installation |
|
bin |
Binary scripts including
|
|
conf |
Configuration files
|
|
conf |
Environment variables including heap size, file descriptors |
|
data |
The location of the data files of each index / shard allocated on the node |
|
logs |
Log files location |
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory |
|
We use the PGP key D88E42B4 , Elasticsearch Signing Key, with fingerprint
4609 5ACC 8548 582C 1A26 99A9 D27D 666C D88E 42B4
to sign all our packages. It is available from http://pgp.mit.edu .
Download and install the Public Signing Key:
Save the repository definition to
/etc/apt/sources.list.d/elasticsearch-{branch}.list
:
Run apt-get update and the repository is ready for use. You can install it with:
Otherwise if your distribution is using systemd:
Download and install the public signing key:
And your repository is ready for use. You can install it with:
To determine whether a rolling upgrade is supported for your release, please consult this table:
Upgrade From | Upgrade To | Supported Upgrade Type |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
Rolling upgrade
(if
|
|
|
To back up a running 1.0 or later system, it is simplest to use the snapshot feature. See the complete instructions for backup and restore with snapshots .
This syntax applies to Elasticsearch 1.0 and later:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"cluster.routing.allocation.enable" : "none"
(Only applicable to upgrades from ES 1.6.0 to a higher version) There is no problem continuing to index while doing the upgrade. However, you can speed the process considerably
by temporarily stopping non-essential indexing and issuing a manual synced flush.
A synced flush is special kind of flush which can seriously speed up recovery of shards. Elasticsearch automatically
uses it when an index has been inactive for a while (default is 5m
) but you can manually trigger it using the following command:
curl -XPOST localhost:9200/_all/_flush/synced
Note that a synced flush call is a best effort operation. It will fail there are any pending indexing operations. It is safe to issue it multiple times if needed.
curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
.deb
or
.rpm
package:
Use
rpm
or
dpkg
to install the new package. All files should be placed in their proper locations, and config files should not be overwritten.
Start the now upgraded node. Confirm that it joins the cluster.
Re-enable shard reallocation:
curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" Observe that all shards are properly allocated on all nodes. Balancing may take some time. Repeat this process for all remaining nodes.
During a rolling upgrade, primary shards assigned to a node with the higher version will never have their replicas assigned to a node with the lower version, because the newer version may have a different data format which is not understood by the older version.
If it is not possible to assign the replica shards to another node with the
higher version — e.g. if there is only one node with the higher version in
the cluster — then the replica shards will remain unassigned, i.e. the
cluster health will be status
yellow
. As soon as another node with the
higher version joins the cluster, the replicas should be assigned and the
cluster health will reach status
green
.
It may be possible to perform the upgrade by installing the new software while the service is running. This would reduce downtime by ensuring the service was ready to run on the new version as soon as it is stopped on the node being upgraded. This can be done by installing the new version in its own directory and using the symbolic link method outlined above. It is important to test this procedure first to be sure that site-specific configuration data and production indices will not be overwritten during the upgrade process.
This syntax is from versions prior to 1.0:
1.x
to
2.x
— requires a
full cluster restart
.
Migration between minor versions — e.g.
1.x
to
1.y
— can be
performed by
upgrading one node at a time
.
See Upgrading for more info.
The More Like This API query has been deprecated and will be removed in 2.0. Instead use the More Like This Query .
top_children
query
Aliases
can include
filters
which
are automatically applied to any search performed via the alias.
Filtered aliases
created with version
1.4.0
or later can only
refer to field names which exist in the mappings of the index (or indices)
pointed to by the alias.
Add or update a mapping via the create index or put mapping apis.
The
get warmer api
will return a section for
warmers
even if there are
no warmers. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_warmers' curl -XGET 'http://localhost:9200/_warmers'
The
get alias api
will return a section for
aliases
even if there are
no aliases. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_aliases' curl -XGET 'http://localhost:9200/_aliases'
The
get mapping api
will return a section for
mappings
even if there are
no mappings. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_mappings' curl -XGET 'http://localhost:9200/_mappings'
Bulk UDP has been deprecated and will be removed in 2.0. You should use the standard bulk API instead. Each cluster must have an elected master node in order to be fully operational. Once a node loses its elected master node it will reject some or all operations.
On versions before
1.4.0.Beta1
all operations are rejected when a node loses its elected master. From
1.4.0.Beta1
only write operations will be rejected by default. Read operations will still be served based on the information available
to the node, which may result in being partial and possibly also stale. If the default is undesired then the
pre
1.4.0.Beta1
behaviour can be enabled, see:
no-master-block
The More Like This Field query has been deprecated in favor of the
More Like This Query
restrained set to a specific
field
. It will be removed in 2.0.
Groovy is the new default scripting language in Elasticsearch, and is enabled in
sandbox
mode
by default. MVEL has been removed from core, but is available as a plugin:
https://github.com/elasticsearch/elasticsearch-lang-mvel
mmapfs
by default. Make
sure that you set
MAX_MAP_COUNT
to a sufficiently high
number. The RPM and Debian packages default this value to
262144
.
The RPM and Debian packages no longer start Elasticsearch by default.
The
cluster.routing.allocation
settings (
disable_allocation
,
disable_new_allocation
and
disable_replica_location
) have been
replaced by the single setting
:
cluster.routing.allocation.enable: all|primaries|new_primaries|none
The
cluster_state
,
nodes_info
,
nodes_stats
and
indices_stats
APIs have all been changed to make their format more RESTful and less clumsy.
For instance, if you just want the
nodes
section of the the
cluster_state
,
instead of:
GET /_cluster/state?filter_metadata&filter_routing_table&filter_blocks
you now use:
GET /_cluster/state/nodes
Similarly for the
nodes_stats
API, if you want the
transport
and
http
metrics only, instead of:
GET /_nodes/stats?clear&transport&http
you now use:
GET /_nodes/stats/transport,http
See the links above for full details.
These URLs have been unified as:
All of the
{indices}
,
{types}
and
{names}
parameters can be replaced by:
Similarly, the return values for
GET
have been unified with the following
rules:
See
put-mapping
,
get-mapping
,
get-field-mapping
,
delete-mapping
,
update-settings
,
get-settings
,
warmers
, and
aliases
for more details.
While the
search
API takes a top-level
query
parameter, the
count
,
delete-by-query
and
validate-query
requests expected the whole body to be a
query. These now
require
a top-level
query
parameter:
GET /_count "query": { "match": { "title": "Interesting stuff" }
Also, the top-level
filter
parameter in search has been renamed to
post_filter
, to indicate that it should not
be used as the primary way to filter search results (use a
filtered
query
instead), but only to filter
results AFTER facets/aggregations have been calculated.
This example counts the top colors in all matching docs, but only returns docs
with color
red
:
GET /_search "query": { "match_all": {} "aggs": { "colors": { "terms": { "field": "color" } "post_filter": { "term": { "color": "red" Stopwords
Previously, the
standard
and
pattern
analyzers used the list of English stopwords
by default, which caused some hard to debug indexing issues. Now they are set to
use the empty stopwords list (ie
_none_
) instead.
When dates are specified without a year, for example:
Dec 15 10:00:00
they
are treated as dates in 2000 during indexing and range searches… except for
the upper included bound
lte
where they were treated as dates in 1970! Now,
all
dates without years
use
1970
as the default.
miles
as the default unit. And we
all know what
happened at NASA
because of that decision. The new default unit is
meters
.
For all queries that support
fuzziness
, the
min_similarity
,
fuzziness
and
edit_distance
parameters have been unified as the single parameter
fuzziness
. See
the section called “Fuzziness
” for details of accepted values.
The
ignore_missing
parameter has been replaced by the
expand_wildcards
,
ignore_unavailable
and
allow_no_indices
parameters, all of which have
sensible defaults. See
the multi-index docs
for more.
An index name (or pattern) is now required for destructive operations like
deleting indices:
# v0.90 - delete all indices: DELETE / # v1.0 - delete all indices: DELETE /_all DELETE /*
Setting
action.destructive_requires_name
to
true
provides further safety
by disabling wildcard expansion on destructive actions.
ok
return value has been removed from all response bodies as it added
no useful information.
The
found
,
not_found
and
exists
return values have been unified as
found
on all relevant APIs.
Field values, in response to the
fields
parameter, are now always returned as arrays. A field could have single or
multiple values, which meant that sometimes they were returned as scalars
and sometimes as arrays. By always returning arrays, this simplifies user
code. The only exception to this rule is when
fields
is used to retrieve
metadata like the
routing
value, which are always singular. Metadata
fields are always returned as scalars.
The
fields
parameter is intended to be used for retrieving stored fields,
rather than for fields extracted from the
_source
. That means that it can no
longer be used to return whole objects and it no longer accepts the
_source.fieldname
format. For these you should use the
_source
_source_include
and
_source_exclude
parameters instead.
Settings, like
index.analysis.analyzer.default
are now returned as proper
nested JSON objects, which makes them easier to work with programatically:
"index": {
"analysis": {
"analyzer": {
"default": xxx
}
You can choose to return them in flattened format by passing
?flat_settings
in the query string.
analyze
API no longer supports the text response
format, but does support JSON and YAML.
Per-document boosting with the
_boost
field has
been removed. You can use the
function_score
instead.
The
path
parameter in mappings has been deprecated. Use the
copy_to
parameter instead.
The
custom_score
and
custom_boost_score
is no longer supported. You can
use
function_score
instead.
API Conventions
The percolator has been redesigned and because of this the dedicated
_percolator
index is no longer used by the percolator,
but instead the percolator works with a dedicated
.percolator
type. Read the
redesigned percolator
blog post for the reasons why the percolator has been redesigned.
Elasticsearch will
not
delete the
_percolator
index when upgrading, only the percolate api will not use the queries
stored in the
_percolator
index. In order to use the already stored queries, you can just re-index the queries from the
_percolator
index into any index under the reserved
.percolator
type. The format in which the percolate queries
were stored has
not
been changed. So a simple script that does a scan search to retrieve all the percolator queries
and then does a bulk request into another index should be sufficient.
The elasticsearch REST APIs are exposed using JSON over HTTP .
The conventions listed in this chapter can be applied throughout the REST API, unless otherwise specified.
All multi indices API support the following url query string parameters:
The defaults settings for the above parameters depend on the api being used.
Single index APIs such as the
Document APIs
and the
single-index
alias
APIs
do not support multiple indices.
The following options can be applied to all of the REST APIs.
It also supports the
*
wildcard character to match any field or part
of a field’s name:
Note that elasticsearch sometimes returns directly the raw value of a field,
like the
_source
field. If you want to filter _source fields, you should
consider combining the already existing
_source
parameter (see
Get API
for more details) with the
filter_path
parameter like this:
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title' "hits" : { "hits" : [ { "_source":{"title":"Book #2"} "_source":{"title":"Book #1"} "_source":{"title":"Book #3"} }
By default the
flat_settings
is set to
false
.
Month Minute Second |
Wherever distances need to be specified, such as the
distance
parameter in
the
Geo Distance Filter
), the default unit if none is specified is
the meter. Distances can be specified in other units, such as
"1km"
or
"2mi"
(2 miles).
The full list of units is listed below:
mi
or
miles
yd
or
yards
ft
or
feet
in
or
inch
Kilometer
km
or
kilometers
Meter
m
or
meters
Centimeter
cm
or
centimeters
Millimeter
mm
or
millimeters
Nautical mile
NM
,
nmi
or
nauticalmiles
|
The
precision
parameter in the
Geohash Cell Filter
accepts
distances with the above units, but if no unit is specified, then the
precision is interpreted as the length of the geohash.
When querying numeric, date and IPv4 fields,
fuzziness
is interpreted as a
+/-
margin. It behaves like a
Range Query
where:
-fuzziness <= field value <= +fuzziness
The
fuzziness
parameter should be set to a numeric value, eg
2
or
2.0
. A
date
field interprets a long as milliseconds, but also accepts a string
containing a time value —
"1h"
— as explained in
the section called “Time units
”. An
ip
field accepts a long or another IPv4 address (which will be converted into a
long).
When querying
string
fields,
fuzziness
is interpreted as a
Levenshtein Edit Distance
— the number of one character changes that need to be made to one string to
make it the same as another string.
The
fuzziness
parameter can be specified as:
0
,
1
,
2
the maximum allowed Levenshtein Edit Distance (or number of edits)
generates an edit distance based on the length of the term. For lengths:
AUTO
should generally be the preferred value for
fuzziness
.
0.0..1.0
[
1.7.0
]
Deprecated in 1.7.0.
Support for similarity will be removed in Elasticsearch 2.0
converted into an edit distance using the formula:
length(term) * (1.0 -
fuzziness)
, eg a
fuzziness
of
0.6
with a term of length 10 would result
in an edit distance of
4
. Note: in all APIs except for the
Fuzzy Like This Query
, the maximum allowed edit distance is
2
.
When enabled, all REST APIs accept a
callback
parameter
resulting in a
JSONP
result. You can enable
this behavior by adding the following to
config.yaml
:
http.jsonp.enable: true
Please note, when enabled, due to the architecture of Elasticsearch, this may pose a security risk. Under some circumstances, an attacker may be able to exfiltrate data in your Elasticsearch server if they’re able to force your browser to make a JSONP request on your behalf (e.g. by including a <script> tag on an untrusted site with a legitimate query against a local Elasticsearch server).
Many users use a proxy with URL-based access control to secure access to Elasticsearch indices. For multi-search , multi-get and bulk requests, the user has the choice of specifying an index in the URL and on each individual request within the request body. This can make URL-based access control challenging.
To prevent the user from overriding the index which has been specified in the
URL, add this setting to the
config.yml
file:
rest.action.multi.allow_explicit_index: false
The default value is
true
, but when set to
false
, Elasticsearch will
reject requests that have an explicit index specified in the request body.
This section describes the following CRUD APIs:
Multi-document APIs
All CRUD APIs are single-index APIs. The
index
parameter accepts a single
index name, or an
alias
which points to a single index.
The result of the above index operation is:
The index operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if one has not yet been created (check out the put mapping API for manually creating a type mapping).
The mapping itself is very flexible and is schema-free. New fields and objects will automatically be added to the mapping definition of the type specified. Check out the mapping section for more information on mapping definitions.
Note that the format of the JSON document can also include the type (very handy
when using JSON mappers) if the
index.mapping.allow_type_wrapper
setting is
set to true, for example:
$ curl -XPOST 'http://localhost:9200/twitter' -d '{ "settings": { "index": { "mapping.allow_type_wrapper": true {"acknowledged":true} $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "tweet" : { "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }'
Automatic index creation can be disabled by setting
action.auto_create_index
to
false
in the config file of all nodes.
Automatic mapping creation can be disabled by setting
index.mapper.dynamic
to
false
in the config files of all nodes (or
on the specific index settings).
Automatic index creation can include a pattern based white/black list,
for example, set
action.auto_create_index
to
+aaa*,-bbb*,+ccc*,-*
(+
meaning allowed, and - meaning disallowed).
Each indexed document is given a version number. The associated
version
number is returned as part of the response to the index API
request. The index API optionally allows for
optimistic
concurrency control
when the
version
parameter is specified. This
will control the version of the document the operation is intended to be
executed against. A good example of a use case for versioning is
performing a transactional read-then-update. Specifying a
version
from
the document initially read ensures no changes have happened in the
meantime (when reading in order to update, it is recommended to set
preference
to
_primary
). For example:
curl -XPUT 'localhost:9200/twitter/tweet/1?version=2' -d '{ "message" : "elasticsearch now has versioning support, double cool!" }'
NOTE: versioning is completely real time, and is not affected by the near real time aspects of search operations. If no version is provided, then the operation is executed without any version checks.
By default, internal versioning is used that starts at 1 and increments
with each update, deletes included. Optionally, the version number can be
supplemented with an external value (for example, if maintained in a
database). To enable this functionality,
version_type
should be set to
external
. The value provided must be a numeric, long value greater or equal to 0,
and less than around 9.2e+18. When using the external version type, instead
of checking for a matching version number, the system checks to see if
the version number passed to the index request is greater than the
version of the currently stored document. If true, the document will be
indexed and the new version number used. If the value provided is less
than or equal to the stored document’s version number, a version
conflict will occur and the index operation will fail.
A nice side effect is that there is no need to maintain strict ordering of async indexing operations executed as a result of changes to a source database, as long as version numbers from the source database are used. Even the simple case of updating the elasticsearch index using data from a database is simplified if external versioning is used, as only the latest version will be used if the index operations are out of order for whatever reason.
Here is an example of using the
op_type
parameter:
Another option to specify
create
is to use the following uri:
The result of the above index operation is:
A child document can be indexed by specifying its parent when indexing. For example:
If the
timestamp
value is not provided externally or in the
_source
,
the
timestamp
will be automatically set to the date the document was
processed by the indexing chain. More information can be found on the
_timestamp mapping page
.
More information can be found on the _ttl mapping page .
Valid write consistency values are
one
,
quorum
, and
all
.
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?timeout=5m' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" Delete API
The result of the above get operation is:
The API also allows to check for the existence of a document using
HEAD
, for example:
If you only want to specify includes, you can use a shorter notation:
For backward compatibility, if the requested fields are not stored, they will be fetched
from the
_source
(parsed and extracted). This functionality has been replaced by the
source filtering
parameter.
Field values fetched from the document it self are always returned as an array. Metadata fields like
_routing
and
_parent
fields are never returned as an array.
Also only leaf fields can be returned via the
field
option. So object fields can’t be returned and such requests
will fail.
The result of the above delete operation is:
The
parent
parameter can be set, which will basically be the same as
setting the routing parameter.
Note that deleting a parent document does not automatically delete its children. One way of deleting all child documents given a parent’s id is to perform a delete by query on the child index with the automatically generated (and indexed) field _parent, which is in the format parent_type#parent_id.
The delete operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if it has not been created before (check out the put mapping API for manually creating type mapping).
For example, lets index a simple doc:
Now, we can execute a script that would increment the counter:
We can also add a new field to the document:
Or remove a field from the document:
detect_noop
If
name
was
new_name
before the request was sent then the entire update
request is ignored.
scripted_upsert
doc_as_upsert
The update operation supports the following query-string parameters:
retry_on_conflict
In between the get and indexing phases of the update, it is possible that
another process might have already updated the same document. By default, the
update will fail with a version conflict exception. The
retry_on_conflict
parameter controls how many times to retry the update before finally throwing
an exception.
routing
Routing is used to route the update request to the right shard and sets the
routing for the upsert request if the document being updated doesn’t exist.
Can’t be used to update the routing of an existing document.
parent
Parent is used to route the update request to the right shard and sets the
parent for the upsert request if the document being updated doesn’t exist.
Can’t be used to update the
parent
of an existing document.
timeout
Timeout waiting for a shard to become available.
consistency
The write consistency of the index/delete operation.
refresh
Refresh the relevant primary and replica shards (not the whole index)
immediately after the operation occurs, so that the updated document appears
in search results immediately.
fields
Return the relevant fields from the updated document. Specify
_source
to
return the full updated source.
version
&
version_type
The Update API uses the Elasticsearch’s versioning support internally to make
sure the document doesn’t change during the update. You can use the
version
parameter to specify that the document should only be updated if it’s version
matches the one specified. By setting version type to
force
you can force
the new version of the document after update (use with care! with
force
there is no guarantee the document didn’t change).Version types
external
&
external_gte
are not supported.
Bulk API
Multi GET API allows to get multiple documents based on an index, type
(optional) and id (and possibly routing). The response includes a
curl 'localhost:9200/_mget' -d '{ "docs" : [ "_index" : "test", "_type" : "type", "_id" : "1" "_index" : "test", "_type" : "type", "_id" : "2" }'
The
curl 'localhost:9200/test/_mget' -d '{ "docs" : [ "_type" : "type", "_id" : "1" "_type" : "type", "_id" : "2" }' And type: curl 'localhost:9200/test/type/_mget' -d '{ "docs" : [ "_id" : "1" "_id" : "2" }'
In which case, the
curl 'localhost:9200/test/type/_mget' -d '{ "ids" : ["1", "2"] }' Optional Type
You need in that case to explicitly set the
Source filtering
By default, the
For example: curl 'localhost:9200/_mget' -d '{ "docs" : [ "_index" : "test", "_type" : "type", "_id" : "1", "_source" : false "_index" : "test", "_type" : "type", "_id" : "2", "_source" : ["field3", "field4"] "_index" : "test", "_type" : "type", "_id" : "3", "_source" : { "include": ["user"], "exclude": ["user.location"] }' FieldsSpecific stored fields can be specified to be retrieved per document to get, similar to the fields parameter of the Get API. For example: curl 'localhost:9200/_mget' -d '{ "docs" : [ "_index" : "test", "_type" : "type", "_id" : "1", "fields" : ["field1", "field2"] "_index" : "test", "_type" : "type", "_id" : "2", "fields" : ["field3", "field4"] }'
Alternatively, you can specify the
curl 'localhost:9200/test/type/_mget?fields=field1,field2' -d '{ "docs" : [ "_id" : "1" |
See the section called “Generated fields ” for fields are generated only when indexing.
You can also specify routing value as a parameter:
The REST API endpoint is
/_bulk
, and it expects the following JSON
structure:
action_and_meta_data\n optional_source\n action_and_meta_data\n optional_source\n action_and_meta_data\n optional_source\n
NOTE
: the final line of data must end with a newline character
\n
.
The possible actions are
index
,
create
,
delete
and
update
.
index
and
create
expect a source on the next
line, and have the same semantics as the
op_type
parameter to the
standard index API (i.e. create will fail if a document with the same
index and type exists already, whereas index will add or replace a
document as necessary).
delete
does not expect a source on the
following line, and has the same semantics as the standard delete API.
update
expects that the partial doc, upsert and script and its options
are specified on the next line.
If you’re providing text file input to
curl
, you
must
use the
--data-binary
flag instead of plain
-d
. The latter doesn’t preserve
newlines. Example:
$ cat requests { "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } $ curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"; echo {"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1}}]}
Because this format uses literal
\n
's as delimiters, please be sure
that the JSON actions and sources are not pretty printed. Here is an
example of a correct sequence of bulk commands:
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } { "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } } { "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } } { "field1" : "value3" } { "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} } { "doc" : {"field2" : "value2"} }
In the above example
doc
for the
update
action is a partial
document, that will be merged with the already stored document.
The endpoints are
/_bulk
,
/{index}/_bulk
, and
{index}/{type}/_bulk
.
When the index or the index/type are provided, they will be used by
default on bulk items that don’t provide them explicitly.
A note on the format. The idea here is to make processing of this as
fast as possible. As some of the actions will be redirected to other
shards on other nodes, only
action_meta_data
is parsed on the
receiving node side.
Client libraries using this protocol should try and strive to do something similar on the client side, and reduce buffering as much as possible.
The response to a bulk action is a large JSON structure with the individual results of each action that was performed. The failure of a single action does not affect the remaining actions.
There is no "correct" number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
If using the HTTP API, make sure that the client does not send HTTP chunks, as this will slow things down.
Each bulk item can include the version value using the
_version
/
version
field. It automatically follows the behavior of the
index / delete operation based on the
_version
mapping. It also
support the
version_type
/
_version_type
(see
versioning
)
The delete by query API allows to delete documents from one or more indices and one or more types based on a query. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/_query?q=user:kimchy' $ curl -XDELETE 'http://localhost:9200/twitter/tweet/_query' -d '{ "query" : { "term" : { "user" : "kimchy" } '
The query being sent in the body must be nested in a
query
key, same as
the
search api
works
Both above examples end up doing the same thing, which is delete all tweets from the twitter index for a certain user. The result of the commands is:
{ "_indices" : { "twitter" : { "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }
Note, delete by query bypasses versioning support. Also, it is not recommended to delete "large chunks of the data in an index", many times, it’s better to simply reindex into a new index.
We can also delete within specific types:
Or even delete across all indices:
Name | Description |
---|---|
df |
The default field to use when no field prefix is defined within the query. |
analyzer |
The analyzer name to be used when analyzing the query string. |
default_operator |
The default operator to be used, can be
|
The delete by query can use the
Query DSL
within its body in order to express the query that should be
executed and delete all documents. The body content can also be passed
as a REST parameter named
source
.
A Bulk UDP service is a service listening over UDP for bulk format requests. The idea is to provide a low latency UDP service that allows to easily index data that is not of critical nature.
The Bulk UDP service is disabled by default, but can be enabled by
setting
bulk.udp.enabled
to
true
.
The bulk UDP service performs internal bulk aggregation of the data and then flushes it based on several parameters:
bulk.udp.bulk_actions
The number of actions to flush a bulk after,
defaults to
1000
.
bulk.udp.bulk_size
The size of the current bulk request to flush
the request once exceeded, defaults to
5mb
.
bulk.udp.flush_interval
An interval after which the current
request is flushed, regardless of the above limits. Defaults to
5s
.
bulk.udp.concurrent_requests
The number on max in flight bulk
requests allowed. Defaults to
4
.
The allowed network settings are:
bulk.udp.host
The host to bind to, defaults to
network.host
which defaults to any.
bulk.udp.port
The port to use, defaults to
9700-9800
.
bulk.udp.receive_buffer_size
The receive buffer size, defaults to
10mb
.
Here is an example of how it can be used:
> cat bulk.txt { "index" : { "_index" : "test", "_type" : "type1" } } { "field1" : "value1" } { "index" : { "_index" : "test", "_type" : "type1" } } { "field1" : "value1" }
> cat bulk.txt | nc -w 0 -u localhost 9700
Returns information and statistics on terms in the fields of a particular
document. The document could be stored in the index or artificially provided
by the user. Term vectors are
realtime
by default, not near
realtime. This can be changed by setting
realtime
parameter to
false
.
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'
Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'
or by adding the requested fields in the request body (see example below). Fields can also be specified with wildcards in similar way to the multi match query .
Setting
term_statistics
to
true
(default is
false
) will
return
Setting
field_statistics
to
false
(default is
true
) will
omit :
First, we create an index that stores term vectors, payloads etc. :
Second, we add some documents:
Term vectors can also be generated for artificial documents,
that is for documents not present in the index. The syntax is similar to the
percolator
API. For example, the following request would
return the same results as in example 1. The mapping used is determined by the
index
and
type
.
If dynamic mapping is turned on (default), the document fields not in the original mapping will be dynamically created.
curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{ "doc" : { "fullname" : "John Doe", "text" : "twitter test test test" }'
{ "_index": "twitter", "_type": "tweet", "_version": 0, "found": true, "term_vectors": { "fullname": { "field_statistics": { "sum_doc_freq": 1, "doc_count": 1, "sum_ttf": 1 "terms": { "John Doe": { "term_freq": 1, "tokens": [ "position": 0, "start_offset": 0, "end_offset": 8 Search APIs
Multi termvectors API allows to get multiple termvectors at once. The
documents from which to retrieve the term vectors are specified by an index,
type and id. But the documents could also be artificially provided.
The response includes a
docs
array with all the fetched termvectors, each element having the structure
provided by the
termvectors
API. Here is an example:
curl 'localhost:9200/_mtermvectors' -d '{ "docs": [ "_index": "testidx", "_type": "test", "_id": "2", "term_statistics": true "_index": "testidx", "_type": "test", "_id": "1", "fields": [ "text" }'
See the termvectors API for a description of possible parameters.
The
_mtermvectors
endpoint can also be used against an index (in which case it
is not required in the body):
curl 'localhost:9200/testidx/_mtermvectors' -d '{ "docs": [ "_type": "test", "_id": "2", "fields": [ "text" "term_statistics": true "_type": "test", "_id": "1" }'
And type:
curl 'localhost:9200/testidx/test/_mtermvectors' -d '{ "docs": [ "_id": "2", "fields": [ "text" "term_statistics": true "_id": "1" }'
If all requested documents are on same index and have same type and also the parameters are the same, the request can be simplified:
curl 'localhost:9200/testidx/test/_mtermvectors' -d '{ "ids" : ["1", "2"], "parameters": { "fields": [ "text" "term_statistics": true, }'
Additionally, just like for the
termvectors
API, term vectors could be generated for user provided documents. The syntax
is similar to the
percolator
API. The mapping used is
determined by
_index
and
_type
.
curl 'localhost:9200/_mtermvectors' -d '{ "docs": [ "_index": "testidx", "_type": "test", "doc" : { "fullname" : "John Doe", "text" : "twitter test test test" "_index": "testidx", "_type": "test", "doc" : { "fullname" : "Jane Doe", "text" : "Another twitter test ..."
Most search APIs are multi-index, multi-type , with the exception of the Explain API endpoints.
A search can be associated with stats groups, which maintains a statistics aggregation per group. It can later be retrieved using the indices stats API specifically. For example, here is a search body request that associate the request with two different groups:
{ "query" : { "match_all" : {} "stats" : ["group1", "group2"] URI Search
The search API allows to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter , or using a request body .
All search APIs can be applied across multiple types within an index, and across multiple indices with support for the multi index syntax . For example, we can search on all documents across all types within the twitter index:
$ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy'
We can also search within specific types:
$ curl -XGET 'http://localhost:9200/twitter/tweet,user/_search?q=user:kimchy'
We can also search all tweets with a certain tag across several indices (for example, when each user has his own index):
$ curl -XGET 'http://localhost:9200/kimchy,elasticsearch/tweet/_search?q=tag:wow'
Or we can search all tweets across all available indices using
_all
placeholder:
$ curl -XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow'
Or even search across all indices and all types:
$ curl -XGET 'http://localhost:9200/_search?q=tag:wow'
And here is a sample response:
The parameters allowed in the URI are:
Name | Description |
---|---|
|
The query string (maps to the
|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
Should terms be automatically lowercased or
not. Defaults to
|
|
Should wildcard and prefix queries be analyzed or
not. Defaults to
|
|
The default operator to be used, can be
|
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
For each hit, contain an explanation of how scoring of the hits was computed. |
|
Set to
|
|
The selective stored fields of the document to return for each hit, comma delimited. Not specifying any value will cause no fields to return. |
|
Sorting to perform. Can either be in the form of
|
|
When sorting, set to
|
|
A search timeout, bounding the search request to be executed within the specified time value and bail with the hits accumulated up to that point when expired. Defaults to no timeout. |
|
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
The maximum number of documents to collect for
each shard, upon reaching which the query execution will terminate early.
If set, the response will have a boolean field
|
|
The starting from index of the hits to return. Defaults to
|
|
The number of hits to return. Defaults to
|
|
The type of the search operation to perform. Can be
|
The search request can be executed with a search DSL, which includes the Query DSL , within its body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{ "query" : { "term" : { "user" : "kimchy" } '
And here is a sample response:
{ "_shards":{ "total" : 5, "successful" : 5, "failed" : 0 "hits":{ "total" : 1, "hits" : [ "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_source" : { "user" : "kimchy", "postDate" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
timeout
A search timeout, bounding the search request to be executed within the
specified time value and bail with the hits accumulated up to that point
when expired. Defaults to no timeout. See
the section called “Time units
”.
The starting from index of the hits to return. Defaults to
0
.
The number of hits to return. Defaults to
10
.
search_type
The type of the search operation to perform. Can be
dfs_query_then_fetch
,
dfs_query_and_fetch
,
query_then_fetch
,
query_and_fetch
. Defaults to
query_then_fetch
. See
Search Type
for more.
query_cache
Set to
true
or
false
to enable or disable the caching
of search results for requests where
?search_type=count
, ie
aggregations and suggestions. See
Shard query cache
.
terminate_after
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
The maximum number of documents to collect for each shard,
upon reaching which the query execution will terminate early. If set, the
response will have a boolean field
terminated_early
to indicate whether
the query execution has actually terminated_early. Defaults to no
terminate_after.
|
The query element within the search request body allows to define a query using the Query DSL .
{ "query" : { "term" : { "user" : "kimchy" }
The
order
option can have the following values:
Sort in ascending order Sort in descending order |
Pick the lowest value. Pick the highest value. Use the sum of all values as sort value. Only applicable for number based array fields. Use the average of all values as sort value. Only applicable for number based array fields. |
Allow to sort by
_geo_distance
. Here is an example:
Note: the geo distance sorting supports
sort_mode
options:
min
,
max
and
avg
.
The following formats are supported in providing the coordinates:
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "sort" : [ "_geo_distance" : { "pin.location" : [-70, 40], "order" : "asc", "unit" : "km" "query" : { "term" : { "user" : "kimchy" } }
Allows to control how the
_source
field is returned with every hit.
You can turn off
_source
retrieval by using the
_source
parameter:
To disable
_source
retrieval set to
false
:
Finally, for complete control, you can specify both include and exclude patterns:
{ "_source": { "include": [ "obj1.*", "obj2.*" ], "exclude": [ "*.description" ] "query" : { "term" : { "user" : "kimchy" } Script Fields
Allows to selectively load specific stored fields for each document represented by a search hit.
*
can be used to load all stored fields from the document.
An empty array will cause only the
_id
and
_type
for each hit to be
returned, for example:
For backwards compatibility, if the fields parameter specifies fields which are not stored (
store
mapping set to
false
), it will load the
_source
and extract it from it. This functionality has been replaced by the
source filtering
parameter.
Field values fetched from the document it self are always returned as an array. Metadata fields like
_routing
and
_parent
fields are never returned as an array.
Also only leaf fields can be returned via the
field
option. So object fields can’t be returned and such requests
will fail.
Script fields can also be automatically detected and used as fields, so
things like
_source.obj1.field1
can be used, though not recommended, as
obj1.field1
will work as well.
And one that will also exclude
obj1.obj3
:
Both
include
and
exclude
support multiple patterns:
{ "query" : { "match_all" : {} "partial_fields" : { "partial1" : { "include" : ["obj1.obj2.*", "obj1.obj4.*"], "exclude" : "obj1.obj3.*" Field Data Fields
Allows to return a script evaluation (based on different fields) for each hit, for example:
{ "query" : { "script_fields" : { "test1" : { "script" : "doc['my_field_name'].value * 2" "test2" : { "script" : "doc['my_field_name'].value * factor", "params" : { "factor" : 2.0 }
Script fields can work on fields that are not stored (
my_field_name
in
the above case), and allow to return custom values to be returned (the
evaluated value of the script).
Script fields can also access the actual
_source
document indexed and
extract specific elements to be returned from it (can be an "object"
type). Here is an example:
{ "query" : { "script_fields" : { "test1" : { "script" : "_source.obj1.obj2" }
Note the
_source
keyword here to navigate the json-like model.
It’s important to understand the difference between
doc['my_field'].value
and
_source.my_field
. The first, using the doc
keyword, will cause the terms for that field to be loaded to memory
(cached), which will result in faster execution, but more memory
consumption. Also, the
doc[...]
notation only allows for simple valued
fields (can’t return a json object from it) and make sense only on
non-analyzed or single term based fields.
The
_source
on the other hand causes the source to be loaded, parsed,
and then only the relevant part of the json is returned.
Allows to return the field data representation of a field for each hit, for example:
Imagine that you are selling shirts, and the user has specified two filters:
color:red
and
brand:gucci
. You only want to show them red shirts made by
Gucci in the search results. Normally you would do this with a
filtered
query
:
curl -XGET localhost:9200/shirts/_search -d ' "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "color": "red" }}, { "term": { "brand": "gucci" }} '
However, you would also like to use
faceted navigation
to display a list of
other options that the user could click on. Perhaps you have a
model
field
that would allow the user to limit their search results to red Gucci
t-shirts
or
dress-shirts
.
This can be done with a
terms
aggregation
:
curl -XGET localhost:9200/shirts/_search -d ' "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "color": "red" }}, { "term": { "brand": "gucci" }} "aggs": { "models": { "terms": { "field": "model" }Returns the most popular models of red shirts by Gucci.
curl -XGET localhost:9200/shirts/_search -d ' "query": { "filtered": { "filter": { { "term": { "brand": "gucci" }}"aggs": { "colors": { "terms": { "field": "color" },
"color_red": { "filter": { "term": { "color": "red" }
"aggs": { "models": { "terms": { "field": "model" }
"post_filter": {
"term": { "color": "red" } The main query now finds all shirts by Gucci, regardless of color. The
colors
agg returns popular colors for shirts by Gucci. Thecolor_red
agg limits themodels
sub-aggregation to red Gucci shirts. Finally, thepost_filter
removes colors other than red from the searchhits
.
The postings highlighter does support highlighting of multi term queries, like prefix queries, wildcard queries and so on. On the other hand, this requires the queries to be rewritten using a proper rewrite method that supports multi term extraction, which is a potentially expensive operation.
1MB
)
Can be customized with
boundary_chars
,
boundary_max_scan
, and
fragment_offset
(see
below
)
Requires setting
term_vector
to
with_positions_offsets
which
increases the size of the index
Can combine matches from multiple fields into one result. See
matched_fields
Can assign different weights to matches at different positions allowing
for things like phrase matches being sorted above term matches when
highlighting a Boosting Query that boosts phrase matches over term matches
Here is an example of setting the
content
field to allow for
highlighting using the fast vector highlighter on it (this will cause
the index to be bigger):
{ "type_name" : { "content" : {"term_vector" : "with_positions_offsets"} }
Rescoring can help to improve precision by reordering just the top (eg
100 - 500) documents returned by the
query
and
post_filter
phases, using a
secondary (usually more costly) algorithm, instead of applying the
costly algorithm to all documents in the index.
A
rescore
request is executed on each shard before it returns its
results to be sorted by the node handling the overall search request.
Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.
the
rescore
phase is not executed when
search_type
is set
to
scan
or
count
.
when exposing pagination to your users, you should not change
window_size
as you step through each page (by passing different
from
values) since that can alter the top hits causing results to
confusingly shift as the user steps through pages.
The query rescorer executes a second query only on the Top-K results
returned by the
query
and
post_filter
phases. The
number of docs which will be examined on each shard can be controlled by
the
window_size
parameter, which defaults to
from
and
size
.
By default the scores from the original query and the rescore query are
combined linearly to produce the final
_score
for each document. The
relative importance of the original query and of the rescore query can
be controlled with the
query_weight
and
rescore_query_weight
respectively. Both default to
1
.
For example:
curl -s -XPOST 'localhost:9200/_search' -d '{ "query" : { "match" : { "field1" : { "operator" : "or", "query" : "the quick brown", "type" : "boolean" "rescore" : { "window_size" : 50, "query" : { "rescore_query" : { "match" : { "field1" : { "query" : "the quick brown", "type" : "phrase", "slop" : 2 "query_weight" : 0.7, "rescore_query_weight" : 1.2 '
The way the scores are combined can be controlled with the
score_mode
:
Score Mode | Description |
---|---|
|
Add the original score and the rescore query score. The default. |
|
Multiply the original score by the rescore query score. Useful
for
|
|
Average the original score and the rescore query score. |
|
Take the max of original score and the rescore query score. |
|
Take the min of the original score and the rescore query score. |
The
scan
search type disables sorting in order to allow very efficient
scrolling through large result sets. See
Efficient scrolling with Scroll-Scan
for more.
The results that are returned from a scroll request reflect the state of
the index at the time that the initial
search
request was made, like a
snapshot in time. Subsequent changes to documents (index, update or delete)
will only affect later search requests.
In order to use scrolling, the initial search request should specify the
scroll
parameter in the query string, which tells Elasticsearch how long it
should keep the “search context” alive (see
Keeping the search context alive
), eg
?scroll=1m
.
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d ' "query": { "match" : { "title" : "elasticsearch" '
The result from the above request includes a
_scroll_id
, which should
be passed to the
scroll
API in order to retrieve the next batch of
results.
curl -XGET'localhost:9200/_search/scroll?scroll=1m'
![]()
\ -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1'
|
Deep pagination with
from
and
size
— e.g.
?size=10&from=10000
— is very inefficient as (in this example) 100,000
sorted results have to be retrieved from each shard and resorted in order to
return just 10 results. This process has to be repeated for every page
requested.
The
scroll
API keeps track of which results have already been returned and
so is able to return sorted results more efficiently than with deep
pagination. However, sorting results (which happens by default) still has a
cost.
Normally, you just want to retrieve all results and the order doesn’t matter.
Scrolling can be combined with the
scan
search type to disable
any scoring or sorting and to return results in the most efficient way
possible. All that is needed is to add
search_type=scan
to the query string
of the initial search request:
curl 'localhost:9200/twitter/tweet/_search?scroll=1m&search_type=scan'-d ' "query": { "match" : { "title" : "elasticsearch" Setting
search_type
toscan
disables sorting and makes scrolling very efficient.
A scanning scroll request differs from a standard scroll request in four ways:
search
request will not contain any results in
the
hits
array. The first results will be returned by the first
scroll
request.
The
size
parameter
controls the number of
results
per shard
, not per request, so a
size
of
10
which hits 5
shards will return a maximum of 50 results per
scroll
request.
If you want the scoring to happen, even without sorting on it, set the
track_scores
parameter to
true
.
The
scroll
parameter (passed to the
search
request and to every
scroll
request) tells Elasticsearch how long it should keep the search context alive.
Its value (e.g.
1m
, see
the section called “Time units
”) does not need to be long enough to
process all data — it just needs to be long enough to process the previous
batch of results. Each
scroll
request (with the
scroll
parameter) sets a
new expiry time.
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
Keeping older segments alive means that more file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See the section called “File Descriptors ”.
You can check how many search contexts are open with the nodes stats API :
curl -XGET localhost:9200/_nodes/stats/indices/search?pretty
The
preference
is a query string parameter which can be set to:
_primary
The operation will go and be executed only on the primary
shards.
_primary_first
The operation will go and be executed on the primary
shard, and if not available (failover), will execute on other shards.
_local
The operation will prefer to be executed on a local
allocated shard if possible.
_only_node:xyz
Restricts the search to execute only on a node with
the provided node id (
xyz
in this case).
_prefer_node:xyz
Prefers execution on the node with the provided
node id (
xyz
in this case) if applicable.
_shards:2,3
Restricts the operation to the specified shards. (
2
and
3
in this case). This preference can be combined with other
preferences but it has to appear first:
_shards:2,3;_primary
_only_nodes
Restricts the operation to nodes specified in node specification
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html
[
1.7.0
]
Added in 1.7.0.
A custom value will be used to guarantee that
the same shards will be used for the same custom value. This can help
with "jumping values" when hitting different shards in different refresh
states. A sample value can be something like the web session id, or the
user name.
|
For instance, use the user’s session ID to ensure consistent ordering of results for the user:
curl localhost:9200/_search?preference=xyzabc123 -d ' "query": { "match": { "title": "elasticsearch"
Enables explanation for each hit on how its score was computed.
Exclude documents which have a
_score
less than the minimum specified
in
min_score
:
Note, most times, this does not make much sense, but is provided for advanced use cases.
The parent/child and nested features allow the return of documents that have matches in a different scope. In the parent/child case, parent document are returned based on matches in child documents or child document are returned based on matches in parent documents. In the nested case, documents are returned based on matches in nested inner objects.
In both cases, the actual matches in the different scopes that caused a document to be returned is hidden. In many cases, it’s very useful to know which inner nested objects (in the case of nested or children or parent documents), or (in the case of parent/child) caused certain information to be returned. The inner hits feature can be used for this. This feature returns per search hit in the search response additional nested hits that caused a search hit to match in a different scope.
Inner hits can be used by defining a
inner_hits
definition on a
nested
,
has_child
or
has_parent
query and filter.
The structure looks like this:
"<query>" : { "inner_hits" : { <inner_hits_options> }
If
_inner_hits
is defined on a query that supports it then each search hit will contain a
inner_hits
json object with the following structure:
"hits": [ "_index": ..., "_type": ..., "_id": ..., "inner_hits": { "<inner_hits_name>": { "hits": { "total": ..., "hits": [ "_type": ..., "_id": ..., ]
Inner hits support the following options:
The offset from where the first hit to fetch for each
inner_hits
in the returned regular search hits.
The maximum number of hits to return per
inner_hits
. By default the top three matching hits are returned.
How the inner hits should be sorted per
inner_hits
. By default the hits are sorted by the score.
The name to be used for the particular inner hit definition in the response. Useful when multiple inner hits
have been defined in a single search request. The default depends in which query the inner hit is defined.
For
has_child
query and filter this is the child type,
has_parent
query and filter this is the parent type
and the nested query and filter this is the nested path.
|
Inner hits also supports the following per document features:
The nested
inner_hits
can be used to include nested inner objects as inner hits to a search hit.
The example below assumes that there is a nested object field defined with the name
comments
:
{ "query" : { "nested" : { "path" : "comments", "query" : { "match" : {"comments.message" : "[actual query]"} "inner_hits" : {}The inner hit definition in the nested query. No other options need to be defined.
An example of a response snippet that could be generated from the above search request:
... "hits": { "hits": [ "_index": "my-index", "_type": "question", "_id": "1", "_source": ..., "inner_hits": { "comments": {"hits": { "total": ..., "hits": [ "_type": "question", "_id": "1", "_nested": { "field": "comments", "offset": 2 "_source": ... The name used in the inner hit definition in the search request. A custom key can be used via the
name
option.
The parent/child
inner_hits
can be used to include parent or child
The examples below assumes that there is a
_parent
field mapping in the
comment
type:
{ "query" : { "has_child" : { "type" : "comment", "query" : { "match" : {"message" : "[actual query]"} "inner_hits" : {}The inner hit definition like in the nested example.
An example of a response snippet that could be generated from the above search request:
An example that shows the use of nested inner hits via the top level notation:
{ "query" : { "nested" : { "path" : "comments", "query" : { "match" : {"comments.message" : "[actual query]"} "inner_hits" : { "comment" : { "path" : {"comments" : {
"query" : { "match" : {"comments.message" : "[different query]"}
The inner hit definition is nested and requires the
path
option. The path option refers to the nested object fieldcomments
A query that runs to collect the nested inner documents for each search hit returned. If no query is defined all nested inner documents will be included belonging to a search hit. This shows that it only make sense to the top level inner hit definition if no query or a different query is specified.
Additional options that are only available when using the top level inner hits notation:
Defines the nested scope where hits will be collected from.
Defines the parent or child type score where hits will be collected from.
query
Defines the query that will run in the defined nested, parent or child scope to collect and score hits. By default all document in the scope will be matched.
|
For more information on how Mustache templating and what kind of templating you can do with it check out the online documentation of the mustache project .
The mustache language is implemented in elasticsearch as a sandboxed
scripting language, hence it obeys settings that may be used to enable or
disable scripts per language, source and operation as described in
scripting docs
[
1.6.0
]
Added in 1.6.0.
mustache
scripts were always on before and it wasn’t possible to disable them
.
A default value is written as
{{var}}{{^var}}default{{/var}}
for instance:
When
params
is
{ "start": 10, "end": 15 }
this query would be rendered as:
But when
params
is
{ "start": 10 }
this query would use the default value
for
end
:
{ "params": { "text": "words to search for", "line_no": {"start": 10,
"end": 20
All three of these elements are optional.
{ "query": { "filtered": { "query": { "match": { "line": "{{text}}""filter": { {{#line_no}}
"range": { "line_no": { {{#start}}
"gte": "{{start}}"
{{#end}},{{/end}}
{{/start}}
{{#end}}
"lte": "{{end}}"
{{/end}}
{{/line_no}}
Fill in the value of param
text
Include therange
filter only ifline_no
is specified Include thegte
clause only ifline_no.start
is specified Fill in the value of paramline_no.start
Add a comma after thegte
clause only ifline_no.start
ANDline_no.end
are specified Include thelte
clause only ifline_no.end
is specified Fill in the value of paramline_no.end
As written above, this template is not valid JSON because it includes the
section
markers like
{{#line_no}}
. For this reason, the template should
either be stored in a file (see
the section called “Pre-registered template
”) or, when used
via the REST API, should be written as a string:
"template": "{\"query\":{\"filtered\":{\"query\":{\"match\":{\"line\":\"{{text}}\"}},\"filter\":{{{#line_no}}\"range\":{\"line_no\":{{{#start}}\"gte\":\"{{start}}\"{{#end}},{{/end}}{{/start}}{{#end}}\"lte\":\"{{end}}\"{{/end}}}}{{/line_no}}}}}}"
GET /_search/template "template": { "file": "storedTemplate""params": { "query_string": "search for these words" Name of the the query template in
config/scripts/
, i.e.,storedTemplate.mustache
.
This template can be retrieved by
This template can be deleted by
To use an indexed template at search time use:
GET /_search/template "template": { "id": "templateName""params": { "query_string": "search for these words" Name of the the query template stored in the
.scripts
index. Aggregations
The
index
and
type
parameters may be single values, or comma-separated.
This will yield the following result:
And specifying the same request, this time with a routing value:
This will yield the following result:
routing
A comma-separated list of routing values to take into account when
determining which shards a request would be executed against.
preference
Controls a
preference
of which shard replicas to execute the search
request on. By default, the operation is randomized between the shard
replicas. See the
preference
documentation for a list of all acceptable values.
local
A boolean value whether to read the cluster state locally in order to
determine where shards are allocated instead of using the Master node’s
cluster state.
Min Aggregation
Aggregations grew out of the facets module and the long experience of how users use it (and would like to use it) for real-time data analytics purposes. As such, it serves as the next generation replacement for the functionality we currently refer to as "faceting". Facets provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that can be defined (filtered queries, top-level filters, and facet level filters). While powerful, their implementation is not designed from the ground up to support complex aggregations and is thus limited. The aggregations module breaks the barriers the current facet implementation put in place. The new name ("Aggregations") also indicates the intention here - a generic yet extremely powerful framework for building aggregations - any types of aggregations. An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed query/filters of the search request). There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into two main families:
The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested! Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of another higher-level aggregation). Structuring AggregationsThe following snippet captures the basic structure of aggregations: Values Source
Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from
a specific document field which is set using the
The
When both
Metrics AggregationsBucket AggregationsCaching heavy aggregationsSee Shard query cache for more details. Returning only aggregation results
Setting
Computing the min price value across all documents: { "aggs" : { "min_price_in_euros" : { "min" : { "field" : "price", "script" : "_value * conversion_rate", "params" : { "conversion_rate" : 1.2 Sum Aggregation Computing the max price value across all documents { "aggs" : { "max_price_in_euros" : { "max" : { "field" : "price", "script" : "_value * conversion_rate", "params" : { "conversion_rate" : 1.2 Avg Aggregation Computing the intraday return based on a script: Computing the sum of squares over all stock tick changes: { "aggs" : { "aggs" : { "daytime_return" : { "sum" : { "field" : "change", "script" : "_value * _value" } Stats Aggregation Assuming the data consists of documents representing exams grades (between 0 and 100) of students Computing the average grade based on a script: { "aggs" : { "aggs" : { "avg_corrected_grade" : { "avg" : { "field" : "grade", "script" : "_value * correction", "params" : { "correction" : 1.2 Extended Stats Aggregation
The stats that are returned consist of:
Assuming the data consists of documents representing exams grades (between 0 and 100) of students Computing the grades stats based on a script: { "aggs" : { "aggs" : { "grades_stats" : { "stats" : { "field" : "grade", "script" : "_value * correction", "params" : { "correction" : 1.2 Value Count Aggregation
The
Assuming the data consists of documents representing exams grades (between 0 and 100) of students { "aggs" : { "grades_stats" : { "extended_stats" : { "field" : "grade" } } }
The above aggregation computes the grades statistics over all documents. The aggregation type is
{ "aggregations": { "grade_stats": { "count": 9, "min": 72, "max": 99, "avg": 86, "sum": 774, "sum_of_squares": 67028, "variance": 51.55555555555556, "std_deviation": 7.180219742846005, "std_deviation_bounds": { "upper": 100.36043948569201, "lower": 71.63956051430799 }
The name of the aggregation (
|
Computing the grades stats based on a script:
{ "aggs" : { "aggs" : { "grades_stats" : { "extended_stats" : { "field" : "grade", "script" : "_value * correction", "params" : { "correction" : 1.2 Percentiles Aggregation
Let’s look at a range of percentiles representing load time:
{ "aggs" : { "load_time_outlier" : { "percentiles" : { "field" : "load_time"The field
load_time
must be a numeric field
{ "aggs" : { "load_time_outlier" : { "percentiles" : { "field" : "load_time", "percents" : [95, 99, 99.9]Use the
percents
parameter to specify particular percentiles to calculate
{ "aggs" : { "load_time_outlier" : { "percentiles" : { "script" : "doc['load_time'].value / timeUnit","params" : { "timeUnit" : 1000
The
field
parameter is replaced with ascript
parameter, which uses the script to generate values which percentiles are calculated on Scripting supports parameterized input just like any other script
The algorithm used by the
percentile
metric is called TDigest (introduced by
Ted Dunning in
Computing Accurate Quantiles using T-Digests
).
When using this metric, there are a few guidelines to keep in mind:
q(1-q)
. This means that extreme percentiles (e.g. 99%)
are more accurate than less extreme percentiles, such as the median
For small sets of values, percentiles are highly accurate (and potentially
100% accurate if the data is small enough).
As the quantity of values in a bucket grows, the algorithm begins to approximate
the percentiles. It is effectively trading accuracy for memory savings. The
exact level of inaccuracy is difficult to generalize, since it depends on your
data distribution and volume of data being aggregated
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
Please see Percentiles are (usually) approximate and Compression for advice regarding approximation and memory use of the percentile ranks aggregation
Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that 95% of page loads completely within 15ms and 99% of page loads complete within 30ms.
Let’s look at a range of percentiles representing load time:
{ "aggs" : { "load_time_outlier" : { "percentile_ranks" : { "field" : "load_time","values" : [15, 30] The field
load_time
must be a numeric field
The response will look like this:
{ "aggs" : { "load_time_outlier" : { "percentile_ranks" : { "values" : [3, 5], "script" : "doc['load_time'].value / timeUnit","params" : { "timeUnit" : 1000
The
field
parameter is replaced with ascript
parameter, which uses the script to generate values which percentile ranks are calculated on Scripting supports parameterized input just like any other script
Assume you are indexing books and would like to count the unique authors that match a query:
This aggregation also supports the
precision_threshold
and
rehash
options:
{ "aggs" : { "author_count" : { "cardinality" : { "field" : "author_hash", "precision_threshold": 100,"rehash": false
The
precision_threshold
options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms). If you computed a hash on client-side, stored it into your documents and want Elasticsearch to use them to compute counts using this hash function without rehashing values, it is possible to specifyrehash: false
. Default value istrue
. Please note that the hash must be indexed as a long whenrehash
is false.
This
cardinality
aggregation is based on the
HyperLogLog++
algorithm, which counts based on the hashes of the values with some interesting
properties:
For a precision threshold of
c
, the implementation that we are using requires
about
c * 8
bytes.
The following chart shows how the error varies before and after the threshold:
For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.
A metric aggregation that computes the bounding box containing all geo_point values for a field.
{ "query" : { "match" : { "business_type" : "shop" } "aggs" : { "viewport" : { "geo_bounds" : { "field" : "location","wrap_longitude" : true
The
geo_bounds
aggregation specifies the field to use to obtain the boundswrap_longitude
is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value istrue
The response for the above aggregation:
{ "aggregations": { "viewport": { "bounds": { "top_left": { "lat": 80.45, "lon": -160.22 "bottom_right": { "lat": 40.65, "lon": 42.57 Scripted Metric Aggregation
If the
top_hits
aggregator is wrapped in a
nested
or
reverse_nested
aggregator then nested hits are being returned.
Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type
has been configured. The
top_hits
aggregator has the ability to un-hide these documents if it is wrapped in a
nested
or
reverse_nested
aggregator. Read more about nested in the
nested type mapping
.
If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share
the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why
nested hits also include their nested identity. The nested identity is kept under the
_nested
field in the search hit
and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based.
Top hits response snippet with a nested hit, which resides in the third slot of array field
nested_field1
in document with id
1
:
... "hits": { "total": 25365, "max_score": 1, "hits": [ "_index": "a", "_type": "b", "_id": "1", "_score": 1, "_nested" : { "field" : "nested_field1", "offset" : 2 "_source": ... ...
If
_source
is requested then just the part of the source of the nested object is returned, not the entire source of the document.
Also stored fields on the
nested
inner object level are accessible via
top_hits
aggregator residing in a
nested
or
reverse_nested
aggregator.
Only nested hits will have a
_nested
field in the hit, non nested (regular) hits will not have a
_nested
field.
The information in
_nested
can also be used to parse the original source somewhere else if
_source
isn’t enabled.
If there are multiple levels of nested object types defined in mappings then the
_nested
information can also be hierarchical
in order to express the identity of nested hits that are two layers deep or more.
In the example below a nested hit resides in the first slot of the field
nested_grand_child_field
which then resides in
the second slow of the
nested_child_field
field:
... "hits": { "total": 2565, "max_score": 1, "hits": [ "_index": "a", "_type": "b", "_id": "1", "_score": 1, "_nested" : { "field" : "nested_child_field", "offset" : 1, "_nested" : { "field" : "nested_grand_child_field", "offset" : 0 "_source": ... Global Aggregation
A metric aggregation that executes using scripts to provide a metric output.
{ "query" : { "match_all" : {} "aggs": { "profit": { "scripted_metric": { "init_script" : "_agg['transactions'] = []", "map_script" : "if (doc['type'].value == \"sale\") { _agg.transactions.add(doc['amount'].value) } else { _agg.transactions.add(-1 * doc['amount'].value) }","combine_script" : "profit = 0; for (t in _agg.transactions) { profit += t }; return profit", "reduce_script" : "profit = 0; for (a in _aggs) { profit += a }; return profit"
map_script
is the only required parameter
The response for the above aggregation:
The scripted metric aggregation uses scripts at 4 stages of its execution:
In the above example, the
init_script
creates an array
transactions
in the
_agg
object.
_agg
.
In the above example, the
map_script
checks the value of the type field. If the value is
sale
the value of the amount field
is added to the transactions array. If the value of the type field is not
sale
the negated value of the amount field is added
to transactions.
In the above example, the
combine_script
iterates through all the stored transactions, summing the values in the
profit
variable
and finally returns
profit
.
_aggs
which is an array of the result of the combine_script on each shard. If a reduce_script is not provided
the reduce phase will return the
_aggs
variable.
In the above example, the
reduce_script
iterates through the
profit
returned by each shard summing the values before returning the
final combined profit which will be returned in the response of the aggregation.
Imagine a situation where you index the following documents into and index with 2 shards:
params
Optional. An object whose contents will be passed as variables to the
init_script
,
map_script
and
combine_script
. This can be
useful to allow the user to control the behavior of the aggregation and for storing state between the scripts. If this is not specified,
the default is the equivalent of providing:
"params" : { "_agg" : {} reduce_params Optional. An object whose contents will be passed as variables to the |
The response for the above aggregation:
{ "aggregations" : { "all_products" : { "doc_count" : 100,"avg_price" : { "value" : 56.3 The number of documents that were aggregated (in our case, all documents within the search context) Filters Aggregation
In the above example, we calculate the average price of all the products that are red.
{ "aggs" : { "red_products" : { "doc_count" : 100, "avg_price" : { "value" : 56.3 } Missing Aggregation
The filters field can also be provided as an array of filters, as in the following request:
... "aggs" : { "messages" : { "buckets" : [ "doc_count" : 34, "monthly" : { "buckets : [ ... // the histogram monthly breakdown "doc_count" : 439, "monthly" : { "buckets" : [ ... // the histogram monthly breakdown Nested Aggregation
In the above example, we get the total number of products that do not have a price.
A special single bucket aggregation that enables aggregating nested documents.
{ "product" : { "properties" : { "resellers" : {"type" : "nested", "properties" : { "name" : { "type" : "string" }, "price" : { "type" : "double" } The
resellers
is an array that holds nested documents under theproduct
object.
The following aggregations will return the minimum price products can be purchased in:
The
reverse_nested
aggregation must be defined inside a
nested
aggregation.
{ "issue" : { "properties" : { "tags" : { "type" : "string" } "comments" : {"type" : "nested" "properties" : { "username" : { "type" : "string", "index" : "not_analyzed" }, "comment" : { "type" : "string" } The
comments
is an array that holds nested documents under theissue
object.
{ "query": { "match": { "name": "led tv" "aggs": { "comments": { "nested": { "path": "comments" "aggs": { "top_usernames": { "terms": { "field": "comments.username" "aggs": { "comment_to_issue": { "reverse_nested": {},"aggs": { "top_tags_per_comment": { "terms": { "field": "tags" }
A
|
{ "aggregations": { "comments": { "top_usernames": { "buckets": [ "key": "username_1", "doc_count": 12, "comment_to_issue": { "top_tags_per_comment": { "buckets": [ "key": "tag1", "doc_count": 9 Terms Aggregation
This aggregation relies on the _parent field in the mapping. This aggregation has a single option:
type
- The what child type the buckets in the parent space should be mapped to.
For example, let’s say we have an index of questions and answers. The answer type has the following
_parent
field in the mapping:
{ "answer" : { "_parent" : { "type" : "question" }
The question typed document contain a tag field and the answer typed documents contain an owner field. With the
children
aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in
two different kinds of documents.
An example of a question typed document:
{ "body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...", "title": "Whats the best way to file transfer my site from server to a newer one?", "tags": [ "windows-server-2003", "windows-server-2008", "file-transfer" }
An example of an answer typed document:
{ "owner": { "location": "Norfolk, United Kingdom", "display_name": "Sam", "id": 48 "body": "<p>Unfortunately your pretty much limited to FTP...", "creation_date": "2009-05-04T13:45:37.030" }
The following request can be built that connects the two together:
{ "aggs": { "top-tags": { "terms": { "field": "tags", "size": 10 "aggs": { "to-answers": { "children": { "type" : "answer""aggs": { "top-names": { "terms": { "field": "owner.display_name", "size": 10 The
type
points to type / mapping with the nameanswer
.
The above example returns the top question tags and per tag the top answer owners.
{ "aggregations": { "top-tags": { "buckets": [ "key": "windows-server-2003", "doc_count": 25365,"to-answers": { "doc_count": 36004,
"top-names": { "buckets": [ "key": "Sam", "doc_count": 274 "key": "chris", "doc_count": 19 "key": "david", "doc_count": 14 "key": "linux", "doc_count": 18342, "to-answers": { "doc_count": 6655, "top-names": { "buckets": [ "key": "abrams", "doc_count": 25 "key": "ignacio", "doc_count": 25 "key": "vazquez", "doc_count": 25 "key": "windows", "doc_count": 18119, "to-answers": { "doc_count": 24051, "top-names": { "buckets": [ "key": "molly7244", "doc_count": 265 "key": "david", "doc_count": 27 "key": "chris", "doc_count": 26 "key": "osx", "doc_count": 10971, "to-answers": { "doc_count": 5902, "top-names": { "buckets": [ "key": "diago", "doc_count": 4 "key": "albert", "doc_count": 3 "key": "asmus", "doc_count": 3 "key": "ubuntu", "doc_count": 8743, "to-answers": { "doc_count": 8784, "top-names": { "buckets": [ "key": "ignacio", "doc_count": 9 "key": "abrams", "doc_count": 8 "key": "molly7244", "doc_count": 8 "key": "windows-xp", "doc_count": 7517, "to-answers": { "doc_count": 13610, "top-names": { "buckets": [ "key": "molly7244", "doc_count": 232 "key": "chris", "doc_count": 9 "key": "john", "doc_count": 9 "key": "networking", "doc_count": 6739, "to-answers": { "doc_count": 2076, "top-names": { "buckets": [ "key": "molly7244", "doc_count": 6 "key": "alnitak", "doc_count": 5 "key": "chris", "doc_count": 3 "key": "mac", "doc_count": 5590, "to-answers": { "doc_count": 999, "top-names": { "buckets": [ "key": "abrams", "doc_count": 2 "key": "ignacio", "doc_count": 2 "key": "vazquez", "doc_count": 2 "key": "wireless-networking", "doc_count": 4409, "to-answers": { "doc_count": 6497, "top-names": { "buckets": [ "key": "molly7244", "doc_count": 61 "key": "chris", "doc_count": 5 "key": "mike", "doc_count": 5 "key": "windows-8", "doc_count": 3601, "to-answers": { "doc_count": 4263, "top-names": { "buckets": [ "key": "molly7244", "doc_count": 3 "key": "msft", "doc_count": 2 "key": "user172132", "doc_count": 2 The number of question documents with the tag
windows-server-2003
. The number of answer documents that are related to question documents with the tagwindows-server-2003
. Significant Terms Aggregation
{ "aggregations" : { "genders" : { "doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets" : [
"key" : "male", "doc_count" : 10 "key" : "female", "doc_count" : 10 an upper bound of the error on the document counts for each term, see below when there are lots of unique terms, elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response the list of the top buckets, the meaning of
top
being defined by the order
By default, the
terms
aggregation will return the buckets for the top ten terms ordered by the
doc_count
. One can
change this default behaviour by setting the
size
parameter.
Shard A | Shard B | Shard C | |
---|---|---|---|
1 |
Product A (25) |
Product A (30) |
Product A (45) |
2 |
Product B (18) |
Product B (25) |
Product C (44) |
3 |
Product C (6) |
Product F (17) |
Product Z (36) |
4 |
Product D (3) |
Product Z (16) |
Product G (30) |
5 |
Product E (2) |
Product G (15) |
Product E (29) |
6 |
Product F (2) |
Product H (14) |
Product H (28) |
7 |
Product G (2) |
Product I (10) |
Product Q (2) |
8 |
Product H (2) |
Product Q (6) |
Product D (1) |
9 |
Product I (1) |
Product J (8) |
|
10 |
Product J (1) |
Product C (4) |
|
The shards will return their top 5 terms so the results from the shards will be:
Shard A | Shard B | Shard C | |
---|---|---|---|
1 |
Product A (25) |
Product A (30) |
Product A (45) |
2 |
Product B (18) |
Product B (25) |
Product C (44) |
3 |
Product C (6) |
Product F (17) |
Product Z (36) |
4 |
Product D (3) |
Product Z (16) |
Product G (30) |
5 |
Product E (2) |
Product G (15) |
Product E (29) |
1 |
Product A (100) |
2 |
Product Z (52) |
3 |
Product C (50) |
4 |
Product G (45) |
5 |
Product B (43) |
Ordering the buckets by their
doc_count
in an ascending manner:
Ordering the buckets alphabetically by their terms in an ascending manner:
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
The path must be defined in the following form:
AGG_SEPARATOR := '>' METRIC_SEPARATOR := '.' AGG_NAME := <the name of the aggregation> METRIC := <the name of the metric (in case of multi-value metrics aggregation)> PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
The above will sort the countries buckets based on the average height among the female population.
The regular expression are based on the Java™ Pattern , and as such, they it is also possible to pass in flags that will determine how the compiled regular expression will work:
{ "aggs" : { "tags" : { "terms" : { "field" : "tags", "include" : { "pattern" : ".*sport.*", "flags" : "CANON_EQ|CASE_INSENSITIVE""exclude" : { "pattern" : "water_.*", "flags" : "CANON_EQ|CASE_INSENSITIVE" the flags are concatenated using the
|
character as a separator
The possible flags that can be used are:
CANON_EQ
,
CASE_INSENSITIVE
,
COMMENTS
,
DOTALL
,
LITERAL
,
MULTILINE
,
UNICODE_CASE
,
UNICODE_CHARACTER_CLASS
and
UNIX_LINES
For matching based on exact values the
include
and
exclude
parameters can simply take an array of
strings that represent the terms as they are found in the index:
{ "aggs" : { "JapaneseCars" : { "terms" : { "field" : "make", "include" : ["mazda", "honda"] "ActiveCarManufacturers" : { "terms" : { "field" : "make", "exclude" : ["rover", "jensen"] }
The
terms
aggregation does not support collecting terms from multiple fields
in the same document. The reason is that the
terms
agg doesn’t collect the
string term values themselves, but rather uses
global ordinals
to produce a list of all of the unique values in the field. Global ordinals
results in an important performance boost which would not be possible across
multiple fields.
There are two approaches that you can use to perform a
terms
agg across
multiple fields:
copy_to
field
If you know ahead of time that you want to collect the terms from two or more
fields, then use
copy_to
in your mapping to create a new dedicated field at
index time which contains the values from both fields. You can aggregate on
this single field, which will benefit from the global ordinals optimization.
There are different mechanisms by which terms aggregations can be executed:
{ "aggs" : { "tags" : { "terms" : { "field" : "tags", "execution_hint": "map"[experimental] This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features. the possible values are
map
,global_ordinals
,global_ordinals_hash
andglobal_ordinals_low_cardinality
An aggregation that returns interesting or unusual occurrences of terms in a set.
Example using a parent aggregation for segmentation:
Now we have anomaly detection for each of the police forces using a single request.
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 ( http://arxiv.org/pdf/cs/0412098v3.pdf ) can be used as significance score by adding the parameter
"gnd": { }
gnd
also accepts the
background_is_superset
parameter.
It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997]( http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf ) for a study on using significant terms for feature selection for text classification).
If none of the above measures suits your usecase than another option is to implement a custom significance measure:
Customized scores can be implemented via a script:
Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see script documentation . Parameters need to be set as follows:
script
Inline script, name of script file or name of indexed script. Mandatory.
script_type
One of "inline" (default), "indexed" or "file".
Script language (default "groovy")
params
Script parameters (default empty).
|
Available parameters in the script are
_subset_freq
Number of documents the term appears in in the subset.
_superset_freq
Number of documents the term appears in in the superset.
_subset_size
Number of documents in the subset.
_superset_size
Number of documents in the superset.
|
If set to
0
, the
size
will be set to
Integer.MAX_VALUE
.
If set to
0
, the
shard_size
will be set to
Integer.MAX_VALUE
.
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the
include
and
exclude
parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
described in the
terms aggregation
documentation.
There are different mechanisms by which terms aggregations can be executed:
{ "aggs" : { "tags" : { "significant_terms" : { "field" : "tags", "execution_hint": "map"the possible values are
map
,global_ordinals
andglobal_ordinals_hash
Please note that Elasticsearch will ignore this execution hint if it is not applicable.
{ "aggs" : { "price_ranges" : { "range" : { "field" : "price", "ranges" : [ { "to" : 50 }, { "from" : 50, "to" : 100 }, { "from" : 100 } "aggs" : { "price_stats" : { "stats" : {}We don’t need to specify the
price
as we "inherit" it by default from the parentrange
aggregation IPv4 Range Aggregation
A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal
range
aggregation is that the
from
and
to
values can be expressed in
Date Math
expressions, and it is also possible to specify a date format by which the
from
and
to
response fields will be returned.
Note that this aggregation includes the
from
value and excludes the
to
value for each range.
Example:
{ "aggs": { "range": { "date_range": { "field": "date", "format": "MM-yyy", "ranges": [ { "to": "now-10M/M" },{ "from": "now-10M/M" }
< now minus 10 months, rounded down to the start of the month. >= now minus 10 months, rounded down to the start of the month.
All ASCII letters are reserved as format pattern letters, which are defined as follows:
Symbol | Meaning | Presentation | Examples |
---|---|---|---|
G |
era |
text |
AD |
C |
century of era (>=0) |
number |
20 |
Y |
year of era (>=0) |
year |
1996 |
x |
weekyear |
year |
1996 |
w |
week of weekyear |
number |
27 |
e |
day of week |
number |
2 |
E |
day of week |
text |
Tuesday; Tue |
y |
year |
year |
1996 |
D |
day of year |
number |
189 |
M |
month of year |
month |
July; Jul; 07 |
d |
day of month |
number |
10 |
a |
halfday of day |
text |
PM |
K |
hour of halfday (0~11) |
number |
0 |
h |
clockhour of halfday (1~12) |
number |
12 |
H |
hour of day (0~23) |
number |
0 |
k |
clockhour of day (1~24) |
number |
24 |
m |
minute of hour |
number |
30 |
s |
second of minute |
number |
55 |
S |
fraction of second |
number |
978 |
z |
time zone |
text |
Pacific Standard Time; PST |
Z |
time zone offset/id |
zone |
-0800; -08:00; America/Los_Angeles |
' |
escape for text |
delimiter |
'' |
The count of pattern letters determine the format.
Any characters in the pattern that are not in the ranges of [ a .. z ] and [ A .. Z ] will be treated as quoted text. For instance, characters like : , . , ' , '# and ? will appear in the resulting time text even they are not embraced within single quotes.
Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:
Example:
{ "aggs" : { "ip_ranges" : { "ip_range" : { "field" : "ip", "ranges" : [ { "to" : "10.0.0.5" }, { "from" : "10.0.0.5" } }
Response:
{ "aggregations": { "ip_ranges": { "buckets" : [ "to": 167772165, "to_as_string": "10.0.0.5", "doc_count": 4 "from": 167772165, "from_as_string": "10.0.0.5", "doc_count": 6 }
IP ranges can also be defined as CIDR masks:
{ "aggs" : { "ip_ranges" : { "ip_range" : { "field" : "ip", "ranges" : [ { "mask" : "10.0.0.0/25" }, { "mask" : "10.0.0.127/25" } }
Response:
{ "aggregations": { "ip_ranges": { "buckets": [ "key": "10.0.0.0/25", "from": 1.6777216E+8, "from_as_string": "10.0.0.0", "to": 167772287, "to_as_string": "10.0.0.127", "doc_count": 127 "key": "10.0.0.127/25", "from": 1.6777216E+8, "from_as_string": "10.0.0.0", "to": 167772287, "to_as_string": "10.0.0.127", "doc_count": 127 Date Histogram Aggregation
From the rounding function above it can be seen that the intervals themselves must be integers.
The following snippet "buckets" the products based on their
price
by interval of
50
:
And the following may be the response:
{ "aggregations": { "prices" : { "buckets": [ "key": 0, "doc_count": 2 "key": 50, "doc_count": 4 "key" : 100, "doc_count" : 0"key": 150, "doc_count": 3 No documents were found that belong in this bucket, yet it is still returned with zero
doc_count
.
To understand why, let’s look at an example:
Ordering the buckets by their key - descending:
Ordering the buckets by their
doc_count
- ascending:
{ "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "price_stats.min" : "asc" }"aggs" : { "price_stats" : { "stats" : {} }
The
{ "price_stats.min" : asc" }
will sort the buckets based onmin
value of theirprice_stats
sub-aggregation. There is no need to configure theprice
field for theprice_stats
aggregation as it will inherit it by default from its parent histogram aggregation.
The path must be defined in the following form:
AGG_SEPARATOR := '>' METRIC_SEPARATOR := '.' AGG_NAME := <the name of the aggregation> METRIC := <the name of the metric (in case of multi-value metrics aggregation)> PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
{ "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "promoted_products>rating_stats.avg" : "desc" }"aggs" : { "promoted_products" : { "filter" : { "term" : { "promoted" : true }}, "aggs" : { "rating_stats" : { "stats" : { "field" : "rating" }} }
The above will sort the buckets based on the avg rating among the promoted products
{ "aggregations": { "prices": { "buckets": { "0": { "key": 0, "doc_count": 2 "50": { "key": 50, "doc_count": 4 "150": { "key": 150, "doc_count": 3 Geo Distance Aggregation
A multi-bucket aggregation similar to the
histogram
except it can
only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible
to use the normal
histogram
on dates as well, though accuracy will be compromised. The reason for this is in the fact
that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason,
we need special support for time based data. From a functionality perspective, this histogram supports the same features
as the normal
histogram
. The main difference is that the interval can be specified by date/time expressions.
Requesting bucket intervals of a month.
{ "aggs" : { "articles_over_time" : { "date_histogram" : { "field" : "date", "interval" : "month" }
Available expressions for interval:
year
,
quarter
,
month
,
week
,
day
,
hour
,
minute
,
second
Fractional values are allowed for seconds, minutes, hours, days and weeks. For example 1.5 hours:
{ "aggs" : { "articles_over_time" : { "date_histogram" : { "field" : "date", "interval" : "1.5h" }
See the section called “Time units ” for accepted abbreviations.
The
offset
option accepts positive or negative time durations like "1h" for an hour or "1M" for a Month. See
the section called “Time units
” for more
possible time duration options.
{ "aggs" : { "articles_over_time" : { "date_histogram" : { "field" : "date", "interval" : "1M", "format" : "yyyy-MM-dd"Supports expressive date format pattern
Response:
{ "aggregations": { "articles_over_time": { "buckets": [ "key_as_string": "2013-02-02", "key": 1328140800000, "doc_count": 1 "key_as_string": "2013-03-02", "key": 1330646400000, "doc_count": 2 }
Like with the normal
histogram
, both document level scripts and
value level scripts are supported. It is also possible to control the order of the returned buckets using the
order
settings and filter the returned buckets based on a
min_doc_count
setting (by default all buckets with
min_doc_count > 0
will be returned). This histogram also supports the
extended_bounds
setting, which enables extending
the bounds of the histogram beyond the data itself (to read more on why you’d want to do that please refer to the
explanation
here
).
A multi-bucket aggregation that works on
geo_point
fields and conceptually works very similar to the
range
aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).
{ "aggs" : { "rings_around_amsterdam" : { "geo_distance" : { "field" : "location", "origin" : "52.3760, 4.894", "ranges" : [ { "to" : 100 }, { "from" : 100, "to" : 300 }, { "from" : 300 } }
Response:
{ "aggregations": { "rings" : { "buckets": [ "key": "*-100.0", "from": 0, "to": 100.0, "doc_count": 3 "key": "100.0-300.0", "from": 100.0, "to": 300.0, "doc_count": 1 "key": "300.0-*", "from": 300.0, "doc_count": 7 }
The specified field must be of type
geo_point
(which can only be set explicitly in the mappings). And it can also hold an array of
geo_point
fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the
geo_point
type
:
{ "lat" : 52.3760, "lon" : 4.894 }
- this is the safest format as it is the most explicit about the
lat
&
lon
values
String format:
"52.3760, 4.894"
- where the first number is the
lat
and the second is the
lon
Array format:
[4.894, 52.3760]
- which is based on the
GeoJson
standard and where the first number is the
lon
and the second one is the
lat
By default, the distance unit is
m
(metres) but it can also accept:
mi
(miles),
in
(inches),
yd
(yards),
km
(kilometers),
cm
(centimeters),
mm
(millimeters).
{ "aggs" : { "rings" : { "geo_distance" : { "field" : "location", "origin" : "52.3760, 4.894", "unit" : "mi","ranges" : [ { "to" : 100 }, { "from" : 100, "to" : 300 }, { "from" : 300 } The distances will be computed as miles
{ "aggs" : { "rings" : { "geo_distance" : { "field" : "location", "origin" : "52.3760, 4.894", "distance_type" : "plane", "ranges" : [ { "to" : 100 }, { "from" : 100, "to" : 300 }, { "from" : 300 } Facets
A multi-bucket aggregation that works on
geo_point
fields and groups points into buckets that represent cells in a grid.
The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a
geohash
which is of user-definable precision.
Geohashes used in this aggregation can have a choice of precision between 1 and 12.
The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail.
The specified field must be of type
geo_point
(which can only be set explicitly in the mappings) and it can also hold an array of
geo_point
fields, in which case all points will be taken into account during aggregation.
When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
{ "aggregations" : { "zoomedInView" : { "filter" : { "geo_bounding_box" : { "location" : { "top_left" : "51.73, 0.9", "bottom_right" : "51.55, 1.1" "aggregations":{ "zoom1":{ "geohash_grid" : { "field":"location", "precision":8, }
GeoHash length Area width x height 5,009.4km x 4,992.6km 1,252.3km x 624.1km 156.5km x 156km 39.1km x 19.5km 4.9km x 4.9km 1.2km x 609.4m 152.9m x 152.4m 38.2m x 19m 4.8m x 4.8m 1.2m x 59.5cm 14.9cm x 14.9cm 3.7cm x 1.9cm |
field
Mandatory. The name of the field indexed with GeoPoints.
precision
Optional. The string length of the geohashes used to define
cells/buckets in the results. Defaults to 5.
Optional. The maximum number of geohash buckets to return
(defaults to 10,000). When results are trimmed, buckets are
prioritised based on the volumes of documents they contain.
A value of
0
will return all buckets that
contain a hit, use with caution as this could use a lot of CPU
and network bandwidth if there are many buckets.
shard_size
Optional. To allow for more accurate counting of the top cells
returned in the final result the aggregation defaults to
returning
max(10,(size x number-of-shards))
buckets from each
shard. If this heuristic is undesirable, the number considered
from each shard can be over-ridden using this parameter.
A value of
0
makes the shard size unlimited.
Terms Facet
The usual purpose of a full-text search engine is to return a small number of documents matching your query. Facets provide aggregated data based on a search query. In the simplest case, a terms facet can return facet counts for various facet values for a specific field . Elasticsearch supports more facet implementations, such as statistical date histogram facets. The field used for facet calculations must be of type numeric, date/time or be analyzed as a single token — see the Mapping guide for details on the analysis process. You can give the facet a custom name and return multiple facets in one request.
Let’s try it out with a simple example. Suppose we have a number of
articles with a field called
We will store some example data first: curl -X DELETE "http://localhost:9200/articles" curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "One", "tags" : ["foo"]}' curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Two", "tags" : ["foo", "bar"]}' curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Three", "tags" : ["foo", "bar", "baz"]}'
Now, let’s query the index for articles beginning with letter
curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d ' "query" : { "query_string" : {"query" : "T*"} }, "facets" : { "tags" : { "terms" : {"field" : "tags"} } '
This request will return articles
"facets" : { "tags" : { "_type" : "terms", "missing" : 0, "total": 5, "other": 0, "terms" : [ { "term" : "foo", "count" : 2 "term" : "bar", "count" : 2 "term" : "baz", "count" : 1 }
In the
Notice, that the counts are scoped to the current query: foo is counted only twice (not three times), bar is counted twice and baz once. Also note that terms are counted once per document, even if the occur more frequently in that document. That’s because the primary purpose of facets is to enable faceted navigation , allowing the user to refine her query based on the insight from the facet, i.e. restrict the search to a specific category, price or date range. Facets can be used, however, for other purposes: computing histograms, statistical aggregations, and more. See the blog about data visualization for inspiration. Scope |
All facets can be configured with an additional filter (explained in the Query DSL section), which will reduce the documents they use for computing results. An example with a term filter:
{ "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { "facet_filter" : { "term" : { "user" : "kimchy"} }
Note that this is different from a facet of the filter type.
Nested mapping allows for better support for "inner" documents faceting, especially when it comes to multi valued key and value facets (like histograms, or term stats).
What is it good for? First of all, this is the only way to use facets on nested documents once they are used (possibly for other reasons). But, there is also facet specific reason why nested documents can be used, and that’s the fact that facets working on different key and value field (like term_stats, or histogram) can now support cases where both are multi valued properly.
For example, let’s use the following mapping:
{ "type1" : { "properties" : { "obj1" : { "type" : "nested" }
And, here is a sample data:
{ "obj1" : [ "name" : "blue", "count" : 4 "name" : "green", "count" : 6 }
{ "query": { "match_all": {} "facets": { "facet1": { "terms_stats": { "key_field" : "name", "value_field": "count" "nested": "obj1", "facet_filter" : { "term" : {"name" : "blue"} Range Facets
The equivalent aggregation would be the
terms
aggregation.
Allow to specify field facets that return the N most frequent terms. For example:
{ "query" : { "match_all" : { } "facets" : { "tag" : { "terms" : { "field" : "tag", "size" : 10 }
It is preferred to have the terms facet executed on a non analyzed field, or a field without a large number of terms it breaks to.
Check
Pattern API
for more details about
regex_flags
options.
The equivalent aggregation would be the
range
aggregation.
range
facet allows to specify a set of ranges and get both the number
of docs (count) that fall within each range, and aggregated data either
based on the field, or using another field. Here is a simple example:
{ "query" : { "match_all" : {} "facets" : { "range1" : { "range" : { "field" : "field_name", "ranges" : [ { "to" : 50 }, { "from" : 20, "to" : 70 }, { "from" : 70, "to" : 120 }, { "from" : 150 } }
Another option which is a bit more DSL enabled is to provide the ranges on the actual field name, for example:
{ "query" : { "match_all" : {} "facets" : { "range1" : { "range" : { "my_field" : [ { "to" : 50 }, { "from" : 20, "to" : 70 }, { "from" : 70, "to" : 120 }, { "from" : 150 } }
The
range
facet always includes the
from
parameter and excludes the
to
parameter for each range.
The equivalent aggregation would be the
histogram
aggregation.
The histogram facet works with numeric data by building a histogram across intervals of the field values. Each value is "rounded" into an interval (or placed in a bucket), and statistics are provided per interval/bucket (count and total). Here is a simple example:
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "histogram" : { "field" : "field_name", "interval" : 100 }
The above example will run a histogram facet on the
field_name
field,
with an
interval
of
100
(so, for example, a value of
1055
will be
placed within the
1000
bucket).
The interval can also be provided as a time based interval (using the time format). This mainly make sense when working on date fields or field that represent absolute milliseconds, here is an example:
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "histogram" : { "field" : "field_name", "time_interval" : "1.5h" }
Sometimes, some munging of both the key and the value are needed. In the key case, before it is rounded into a bucket, and for the value, when the statistical data is computed per bucket scripts can be used. Here is an example:
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "histogram" : { "key_script" : "doc['date'].date.minuteOfHour", "value_script" : "doc['num1'].value" }
In the above sample, we can use a date type field called
date
to get
the minute of hour from it, and the total will be computed based on
another field
num1
. Note, in this case, no
interval
was provided, so
the bucket will be based directly on the
key_script
(no rounding).
Parameters can also be provided to the different scripts (preferable if the script is the same, with different values for a specific parameter, like "factor"):
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "histogram" : { "key_script" : "doc['date'].date.minuteOfHour * factor1", "value_script" : "doc['num1'].value + factor2", "params" : { "factor1" : 2, "factor2" : 3 }
The equivalent aggregation would be the
date_histogram
aggregation.
A specific histogram facet that can work with
date
field types
enhancing it over the regular
histogram facet
. Here is a quick example:
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "date_histogram" : { "field" : "field_name", "interval" : "day" }
The zone value accepts either a numeric value for the hours offset, for
example:
"time_zone" : -2
. It also accepts a format of hours and
minutes, like
"time_zone" : "-02:30"
. Another option is to provide a
time zone accepted as one of the values listed
here
.
Lets take an example. For
2012-04-01T04:15:30Z
, with a
pre_zone
of
-08:00
. For
day
interval, the actual time by applying the time zone
and rounding falls under
2012-03-31
, so the returned value will be (in
millis) of
2012-03-31T00:00:00Z
(UTC). For
hour
interval, applying
the time zone results in
2012-03-31T20:15:30
, rounding it results in
2012-03-31T20:00:00
, but, we want to return it in UTC (
post_zone
is
not set), so we convert it back to UTC:
2012-04-01T04:00:00Z
. Note, we
are consistent in the results, returning the rounded value in UTC.
post_zone
simply takes the result, and adds the relevant offset.
Sometimes, we want to apply the same conversion to UTC we did above for
hour
also for
day
(and up) intervals. We can set
pre_zone_adjust_large_interval
to
true
, which will apply the same
conversion done for
hour
interval in the example, to
day
and above
intervals (it can be set regardless of the interval, but only kick in
when using
day
and higher intervals).
{ "query" : { "match_all" : {} "facets" : { "histo1" : { "date_histogram" : { "key_field" : "timestamp", "value_script" : "doc['price'].value * 2", "interval" : "day" Query Facets
The equivalent aggregation would be the
filter
aggregation.
A filter facet (not to be confused with a facet filter ) allows you to return a count of the hits matching the filter. The filter itself can be expressed using the Query DSL . For example:
{ "facets" : { "wow_facet" : { "filter" : { "term" : { "tag" : "wow" } }
Note, filter facet filters are faster than query facet when using native filters (non query wrapper ones).
There is no equivalent aggregation but you can use the
filter
aggregation and wrap
the query inside a
query filter
.
A facet query allows to return a count of the hits matching the facet query. The query itself can be expressed using the Query DSL. For example:
{ "facets" : { "wow_facet" : { "query" : { "term" : { "tag" : "wow" } Terms Stats Facet
The equivalent aggregation would be the
stats
aggregation.
Statistical facet allows to compute statistical data on a numeric fields. The statistical data include count, total, sum of squares, mean (average), minimum, maximum, variance, and standard deviation. Here is an example:
{ "query" : { "match_all" : {} "facets" : { "stat1" : { "statistical" : { "field" : "num1" }
When using
field
, the numeric value of the field is used to compute
the statistical information. Sometimes, several fields values represent
the statistics we want to compute, or some sort of mathematical
evaluation. The script field allows to define a
script
to evaluate, with
its value used to compute the statistical information. For example:
{ "query" : { "match_all" : {} "facets" : { "stat1" : { "statistical" : { "script" : "doc['num1'].value + doc['num2'].value" }
Parameters can also be provided to the different scripts (preferable if the script is the same, with different values for a specific parameter, like "factor"):
{ "query" : { "match_all" : {} "facets" : { "stat1" : { "statistical" : { "script" : "(doc['num1'].value + doc['num2'].value) * factor", "params" : { "factor" : 5 }
The
terms_stats
facet combines both the
terms
and
statistical
allowing to compute stats computed on a field, per term value driven by
another field. For example:
{ "query" : { "match_all" : { } "facets" : { "tag_price_stats" : { "terms_stats" : { "key_field" : "tag", "value_field" : "price" }
The
size
parameter controls how many facet entries will be returned.
It defaults to
10
. Setting it to 0 will return all terms matching the
hits (be careful not to return too many results).
One can also set
shard_size
(in addition to
size
) which will determine
how many term entries will be requested from each shard. When dealing
with field with high cardinality (at least higher than the requested
size
)
The greater
shard_size
is - the more accurate the result will be (and the
more expensive the overall facet computation will be).
shard_size
is there
to enable you to increase accuracy yet still avoid returning too many
terms_stats entries back to the client.
Ordering is done by setting
order
, with possible values of
term
,
reverse_term
,
count
,
reverse_count
,
total
,
reverse_total
,
min
,
reverse_min
,
max
,
reverse_max
,
mean
,
reverse_mean
.
Defaults to
count
.
The value computed can also be a script, using the
value_script
instead of
value_field
, in which case the
lang
can control its
language, and
params
allow to provide custom parameters (as in other
scripted components).
Note, the terms stats can work with multi valued key fields, or multi valued value fields, but not when both are multi valued (as ordering is not maintained).
The equivalent aggregation would be the
geo_distance
aggregation.
The geo_distance facet is a facet providing information for ranges of distances from a provided geo_point including count of the number of hits that fall within each range, and aggregation information (like total).
Assuming the following sample doc:
{ "pin" : { "location" : { "lat" : 40.12, "lon" : -71.34 }
Here is an example that create a
geo_distance
facet from a
pin.location
of 40,-70, and a set of ranges:
{ "query" : { "match_all" : {} "facets" : { "geo1" : { "geo_distance" : { "pin.location" : { "lat" : 40, "lon" : -70 "ranges" : [ { "to" : 10 }, { "from" : 10, "to" : 20 }, { "from" : 20, "to" : 100 }, { "from" : 100 } }
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "query" : { "match_all" : {} "facets" : { "geo1" : { "geo_distance" : { "pin.location" : [40, -70], "ranges" : [ { "to" : 10 }, { "from" : 10, "to" : 20 }, { "from" : 20, "to" : 100 }, { "from" : 100 } }
Option | Description |
---|---|
|
The unit the ranges are provided in. Defaults to
|
|
How to compute the distance. Can either be
|
Facets have been deprecated in favor of aggregations and as such it is recommended to migrate existing code using facets to aggregations.
It is recommended to read the documentation about aggregations before this section.
There is no
query
aggregation so such facets must be migrated to the
filter
aggregation.
can be replaced with the following filter aggregation that uses the query filter :
{ "aggs" : { "wow" : { "filter" : { "query" : { "query_string" : { "query" : "tag:wow" }
There is no
term_stats
aggregation, so you actually need to create a
terms aggregation
that will
create buckets that will be processed with a
stats aggregation
.
For example
{ "facets" : { "tag_price_stats" : { "terms_stats" : { "key_field" : "tag", "value_field" : "price" }
can be replaced with
{ "aggs" : { "tags" : { "terms" : { "field" : "tag" "aggs" : { "price_stats" : { "stats" : { "field" : "price" }
The
histogram
,
date_histogram
,
range
and
geo_distance
facets have a
value_field
parameter that allows to compute statistics per bucket. With
aggregations this needs to be changed to a sub
stats aggregation
.
For example
{ "facets" : { "histo1" : { "date_histogram" : { "key_field" : "timestamp", "value_field" : "price", "interval" : "day" }
can be replaced with
{ "aggs" : { "histo1" : { "date_histogram" : { "field" : "timestamp", "interval" : "day" "aggs" : { "price_stats" : { "stats" : { "field" : "price" }
Facets allow to set a global scope by setting
global : true
in the facet
definition. With aggregations, you will need to put your aggregation under a
global aggregation
instead.
For example
{ "facets" : { "terms1" : { "terms" : { ... }, "global" : true }
can be replaced with
{ "aggs" : { "global_count" : { "global" : {}, "aggs" : { "terms1" : { "terms" : { ... } }
Facet filters can be replaced with a filter aggregation .
For example
{ "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { "facet_filter" : { "term" : { "user" : "mvg" } }
can be replaced with
{ "aggs" : { "filter1" : { "filter" : { "term" : { "user" : "mvg" } "aggs" : { "<AGG NAME>" : { "<AGG TYPE>" : { }
Aggregations have a dedicated nested aggregation to deal with nested objects.
For example
{ "facets" : { "facet1" : { "terms" : { "field" : "name" "nested" : "obj1" }
can be replaced with
{ "aggs" : { "agg1" : { "nested" : { "path" : "obj1" "aggs" : { "agg1": { "terms": { "field" : "obj1.name" }
Note how fields are identified with their full path instead of relative path.
Similarly, this more complex facet that combines
nested
and facet filters:
{ "facets" : { "facet1" : { "terms" : { "field" : "name" "nested" : "obj1", "facet_filter" : { "term" : { "color" : "blue" } }
can be replaced with the following aggregation, which puts a terms aggregation under a filter aggregation, and the filter aggregation under a nested aggregation:
{ "aggs" : { "nested_obj1" : { "nested" : { "path" : "obj1" "aggs" : { "color_filter" : { "filter" : { "term" : { "obj1.color" : "blue" } "aggs" : { "name_terms" : { "terms" : { "field" : "obj1.name" }
In short, this aggregation first moves from the root documents to their nested
documents following the path
obj1
. Then for each nested document, it filters
out those that are not blue, and for the remaining documents, it computes a
terms aggregation on the
name
field.
The
term
suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested terms
are provided per analyzed suggest text token. The
term
suggester
doesn’t take the query into account that is part of request.
The suggest text. The suggest text is a required option that
needs to be set globally or per suggestion.
field
The field to fetch the candidate suggestions from. This is
an required option that either needs to be set globally or per
suggestion.
analyzer
The analyzer to analyse the suggest text with. Defaults
to the search analyzer of the suggest field.
The maximum corrections to be returned per suggest text
token.
Defines how suggestions should be sorted per suggest text
term. Two possible values:
score
: Sort by score first, then document frequency and
then the term itself.
frequency
: Sort by document frequency first, then similarity
score and then the term itself.
suggest_mode
The suggest mode controls what suggestions are
included or controls for what suggest text terms, suggestions should be
suggested. Three possible values can be specified:
missing
: Only provide suggestions for suggest text terms that are
not in the index. This is the default.
popular
: Only suggest suggestions that occur in more docs then
the original suggest text term.
always
: Suggest any matching suggestions based on terms in the
suggest text.
|
lowercase_terms
Lower cases the suggest text terms after text analysis.
max_edits
The maximum edit distance candidate suggestions can
have in order to be considered as a suggestion. Can only be a value
between 1 and 2. Any other value result in an bad request error being
thrown. Defaults to 2.
prefix_length
The number of minimal prefix characters that must
match in order be a candidate suggestions. Defaults to 1. Increasing
this number improves spellcheck performance. Usually misspellings don’t
occur in the beginning of terms. (Old name "prefix_len" is deprecated)
min_word_length
The minimum length a suggest text term must have in
order to be included. Defaults to 4. (Old name "min_word_len" is deprecated)
shard_size
Sets the maximum number of suggestions to be retrieved
from each individual shard. During the reduce phase only the top N
suggestions are returned based on the
size
option. Defaults to the
size
option. Setting this to a value higher than the
size
can be
useful in order to get a more accurate document frequency for spelling
corrections at the cost of performance. Due to the fact that terms are
partitioned amongst shards, the shard level document frequencies of
spelling corrections may not be precise. Increasing this will make these
document frequencies more precise.
max_inspections
A factor that is used to multiply with the
shards_size
in order to inspect more candidate spell corrections on
the shard level. Can improve accuracy at the cost of performance.
Defaults to 5.
min_doc_freq
The minimal threshold in number of documents a
suggestion should appear in. This can be specified as an absolute number
or as a relative percentage of number of documents. This can improve
quality by only suggesting high frequency terms. Defaults to 0f and is
not enabled. If a value higher than 1 is specified then the number
cannot be fractional. The shard level document frequencies are used for
this option.
max_term_freq
The maximum threshold in number of documents a
suggest text token can exist in order to be included. Can be a relative
percentage number (e.g 0.4) or an absolute number to represent document
frequencies. If an value higher than 1 is specified then fractional can
not be specified. Defaults to 0.01f. This can be used to exclude high
frequency terms from being spellchecked. High frequency terms are
usually spelled correctly on top of this also improves the spellcheck
performance. The shard level document frequencies are used for this
option.
Completion Suggester
The
curl -XPOST 'localhost:9200/_search' -d { "suggest" : { "text" : "Xor the Got-Jewel", "simple_phrase" : { "phrase" : { "field" : "bigram", "size" : 1, "direct_generator" : [ { "field" : "body", "suggest_mode" : "always", "min_word_length" : 1 "collate": { "query": { |
stupid_backoff
a simple backoff model that backs off to lower
order n-gram models if the higher order count is
0
and discounts the
lower order n-gram model by a constant factor. The default
discount
is
0.4
. Stupid Backoff is the default model.
laplace
a smoothing model that uses an additive smoothing where a
constant (typically
1.0
or smaller) is added to all counts to balance
weights, The default
alpha
is
0.5
.
linear_interpolation
a smoothing model that takes the weighted
mean of the unigrams, bigrams and trigrams based on user supplied
weights (lambdas). Linear Interpolation doesn’t have any default values.
All parameters (
trigram_lambda
,
bigram_lambda
,
unigram_lambda
)
must be supplied.
|
The direct generators support the following parameters:
field
The field to fetch the candidate suggestions from. This is
a required option that either needs to be set globally or per
suggestion.
The maximum corrections to be returned per suggest text token.
suggest_mode
The suggest mode controls what suggestions are
included or controls for what suggest text terms, suggestions should be
suggested. Three possible values can be specified:
missing
: Only suggest terms in the suggest text that aren’t in the
index. This is the default.
popular
: Only suggest suggestions that occur in more docs then the
original suggest text term.
always
: Suggest any matching suggestions based on terms in the
suggest text.
max_edits
The maximum edit distance candidate suggestions can have
in order to be considered as a suggestion. Can only be a value between 1
and 2. Any other value result in an bad request error being thrown.
Defaults to 2.
prefix_length
The number of minimal prefix characters that must
match in order be a candidate suggestions. Defaults to 1. Increasing
this number improves spellcheck performance. Usually misspellings don’t
occur in the beginning of terms. (Old name "prefix_len" is deprecated)
min_word_length
The minimum length a suggest text term must have in
order to be included. Defaults to 4. (Old name "min_word_len" is deprecated)
max_inspections
A factor that is used to multiply with the
shards_size
in order to inspect more candidate spell corrections on
the shard level. Can improve accuracy at the cost of performance.
Defaults to 5.
min_doc_freq
The minimal threshold in number of documents a
suggestion should appear in. This can be specified as an absolute number
or as a relative percentage of number of documents. This can improve
quality by only suggesting high frequency terms. Defaults to 0f and is
not enabled. If a value higher than 1 is specified then the number
cannot be fractional. The shard level document frequencies are used for
this option.
max_term_freq
The maximum threshold in number of documents a
suggest text token can exist in order to be included. Can be a relative
percentage number (e.g 0.4) or an absolute number to represent document
frequencies. If an value higher than 1 is specified then fractional can
not be specified. Defaults to 0.01f. This can be used to exclude high
frequency terms from being spellchecked. High frequency terms are
usually spelled correctly on top of this also improves the spellcheck
performance. The shard level document frequencies are used for this
option.
pre_filter
a filter (analyzer) that is applied to each of the
tokens passed to this candidate generator. This filter is applied to the
original token before candidates are generated.
post_filter
a filter (analyzer) that is applied to each of the
generated tokens before they are passed to the actual phrase scorer.
|
The
completion
suggester is a so-called prefix suggester. It does not
do spell correction like the
term
or
phrase
suggesters but allows
basic
auto-complete
functionality.
The following parameters are supported:
The suggest data structure might not reflect deletes on
documents immediately. You may need to do an
Optimize
for that.
You can call optimize with the
only_expunge_deletes=true
to only target
deletions for merging. By default
only_expunge_deletes=true
will only select
segments for merge where the percentage of deleted documents is greater than
10%
of
the number of document in that segment. To adjust this
index.merge.policy.expunge_deletes_allowed
can
be updated to a value between
[0..100]
. Please remember even with this option set, optimize
is considered a extremely heavy operation and should be called rarely.
Suggesting works as usual, except that you have to specify the suggest
type as
completion
.
The basic completion suggester query supports the following two parameters:
The completion suggester considers all documents in the index. See Context Suggester for an explanation of how to query a subset of documents instead.
The fuzzy query can take specific fuzzy parameters. The following parameters are supported:
fuzziness
The fuzziness factor, defaults to
AUTO
.
See
the section called “Fuzziness
” for allowed settings.
transpositions
Sets if transpositions should be counted
as one or two changes, defaults to
true
min_length
Minimum length of the input before fuzzy
suggestions are returned, defaults
3
prefix_length
Minimum length of the input, which is not
checked for fuzzy alternatives, defaults to
1
unicode_aware
Sets all are measurements (like edit distance,
transpositions and lengths) in unicode code points
(actual letters) instead of bytes.
|
PUT services/_mapping/service "service": { "properties": { "name": { "type" : "string" "tag": { "type" : "string" "suggest_field": { "type": "completion", "context": { "color": {"type": "category", "path": "color_field", "default": ["red", "green", "blue"] "location": {
"type": "geo", "precision": "5m", "neighbors": true, "default": "u33" See the section called “Category Context” See the section called “Geo location Context”
However contexts are specified (as type
category
or
geo
, which are discussed below), each
context value generates a new sub-set of documents which can be queried by the completion
suggester. All three types accept a
default
parameter which provides a default value to use
if the corresponding context value is absent.
The basic structure of this element is that each field forms a new context and the fieldname
is used to reference this context information later on during indexing or querying. All context
mappings have the
default
and the
type
option in common. The value of the
default
field
is used, when ever no specific is provided for the certain context. Note that a context is
defined by at least one value. The
type
option defines the kind of information hold by this
context. These type will be explained further in the following sections.
or as reference to another field within the documents indexed:
The mapping for a geo context accepts four settings, only of which
precision
is required:
precision
This defines the precision of the geohash and can be specified as
5m
,
10km
,
or as a raw geohash precision:
1
..
12
. It’s also possible to setup multiple
precisions by defining a list of precisions:
["5m", "10km"]
neighbors
Geohashes are rectangles, so a geolocation, which in reality is only 1 metre
away from the specified point, may fall into the neighbouring rectangle. Set
neighbours
to
true
to include the neighbouring geohashes in the context.
(default is
on
)
Optionally specify a field to use to look up the geopoint.
default
The geopoint to use if no geopoint has been specified.
|
POST services/_suggest "suggest" : { "text" : "m", "completion" : { "field" : "suggest_field", "size": 10, "context": { "location": { "lat": 0, "lon": 0, "precision": "1km" Count API
The count API allows to easily execute a query and get the number of matches for that query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=user:kimchy' $ curl -XGET 'http://localhost:9200/twitter/tweet/_count' -d ' "query" : { "term" : { "user" : "kimchy" } }'
The query being sent in the body must be nested in a
query
key, same as
the
search api
works
Both examples above do the same thing, which is count the number of tweets from the twitter index for a certain user. The result is:
{ "count" : 1, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }
The query is optional, and when not provided, it will use
match_all
to
count all the docs.
The count API can be applied to multiple types in multiple indices .
Name | Description |
---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be
|
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to
|
|
Should wildcard and prefix queries be analyzed or
not. Defaults to
|
|
[
experimental
]
The API for this feature may change in the future
The maximum count for each shard, upon reaching which the query execution
will terminate early. If set, the response will have a boolean field
|
The count can use the
Query DSL
within
its body in order to express the query that should be executed. The body
content can also be passed as a REST parameter named
source
.
Both HTTP GET and HTTP POST can be used to execute count with body. Since not all clients support GET with body, POST is allowed as well.
The exists API allows to easily determine if any matching documents exist for a provided query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search/exists?q=user:kimchy' $ curl -XGET 'http://localhost:9200/twitter/tweet/_search/exists' -d ' "query" : { "term" : { "user" : "kimchy" } }'
The query being sent in the body must be nested in a
query
key, same as
how the
search api
works.
Both the examples above do the same thing, which is determine the existence of tweets from the twitter index for a certain user. The response body will be of the following format:
{ "exists" : true }
The exists API can be applied to multiple types in multiple indices .
Name | Description |
---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be
|
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to
|
|
Should wildcard and prefix queries be analyzed or
not. Defaults to
|
The exists API can use the
Query DSL
within
its body in order to express the query that should be executed. The body
content can also be passed as a REST parameter named
source
.
HTTP GET and HTTP POST can be used to execute exists with body. Since not all clients support GET with body, POST is allowed as well.
When the query is valid, the response contains
valid:true
:
Name | Description |
---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be
|
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to
|
|
Should wildcard and prefix queries be analyzed or
not. Defaults to
|
If the query is invalid,
valid
will be
false
. Here the query is
invalid because Elasticsearch knows the post_date field should be a date
due to dynamic mapping, and
foo
does not correctly parse into a date:
curl -XGET 'http://localhost:9200/twitter/tweet/_validate/query?q=post_date:foo' {"valid":false,"_shards":{"total":1,"successful":1,"failed":0}}
An
explain
parameter can be specified to get more detailed information
about why a query failed:
curl -XGET 'http://localhost:9200/twitter/tweet/_validate/query?q=post_date:foo&pretty=true&explain=true' "valid" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 "explanations" : [ { "index" : "twitter", "valid" : false, "error" : "org.elasticsearch.index.query.QueryParsingException: [twitter] Failed to parse; org.elasticsearch.ElasticsearchParseException: failed to parse date field [foo], tried both date format [dateOptionalTime], and timestamp number; java.lang.IllegalArgumentException: Invalid format: \"foo\"" [1.6.0] Added in 1.6.0. When the query is valid, the explanation defaults to the string representation of that query. Withrewrite
set totrue
, the explanation is more detailed showing the actual Lucene query that will be executed.For Fuzzy Queries:
curl -XGET 'http://localhost:9200/imdb/movies/_validate/query?rewrite=true' "query": { "fuzzy": { "actors": "kyle" }Response:
{ "valid": true, "_shards": { "total": 1, "successful": 1, "failed": 0 "explanations": [ "index": "imdb", "valid": true, "explanation": "filtered(plot:kyle plot:kylie^0.75 plot:kyne^0.75 plot:lyle^0.75 plot:pyle^0.75)->cache(_type:movies)" }For More Like This:
curl -XGET 'http://localhost:9200/imdb/movies/_validate/query?rewrite=true' "query": { "more_like_this": { "like": { "_id": "88247" "boost_terms": 1 }Response:
{ "valid": true, "_shards": { "total": 1, "successful": 1, "failed": 0 "explanations": [ "index": "imdb", "valid": true, "explanation": "filtered(((title:terminator^3.71334 plot:future^2.763601 plot:human^2.8415773 plot:sarah^3.4193945 plot:kyle^3.8244398 plot:cyborg^3.9177752 plot:connor^4.040236 plot:reese^4.7133346 ... )~6) -ConstantScore(_uid:movies#88247))->cache(_type:movies)" }The request is executed on a single shard only, which is randomly selected. The detailed explanation of the query may depend on which shard is being hit, and therefore may vary from one request to another.
The
index
and
type
parameters expect a single index and a single
type respectively.
This will yield the following result:
This will yield the same result as the previous request.
_source
Set to
true
to retrieve the
_source
of the document explained. You can also
retrieve part of the document by using
_source_include
&
_source_exclude
(see
Get API
for more details)
fields
Allows to control which stored fields to return as part of the
document explained.
routing
Controls the routing in the case the routing was used
during indexing.
parent
Same effect as setting the routing parameter.
preference
Controls on which shard the explain is executed.
source
Allows the data of the request to be put in the query
string of the url.
The query string (maps to the query_string query).
The default field to use when no field prefix is defined within
the query. Defaults to _all field.
analyzer
The analyzer name to be used when analyzing the query
string. Defaults to the analyzer of the _all field.
analyze_wildcard
Should wildcard and prefix queries be analyzed or
not. Defaults to false.
lowercase_expanded_terms
Should terms be automatically lowercased
or not. Defaults to true.
lenient
If set to true will cause format based failures (like
providing text to a numeric field) to be ignored. Defaults to false.
default_operator
The default operator to be used, can be AND or
OR. Defaults to OR.
More Like This API
Sample Usage
Create an index with a mapping for the field
Register a query in the percolator: |
Because
.percolate
is a type it also has a mapping. By default the following mapping is active:
If needed, this mapping can be modified with the update mapping API.
Percolating an Existing Document:
The response is the same as with the regular percolate API.
The
index
and
type
defined in the url path are the default index and type.
The
inner_hits
feature on the
nested
query isn’t supported in the percolate api.
The API simply results in executing a search request with
moreLikeThis
query (http
parameters match the parameters to the
more_like_this
query). This
means that the body of the request can optionally include all the
request body options in the
search API
(aggs, from/to and so on). Internally, the more like this
API is equivalent to performing a boolean query of
more_like_this_field
queries, with one query per specified
mlt_fields
.
Rest parameters relating to search are also allowed, including
search_type
,
search_indices
,
search_types
,
search_scroll
,
search_size
and
search_from
.
When no
mlt_fields
are specified, all the fields of the document will
be used in the
more_like_this
query generated.
By default, the queried document is excluded from the response (
include
set to false).
Note: In order to use the
mlt
feature a
mlt_field
needs to be either
be
stored
, store
term_vector
or
source
needs to be enabled.
The field stats api by defaults executes on all indices, but can execute on specific indices too.
fields
A list of fields to compute stats for.
level
Defines if field stats should be returned on a per index level or on a
cluster wide level. Valid values are
indices
and
cluster
(default).
|
max_doc
The total number of documents.
doc_count
The number of documents that have at least one term for this field, or -1 if
this measurement isn’t available on one or more shards.
density
The percentage of documents that have at least one value for this field. This
is a derived statistic and is based on the
max_doc
and
doc_count
.
sum_doc_freq
The sum of each term’s document frequency in this field, or -1 if this
measurement isn’t available on one or more shards.
Document frequency is the number of documents containing a particular term.
sum_total_term_freq
The sum of the term frequencies of all terms in this field across all
documents, or -1 if this measurement isn’t available on one or more shards.
Term frequency is the total number of occurrences of a term in a particular
document and field.
min_value
The lowest value in the field represented in a displayable form.
max_value
The highest value in the field represented in a displayable form.
|
Example 1. Cluster level field statistics
GET /_field_stats?fields=rating,answer_count,creation_date,display_name
{ "_shards": { "total": 1, "successful": 1, "failed": 0 "indices": { "_all": {"fields": { "creation_date": { "max_doc": 1326564, "doc_count": 564633, "density": 42, "sum_doc_freq": 2258532, "sum_total_term_freq": -1, "min_value": "2008-08-01T16:37:51.513Z", "max_value": "2013-06-02T03:23:11.593Z" "display_name": { "max_doc": 1326564, "doc_count": 126741, "density": 9, "sum_doc_freq": 166535, "sum_total_term_freq": 166616, "min_value": "0", "max_value": "정혜선" "answer_count": { "max_doc": 1326564, "doc_count": 139885, "density": 10, "sum_doc_freq": 559540, "sum_total_term_freq": -1, "min_value": 0, "max_value": 160 "rating": { "max_doc": 1326564, "doc_count": 437892, "density": 33, "sum_doc_freq": 1751568, "sum_total_term_freq": -1, "min_value": -14, "max_value": 1277 The
_all
key indicates that it contains the field stats of all indices in the cluster.
Example 2. Indices level field statistics
GET /_field_stats?fields=rating,answer_count,creation_date,display_name&level=indices
{ "_shards": { "total": 1, "successful": 1, "failed": 0 "indices": { "stack": {"fields": { "creation_date": { "max_doc": 1326564, "doc_count": 564633, "density": 42, "sum_doc_freq": 2258532, "sum_total_term_freq": -1, "min_value": "2008-08-01T16:37:51.513Z", "max_value": "2013-06-02T03:23:11.593Z" "display_name": { "max_doc": 1326564, "doc_count": 126741, "density": 9, "sum_doc_freq": 166535, "sum_total_term_freq": 166616, "min_value": "0", "max_value": "정혜선" "answer_count": { "max_doc": 1326564, "doc_count": 139885, "density": 10, "sum_doc_freq": 559540, "sum_total_term_freq": -1, "min_value": 0, "max_value": 160 "rating": { "max_doc": 1326564, "doc_count": 437892, "density": 33, "sum_doc_freq": 1751568, "sum_total_term_freq": -1, "min_value": -14, "max_value": 1277 The
stack
key means it contains all field stats for thestack
index.
Each index created can have specific settings associated with it.
$ curl -XPUT 'http://localhost:9200/twitter/' $ curl -XPUT 'http://localhost:9200/twitter/' -d ' index : number_of_shards : 3number_of_replicas : 2
Default for
number_of_shards
is 5 Default fornumber_of_replicas
is 1 (ie one replica for each primary shard)
The above second curl example shows how an index called
twitter
can be
created with specific settings for it using
YAML
.
In this case, creating an index with 3 shards, each with 2 replicas. The
index settings can also be defined with
JSON
:
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 2 }'
or more simplified
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 2 }'
You do not have to explicitly specify
index
section inside the
settings
section.
For more information regarding all the different index level settings that can be set when creating an index, please check the index modules section.
The create index API allows to provide a set of one or more mappings:
The create index API allows also to provide a set of warmers :
curl -XPUT localhost:9200/test -d '{ "warmers" : { "warmer_1" : { "source" : { "query" : { }'
The create index API allows also to provide a set of aliases :
curl -XPUT localhost:9200/test -d '{ "aliases" : { "alias_1" : {}, "alias_2" : { "filter" : { "term" : {"user" : "kimchy" } "routing" : "kimchy" }'
curl -XPUT localhost:9200/test -d '{ "creation_date" : 1407751337000![]()
creation_date
is set using epoch time in milliseconds. Get Index
The get index API allows to retrieve information about one or more indexes.
The above command will only return the settings and mappings for the index called
twitter
.
The available features are
_settings
,
_mappings
,
_warmers
and
_aliases
.
The put mapping API allows to register specific mapping definition for a specific type.
More information on how to define type mappings can be found in the mapping section.
{index}
blank | * | _all | glob pattern | name1, name2, …
{type}
Name of the type to add. Must be the name of the type defined in the body.
|
The get mapping API allows to retrieve mapping definitions for an index or index/type.
If you want to get mappings of all indices and types then the following two examples are equivalent:
The get field mapping API allows you to retrieve mapping definitions for one or more fields. This is useful when you do not need the complete type mapping returned by the Get Mapping API.
The following returns the mapping of the field
text
only:
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet/field/text'
For which the response is (assuming
text
is a default string field):
{ "twitter": { "tweet": { "text": { "full_name": "text", "mapping": { "text": { "type": "string" } }
Full names
the full path, including any parent object name the field is
part of (ex.
user.id
).
Index names
the name of the lucene field (can be different than the
field name if the
index_name
option of the mapping is used).
Field names
the name of the field without the path to it (ex.
id
for
{ "user" : { "id" : 1 } }
).
|
For example, consider the following mapping:
include_defaults
adding
include_defaults=true
to the query string will cause the response
to include default values, which are normally suppressed.
Delete Mapping
Index Aliases
Allow to delete a mapping (type) along with its data. The REST endpoints are
Here is a sample of associating the alias
An alias can also be removed, for example:
Associating an alias with more than one index are simply several
It is an error to index to an alias which points to more than one index. Filtered AliasesTo create a filtered alias, first we need to ensure that the fields already exist in the mapping:
Now we can create an alias that uses a filter on field
RoutingIt’s also possible to specify different routing values for searching and indexing operations: Add a single aliasAn alias can also be added with the endpoint
You can also use the plural
Examples:Aliases during index creationAliases can also be specified during index creation : curl -XPUT localhost:9200/logs_20142801 -d '{ "mappings" : { "type" : { "properties" : { "year" : {"type" : "integer"} "aliases" : { "current_day" : {}, "2014" : { "filter" : { "term" : {"year" : 2014 } }' Delete aliases
The rest endpoint is:
Alternatively you can use the plural
Retrieving existing aliases
The rest endpoint is:
For future versions of Elasticsearch, the default Multiple Indices options will error if a requested index is unavailable. This is to bring this API in line with the other indices GET APIs Examples:All aliases for the index users: All aliases with the name 2013 in any index: Change specific index level settings in real time. Below is the list of settings that can be changed using the update settings API: Deprecated in 1.5.0.As of 2.0 index.fail_on_merge_failure is removed and the engine will always fail on an unexpected merge exception.
index.translog.flush_threshold_ops
When to flush based on operations.
index.translog.flush_threshold_size
When to flush based on translog (bytes) size.
index.translog.flush_threshold_period
When to flush based on a period of not flushing.
index.translog.disable_flush
Disables flushing. Note, should be set for a short
interval and then enabled.
index.cache.filter.max_size
The maximum size of filter cache (per segment in shard).
Set to
-1
to disable.
index.cache.filter.expire
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
The expire after access time for filter cache.
Set to
-1
to disable.
index.gateway.snapshot_interval
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
The gateway snapshot interval (only applies to shared
gateways). Defaults to 10s.
merge policy
All the settings for the merge policy currently configured.
A different merge policy can’t be set.
index.routing.allocation.include.*
A node matching any rule will be allowed to host shards from the index.
index.routing.allocation.exclude.*
A node matching any rule will NOT be allowed to host shards from the index.
index.routing.allocation.require.*
Only nodes matching all rules will be allowed to host shards from the index.
index.routing.allocation.disable_allocation
Disable allocation. Defaults to
false
. Deprecated in favour for
index.routing.allocation.enable
.
index.routing.allocation.disable_new_allocation
Disable new allocation. Defaults to
false
. Deprecated in favour for
index.routing.allocation.enable
.
index.routing.allocation.disable_replica_allocation
Disable replica allocation. Defaults to
false
. Deprecated in favour for
index.routing.allocation.enable
.
index.routing.allocation.enable
Enables shard allocation for a specific index. It can be set to:
all
(default) - Allows shard allocation for all shards.
primaries
- Allows shard allocation only for primary shards.
new_primaries
- Allows shard allocation only for primary shards for new indices.
none
- No shard allocation is allowed.
index.routing.allocation.total_shards_per_node
Controls the total number of shards (replicas and primaries) allowed to be allocated on a single node. Defaults to unbounded (
-1
).
index.recovery.initial_shards
When using local gateway a particular shard is recovered only if there can be allocated quorum shards in the cluster. It can be set to:
quorum
(default)
quorum-1
(or
half
)
full-1
.
Number values are also supported, e.g.
1
.
index.gc_deletes
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
index.ttl.disable_purge
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
Disables temporarily the purge of expired docs.
store level throttling
All the settings for the store level throttling policy currently configured.
index.translog.fs.type
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
Either
simple
or
buffered
(default).
index.compound_format
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
See
index.compound_format
in
the section called “Index Settings
”.
index.compound_on_flush
[
experimental
]
This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
See
`index.compound_on_flush
in
the section called “Index Settings
”.
Index Slow Log
All the settings for slow log.
index.warmer.enabled
See
Warmers
. Defaults to
true
.
Bulk Indexing UsageThen, once bulk indexing is done, the settings can be updated (back to the defaults for example): And, an optimize should be called: Updating Index AnalysisIt is also possible to define new analyzers for the index. But it is required to close the index first and open it after the changes are made.
For example if
curl -XPOST 'localhost:9200/myindex/_close' curl -XPUT 'localhost:9200/myindex/_settings' -d '{ "analysis" : { "analyzer":{ "content":{ "type":"custom", "tokenizer":"whitespace" curl -XPOST 'localhost:9200/myindex/_open' The get settings API allows to retrieve settings of index/indices: Multiple Indices and TypesFiltering settings by nameThe settings that are returned can be filtered with wildcard matching as follows: Performs the analysis process on a text and return the tokens breakdown of the text. Can be used without specifying an index against one of the many built in analyzers: It can also run against a specific index: Also, the analyzer can be derived based on a field mapping, for example: Also, the text can be provided as part of the request body, and not as a parameter. It is also possible to include aliases in an index template as follows: curl -XPUT localhost:9200/_template/template_1 -d ' "template" : "te*", "settings" : { "number_of_shards" : 1 "aliases" : { "alias1" : {}, "alias2" : { "filter" : { "term" : {"user" : "kimchy" } "routing" : "kimchy" "{index}-alias" : {} |
Index templates are identified by a name (in the above case
template_1
) and can be deleted as well:
You can also match several templates by using wildcards like:
To get list of all index templates you can run:
Used to check if the template exists or not. For example:
Please note that templates added this way will not appear in the
/_template/*
API request.
Warmers can be registered when an index gets created, for example:
And an example that registers a warmup against specific types:
{index}
* | _all | glob pattern | name1, name2, …
{type}
* | _all | glob pattern | name1, name2, …
|
Instead of
_warmer
you can also use the plural
_warmers
.
Warmers can be deleted using the following endpoint:
{index}
* | _all | glob pattern | name1, name2, …
{name}
* | _all | glob pattern | name1, name2, …
|
Instead of
_warmer
you can also use the plural
_warmers
.
The indices status API allows to get a comprehensive status information of one or more indices.
The following returns high level aggregation and index level stats for all indices:
Specific index stats can be retrieved using:
The number of docs / deleted docs (docs not yet merged out).
Note, affected by refreshing the index.
store
The size of the index.
indexing
Indexing statistics, can be combined with a comma
separated list of
types
to provide document type level stats.
Get statistics, including missing stats.
search
Search statistics. You can include statistics for custom groups by adding
an extra
groups
parameter (search operations can be associated with one or more
groups). The
groups
parameter accepts a comma separated list of group names.
Use
_all
to return statistics for all groups.
completion
Completion suggest statistics.
fielddata
Fielddata statistics.
flush
Flush statistics.
merge
Merge statistics.
query_cache
Shard query cache
statistics.
refresh
Refresh statistics.
suggest
Suggest statistics.
warmer
Warmer statistics.
translog
Translog statistics.
|
fields
List of fields to be included in the statistics. This is used as the
default list unless a more specific field list is provided (see below).
completion_fields
List of fields to be included in the Completion Suggest statistics.
fielddata_fields
List of fields to be included in the Fielddata statistics.
|
In order to get back shard level stats, set the
level
parameter to
shards
.
Endpoints include segments for a specific index, several indices, or all:
{ "_3": { "generation": 3, "num_docs": 1121, "deleted_docs": 53, "size_in_bytes": 228288, "memory_in_bytes": 3211, "committed": true, "search": true, "version": "4.6", "compound": true The key of the JSON document is the name of the segment. This name is used to generate file names: all files starting with this segment name in the directory of the shard belong to this segment. generation A generation number that is basically incremented when needing to write a new segment. The segment name is derived from this generation number. num_docs The number of non-deleted documents that are stored in this segment. deleted_docs The number of deleted documents that are stored in this segment. It is perfectly fine if this number is greater than 0, space is going to be reclaimed when this segment gets merged. size_in_bytes The amount of disk space that this segment uses, in bytes. memory_in_bytes Segments need to store some data into memory in order to be searchable efficiently. This number returns the number of bytes that are used for that purpose. A value of -1 indicates that Elasticsearch was not able to compute this number. committed Whether the segment has been sync’ed on disk. Segments that are committed would survive a hard reboot. No need to worry in case of false, the data from uncommitted segments is also stored in the transaction log so that Elasticsearch is able to replay changes on the next start. search Whether the segment is searchable. A value of false would most likely mean that the segment has been written to disk but no refresh occurred since then to make it searchable. version The version of Lucene that has been used to write this segment. compound Whether the segment is stored in a compound file. When true, this means that Lucene merged all files from the segment in a single one in order to save file descriptors. Clear Cache
To see cluster-wide recovery status simply leave out the index names.
Here is a complete list of options:
detailed
Display a detailed view. This is primarily useful for viewing the recovery of physical index files. Default: false.
active_only
Display only those recoveries that are currently on-going. Default: false.
|
Shard ID
Recovery type:
gateway
snapshot
replica
relocating
stage
Recovery stage:
init: Recovery has not started
index: Reading index meta-data and copying bytes from source to destination
start: Starting the engine; opening the index for use
translog: Replaying transaction log
finalize: Cleanup
done: Complete
primary
True if shard is primary, false otherwise
start_time
Timestamp of recovery start
stop_time
Timestamp of recovery finish
total_time_in_millis
Total time to recover shard in milliseconds
source
Recovery source:
repository description if recovery is from a snapshot
description of source node otherwise
target
Destination node
index
Statistics about physical index recovery
translog
Statistics about translog recovery
start
Statistics about time to open and start the index
Flush
The flush API allows to flush one or more indices through an API. The flush process of an index basically frees memory from the index by flushing data to the index storage and clearing the internal transaction log . By default, Elasticsearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory. POST /twitter/_flush Request ParametersThe flush API accepts the following request parameters:
Multi Index
The flush API can be applied to more than one index with a single call,
or even on
To check whether a shard has a marker or not, look for the
GET /twitter/_stats/commit?level=shards which returns something similar to: { "indices": { "twitter": { "primaries": {}, "total": {}, "shards": { "0": [ "routing": { "commit": { "id": "te7zF7C4UsirqvL6jp/vUg==", "generation": 2, "user_data": { "sync_id": "AU2VU0meX-VX2aNbEUsD" |
While handy, there are a couple of caveats for this API:
The optimize API accepts the following request parameters as query arguments:
max_num_segments
The number of segments to optimize to. To fully
optimize the index, set it to
1
. Defaults to simply checking if a
merge needs to execute, and if so, executes it.
only_expunge_deletes
Should the optimize process only expunge segments with
deletes in it. In Lucene, a document is not deleted from a segment, just marked
as deleted. During a merge process of segments, a new segment is created that
does not have those deletes. This flag allows to only merge segments that have
deletes. Defaults to
false
. Note that this won’t override the
index.merge.policy.expunge_deletes_allowed
threshold.
flush
Should a flush be performed after the optimize. Defaults to
true
.
|
Below is the list of settings that can be changed using the update settings API:
index.data_path
(string)
Path to use for the index’s data. Note that by default Elasticsearch will
append the node ordinal by default to the path to ensure multiple instances
of Elasticsearch on the same machine do not share a data directory.
index.shadow_replicas
Boolean value indicating this index should use shadow replicas. Defaults to
false
.
index.shared_filesystem
Boolean value indicating this index uses a shared filesystem. Defaults to
the
true
if
index.shadow_replicas
is set to true,
false
otherwise.
index.shared_filesystem.recover_on_any_node
Boolean value indicating whether the primary shards for the index should be
allowed to recover on any node in the cluster, regardless of the number of
replicas or whether the node has previously had the shard allocated to it
before. Defaults to
false
.
These are non-dynamic settings that need to be configured in
elasticsearch.yml
node.add_id_to_custom_path
Boolean setting indicating whether Elasticsearch should append the node’s
ordinal to the custom data path. For example, if this is enabled and a path
of "/tmp/foo" is used, the first locally-running node will use "/tmp/foo/0",
the second will use "/tmp/foo/1", the third "/tmp/foo/2", etc. Defaults to
true
.
node.enable_custom_paths
Boolean value that must be set to
true
in order to use the
index.data_path
setting. Defaults to
false
.
cat APIs
The
upgrade
API accepts the following request parameters:
only_ancient_segments
If true, only very old segments (from a
previous Lucene major release) will be upgraded. While this will do
the minimal work to ensure the next major release of Elasticsearch can
read the segments, it’s dangerous because it can leave other very old
segments in sub-optimal formats. Defaults to
false
.
|
Each of the commands accepts a query string parameter
v
to turn on
verbose output.
Each of the commands accepts a query string parameter
help
which will
output its available columns.
Each of the commands accepts a query string parameter
h
which forces
only those columns to appear.
fielddata
shows information about currently loaded fielddata on a per-node
basis.
Fields can be specified either as a query parameter, or in the URL path:
The output shows the total fielddata and then the individual fielddata for the
body
and
text
fields.
The
indices
command provides a cross-section of each index. This
information
spans nodes
.
What’s my largest index by disk usage not including replicas?
How many merge operations have the shards for the
wiki
completed?
The
nodes
command shows the cluster topology.
The next few give a picture of your heap, memory, and load.
Header | Alias | Appear by Default | Description | Example |
---|---|---|---|---|
|
|
No |
Unique node ID |
k0zy |
|
|
No |
Process ID |
13061 |
|
|
Yes |
Host name |
n1 |
|
|
Yes |
IP address |
127.0.1.1 |
|
|
No |
Bound transport port |
9300 |
|
|
No |
Elasticsearch version |
1.7.6 |
|
|
No |
Elasticsearch Build hash |
5c03844 |
|
|
No |
Running Java version |
1.8.0 |
|
|
No |
Available disk space |
1.8gb |
|
|
No |
Used heap |
311.2mb |
|
|
Yes |
Used heap percentage |
7 |
|
|
No |
Maximum configured heap |
1015.6mb |
|
|
No |
Used total memory |
513.4mb |
|
|
Yes |
Used total memory percentage |
47 |
|
|
No |
Total memory |
2.9gb |
|
|
No |
Used file descriptors |
123 |
|
|
Yes |
Used file descriptors percentage |
1 |
|
|
No |
Maximum number of file descriptors |
1024 |
|
|
No |
Most recent load average |
0.22 |
|
|
No |
Node uptime |
17.3m |
|
|
Yes |
Data node (d); Client node (c) |
d |
|
|
Yes |
Current master (*); master eligible (m) |
m |
|
|
Yes |
Node name |
Venom |
|
|
No |
Size of completion |
0b |
|
|
No |
Used fielddata cache memory |
0b |
|
|
No |
Fielddata cache evictions |
0 |
|
|
No |
Used filter cache memory |
0b |
|
|
No |
Filter cache evictions |
0 |
|
|
No |
Number of flushes |
1 |
|
|
No |
Time spent in flush |
1 |
|
|
No |
Number of current get operations |
0 |
|
|
No |
Time spent in get |
14ms |
|
|
No |
Number of get operations |
2 |
|
|
No |
Time spent in successful gets |
14ms |
|
|
No |
Number of successful get operations |
2 |
|
|
No |
Time spent in failed gets |
0s |
|
|
No |
Number of failed get operations |
1 |
|
|
No |
Used ID cache memory |
216b |
|
|
No |
Number of current deletion operations |
0 |
|
|
No |
Time spent in deletions |
2ms |
|
|
No |
Number of deletion operations |
2 |
|
|
No |
Number of current indexing operations |
0 |
|
|
No |
Time spent in indexing |
134ms |
|
|
No |
Number of indexing operations |
1 |
|
|
No |
Number of current merge operations |
0 |
|
|
No |
Number of current merging documents |
0 |
|
|
No |
Size of current merges |
0b |
|
|
No |
Number of completed merge operations |
0 |
|
|
No |
Number of merged documents |
0 |
|
|
No |
Size of current merges |
0b |
|
|
No |
Time spent merging documents |
0s |
|
|
No |
Number of current percolations |
0 |
|
|
No |
Memory used by current percolations |
0b |
|
|
No |
Number of registered percolation queries |
0 |
|
|
No |
Time spent percolating |
0s |
|
|
No |
Total percolations |
0 |
|
|
No |
Number of refreshes |
16 |
|
|
No |
Time spent in refreshes |
91ms |
|
|
No |
Current fetch phase operations |
0 |
|
|
No |
Time spent in fetch phase |
37ms |
|
|
No |
Number of fetch operations |
7 |
|
|
No |
Open search contexts |
0 |
|
|
No |
Current query phase operations |
0 |
|
|
No |
Time spent in query phase |
43ms |
|
|
No |
Number of query operations |
9 |
|
|
No |
Number of segments |
4 |
|
|
No |
Memory used by segments |
1.4kb |
|
|
No |
Memory used by index writer |
18mb |
|
|
No |
Maximum memory index writer may use before it must write buffered documents to a new segment |
32mb |
|
|
No |
Memory used by version map |
1.0kb |
pending_tasks
provides the same information as the
/_cluster/pending_tasks
API in a
convenient tabular format.
% curl 'localhost:9200/_cat/pending_tasks?v' insertOrder timeInQueue priority source 1685 855ms HIGH update-mapping [foo][t] 1686 843ms HIGH update-mapping [foo][t] 1693 753ms HIGH refresh-mapping [foo][[t]] 1688 816ms HIGH update-mapping [foo][t] 1689 802ms HIGH update-mapping [foo][t] 1690 787ms HIGH update-mapping [foo][t] 1691 773ms HIGH update-mapping [foo][t]
The
plugins
command provides a view per node of running plugins. This information
spans nodes
.
We can tell quickly how many plugins per node we have and which versions.
The
recovery
command is a view of index shard recoveries, both on-going and previously
completed. It is a more compact view of the JSON
recovery
API.
A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local gateway recovery and is the normal way for shards to be loaded from disk when a node starts up.
As an example, here is what the recovery state of a cluster may look like when there are no shards in transit from one node to another:
> curl -XGET 'localhost:9200/_cat/recovery?v' index shard time type stage source target files percent bytes percent wiki 0 73 gateway done hostA hostA 36 100.0% 24982806 100.0% wiki 1 245 gateway done hostA hostA 33 100.0% 24501912 100.0% wiki 2 230 gateway done hostA hostA 36 100.0% 30267222 100.0%
In the above case, the source and target nodes are the same because the recovery type was gateway, i.e. they were read from local storage on node start.
Now let’s see what a live recovery looks like. By increasing the replica count of our index and bringing another node online to host the replicas, we can see what a live shard recovery looks like.
> curl -XPUT 'localhost:9200/wiki/_settings' -d'{"number_of_replicas":1}' {"acknowledged":true} > curl -XGET 'localhost:9200/_cat/recovery?v' index shard time type stage source target files percent bytes percent wiki 0 1252 gateway done hostA hostA 4 100.0% 23638870 100.0% wiki 0 1672 replica index hostA hostB 4 75.0% 23638870 48.8% wiki 1 1698 replica index hostA hostB 4 75.0% 23348540 49.4% wiki 1 4812 gateway done hostA hostA 33 100.0% 24501912 100.0% wiki 2 1689 replica index hostA hostB 4 75.0% 28681851 40.2% wiki 2 5317 gateway done hostA hostA 36 100.0% 30267222 100.0%
We can see in the above listing that our 3 initial shards are in various stages
of being replicated from one node to another. Notice that the recovery type is
shown as
replica
. The files and bytes copied are real-time measurements.
Finally, let’s see what a snapshot recovery looks like. Assuming I have previously made a backup of my index, I can restore it using the snapshot and restore API.
> curl -XPOST 'localhost:9200/_snapshot/imdb/snapshot_2/_restore' {"acknowledged":true} > curl -XGET 'localhost:9200/_cat/recovery?v' index shard time type stage repository snapshot files percent bytes percent imdb 0 1978 snapshot done imdb snap_1 79 8.0% 12086 9.0% imdb 1 2790 snapshot index imdb snap_1 88 7.7% 11025 8.1% imdb 2 2790 snapshot index imdb snap_1 85 0.0% 12072 0.0% imdb 3 2796 snapshot index imdb snap_1 85 2.4% 12048 7.2% imdb 4 819 snapshot init imdb snap_1 0 0.0% 0 0.0%
The first two columns contain the host and ip of a node.
The next three columns show the active queue and rejected statistics for the bulk thread pool.
Also other statistics of different thread pools can be retrieved by using the
h
(header) parameter.
Currently available thread pools :
Thread Pool | Alias | Description |
---|---|---|
|
|
Thread pool used for bulk operations |
|
|
Thread pool used for flush operations |
|
|
Thread pool used for generic operations (e.g. background node discovery) |
|
|
Thread pool used for get operations |
|
|
|
|
|
Thread pool used for management of Elasticsearch (e.g. cluster management) |
|
|
Thread pool used for merge operations |
|
|
Thread pool used for optimize operations |
|
|
Thread pool used for percolator operations |
|
|
Thread pool used for refresh operations |
|
|
|
|
|
Thread pool used for snapshot operations |
|
|
Thread pool used for suggester operations |
|
|
Thread pool used for index warm-up operations |
The thread pool name (or alias) must be combined with a thread pool field below to retrieve the requested information.
Field Name | Alias | Description |
---|---|---|
|
|
The current (*) type of thread pool (
|
|
|
The number of active threads in the current thread pool |
|
|
The number of threads in the current thread pool |
|
|
The number of tasks in the queue for the current thread pool |
|
|
The maximum number of tasks in the queue for the current thread pool |
|
|
The number of rejected threads in the current thread pool |
|
|
The highest number of active threads in the current thread pool |
|
|
The number of completed threads in the current thread pool |
|
|
The configured minimum number of active threads allowed in the current thread pool |
|
|
The configured maximum number of active threads allowed in the current thread pool |
|
|
The configured keep alive time for threads |
Field Name | Alias | Description |
---|---|---|
|
|
The unique node ID |
|
|
The process ID of the running node |
|
|
The hostname for the current node |
|
|
The IP address for the current node |
|
|
The bound transport port for the current node |
The
segments
command provides low level information about the segments
in the shards of an index. It provides information similar to the
_segments
endpoint.
% curl 'http://localhost:9200/_cat/segments?v' index shard prirep ip segment generation docs.count [...] test 4 p 192.168.2.105 _0 0 1 test1 2 p 192.168.2.105 _0 0 1 test1 3 p 192.168.2.105 _2 2 1
[...] docs.deleted size size.memory committed searchable version compound 0 2.9kb 7818 false true 4.10.2 true 0 2.9kb 7818 false true 4.10.2 true 0 2.9kb 7818 false true 4.10.2 true
The output shows information about index names and shard numbers in the first two columns.
If you only want to get information about segments in one particular index,
you can add the index name in the URL, for example
/_cat/segments/test
. Also,
several indexes can be queried like
/_cat/segments/test,test1
The following columns provide additional monitoring information:
The cluster health API allows to get a very simple status on the health of the cluster.
The API can also be executed against one or more indices to get just the specified indices health:
The cluster health API accepts the following request parameters:
The following is an example of getting the cluster health at the
shards
level:
The cluster state API allows to get a comprehensive state information of the whole cluster.
{ "timestamp": 1439326129256, "cluster_name": "elasticsearch", "status": "green", "indices": { "count": 3, "shards": { "total": 35, "primaries": 15, "replication": 1.333333333333333, "index": { "shards": { "min": 10, "max": 15, "avg": 11.66666666666666 "primaries": { "min": 5, "max": 5, "avg": 5 "replication": { "min": 1, "max": 2, "avg": 1.3333333333333333 "docs": { "count": 2, "deleted": 0 "store": { "size": "5.6kb", "size_in_bytes": 5770, "throttle_time": "0s", "throttle_time_in_millis": 0 "fielddata": { "memory_size": "0b", "memory_size_in_bytes": 0, "evictions": 0 "filter_cache": { "memory_size": "0b", "memory_size_in_bytes": 0, "evictions": 0 "id_cache": { "memory_size": "0b", "memory_size_in_bytes": 0 "completion": { "size": "0b", "size_in_bytes": 0 "segments": { "count": 2, "memory": "6.4kb", "memory_in_bytes": 6596, "index_writer_memory": "0b", "index_writer_memory_in_bytes": 0, "index_writer_max_memory": "275.7mb", "index_writer_max_memory_in_bytes": 289194639, "version_map_memory": "0b", "version_map_memory_in_bytes": 0, "fixed_bit_set": "0b", "fixed_bit_set_memory_in_bytes": 0 "percolate": { "total": 0, "get_time": "0s", "time_in_millis": 0, "current": 0, "memory_size_in_bytes": -1, "memory_size": "-1b", "queries": 0 "nodes": { "count": { "total": 2, "master_only": 0, "data_only": 0, "master_data": 2, "client": 0 "versions": [ "1.7.6" "os": { "available_processors": 4, "mem": { "total": "8gb", "total_in_bytes": 8589934592 "cpu": [ "vendor": "Intel", "model": "MacBookAir5,2", "mhz": 2000, "total_cores": 4, "total_sockets": 4, "cores_per_socket": 16, "cache_size": "256b", "cache_size_in_bytes": 256, "count": 1 "process": { "cpu": { "percent": 3 "open_file_descriptors": { "min": 200, "max": 346, "avg": 273 "jvm": { "max_uptime": "24s", "max_uptime_in_millis": 24054, "versions": [ "version": "1.6.0_45", "vm_name": "Java HotSpot(TM) 64-Bit Server VM", "vm_version": "20.45-b01-451", "vm_vendor": "Apple Inc.", "count": 2 "mem": { "heap_used": "38.3mb", "heap_used_in_bytes": 40237120, "heap_max": "1.9gb", "heap_max_in_bytes": 2130051072 "threads": 89 "fs": "total": "232.9gb", "total_in_bytes": 250140434432, "free": "31.3gb", "free_in_bytes": 33705881600, "available": "31.1gb", "available_in_bytes": 33443737600, "disk_reads": 21202753, "disk_writes": 27028840, "disk_io_op": 48231593, "disk_read_size": "528gb", "disk_read_size_in_bytes": 566980806656, "disk_write_size": "617.9gb", "disk_write_size_in_bytes": 663525366784, "disk_io_size": "1145.9gb", "disk_io_size_in_bytes": 1230506173440 "plugins": [ // all plugins installed on nodes "name": "inquisitor", "description": "", "url": "/_plugin/inquisitor/", "jvm": false, "site": true Cluster Reroute
{ "tasks": [ "insert_order": 101, "priority": "URGENT", "source": "create-index [foo_9], cause [api]", "time_in_queue_millis": 86, "time_in_queue": "86ms" "insert_order": 46, "priority": "HIGH", "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]", "time_in_queue_millis": 842, "time_in_queue": "842ms" "insert_order": 45, "priority": "HIGH", "source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]", "time_in_queue_millis": 858, "time_in_queue": "858ms" Cluster Update Settings
The cluster responds with the settings updated. So the response for the last example will be:
Cluster wide settings can be returned using:
There is a specific list of settings that can be updated, those include:
cluster.routing.allocation.awareness.attributes
See
Cluster
.
cluster.routing.allocation.awareness.force.*
See
Cluster
.
cluster.routing.allocation.disable_allocation
See
Cluster
.
cluster.routing.allocation.disable_replica_allocation
See
Cluster
.
cluster.routing.allocation.disable_new_allocation
See
Cluster
.
cluster.routing.allocation.node_initial_primaries_recoveries
See
Cluster
.
cluster.routing.allocation.node_concurrent_recoveries
See
Cluster
.
cluster.routing.allocation.include.*
See
Cluster
.
cluster.routing.allocation.exclude.*
See
Cluster
.
cluster.routing.allocation.require.*
See
Cluster
.
discovery.zen.minimum_master_nodes
See
Zen Discovery
discovery.zen.publish_timeout
See
Zen Discovery
indices.cache.filter.size
See
Cache
indices.cache.filter.expire
(time)
See
Cache
indices.recovery.concurrent_streams
See
Indices
indices.recovery.concurrent_small_file_streams
See
Indices
indices.recovery.file_chunk_size
See
Indices
indices.recovery.translog_ops
See
Indices
indices.recovery.translog_size
See
Indices
indices.recovery.compress
See
Indices
indices.recovery.max_bytes_per_sec
See
Indices
indices.store.throttle.type
See
Store
indices.store.throttle.max_bytes_per_sec
See
Store
indices.breaker.fielddata.limit
See
Field data
indices.breaker.fielddata.overhead
See
Field data
Nodes Info
The cluster nodes stats API allows to retrieve one or more (or all) of the cluster nodes statistics.
The first command retrieves stats of all the nodes in the cluster. The
second command selectively retrieves nodes stats of only
nodeId1
and
nodeId2
. All the nodes selective options are explained
here
.
By default, all stats are returned. You can limit this by combining any
of
indices
,
os
,
process
,
jvm
,
network
,
transport
,
http
,
fs
,
breaker
and
thread_pool
. For example:
indices
Indices stats about size, document count, indexing and
deletion times, search times, field cache size , merges and flushes
File system information, data path, free disk space, read/write
stats
HTTP connection information
JVM stats, memory pool information, garbage collection, buffer
pools
network
TCP information
Operating system stats, load average, cpu, mem, swap
process
Process statistics, memory consumption, cpu usage, open
file descriptors
thread_pool
Statistics about each thread pool, including current
size, queue and rejected tasks
transport
Transport statistics about sent and received bytes in
cluster communication
breaker
Statistics about the field data circuit breaker
|
# return indices and os curl -XGET 'http://localhost:9200/_nodes/stats/os' # return just os and process curl -XGET 'http://localhost:9200/_nodes/stats/os,process' # specific type endpoint curl -XGET 'http://localhost:9200/_nodes/stats/process' curl -XGET 'http://localhost:9200/_nodes/10.0.0.1/stats/process'
All stats can be explicitly requested via
/_nodes/stats/_all
or
/_nodes/stats?metric=_all
.
You can get information about field data memory usage on node level or on index level.
You can get statistics about search groups for searches executed on this node.
The cluster nodes info API allows to retrieve one or more (or all) of the cluster nodes information.
The first command retrieves information of all the nodes in the cluster.
The second command selectively retrieves nodes information of only
nodeId1
and
nodeId2
. All the nodes selective options are explained
here
.
By default, it just returns all attributes and core settings for a node.
It also allows to get only information on
settings
,
os
,
process
,
jvm
,
thread_pool
,
network
,
transport
,
http
and
plugins
:
curl -XGET 'http://localhost:9200/_nodes/process' curl -XGET 'http://localhost:9200/_nodes/_all/process' curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/jvm,process' # same as above curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/info/jvm,process' curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/_all
The
_all
flag can be set to return all the information - or you can simply omit it.
plugins
- if set, the result will contain details about the loaded
plugins per node:
name
: plugin name
description
: plugin description if any
site
:
true
if the plugin is a site plugin
jvm
:
true
if the plugin is a plugin running in the JVM
url
: URL if the plugin is a site plugin
The result will look similar to:
{ "cluster_name" : "test-cluster-MacBook-Air-de-David.local", "nodes" : { "hJLXmY_NTrCytiIMbX4_1g" : { "name" : "node4", "transport_address" : "inet[/172.18.58.139:9303]", "hostname" : "MacBook-Air-de-David.local", "version" : "0.90.0.Beta2-SNAPSHOT", "http_address" : "inet[/172.18.58.139:9203]", "plugins" : [ { "name" : "test-plugin", "description" : "test-plugin description", "site" : true, "jvm" : false "name" : "test-no-version-plugin", "description" : "test-no-version-plugin description", "site" : true, "jvm" : false "name" : "dummy", "description" : "No description found for dummy.", "url" : "/_plugin/dummy/", "site" : false, "jvm" : true }
if your
plugin
data is subject to change use
plugins.info_refresh_interval
to change or disable the caching
interval:
# Change cache to 20 seconds plugins.info_refresh_interval: 20s # Infinite cache plugins.info_refresh_interval: -1 # Disable cache plugins.info_refresh_interval: 0
The output is plain text with a breakdown of each node’s top hot threads. Parameters allowed are:
threads
number of hot threads to provide, defaults to 3.
interval
the interval to do the second sampling of threads.
Defaults to 500ms.
The type to sample, defaults to cpu, but supports wait and
block to see hot threads that are in wait or block state.
ignore_idle_threads
If true, known idle threads (e.g. waiting in a socket select, or to
get a task from an empty queue) are filtered out. Defaults to true.
Specific node(s) can be shutdown as well using their respective node ids (or other selective options as explained here .): $ curl -XPOST 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/_shutdown' The master (of the cluster) can also be shutdown using: $ curl -XPOST 'http://localhost:9200/_cluster/nodes/_master/_shutdown' Finally, all nodes can be shutdown using one of the options below: $ curl -XPOST 'http://localhost:9200/_shutdown' $ curl -XPOST 'http://localhost:9200/_cluster/nodes/_shutdown' $ curl -XPOST 'http://localhost:9200/_cluster/nodes/_all/_shutdown' DelayDisable Shutdown
The shutdown API can be disabled by setting
elasticsearch provides a full Query DSL based on JSON to define queries. In general, there are basic queries such as term or prefix . There are also compound queries like the bool query. Queries can also have filters associated with them such as the filtered or constant_score queries, with specific filter queries. Think of the Query DSL as an AST of queries. Certain queries can contain other queries (like the bool query), others can contain filters (like the constant_score ), and some can contain both a query and a filter (like the filtered ). Each of those can contain any query of the list of queries or any filter from the list of filters, resulting in the ability to build quite complex (and interesting) queries. Both queries and filters can be used in different APIs. For example, within a search query , or as a facet filter . This section explains the components (queries and filters) that can form the AST one can use. Filters are very handy since they perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached. As a general rule, queries should be used instead of filters: Types of Match Queriesboolean
The default
The
The
Here is an example when providing additional parameters (note the slight
change in structure,
{ "match" : { "message" : { "query" : "this is a test", "operator" : "and" }
zero_terms_query.
If the analyzer used removes all tokens in a query like a
{ "match" : { "message" : { "query" : "to be or not to be", "operator" : "and", "zero_terms_query": "all" }
cutoff_frequency.
The match query supports a
This query allows handling
The
Here is an example showing a query composed of stopwords exclusively: { "match" : { "message" : { "query" : "to be or not to be", "cutoff_frequency" : 0.001 }
The
phrase
Since
match_phrase_prefixComparison to query_string / fieldOther options
The
{ "multi_match" : { "query": "this is a test", |
fields
and per-field boosting
Fields can be specified with wildcards, eg:
{ "multi_match" : { "query": "Will Smith", "fields": [ "title", "*_name" ]Query the
title
,first_name
andlast_name
fields.
Individual fields can be boosted with the caret (
^
) notation:
{ "multi_match" : { "query" : "this is a test", "fields" : [ "subject^3", "message" ]The
subject
field is three times as important as themessage
field.
use_dis_max
multi_match
query:
best_fields
(
default
) Finds documents which match any field, but
uses the
_score
from the best field. See
best_fields
.
most_fields
Finds documents which match any field and combines
the
_score
from each field. See
most_fields
.
cross_fields
Treats fields with the same
analyzer
as though they
were one big field. Looks for each word in
any
field. See
cross_fields
.
phrase
Runs a
match_phrase
query on each field and combines
the
_score
from each field. See
phrase
and
phrase_prefix
.
phrase_prefix
Runs a
match_phrase_prefix
query on each field and
combines the
_score
from each field. See
phrase
and
phrase_prefix
.
|
The
best_fields
type generates a
match
query
for
each field and wraps them in a
dis_max
query, to
find the single best matching field. For instance, this query:
{ "multi_match" : { "query": "brown fox", "type": "best_fields", "fields": [ "subject", "message" ], "tie_breaker": 0.3 }
would be executed as:
{ "dis_max": { "queries": [ { "match": { "subject": "brown fox" }}, { "match": { "message": "brown fox" }} "tie_breaker": 0.3 }
Normally the
best_fields
type uses the score of the
single
best matching
field, but if
tie_breaker
is specified, then it calculates the score as
follows:
tie_breaker * _score
for all other matching fields
Also, accepts
analyzer
,
boost
,
operator
,
minimum_should_match
,
fuzziness
,
prefix_length
,
max_expansions
,
rewrite
,
zero_terms_query
and
cutoff_frequency
, as explained in
match query
.
operator
and
minimum_should_match
{ "multi_match" : { "query": "Will Smith", "type": "best_fields", "fields": [ "first_name", "last_name" ], "operator": "and"All terms must be present.
(+first_name:will +first_name:smith) | (+last_name:will +last_name:smith)
In other words, all terms must be present in a single field for a document to match.
See
cross_fields
for a better solution.
The score from each
match
clause is added together, then divided by the
number of
match
clauses.
Also, accepts
analyzer
,
boost
,
operator
,
minimum_should_match
,
fuzziness
,
prefix_length
,
max_expansions
,
rewrite
,
zero_terms_query
and
cutoff_frequency
, as explained in
match query
, but
see
operator
and
minimum_should_match
.
The
phrase
and
phrase_prefix
types behave just like
best_fields
,
but they use a
match_phrase
or
match_phrase_prefix
query instead of a
match
query.
This query:
{ "multi_match" : { "query": "quick brown f", "type": "phrase_prefix", "fields": [ "subject", "message" ] }
would be executed as:
{ "dis_max": { "queries": [ { "match_phrase_prefix": { "subject": "quick brown f" }}, { "match_phrase_prefix": { "message": "quick brown f" }} }
Also, accepts
analyzer
,
boost
,
slop
and
zero_terms_query
as explained
in
Match Query
. Type
phrase_prefix
additionally accepts
max_expansions
.
This sounds like a job for
most_fields
but there are two problems
with that approach. The first problem is that
operator
and
minimum_should_match
are applied per-field, instead of per-term (see
explanation above
).
The second problem is to do with relevance: the different term frequencies in
the
first_name
and
last_name
fields can produce unexpected results.
For instance, imagine we have two people: “Will Smith” and “Smith Jones”. “Smith” as a last name is very common (and so is of low importance) but “Smith” as a first name is very uncommon (and so is of great importance).
If we do a search for “Will Smith”, the “Smith Jones” document will
probably appear above the better matching “Will Smith” because the score of
first_name:smith
has trumped the combined scores of
first_name:will
plus
last_name:smith
.
One way of dealing with these types of queries is simply to index the
first_name
and
last_name
fields into a single
full_name
field. Of
course, this can only be done at index time.
The
cross_field
type tries to solve these problems at query time by taking a
term-centric
approach. It first analyzes the query string into individual
terms, then looks for each term in any of the fields, as though they were one
big field.
A query like:
{ "multi_match" : { "query": "Will Smith", "type": "cross_fields", "fields": [ "first_name", "last_name" ], "operator": "and" }
is executed as:
+(first_name:will last_name:will) +(first_name:smith last_name:smith)
In other words,
all terms
must be present
in at least one field
for a
document to match. (Compare this to
the logic used for
best_fields
and
most_fields
.)
That solves one of the two problems. The problem of differing term frequencies is solved by blending the term frequencies for all fields in order to even out the differences.
In practice,
first_name:smith
will be treated as though it has the same
frequencies as
last_name:smith
, plus one. This will make matches on
first_name
and
last_name
have comparable scores, with a tiny advantage
for
last_name
since it is the most likely field that contains
smith
.
Note that
cross_fields
is usually only useful on short string fields
that all have a
boost
of
1
. Otherwise boosts, term freqs and length
normalization contribute to the score in such a way that the blending of term
statistics is not meaningful anymore.
If you run the above query through the Validate API , it returns this explanation:
+blended("will", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name])
Also, accepts
analyzer
,
boost
,
operator
,
minimum_should_match
,
zero_terms_query
and
cutoff_frequency
, as explained in
match query
.
blended("jon", fields: [first, last]) blended("j", fields: [first.edge, last.edge]) blended("jo", fields: [first.edge, last.edge]) blended("jon", fields: [first.edge, last.edge]) )
Having multiple groups is fine, but when combined with
operator
or
minimum_should_match
, it can suffer from the
same problem
as
most_fields
or
best_fields
.
You can easily rewrite this query yourself as two separate
cross_fields
queries combined with a
bool
query, and apply the
minimum_should_match
parameter to just one of them:
{ "bool": { "should": [ "multi_match" : { "query": "Will Smith", "type": "cross_fields", "fields": [ "first", "last" ], "minimum_should_match": "50%""multi_match" : { "query": "Will Smith", "type": "cross_fields", "fields": [ "*.edge" ] Either
will
orsmith
must be present in either of thefirst
orlast
fields
You can force all fields into the same group by specifying the
analyzer
parameter in the query.
{ "multi_match" : { "query": "Jon", "type": "cross_fields", "analyzer": "standard","fields": [ "first", "last", "*.edge" ] Use the
standard
analyzer for all fields.
blended("will", fields: [first, first.edge, last.edge, last]) blended("smith", fields: [first, first.edge, last.edge, last])
Take the single best score out of (eg)
first_name:will
and
last_name:will
(
default
)
Add together the scores for (eg)
first_name:will
and
last_name:will
0.0 < n < 1.0
Take the single best score plus
tie_breaker
multiplied
by each of the scores from other matching fields.
Boosting Query
{ "bool" : { "must" : { "term" : { "user" : "kimchy" } "must_not" : { "range" : { "age" : { "from" : 10, "to" : 20 } "should" : [ "term" : { "tag" : "wow" } "term" : { "tag" : "elasticsearch" } "minimum_should_match" : 1, "boost" : 1.0 Common Terms Query { "boosting" : { "positive" : { "term" : { "field1" : "value1" "negative" : { "term" : { "field2" : "value2" "negative_boost" : 0.2 Constant Score Query The problemThe solution
If a query consists only of high frequency terms, then a single query is
executed as an
Terms are allocated to the high or low frequency groups based on the
Perhaps the most interesting property of this query is that it adapts to
domain specific stopwords automatically. For example, on a video hosting
site, common terms like
Examples
The number of terms which should match can be controlled with the
For low frequency terms, set the
{ "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } which is roughly equivalent to: { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} "should": [ { "term": { "body": "the"}} { "term": { "body": "as"}} { "term": { "body": "a"}} }
Alternatively use
{ "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": 2 } which is roughly equivalent to: { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} "minimum_should_match": 2 "should": [ { "term": { "body": "the"}} { "term": { "body": "as"}} { "term": { "body": "a"}} } minimum_should_match
A different
{ "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } which is roughly equivalent to: { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} "minimum_should_match": 2 "should": { "bool": { "should": [ { "term": { "body": "the"}}, { "term": { "body": "not"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} "minimum_should_match": 3 }
In this case it means the high frequency terms have only an impact on
relevance when there are at least three of them. But the most
interesting use of the
{ "common": { "body": { "query": "how not to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } which is roughly equivalent to: { "bool": { "should": [ { "term": { "body": "how"}}, { "term": { "body": "not"}}, { "term": { "body": "to"}}, { "term": { "body": "be"}} "minimum_should_match": "3<50%" }
The high frequency generated query is then slightly less restrictive
than with an
The
A query can also be wrapped in a
This query maps to Lucene
{ "dis_max" : { "tie_breaker" : 0.7, "boost" : 1.2, "queries" : [ "term" : { "age" : 34 } "term" : { "age" : 35 } Fuzzy Like This Query
The
Exclude as many document as you can with a filter, then query just the documents that remain. { "filtered": { "query": { "match": { "tweet": "full text search" } "filter": { "range": { "created": { "gte": "now-1d/d" }} }
The
curl -XGET localhost:9200/_search -d ' "query": { "filtered": { |
If a
query
is not specified, it defaults to the
match_all
query
. This means that the
filtered
query can be used to wrap just a filter, so that it can be used
wherever a query is expected.
curl -XGET localhost:9200/_search -d ' "query": { "filtered": {"filter": { "range": { "created": { "gte": "now-1d/d" }} No
query
has been specified, so this request applies just the filter, returning all documents created since yesterday.
Multiple filters can be applied by wrapping them in a
bool
filter
, for example:
{ "filtered": { "query": { "match": { "tweet": "full text search" }}, "filter": { "bool": { "must": { "range": { "created": { "gte": "now-1d/d" }}}, "should": [ { "term": { "featured": true }}, { "term": { "starred": true }} "must_not": { "term": { "deleted": false }} }
Similarly, multiple queries can be combined with a
bool
query
.
You can control how the filter and query are executed with the
strategy
parameter:
The
strategy
parameter accepts the following options:
leap_frog_query_first
Look for the first document matching the query, and then alternatively
advance the query and the filter to find common matches.
leap_frog_filter_first
Look for the first document matching the filter, and then alternatively
advance the query and the filter to find common matches.
leap_frog
Same as
leap_frog_query_first
.
query_first
If the filter supports random access, then search for documents using the
query, and then consult the filter to check whether there is a match.
Otherwise fall back to
leap_frog_query_first
.
random_access_${threshold}
If the filter supports random access and if there is at least one matching
document among the first
threshold
ones, then apply the filter first.
Otherwise fall back to
leap_frog_query_first
.
${threshold}
must be
greater than or equal to
1
.
random_access_always
Apply the filter first if it supports random access. Otherwise fall back
to
leap_frog_query_first
.
|
fuzzy_like_this
can be shortened to
flt
.
The
fuzzy_like_this
top level parameters include:
Parameter | Description |
---|---|
|
A list of the fields to run the more like this query against.
Defaults to the
|
|
The text to find documents like it, required . |
|
Should term frequency be ignored. Defaults to
|
|
The maximum number of query terms that will be
included in any generated query. Defaults to
|
|
The minimum similarity of the term variants. Defaults
to
|
|
Length of required common prefix on variant terms.
Defaults to
|
|
Sets the boost value of the query. Defaults to
|
|
The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field. |
fuzzy_like_this_field
can be shortened to
flt_field
.
The
fuzzy_like_this_field
top level parameters include:
Parameter | Description |
---|---|
|
The text to find documents like it, required . |
|
Should term frequency be ignored. Defaults to
|
|
The maximum number of query terms that will be
included in any generated query. Defaults to
|
|
The fuzziness of the term variants. Defaults
to
|
|
Length of required common prefix on variant terms.
Defaults to
|
|
Sets the boost value of the query. Defaults to
|
|
The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field. |
function_score
can be used with only one function like this:
"function_score": { "(query|filter)": {}, "boost": "boost for the whole query", "FUNCTION": {},"boost_mode":"(multiply|replace|...)" See Score functions for a list of supported functions.
Furthermore, several functions can be combined. In this case one can optionally choose to apply the function only if a document matches a given filter:
"function_score": { "(query|filter)": {}, "boost": "boost for the whole query", "functions": [ "filter": {}, "FUNCTION": {},"weight": number "FUNCTION": {}
"filter": {}, "weight": number "max_boost": number, "score_mode": "(multiply|max|...)", "boost_mode": "(multiply|replace|...)", "min_score" : number See Score functions for a list of supported functions.
If no filter is given with a function this is equivalent to specifying
"match_all": {}
First, each document is scored by the defined functions. The parameter
score_mode
specifies how the computed scores are combined:
multiply
scores are multiplied (default)
scores are summed
scores are averaged
first
the first function that has a matching filter
is applied
maximum score is used
minimum score is used
|
Because scores can be on different scales (for example, between 0 and 1 for decay functions but arbitrary for
field_value_factor
) and also because sometimes a different impact of functions on the score is desirable, the score of each function can be adjusted with a user defined
weight
(). The
weight
can be defined per function in the
functions
array (example above) and is multiplied with the score computed by the respective function.
If weight is given without any other function declaration,
weight
acts as a function that simply returns the
weight
.
The new score can be restricted to not exceed a certain limit by setting
the
max_boost
parameter. The default for
max_boost
is FLT_MAX.
The newly computed score is combined with the score of the
query. The parameter
boost_mode
defines how:
multiply
query score and function score is multiplied (default)
replace
only function score is used, the query score is ignored
query score and function score are added
average
max of query score and function score
min of query score and function score
|
By default, modifying the score does not change which documents match. To exclude
documents that do not meet a certain score threshold the
min_score
parameter can be set to the desired score threshold.
The
function_score
query provides several types of score functions:
script_score
weight
random_score
field_value_factor
decay functions
:
gauss
,
linear
,
exp
Which will translate into the following formula for scoring:
sqrt(1.2 * doc['popularity'].value)
There are a number of options for the
field_value_factor
function:
field
Field to be extracted from the document.
factor
Optional factor to multiply the field value with, defaults to
1
.
modifier
Modifier to apply to the field value, can be one of:
none
,
log
,
log1p
,
log2p
,
ln
,
ln1p
,
ln2p
,
square
,
sqrt
, or
reciprocal
.
Defaults to
none
.
missing
Value used if the document doesn’t have that field. The modifier
and factor are still applied to it as though it were read from the document.
|
"DECAY_FUNCTION": {"FIELD_NAME": {
"origin": "11, 12", "scale": "2km", "offset": "0km", "decay": 0.33 The
DECAY_FUNCTION
should be one oflinear
,exp
, orgauss
. The specified field must be a numeric field.
In the above example, the field is a
Geo Point Type
and origin can be provided in geo format.
scale
and
offset
must be given with a unit in this case. If your field is a date field, you can set
scale
and
offset
as days, weeks, and so on. Example:
"gauss": { "date": { "origin": "2013-09-17","scale": "10d", "offset": "5d",
"decay" : 0.5
The date format of the origin depends on the Date Format defined in your mapping. If you do not define the origin, the current time is used. The
offset
anddecay
parameters are optional.offset
If anoffset
is defined, the decay function will only compute the decay function for documents with a distance greater that the definedoffset
. The default is 0.decay
Thedecay
parameter defines how documents are scored at the distance given atscale
. If nodecay
is defined, documents at the distancescale
will be scored 0.5.
In the first example, your documents might represents hotels and contain a geo location field. You want to compute a decay function depending on how far the hotel is from a given location. You might not immediately see what scale to choose for the gauss function, but you can say something like: "At a distance of 2km from the desired location, the score should be reduced by one third." The parameter "scale" will then be adjusted automatically to assure that the score function computes a score of 0.5 for hotels that are 2km away from the desired location.
In the second example, documents with a field value between 2013-09-12 and 2013-09-22 would get a weight of 1.0 and documents which are 15 days from that date a weight of 0.5.
The
DECAY_FUNCTION
determines the shape of the decay:
gauss
See
Normal decay, keyword
gauss
for graphs demonstrating the curve generated by the
gauss
function.
Exponential decay, computed as:
where again the parameter
is computed to assure that the score takes the value
decay
at distance
scale
from
origin
+-
offset
See
Exponential decay, keyword
exp
for graphs demonstrating the curve generated by the
exp
function.
linear
Linear decay, computed as:
.
where again the parameter
s
is computed to assure that the score takes the value
decay
at distance
scale
from
origin
+-
offset
In contrast to the normal and exponential decay, this function actually sets the score to 0 if the field value exceeds twice the user given scale value.
See
Linear decay, keyword
linear
for graphs demonstrating the curve generated by the
linear
function.
Distance is the minimum distance Distance is the maximum distance Distance is the average distance Distance is the sum of all distances |
The function for
price
in this case would be
"gauss": {"location": { "origin": "11, 12", "scale": "2km" The decay function could also be
linear
orexp
.
Next, we show how the computed score looks like for each of the three possible decay functions.
"function_score": { "functions": [ "weight": "3", "filter": {...} "filter": {...}, "script_score": { "params": { "param1": 2, "param2": 3.1 "script": "_score * doc['my_numeric_field'].value / pow(param1, param2)" "query": {...}, "score_mode": "first" GeoShape Query
Or with more advanced settings:
fuzziness
The maximum edit distance. Defaults to
AUTO
. See
the section called “Fuzziness
”.
prefix_length
The number of initial characters which will not be “fuzzified”. This
helps to reduce the number of terms which must be examined. Defaults
to
0
.
max_expansions
The maximum number of terms that the
fuzzy
query will expand to.
Defaults to
50
.
|
Performs a
Range Query
“around” the value using the
fuzziness
value as a
+/-
range, where:
-fuzziness <= field value <= +fuzziness
For example:
{ "fuzzy" : { "price" : { "value" : 12, "fuzziness" : 2 }
Will result in a range query between 10 and 14. Date fields support time values , eg:
{ "fuzzy" : { "created" : { "value" : "2010-02-05T12:05:07", "fuzziness" : "1d" }
See the section called “Fuzziness ” for more details about accepted values.
Query version of the geo_shape Filter .
Requires the geo_shape Mapping .
Given a document that looks like this:
{ "name": "Wind & Wetter, Berlin, Germany", "location": { "type": "Point", "coordinates": [13.400544, 52.530286] }
The following query will find the point:
{ "query": { "geo_shape": { "location": { "shape": { "type": "envelope", "coordinates": [[13, 53],[14, 52]] }
See the Filter’s documentation for more information.
Currently Elasticsearch does not have any notion of geo shape relevancy,
consequently the Query internally uses a
constant_score
Query which
wraps a
geo_shape filter
.
The geo_shape strategy mapping parameter determines which spatial relation operators may be used at search time.
The following is a complete list of spatial relation operators available:
INTERSECTS
- (default) Return all documents whose
geo_shape
field
intersects the query geometry.
DISJOINT
- Return all documents whose
geo_shape
field
has nothing in common with the query geometry.
WITHIN
- Return all documents whose
geo_shape
field
is within the query geometry.
The
has_child
query works the same as the
has_child
filter,
by automatically wrapping the filter with a
constant_score
(when using the default score type). It has the same syntax as the
has_child
filter:
{ "has_child" : { "type" : "blog_tag", "query" : { "term" : { "tag" : "something" }
An important difference with the
top_children
query is that this query
is always executed in two iterations whereas the
top_children
query
can be executed in one or more iteration. When using the
has_child
query the
total_hits
is always correct.
{ "has_child" : { "type" : "blog_tag", "score_mode" : "sum", "min_children": 2,"max_children": 10,
"query" : { "term" : { "tag" : "something" Both
min_children
andmax_children
are optional.
The
min_children
and
max_children
parameters can be combined with
the
score_mode
parameter.
In order to support parent-child joins, all of the (string) parent IDs must be resident in memory (in the field data cache . Additionally, every child document is mapped to its parent using a long value (approximately). It is advisable to keep the string parent ID short in order to reduce memory usage.
You can check how much memory is being used by the ID cache using the indices stats or nodes stats APIS, eg:
curl -XGET "http://localhost:9200/_stats/id_cache?pretty&human"
The
has_parent
query works the same as the
has_parent
filter, by automatically wrapping the filter with a constant_score (when
using the default score type). It has the same syntax as the
has_parent
filter.
{ "has_parent" : { "parent_type" : "blog", "query" : { "term" : { "tag" : "something" }
In order to support parent-child joins, all of the (string) parent IDs must be resident in memory (in the field data cache . Additionally, every child document is mapped to its parent using a long value (approximately). It is advisable to keep the string parent ID short in order to reduce memory usage.
You can check how much memory is being used by the ID cache using the indices stats or nodes stats APIS, eg:
curl -XGET "http://localhost:9200/_stats/id_cache?pretty&human"
Filters documents that only have the provided ids. Note, this filter does not require the _id field to be indexed since it works using the _uid field.
{ "ids" : { "type" : "my_type", "values" : ["1", "4", "100"] }
The
type
is optional and can be omitted, and can also accept an array
of values.
A query that matches all documents. Maps to Lucene
MatchAllDocsQuery
.
Which can also have boost associated with it:
Another use case consists of asking for similar documents to ones already existing in the index. In this case, the syntax to specify a document is similar to the one used in the Multi GET API .
{ "more_like_this" : { "fields" : ["title", "description"], "docs" : [ "_index" : "imdb", "_type" : "movies", "_id" : "1" "_index" : "imdb", "_type" : "movies", "_id" : "2" "min_term_freq" : 1, "max_query_terms" : 12 }
Finally, users can also provide documents not necessarily present in the index using a syntax is similar to artificial documents .
{ "more_like_this" : { "fields" : ["name.first", "name.last"], "docs" : [ "_index" : "marvel", "_type" : "quotes", "doc" : { "name": { "first": "Ben", "last": "Grimm" "tweet": "You got no idea what I'd... what I'd give to be invisible." "_index" : "marvel", "_type" : "quotes", "_id" : "2" "min_term_freq" : 1, "max_query_terms" : 12 }
Suppose we wanted to find all documents similar to a given input document.
Obviously, the input document itself should be its best match for that type of
query. And the reason would be mostly, according to
Lucene scoring formula
,
due to the terms with the highest tf-idf. Therefore, the terms of the input
document that have the highest tf-idf are good representatives of that
document, and could be used within a disjunctive query (or
OR
) to retrieve
similar documents. The MLT query simply extracts the text from the input
document, analyzes it, usually using the same analyzer as the field, then
selects the top K terms with highest tf-idf to form a disjunctive query of
these terms.
The fields on which to perform MLT must be indexed and of type
string
. Additionally, when using
like
with documents, either
_source
must be enabled or the fields must be
stored
or have
term_vector
enabled.
In order to speed up analysis, it could help to store term vectors at index
time, but at the expense of disk usage.
For example, if we wish to perform MLT on the "title" and "tags.raw" fields,
we can explicitly store their
term_vector
at index time. We can still
perform MLT on the "description" and "tags" fields, as
_source
is enabled by
default, but there will be no speed up on analysis for these fields.
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{ "mappings": { "movies": { "properties": { "title": { "type": "string", "term_vector": "yes" "description": { "type": "string" "tags": { "type": "string", "fields" : { "raw": { "type" : "string", "index" : "not_analyzed", "term_vector" : "yes" }
The list of documents to find documents like it. The syntax to specify
documents is similar to the one used by the
Multi GET API
.
The text is fetched from
fields
unless overridden in each document request.
The text is analyzed by the analyzer at the field, but could also be
overridden. The syntax to override the analyzer at the field follows a similar
syntax to the
per_field_analyzer
parameter of the
Term Vectors API
. Additionally, to
provide documents not necessarily present in the index,
artificial documents
are also supported.
A list of document ids, shortcut to
docs
if
_index
and
_type
are the
same as the request.
like_text
The text to find documents like it.
required
if
ids
or
docs
are not
specified.
fields
A list of the fields to run the more like this query against. Defaults to the
_all
field for
like_text
and to all possible fields for
ids
or
docs
.
|
max_query_terms
The maximum number of query terms that will be selected. Increasing this value
gives greater accuracy at the expense of query execution speed. Defaults to
min_term_freq
The minimum term frequency below which the terms will be ignored from the
input document. Defaults to
2
.
min_doc_freq
The minimum document frequency below which the terms will be ignored from the
input document. Defaults to
5
.
max_doc_freq
The maximum document frequency above which the terms will be ignored from the
input document. This could be useful in order to ignore highly frequent words
such as stop words. Defaults to unbounded (
0
).
min_word_length
The minimum word length below which the terms will be ignored. Defaults to
0
.
max_word_length
The maximum word length above which the terms will be ignored. Defaults to unbounded (
0
).
stop_words
An array of stop words. Any word in this set is considered "uninteresting" and
ignored. If the analyzer allows for stop words, you might want to tell MLT to
explicitly ignore them, as for the purposes of document similarity it seems
reasonable to assume that "a stop word is never interesting".
analyzer
The analyzer that is used to analyze the free form text. Defaults to the
analyzer associated with the first field in
fields
.
|
minimum_should_match
After the disjunctive query has been formed, this parameter controls the
number of terms that must match. The syntax is the same as the
minimum should match
. (Defaults to
"30%"
).
percent_terms_to_match
Each term in the formed query could be further boosted by their tf-idf score.
This sets the boost factor to use when using this feature. Defaults to
deactivated (
0
). Any other positive value activates terms boosting with the
given boost factor.
include
Specifies whether the input documents should also be included in the search
results returned. Defaults to
false
.
boost
Sets the boost value of the whole query. Defaults to
1.0
.
Nested query allows to query nested objects / docs (see nested mapping ). The query is executed against the nested objects / docs as if they were indexed as separate docs (they are, internally) and resulting in the root parent doc (or parent nested mapping). Here is a sample mapping we will work with: { "type1" : { "properties" : { "obj1" : { "type" : "nested" } And here is a sample nested query usage: { "nested" : { "path" : "obj1", "score_mode" : "avg", "query" : { "bool" : { "must" : [ "match" : {"obj1.name" : "blue"} "range" : {"obj1.count" : {"gt" : 5}} }
The query
The
Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query. A boost can also be associated with the query: This multi term query allows you to control how it gets rewritten using the rewrite parameter. A query that uses a query parser in order to parse its content. Here is an example:
The
When a multi term query is being generated, one can control how it gets rewritten using the rewrite parameter. Default Field
So, if
Multi Fieldfield1:query_term OR field2:query_term | ... For example, the following query
The query string “mini-language” is used by the
Query String Query
and by the
The query string is parsed into a series of
terms
and
operators
. A
term can be a single word —
Operators allow you to customize the search — the available options are explained below.
As mentioned in
Query String Query
, the
name:/joh?n(ath[oa]n)/ The supported regular expression syntax is explained in Regular expression syntax .
The
/.*n/ Use with caution! quikc~ brwn~ foks~ This uses the Damerau-Levenshtein distance to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character, or transposition of two adjacent characters.
The default
edit distance
is
quikc~1 Curly and square brackets can be combined: Ranges with one side unbounded can use the following syntax: age:>10 age:>=10 age:<10 age:<=10
The parsing of ranges in query strings can be complex and error prone. It is
much more reliable to use an explicit
quick brown +fox -news
Rewriting the above query using
In contrast, the same query rewritten using the
{ "bool": { "must": { "match": "fox" }, "should": { "match": "quick brown" }, "must_not": { "match": "news" } } Multiple terms or clauses can be grouped together with parentheses, to form sub-queries: (quick OR brown) AND fox Groups can be used to target a particular field, or to boost the result of a sub-query: status:(active OR pending) title:(full text search)^2
The
Simple Query String Syntax
The
In order to search for any of these special characters, they will need to
be escaped with
Default Field
So, if
Multi FieldFlags
The
Date options
In the above example,
{ "range" : { "born" : { "gte": "01/01/2012", "lte": "2013", "format": "dd/MM/yyyy||yyyy" Span First Query
The
Note
: The performance of a
{ "regexp":{ "name.first": "s.*y" } Boosting is also supported { "regexp":{ "name.first":{ "value":"s.*y", "boost":1.2 } You can also use special flags { "regexp":{ "name.first": { "value": "s.*y", "flags" : "INTERSECTION|COMPLEMENT|EMPTY" }
Possible flags are
Regular expressions are dangerous because it’s easy to accidentally
create an innocuous looking one that requires an exponential number of
internal determinized automaton states (and corresponding RAM and CPU)
for Lucene to execute. Lucene prevents these using the
{ "regexp":{ "name.first": { "value": "s.*y", "flags" : "INTERSECTION|COMPLEMENT|EMPTY", "max_determinized_states": 20000 }
A boost can also be associated with the query: { "span_multi":{ "match":{ "prefix" : { "user" : { "value" : "ki", "boost" : 1.08 } } Span Not Query
|
By default, however,
string
fields are
analyzed
. This means that their
values are first passed through an
analyzer
to produce a list of
terms, which are then added to the inverted index.
There are many ways to analyze text: the default
standard
analyzer
drops most punctuation,
breaks up text into individual words, and lower cases them. For instance,
the
standard
analyzer would turn the string “Quick Brown Fox!” into the
terms [
quick
,
brown
,
fox
].
This analysis process makes it possible to search for individual words within a big block of full text.
The
term
query looks for the
exact
term in the field’s inverted index — it doesn’t know anything about the field’s analyzer. This makes it useful for
looking up values in
not_analyzed
string fields, or in numeric or date
fields. When querying full text fields, use the
match
query
instead, which understands how the field
has been analyzed.
To demonstrate, try out the example below. First, create an index, specifying the field mappings, and index a document:
PUT my_index "mappings": { "my_type": { "properties": { "full_text": { "type": "string""exact_value": { "type": "string", "index": "not_analyzed"
PUT my_index/my_type/1 "full_text": "Quick Foxes!",
"exact_value": "Quick Foxes!"
The
full_text
field isanalyzed
by default. Theexact_value
field is set to benot_analyzed
. Thefull_text
inverted index will contain the terms: [quick
,foxes
]. Theexact_value
inverted index will contain the exact term: [Quick Foxes!
].
Now, compare the results for the
term
query and the
match
query:
GET my_index/my_type/_search "query": { "term": { "exact_value": "Quick Foxes!"GET my_index/my_type/_search "query": { "term": { "full_text": "Quick Foxes!"
GET my_index/my_type/_search "query": { "term": { "exact_value": "foxes"
GET my_index/my_type/_search "query": { "match": { "full_text": "Quick Foxes!"
This query matches because the
exact_value
field contains the exact termQuick Foxes!
. This query does not match, because thefull_text
field only contains the termsquick
andfoxes
. It does not contain the exact termQuick Foxes!
. Aterm
query for the termfoxes
matches thefull_text
field. Thismatch
query on thefull_text
field first analyzes the query string, then looks for documents containingquick
orfoxes
or both. Top Children Query
In order to support parent-child joins, all of the (string) parent IDs must be resident in memory (in the field data cache . Additionally, every child document is mapped to its parent using a long value (approximately). It is advisable to keep the string parent ID short in order to reduce memory usage.
You can check how much memory is being used by the ID cache using the indices stats or nodes stats APIS, eg:
curl -XGET "http://localhost:9200/_stats/id_cache?pretty&human"
A boost can also be associated with the query:
This multi term query allows to control how it gets rewritten using the rewrite parameter.
The
minimum_should_match
parameter possible values:
Type | Example | Description |
---|---|---|
Integer |
|
Indicates a fixed value regardless of the number of optional clauses. |
Negative integer |
|
Indicates that the total number of optional clauses, minus this number should be mandatory. |
Percentage |
|
Indicates that this percent of the total number of optional clauses are necessary. The number computed from the percentage is rounded down and used as the minimum. |
Negative percentage |
|
Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum. |
Combination |
|
A positive integer, followed by the less-than symbol, followed by any of the previously mentioned specifiers is a conditional specification. It indicates that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it’s greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required. |
Multiple combinations |
|
Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it. In this example: if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required. |
rewrite
parameter:
constant_score_auto
, defaults to
automatically choosing either
constant_score_boolean
or
constant_score_filter
based on query characteristics.
scoring_boolean
: A rewrite method that first translates each term
into a should clause in a boolean query, and keeps the scores as
computed by the query. Note that typically such scores are meaningless
to the user, and require non-trivial CPU to compute, so it’s almost
always better to use
constant_score_auto
. This rewrite method will hit
too many clauses failure if it exceeds the boolean query limit (defaults
to
1024
).
constant_score_boolean
: Similar to
scoring_boolean
except scores
are not computed. Instead, each matching document receives a constant
score equal to the query’s boost. This rewrite method will hit too many
clauses failure if it exceeds the boolean query limit (defaults to
1024
).
constant_score_filter
: A rewrite method that first creates a private
Filter by visiting each term in sequence and marking all docs for that
term. Matching documents are assigned a constant score equal to the
query’s boost.
top_terms_N
: A rewrite method that first translates each term into
should clause in boolean query, and keeps the scores as computed by the
query. This rewrite method only uses the top scoring terms so it will
not overflow boolean max clause count. The
N
controls the size of the
top scoring terms to use.
top_terms_boost_N
: A rewrite method that first translates each term
into should clause in boolean query, but the scores are only computed as
the boost. This rewrite method only uses the top scoring terms so it
will not overflow the boolean max clause count. The
N
controls the
size of the top scoring terms to use.
Filters
The above request is translated into:
Alternatively passing the template as an escaped string works as well:
GET /_search "query": { "template": { "query": "{ \"match\": { \"text\": \"{{query_string}}\" }}","params" : { "query_string" : "all about search" New line characters (
\n
) should be escaped as\\n
or removed, and quotes ("
) should be escaped as\\"
.
GET /_search "query": { "template": { "file": "my_template","params" : { "query_string" : "all about search" Name of the the query template in
config/scripts/
, i.e.,my_template.mustache
.
Alternatively, you can register a query template in the special
.scripts
index with:
and refer to it in the
template
query with the
id
parameter:
GET /_search "query": { "template": { "id": "my_template","params" : { "query_string" : "all about search" Name of the the query template in
config/scripts/
, i.e.,storedTemplate.mustache
.
There is also a dedicated
template
endpoint, allows you to template an entire search request.
Please see
Search Template
for more details.
As a general rule, filters should be used instead of queries:
Some filters already produce a result that is easily cacheable, and the difference between caching and not caching them is the act of placing the result in the cache or not. These filters, which include the term , terms , prefix , and range filters, are by default cached and are recommended to use (compared to the equivalent query version) when the same filter (same parameters) will be used across multiple different queries (for example, a range filter with age higher than 10).
Other filters, usually already working with the field data loaded into memory, are not cached by default. Those filters are already very fast, and the process of caching them requires extra processing in order to allow the filter result to be used with different queries than the one executed. These filters, including the geo, and script filters are not cached by default.
The last type of filters are those working with other filters. The and , not and or filters are not cached as they basically just manipulate the internal filters.
All filters allow to set
_cache
element on them to explicitly control
caching. They also allow to set
_cache_key
which will be used as the
caching key for that filter. This can be handy when using very large
filters (like a terms filter with many elements in it).
{ "filtered" : { "query" : { "term" : { "name.first" : "shay" } "filter" : { "and" : { "filters": [ "range" : { "postDate" : { "from" : "2010-03-01", "to" : "2010-04-01" "prefix" : { "name.second" : "ba" } "_cache" : true Exists Filter
A filter that matches documents matching boolean combinations of other queries. Similar in concept to Boolean query , except that the clauses are other filters. Can be placed within queries that accept a filter.
{ "filtered" : { "query" : { "query_string" : { "default_field" : "message", "query" : "elasticsearch" "filter" : { "bool" : { "must" : { "term" : { "tag" : "wow" } "must_not" : { "range" : { "age" : { "gte" : 10, "lt" : 20 } "should" : [ "term" : { "tag" : "sometag" } "term" : { "tag" : "sometagtag" } }
Returns documents that have at least one non-
null
value in the original field:
For instance, these documents would all match the above filter:
An empty string is a non-
|
These documents would not match the above filter:
This field has no values.
At least one non-
|
null_value
mapping
If the field mapping includes the
null_value
setting (see
Core Types
)
then explicit
null
values are replaced with the specified
null_value
. For
instance, if the
user
field were mapped as follows:
"user": { "type": "string", "null_value": "_null_" }
then explicit
null
values would be indexed as the string
_null_
, and the
following docs would match the
exists
filter:
{ "user": null } { "user": [null] }
However, these docs—without explicit
null
values—would still have
no values in the
user
field and thus would not match the
exists
filter:
{ "user": [] } { "foo": "bar" }
Then the following simple query can be executed with a
geo_bounding_box
filter:
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "filtered" : { "query" : { "match_all" : {} "filter" : { "geo_bounding_box" : { "pin.location" : { "top_left" : [-74.1, 40.73], "bottom_right" : [-71.12, 40.01] }
The filter
requires
the
geo_point
type to be set on the relevant
field.
Then the following simple query can be executed with a
geo_distance
filter:
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "filtered" : { "query" : { "match_all" : {} "filter" : { "geo_distance" : { "distance" : "12km", "pin.location" : [-70, 40] }
The following are options allowed on the filter:
distance
The radius of the circle centred on the specified location. Points which
fall into this circle are considered to be matches. The
distance
can be
specified in various units. See
the section called “Distance Units
”.
distance_type
How to compute the distance. Can either be
sloppy_arc
(default),
arc
(slightly more precise but significantly slower) or
plane
(faster, but inaccurate on long distances and close to the poles).
optimize_bbox
Whether to use the optimization of first running a bounding box check
before the distance check. Defaults to
memory
which will do in memory
checks. Can also have values of
indexed
to use indexed value check (make
sure the
geo_point
type index lat lon in this case), or
none
which
disables bounding box optimization.
|
The filter
requires
the
geo_point
type to be set on the relevant
field.
Filters documents that exists within a range from a specific point:
Supports the same point location parameter as the geo_distance filter. And also support the common parameters for range (lt, lte, gt, gte, from, to, include_upper and include_lower).
A filter allowing to include hits that only fall within a polygon of points. Here is an example:
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "filtered" : { "query" : { "match_all" : {} "filter" : { "geo_polygon" : { "person.location" : { "points" : [ [-70, 40], [-80, 30], [-90, 20] }
The filter requires the geo_point type to be set on the relevant field.
Filter documents indexed using the
geo_shape
type.
Requires the geo_shape Mapping .
You may also use the geo_shape Query .
The
geo_shape
Filter uses the same grid square representation as the
geo_shape mapping to find documents that have a shape that intersects
with the query shape. It will also use the same PrefixTree configuration
as defined for the field mapping.
Similar to the
geo_shape
type, the
geo_shape
Filter uses
GeoJSON
to represent shapes.
Given a document that looks like this:
{ "name": "Wind & Wetter, Berlin, Germany", "location": { "type": "Point", "coordinates": [13.400544, 52.530286] }
The following query will find the point using the Elasticsearch’s
envelope
GeoJSON extension:
{ "query":{ "filtered": { "query": { "match_all": {} "filter": { "geo_shape": { "location": { "shape": { "type": "envelope", "coordinates" : [[13.0, 53.0], [14.0, 52.0]] }
The following is an example of using the Filter with a pre-indexed shape:
The
geohash_cell
filter provides access to a hierarchy of geohashes.
By defining a geohash cell, only
geopoints
within this cell will match this filter.
To get this filter work all prefixes of a geohash need to be indexed. In
example a geohash
u30
needs to be decomposed into three terms:
u30
,
u3
and
u
. This decomposition must be enabled in the mapping of the
geopoint
field that’s going to be filtered by
setting the
geohash_prefix
option:
{ "mappings" : { "location": { "properties": { "pin": { "type": "geo_point", "geohash": true, "geohash_prefix": true, "geohash_precision": 10 }
The geohash cell can defined by all formats of
geo_points
. If such a cell is
defined by a latitude and longitude pair the size of the cell needs to be
setup. This can be done by the
precision
parameter of the filter. This
parameter can be set to an integer value which sets the length of the geohash
prefix. Instead of setting a geohash length directly it is also possible to
define the precision as distance, in example
"precision": "50m"
. (See
the section called “Distance Units
”.)
The
neighbor
option of the filter offers the possibility to filter cells
next to the given cell.
{ "filtered" : { "query" : { "match_all" : {} "filter" : { "geohash_cell": { "pin": { "lat": 13.4080, "lon": 52.5186 "precision": 3, "neighbors": true }
The
has_child
filter also accepts a filter instead of a query:
{ "has_child" : { "type" : "comment", "min_children": 2,"max_children": 10,
"filter" : { "term" : { "user" : "john" Both
min_children
andmax_children
are optional.
In order to support parent-child joins, all of the (string) parent IDs must be resident in memory (in the field data cache . Additionally, every child document is mapped to its parent using a long value (approximately). It is advisable to keep the string parent ID short in order to reduce memory usage.
You can check how much memory is being used by the ID cache using the indices stats or nodes stats APIS, eg:
curl -XGET "http://localhost:9200/_stats/id_cache?pretty&human"
The
parent_type
field name can also be abbreviated to
type
.
The
has_parent
filter also accepts a filter instead of a query:
In order to support parent-child joins, all of the (string) parent IDs must be resident in memory (in the field data cache . Additionally, every child document is mapped to its parent using a long value (approximately). It is advisable to keep the string parent ID short in order to reduce memory usage.
You can check how much memory is being used by the ID cache using the indices stats or nodes stats APIS, eg:
curl -XGET "http://localhost:9200/_stats/id_cache?pretty&human"
Filters documents that only have the provided ids. Note, this filter does not require the _id field to be indexed since it works using the _uid field.
{ "ids" : { "type" : "my_type", "values" : ["1", "4", "100"] }
The
type
is optional and can be omitted, and can also accept an array
of values.
A limit filter limits the number of documents (per shard) to execute on. For example:
A filter that matches on all documents:
Returns documents that have only
null
values or no value in the original field:
For instance, the following docs would match the above filter:
These documents would not match the above filter:
An empty string is a non-
|
null_value
mapping
If the field mapping includes a
null_value
(see
Core Types
) then explicit
null
values
are replaced with the specified
null_value
. For instance, if the
user
field were mapped
as follows:
"user": { "type": "string", "null_value": "_null_" }
then explicit
null
values would be indexed as the string
_null_
, and the
the following docs would
not
match the
missing
filter:
{ "user": null } { "user": [null] }
However, these docs—without explicit
null
values—would still have
no values in the
user
field and thus would match the
missing
filter:
{ "user": [] } { "foo": "bar" }
existence
and
null_value
parameters
{ "constant_score" : { "filter" : { "missing" : { "field" : "user", "existence" : true, "null_value" : falseexistence
When the
existence
parameter is set totrue
(the default), the missing filter will include documents where the field has no values, ie:{ "user": [] } { "foo": "bar" }When set to
false
, these documents will not be included.null_value
When the null_value
parameter is set to true
, the missing
filter will include documents where the field contains a null
value, ie:
When set to false
(the default), these documents will not be included.
A
nested
filter works in a similar fashion to the
nested
query, except it’s
used as a filter. It follows exactly the same structure, but also allows
to cache the results (set
_cache
to
true
), and have it named (set
the
_name
value). For example:
{ "filtered" : { "query" : { "match_all" : {} }, "filter" : { "nested" : { "path" : "obj1", "filter" : { "bool" : { "must" : [ "term" : {"obj1.name" : "blue"} "range" : {"obj1.count" : {"gt" : 5}} "_cache" : true }
{ "query" : { "nested" : { "path" : "offers", "query" : { "match" : { "offers.color" : "blue" "facets" : { "size" : { "terms" : { "field" : "offers.size" "facet_filter" : { "nested" : { "path" : "offers", "query" : { "match" : { "offers.color" : "blue" "join" : false "nested" : "offers" Or Filter
Or, in a longer form with a
filter
element:
{ "filtered" : { "query" : { "term" : { "name.first" : "shay" } "filter" : { "not" : { "filter" : { "range" : { "postDate" : { "from" : "2010-03-01", "to" : "2010-04-01" "_cache" : true Prefix Filter
{ "filtered" : { "query" : { "term" : { "name.first" : "shay" } "filter" : { "or" : { "filters" : [ "term" : { "name.second" : "banon" } "term" : { "name.nick" : "kimchy" } "_cache" : true Query Filter
Wraps any query to be used as a filter. Can be placed within queries that accept a filter.
Setting the
_cache
element requires a different format for the
query
:
{ "constantScore" : { "filter" : { "fquery" : { "query" : { "query_string" : { "query" : "this AND that OR thus" "_cache" : true Regexp Filter
Filters documents with fields that have terms within a certain range. Similar to range query , except that it acts as a filter. Can be placed within queries that accept a filter.
{ "constant_score" : { "filter" : { "range" : { "age" : { "gte": 10, "lte": 20 }
The
range
filter accepts the following parameters:
Greater-than or equal to Greater-than Less-than or equal to Less-than |
In the above example,
gte
will be actually moved to
2011-12-31T23:00:00
UTC date.
index
Uses the field’s inverted index in order to determine whether documents fall within the specified range.
fielddata
Uses fielddata in order to determine whether documents fall within the specified range.
|
The
regexp
filter is similar to the
regexp
query, except
that it is cacheable and can speedup performance in case you are reusing
this filter in your queries.
See Regular expression syntax for details of the supported regular expression language.
{ "filtered": { "query": { "match_all": {} "filter": { "regexp":{ "name.first" : "s.*y" }
You can also select the cache name and use the same regexp flags in the filter as in the query.
Regular expressions are dangerous because it’s easy to accidentally
create an innocuous looking one that requires an exponential number of
internal determinized automaton states (and corresponding RAM and CPU)
for Lucene to execute. Lucene prevents these using the
max_determinized_states
setting (defaults to 10000). You can raise
this limit to allow more complex regular expressions to execute.
You have to enable caching explicitly in order to have the
regexp
filter cached.
{ "filtered": { "query": { "match_all": {} "filter": { "regexp":{ "name.first" : { "value" : "s.*y", "flags" : "INTERSECTION|COMPLEMENT|EMPTY", "max_determinized_states": 20000 "_name":"test", "_cache" : true, "_cache_key" : "key" }
Filters documents that have fields that contain a term ( not analyzed ). Similar to term query , except that it acts as a filter. Can be placed within queries that accept a filter, for example:
{ "constant_score" : { "filter" : { "term" : { "user" : "kimchy"} }
The
terms
filter is also aliased with
in
as the filter name for
simpler usage.
The
execution
option now has the following options :
plain
The default. Works as today. Iterates over all the terms,
building a bit set matching it, and filtering. The total filter is
cached.
fielddata
Generates a terms filters that uses the fielddata cache to
compare terms. This execution mode is great to use when filtering
on a field that is already loaded into the fielddata cache from
faceting, sorting, or index warmers. When filtering on
a large number of terms, this execution can be considerably faster
than the other modes. The total filter is not cached unless
explicitly configured to do so.
Generates a term filter (which is cached) for each term, and
wraps those in a bool filter. The bool filter itself is not cached as it
can operate very quickly on the cached term filters.
Generates a term filter (which is cached) for each term, and
wraps those in an and filter. The and filter itself is not cached.
Generates a term filter (which is cached) for each term, and
wraps those in an or filter. The or filter itself is not cached.
Generally, the
bool
execution mode should be preferred.
|
The terms lookup mechanism supports the following options:
index
The index to fetch the term values from. Defaults to the
current index.
The type to fetch the term values from.
The id of the document to fetch the term values from.
The field specified as path to fetch the actual values for the
terms
filter.
routing
A custom routing value to be used when retrieving the
external terms doc.
cache
Whether to cache the filter built from the retrieved document
(
true
- default) or whether to fetch and rebuild the filter on every
request (
false
). See "
Terms lookup caching
" below
|
The structure of the external terms document can also include array of inner objects, for example:
Filters documents matching the provided document / mapping type. Note,
this filter can work even when the
_type
field is not indexed (using
the
_uid
field).
{ "type" : { "value" : "my_type"
To create a mapping, you will need the Put Mapping API , or you can add multiple mappings when you create an index .
The
_type
field can be stored as well, for example:
The
_source
field is an automatically generated field that stores the actual
JSON that was used as the indexed document. It is not indexed (searchable),
just stored. When executing "fetch" requests, like
get
or
search
, the
_source
field is returned by default.
update
API
.
On the fly
highlighting
.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
The metrics use case
The metrics use case is distinct from other time-based or logging use cases in that there are many small documents which consist only of numbers, dates, or keywords. There are no updates, no highlighting requests, and the data ages quickly so there is no need to reindex. Search requests typically use simple queries to filter the dataset by date or tags, and the results are returned as aggregations.
In this case, disabling the
_source
field will save space and reduce I/O.
It is also advisable to disable the
_all
field
in the
metrics case.
PUT logs "mappings": { "event": { "_source": { "includes": [ "*.count", "meta.*" "excludes": [ "meta.description", "meta.other.*" PUT logs/event/1 "requests": { "count": 10, "foo": "bar""meta": { "name": "Some metric", "description": "Some metric description",
"other": { "foo": "one",
"baz": "two"
GET logs/event/_search "query": { "match": { "meta.other.foo": "one"
These fields will be removed from the stored
_source
field. We can still search on this field, even though it is not in the stored_source
.
For any field to allow
highlighting
it has
to be either stored or part of the
_source
field. By default the
_all
field does not qualify for either, so highlighting for it does not yield
any data.
Although it is possible to
store
the
_all
field, it is basically an
aggregation of all fields, which means more data will be stored, and
highlighting it might produce strange results.
Boosting is the process of enhancing the relevancy of a document or field. Field level mapping allows to define an explicit boost level on a specific field. The boost field mapping (applied on the root object ) allows to define a boost field mapping where its content will control the boost level of the document . For example, consider the following mapping:
{ "tweet" : { "_boost" : {"name" : "my_boost", "null_value" : 1.0} }
The above mapping defines a mapping for a field named
my_boost
. If the
my_boost
field exists within the JSON document indexed, its value will
control the boost level of the document indexed. For example, the
following JSON document will be indexed with a boost value of
2.2
:
{ "my_boost" : 2.2, "message" : "This is a tweet!" }
Instead, the Function Score Query can be used to achieve the desired functionality by boosting each document by the value in any field of the document:
{ "query": { "function_score": { "query": {"match": { "title": "your main query" "functions": [{ "field_value_factor": {
"field": "my_boost_field" "score_mode": "multiply" The original query, now wrapped in a
function_score
query. This function returns the value inmy_boost_field
, which is then multiplied by the query_score
for each document.
_field_names
_routing
Will cause the following doc to be routed based on the
111222
value:
In order to also store it, use:
The
_timestamp
field allows to automatically index the timestamp of a
document. It can be provided externally via the index request or in the
_source
. If it is not provided externally it will be automatically set
to a
default date
.
By default it is disabled. In order to enable it, the following mapping should be defined:
Will cause
2009-11-15T14:12:12
to be used as the timestamp value for:
You can define the date format used to parse the provided timestamp value. For example:
{ "tweet" : { "_timestamp" : { "enabled" : true, "path" : "post_date", "format" : "YYYY-MM-dd" }
Note, the default format is
dateOptionalTime
. The timestamp value will
first be parsed as a number and if it fails the format will be tried.
You can also set the default value to any date respecting timestamp format :
{ "tweet" : { "_timestamp" : { "enabled" : true, "format" : "YYYY-MM-dd", "default" : "1970-01-01" }
If you don’t provide any timestamp value, _timestamp will be set to this default value.
_ttl
accepts two parameters which are described below, every other setting will be silently ignored.
By default it is disabled, in order to enable it, the following mapping should be defined:
_ttl
can only be enabled once and never be disabled again.
You can provide a per index/type default
_ttl
value as follows:
The following sample tweet JSON document will be used to explain the core types:
Explicit mapping for the above JSON tweet can be:
The following table lists all the attributes that can be used with the
string
type:
Attribute | Description |
---|---|
|
[
1.5.0
]
Deprecated in 1.5.0.
Use
|
|
Set to
|
|
Set to
|
|
Set to
|
|
Possible values are
|
|
The boost value. Defaults to
|
|
When there is a (JSON) null value for the field, use the
|
|
Boolean value if norms should be enabled or
not. Defaults to
|
|
Describes how norms should be loaded, possible values are
|
|
Allows to set the indexing
options, possible values are
|
|
The analyzer used to analyze the text contents when
|
|
The analyzer used to analyze the text contents when
|
|
The analyzer used to analyze the field when part of a query string. Can be updated on an existing field. |
|
Should the field be included in the
|
|
The analyzer will ignore strings larger than this size.
Useful for generic
This option is also useful for protecting against Lucene’s term byte-length
limit of
|
|
Position increment gap between field instances with the same field name. Defaults to 0. |
In case you would like to disable norms after the fact, it is possible to do so by using the PUT mapping API , like this:
PUT my_index/_mapping/my_type "properties": { "title": { "type": "string", "norms": { "enabled": false }
Please however note that norms won’t be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.
A number based type supporting
float
,
double
,
byte
,
short
,
integer
, and
long
. It uses specific constructs within Lucene in
order to support numeric values. The number types have the same ranges
as corresponding
types
. An example mapping can be:
{ "tweet" : { "properties" : { "rank" : { "type" : "float", "null_value" : 1.0 }
The following table lists all the attributes that can be used with a numbered type:
Attribute | Description |
---|---|
|
The type of the number. Can be
|
|
[
1.5.0
]
Deprecated in 1.5.0.
Use
|
|
Set to
|
|
Set to
|
|
Set to
|
|
The precision step (influences the number of terms
generated for each number value). Defaults to
|
|
The boost value. Defaults to
|
|
When there is a (JSON) null value for the field, use the
|
|
Should the field be included in the
|
|
Ignored a malformed number. Defaults to
|
|
Try convert strings to numbers and truncate fractions for integers. Defaults to
|
The following table lists all the attributes that can be used with a date type:
Attribute | Description |
---|---|
|
[
1.5.0
]
Deprecated in 1.5.0.
Use
|
|
The
date format
. Defaults to
|
|
Set to
|
|
Set to
|
|
Set to
|
|
The precision step (influences the number of terms
generated for each number value). Defaults to
|
|
The boost value. Defaults to
|
|
When there is a (JSON) null value for the field, use the
|
|
Should the field be included in the
|
|
Ignored a malformed number. Defaults to
|
|
The unit to use when passed in a numeric values. Possible
values include
|
The following table lists all the attributes that can be used with the boolean type:
Attribute | Description |
---|---|
|
[
1.5.0
]
Deprecated in 1.5.0.
Use
|
|
Set to
|
|
Set to
|
|
The boost value. Defaults to
|
|
When there is a (JSON) null value for the field, use the
|
The following table lists all the attributes that can be used with the binary type:
index_name
Deprecated in 1.5.0.
Use
copy_to
instead
The name of the field that will be stored in the index.
Defaults to the property/field name.
store
Set to
true
to store actual field in the index,
false
to not store it.
Defaults to
false
(note, the JSON document itself is already stored, so
the binary field can be retrieved from there).
doc_values
Set to
true
to store field values in a column-stride fashion.
compress
Set to
true
to compress the stored binary value.
compress_threshold
Compression will only be applied to stored binary fields that are greater
than this size. Defaults to
-1
|
It is possible to control which field values are loaded into memory, which is particularly useful for aggregating on string fields, using fielddata filters, which are explained in detail in the Fielddata section.
Fielddata filters can exclude terms which do not match a regex, or which
don’t fall between a
min
and
max
frequency range:
{ tweet: { type: "string", analyzer: "whitespace" fielddata: { filter: { regex: { "pattern": "^#.*" frequency: { min: 0.001, max: 0.1, min_segment_size: 500 }
These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.
You can configure similarities via the similarity module
The following Similarities are configured out-of-box:
Multiple fields are also supported:
{ "tweet" : { "properties": { "first_name": { "type": "string", "index": "analyzed", "path": "just_name", "fields": { "any_name": {"type": "string","index": "analyzed"} "last_name": { "type": "string", "index": "analyzed", "path": "just_name", "fields": { "any_name": {"type": "string","index": "analyzed"} Object Type
JSON documents allow to define an array (list) of fields or objects. Mapping array types could not be simpler since arrays gets automatically detected and mapping them can be done either with Core Types or Object Type mappings. For example, the following JSON defines several arrays:
{ "tweet" : { "message" : "some arrays in this tweet...", "tags" : ["elasticsearch", "wow"], "lists" : [ "name" : "prog_list", "description" : "programming list" "name" : "cool_list", "description" : "cool stuff list" }
The above JSON has the
tags
property defining a list of a simple
string
type, and the
lists
property is an
object
type array. Here
is a sample explicit mapping:
{ "tweet" : { "properties" : { "message" : {"type" : "string"}, "tags" : {"type" : "string"}, "lists" : { "properties" : { "name" : {"type" : "string"}, "description" : {"type" : "string"} }
The fact that array types are automatically supported can be shown by the fact that the following JSON document is perfectly fine:
{ "tweet" : { "message" : "some arrays in this tweet...", "tags" : "elasticsearch", "lists" : { "name" : "prog_list", "description" : "programming list" Root Object Type
The above shows an example where a tweet includes the actual
person
details. A
person
is an object, with a
sid
, and a
name
object
which has
first_name
and
last_name
. It’s important to note that
tweet
is also an object, although it is a special
root object type
which allows for additional mapping definitions.
The following is an example of explicit mapping for the above JSON:
{ "tweet" : { "properties" : { "person" : { "type" : "object", "properties" : { "name" : { "type" : "object", "properties" : { "first_name" : {"type" : "string"}, "last_name" : {"type" : "string"} "sid" : {"type" : "string", "index" : "not_analyzed"} "message" : {"type" : "string"} }
In order to mark a mapping of type
object
, set the
type
to object.
This is an optional step, since if there are
properties
defined for
it, it will automatically be identified as an
object
mapping.
An object mapping can optionally define one or more properties using the
properties
tag for a field. Each property can be either another
object
, or one of the
core_types
.
When processing dynamic new fields, their type is automatically derived.
For example, if it is a
number
, it will automatically be treated as
number
core_type
. Dynamic
fields default to their default attributes, for example, they are not
stored and they are always indexed.
Date fields are special since they are represented as a
string
. Date
fields are detected if they can be parsed as a date when they are first
introduced into the system. The set of date formats that are tested
against can be configured using the
dynamic_date_formats
on the root object,
which is explained later.
Note, once a field has been added, its type can not change . For example, if we added age and its value is a number, then it can’t be treated as a string.
The
dynamic
parameter can also be set to
strict
, meaning that not
only will new fields not be introduced into the mapping, but also that parsing
(indexing) docs with such new fields will fail.
In the above,
name
and its content will not be indexed at all.
In the
core_types
section, a field can have a
index_name
associated with it in order to
control the name of the field that will be stored within the index. When
that field exists within an object(s) that are not the root object, the
name of the field of the index can either include the full "path" to the
field with its
index_name
, or just the
index_name
. For example
(under mapping of
type
person
, removed the tweet type for clarity):
{ "person" : { "properties" : { "name1" : { "type" : "object", "path" : "just_name", "properties" : { "first1" : {"type" : "string"}, "last1" : {"type" : "string", "index_name" : "i_last_1"} "name2" : { "type" : "object", "path" : "full", "properties" : { "first2" : {"type" : "string"}, "last2" : {"type" : "string", "index_name" : "i_last_2"} }
In the above example, the
name1
and
name2
objects within the
person
object have different combination of
path
and
index_name
.
The document fields that will be stored in the index as a result of that
are:
JSON Name | Document Field Name |
---|---|
|
|
|
|
|
|
|
|
Note, when querying or using a field name in any of the APIs provided
(search, query, selective loading, …), there is an automatic detection
from logical full path and into the
index_name
and vice versa. For
example, even though
name1
/
last1
defines that it is stored with
just_name
and a different
index_name
, it can either be referred to
using
name1.last1
(logical name), or its actual indexed name of
i_last_1
.
More over, where applicable, for example, in queries, the full path
including the type can be used such as
person.name.last1
, in this
case, both the actual indexed name will be resolved to match against the
index, and an automatic query filter will be added to only match
person
types.
The root object mapping is an object type mapping that maps the root object (the type itself). It supports all of the different mappings that can be set using the object type mapping .
The root object mapping allows to index a JSON document that only contains its
fields. For example, the following
tweet
JSON can be indexed without
specifying the
tweet
type in the document itself:
{ "message" : "This is a tweet!" }
In the above mapping, if a new JSON field of type string is detected,
the date formats specified will be used in order to check if its a date.
If it passes parsing, then the field will be declared with
date
type,
and will use the matching format as its format attribute. The date
format itself is explained
here
.
The default formats are:
dateOptionalTime
(ISO) and
yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z
.
Note:
dynamic_date_formats
are used
only
for dynamically added
date fields, not for
date
fields that you specify in your mapping.
The
nested
type works like the
object
type
except
that an array of
objects
is flattened, while an array of
nested
objects
allows each object to be queried independently. To explain, consider this
document:
{ "group" : "fans", "user" : [ "first" : "John", "last" : "Smith" "first" : "Alice", "last" : "White" }
If the
user
field is of type
object
, this document would be indexed
internally something like this:
{ "group" : "fans", "user.first" : [ "alice", "john" ], "user.last" : [ "smith", "white" ] }
The
first
and
last
fields are flattened, and the association between
alice
and
white
is lost. This document would incorrectly match a query
for
alice AND smith
.
If the
user
field is of type
nested
, each object is indexed as a separate
document, something like this:
Searching on nested docs can be done using either the nested query or nested filter .
The mapping for
nested
fields is the same as
object
fields, except that it
uses type
nested
:
The result of indexing our example document would be something like this:
{"user.first" : "alice", "user.last" : "white" "user.first" : "john", "user.last" : "smith" "group" : "fans", "user.first" : [ "alice", "john" ], "user.last" : [ "smith", "white" ] Hidden nested documents. Visible “parent” document.
The
include_in_parent
and
include_in_root
options do not apply
to
geo_shape
fields
, which are only ever
indexed inside the nested document.
Nested docs will automatically use the root doc
_all
field only.
Internal Implementation
Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.
Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.
Because nested docs are always masked to the parent doc, the nested docs
can never be accessed outside the scope of the
nested
query. For example
stored fields can be enabled on fields inside nested objects, but there is
no way of retrieving them, since stored fields are fetched outside of
the
nested
query scope.
The
_source
field is always associated with the parent document and
because of that field values via the source can be fetched for nested object.
The following table lists all the attributes that can be used with an ip type:
Attribute | Description |
---|---|
|
[
1.5.0
]
Deprecated in 1.5.0.
Use
|
|
Set to
|
|
Set to
|
|
The precision step (influences the number of terms
generated for each number value). Defaults to
|
|
The boost value. Defaults to
|
|
When there is a (JSON) null value for the field, use the
|
|
Should the field be included in the
|
|
Set to
|
Mapper type called
geo_point
to support geo based points. The
declaration looks as follows:
More usefully, set the
geohash_prefix
option to
true
to not only index
the geohash value, but all the enclosing cells as well. For instance, a
geohash of
u30
will be indexed as
[u,u3,u30]
. This option can be used
by the
Geohash Cell Filter
to find geopoints within a
particular cell very efficiently.
Format in
[lon, lat]
, note, the order of lon/lat here in order to
conform with
GeoJSON
.
{ "pin" : { "location" : [-71.34, 41.12] }
Option | Description |
---|---|
|
Set to
|
|
Set to
|
|
Sets the geohash precision. It can be set to an absolute geohash length or a distance value (eg 1km, 1m, 1ml) defining the size of the smallest cell. Defaults to an absolute length of 12. |
|
If this option is set to
|
|
Set to
|
|
Set to
|
|
Set to
|
|
Set to
|
|
Set to
|
|
Set to
|
|
The precision step (influences the number of terms
generated for each number value) for
|
Precision |
Bytes per point |
Size reduction |
1km |
4 |
75% |
3m |
6 |
62.5% |
1cm |
8 |
50% |
1mm |
10 |
37.5% |
Precision can be changed on a live index by using the update mapping API.
You can query documents using this type using geo_shape Filter or geo_shape Query .
Option | Description | Default |
---|---|---|
|
Name of the PrefixTree implementation to be used:
|
|
|
This parameter may be used instead of
|
|
|
Maximum number of layers to be used by the PrefixTree.
This can be used to control the precision of shape representations and
therefore how many terms are indexed. Defaults to the default value of
the chosen PrefixTree implementation. Since this parameter requires a
certain level of understanding of the underlying implementation, users
may use the
|
|
|
The strategy parameter defines the approach for how to
represent shapes at indexing and search time. It also influences the
capabilities available so it is recommended to let Elasticsearch set
this parameter automatically. There are two strategies available:
|
|
|
Used as a hint to the PrefixTree about how
precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum
supported value. PERFORMANCE NOTE: This value will default to 0 if a
|
|
|
Optionally define how to interpret vertex order for
polygons / multipolygons. This parameter defines one of two coordinate
system rules (Right-hand or Left-hand) each of which can be specified in three
different ways. 1. Right-hand rule:
|
|
Multiple PrefixTree implementations are provided:
The following Strategy implementations (with corresponding capabilities) are provided:
Strategy | Supported Shapes | Supported Queries | Multiple Shapes |
---|---|---|---|
|
|
Yes |
|
|
|
Yes |
The GeoJSON format is used to represent shapes as input as follows:
GeoJSON Type | Elasticsearch Type | Description |
---|---|---|
|
|
A single geographic coordinate. |
|
|
An arbitrary line given two or more points. |
|
|
A
closed
polygon whose first and last point
must match, thus requiring
|
|
|
An array of unconnected, but likely related points. |
|
|
An array of separate linestrings. |
|
|
An array of separate polygons. |
|
|
A GeoJSON shape similar to the
|
|
|
A bounding rectangle, or envelope, specified by specifying only the top left and bottom right points. |
|
|
A circle specified by a center point and radius with
units, which default to
|
For all types, both the inner
type
and
coordinates
fields are
required.
In GeoJSON, and therefore Elasticsearch, the correct coordinate order is longitude, latitude (X, Y) within coordinate arrays. This differs from many Geospatial APIs (e.g., Google Maps) that generally use the colloquial latitude, longitude (Y, X).
A point is a single geographic coordinate, such as the location of a building or the current position given by a smartphone’s Geolocation API.
{ "location" : { "type" : "point", "coordinates" : [-77.03653, 38.897676] }
A
linestring
defined by an array of two or more positions. By
specifying only two points, the
linestring
will represent a straight
line. Specifying more than two points creates an arbitrary path.
{ "location" : { "type" : "linestring", "coordinates" : [[-77.03653, 38.897676], [-77.009051, 38.889939]] }
The above
linestring
would draw a straight line starting at the White
House to the US Capitol Building.
A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed).
{ "location" : { "type" : "polygon", "coordinates" : [ [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ] }
The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"):
{ "location" : { "type" : "polygon", "coordinates" : [ [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ], [ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2] ] }
IMPORTANT NOTE: GeoJSON does not mandate a specific order for vertices thus ambiguous polygons around the dateline and poles are possible. To alleviate ambiguity the Open Geospatial Consortium (OGC) Simple Feature Access specification defines the following vertex ordering:
For polygons that do not cross the dateline, vertex order will not matter in Elasticsearch. For polygons that do cross the dateline, Elasticsearch requires vertex ordering to comply with the OGC specification. Otherwise, an unintended polygon may be created and unexpected query/filter results will be returned.
The following provides an example of an ambiguous polygon. Elasticsearch will apply OGC standards to eliminate ambiguity resulting in a polygon that crosses the dateline.
{ "location" : { "type" : "polygon", "coordinates" : [ [ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ], [ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ] }
An
orientation
parameter can be defined when setting the geo_shape mapping (see
the section called “Mapping Options
”). This will define vertex
order for the coordinate list on the mapped geo_shape field. It can also be overridden on each document. The following is an example for
overriding the orientation on a document:
{ "location" : { "type" : "polygon", "orientation" : "clockwise", "coordinates" : [ [ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ], [ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ] }
A list of geojson points.
{ "location" : { "type" : "multipoint", "coordinates" : [ [102.0, 2.0], [103.0, 2.0] }
A list of geojson linestrings.
{ "location" : { "type" : "multilinestring", "coordinates" : [ [ [102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0] ], [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0] ], [ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8] ] }
A list of geojson polygons.
{ "location" : { "type" : "multipolygon", "coordinates" : [ [ [[102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0], [102.0, 2.0]] ], [ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]], [[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ] }
A collection of geojson geometry objects.
{ "location" : { "type": "geometrycollection", "geometries": [ "type": "point", "coordinates": [100.0, 0.0] "type": "linestring", "coordinates": [ [101.0, 0.0], [102.0, 1.0] ] }
Elasticsearch supports a
circle
type, which consists of a center
point with a radius:
The
attachment
type is provided as a
plugin
extension
. It uses
Apache Tika
behind the scene.
See README file for details.
The parsing of dates uses Joda . The default date parsing used if no format is specified is ISODateTimeFormat.dateOptionalTimeParser .
An extension to the format allow to define several formats using
||
separator. This allows to define less strict formats that can be used,
for example, the
yyyy/MM/dd HH:mm:ss||yyyy/MM/dd
format will parse
both
yyyy/MM/dd HH:mm:ss
and
yyyy/MM/dd
. The first format will also
act as the one that converts back from milliseconds to a string
representation.
Here are some samples:
now+1h
,
now+1h+1m
,
now+1h/d
,
2012-01-01||+1M/d
.
To change this behavior, set
"mapping.date.round_ceil": false
.
The following tables lists all the defaults ISO formats supported:
Name | Description |
---|---|
|
A basic formatter for a full date as four digit year, two digit month of year, and two digit day of month (yyyyMMdd). |
|
A basic formatter that combines a basic date and time, separated by a T (yyyyMMdd’T'HHmmss.SSSZ). |
|
A basic formatter that combines a basic date and time without millis, separated by a T (yyyyMMdd’T'HHmmssZ). |
|
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear (yyyyDDD). |
|
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear (yyyyDDD’T'HHmmss.SSSZ). |
|
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear (yyyyDDD’T'HHmmssZ). |
|
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone offset (HHmmss.SSSZ). |
|
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset (HHmmssZ). |
|
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone off set prefixed by T ('T’HHmmss.SSSZ). |
|
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by T ('T’HHmmssZ). |
|
A basic formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week (xxxx’W'wwe). |
|
A basic formatter that combines a basic weekyear date and time, separated by a T (xxxx’W'wwe’T'HHmmss.SSSZ). |
|
A basic formatter that combines a basic weekyear date and time without millis, separated by a T (xxxx’W'wwe’T'HHmmssZ). |
|
A formatter for a full date as four digit year, two digit month of year, and two digit day of month (yyyy-MM-dd). |
|
A formatter that combines a full date and two digit hour of day. |
|
A formatter that combines a full date, two digit hour of day, and two digit minute of hour. |
|
A formatter that combines a full date, two digit hour of day, two digit minute of hour, and two digit second of minute. |
|
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (yyyy-MM-dd’T'HH:mm:ss.SSS). |
|
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (yyyy-MM-dd’T'HH:mm:ss.SSS). |
|
a generic ISO datetime parser where the date is mandatory and the time is optional. |
|
A formatter that combines a full date and time, separated by a T (yyyy-MM-dd’T'HH:mm:ss.SSSZZ). |
|
A formatter that combines a full date and time without millis, separated by a T (yyyy-MM-dd’T'HH:mm:ssZZ). |
|
A formatter for a two digit hour of day. |
|
A formatter for a two digit hour of day and two digit minute of hour. |
|
A formatter for a two digit hour of day, two digit minute of hour, and two digit second of minute. |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (HH:mm:ss.SSS). |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second (HH:mm:ss.SSS). |
|
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear (yyyy-DDD). |
|
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear (yyyy-DDD’T'HH:mm:ss.SSSZZ). |
|
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear (yyyy-DDD’T'HH:mm:ssZZ). |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset (HH:mm:ss.SSSZZ). |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset (HH:mm:ssZZ). |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset prefixed by T ('T’HH:mm:ss.SSSZZ). |
|
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by T ('T’HH:mm:ssZZ). |
|
A formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week (xxxx-'W’ww-e). |
|
A formatter that combines a full weekyear date and time, separated by a T (xxxx-'W’ww-e’T'HH:mm:ss.SSSZZ). |
|
A formatter that combines a full weekyear date and time without millis, separated by a T (xxxx-'W’ww-e’T'HH:mm:ssZZ). |
|
A formatter for a four digit weekyear. |
|
A formatter for a four digit weekyear and two digit week of weekyear. |
|
A formatter for a four digit weekyear, two digit week of weekyear, and one digit day of week. |
|
A formatter for a four digit year. |
|
A formatter for a four digit year and two digit month of year. |
|
A formatter for a four digit year, two digit month of year, and two digit day of month. |
Allows for a completely customizable date format explained here .
Default mappings allow generic mapping definitions to be automatically applied to types that do not have mappings predefined. This is mainly done thanks to the fact that the object mapping and namely the root object mapping allow for schema-less dynamic addition of unmapped fields.
The default mapping definition is a plain mapping definition that is embedded within the distribution:
{ "_default_" : { }
Pretty short, isn’t it? Basically, everything is defaulted, especially the
dynamic nature of the root object mapping. The default mapping
definition can be overridden in several manners. The simplest manner is
to simply define a file called
default-mapping.json
and to place it
under the
config
directory (which can be configured to exist in a
different location). It can also be explicitly set using the
index.mapper.default_mapping_location
setting.
The dynamic creation of mappings for unmapped types can be completely
disabled by setting
index.mapper.dynamic
to
false
.
The dynamic creation of fields within a type can be completely
disabled by setting the
dynamic
property of the type to
strict
.
Here is a
Put Mapping
example that
disables dynamic field creation for a
tweet
:
$ curl -XPUT 'http://localhost:9200/twitter/_mapping/tweet' -d ' "tweet" : { "dynamic": "strict", "properties" : { "message" : {"type" : "string", "store" : true } '
Here is how we can change the default date_formats used in the root and inner object types:
{ "_default_" : { "dynamic_date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"] }
When registering a new
percolator query
or creating
a
filtered alias
then the
index.query.parse.allow_unmapped_fields
setting
is forcefully overwritten to disallowed unmapped fields.
Creating new mappings can be done using the Put Mapping API. When a document is indexed with no mapping associated with it in the specific index, the dynamic / default mapping feature will kick in and automatically create mapping definition for it.
Mappings can also be provided on the node level, meaning that each index created will automatically be started with all the mappings defined within a certain location.
Mappings can be defined within files called
[mapping_name].json
and be
placed either under
config/mappings/_default
location, or under
config/mappings/[index_name]
(for mappings that should be associated
only with a specific index).
Analyzers are composed of a single
Tokenizer
and zero or more
TokenFilters
. The tokenizer may
be preceded by one or more
CharFilters
. The
analysis module allows one to register
TokenFilters
,
Tokenizers
and
Analyzers
under logical names that can then be referenced either in
mapping definitions or in certain APIs. The Analysis module
automatically registers (
if not explicitly defined
) built in
analyzers, token filters, and tokenizers.
Here is a sample configuration:
index : analysis : analyzer : standard : type : standard stopwords : [stop1, stop2] myAnalyzer1 : type : standard stopwords : [stop1, stop2, stop3] max_token_length : 500 # configure a custom analyzer which is # exactly like the default standard analyzer myAnalyzer2 : tokenizer : standard filter : [standard, lowercase, stop] tokenizer : myTokenizer1 : type : standard max_token_length : 900 myTokenizer2 : type : keyword buffer_size : 512 filter : myTokenFilter1 : type : stop stopwords : [stop1, stop2, stop3, stop4] myTokenFilter2 : type : length min : 0 max : 2000
Analyzers are composed of a single
Tokenizer
and zero or more
TokenFilters
. The tokenizer may
be preceded by one or more
CharFilters
.
The analysis module allows you to register
Analyzers
under logical
names which can then be referenced either in mapping definitions or in
certain APIs.
Elasticsearch comes with a number of prebuilt analyzers which are ready to use. Alternatively, you can combine the built in character filters, tokenizers and token filters to create custom analyzers .
An analyzer of type
standard
is built using the
Standard Tokenizer
with the
Standard Token Filter
,
Lower Case Token Filter
, and
Stop Token Filter
.
The following are settings that can be set for a
standard
analyzer
type:
Setting | Description |
---|---|
|
A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details. |
|
The maximum token length. If a token is seen that exceeds
this length then it is split at
|
An analyzer of type
stop
that is built using a
Lower Case Tokenizer
, with
Stop Token Filter
.
The following are settings that can be set for a
stop
analyzer type:
Setting | Description |
---|---|
|
A list of stopwords to initialize the stop filter with. Defaults to the english stop words. |
|
A path (either relative to
|
Use
stopwords: _none_
to explicitly specify an
empty
stopword list.
The following are settings that can be set for a
pattern
analyzer
type:
lowercase
Should terms be lowercased or not. Defaults to
true
.
pattern
The regular expression pattern, defaults to
\W+
.
flags
The regular expression flags.
stopwords
A list of stopwords to initialize the stop filter with.
Defaults to an
empty
stopword list Check
Stop Analyzer
for more details.
|
IMPORTANT : The regular expression should match the token separators , not the tokens themselves.
Flags should be pipe-separated, eg
"CASE_INSENSITIVE|COMMENTS"
. Check
Pattern API
for more details about
flags
options.
In order to try out these examples, you should delete the
test
index
before running each example.
DELETE test PUT /test "settings": { "analysis": { "analyzer": { "nonword": { "type": "pattern", "pattern": "[^\\w]+"GET /test/_analyze?analyzer=nonword&text=foo,bar baz # "foo,bar baz" becomes "foo", "bar", "baz" GET /test/_analyze?analyzer=nonword&text=type_1-type_4 # "type_1","type_4"
The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers, | (?<=\D)(?=\d) # or non-number followed by number, | (?<=\d)(?=\D) # or number followed by non-number, | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case (?=\p{Lu}) # followed by upper case, | (?<=\p{Lu}) # or upper case (?=\p{Lu} # followed by upper case [\p{L}&&[^\p{Lu}]] # then lower case Snowball Analyzer
A set of analyzers aimed at analyzing specific language text. The
following types are supported:
arabic
,
armenian
,
basque
,
brazilian
,
bulgarian
,
catalan
,
chinese
,
cjk
,
czech
,
danish
,
dutch
,
english
,
finnish
,
french
,
galician
,
german
,
greek
,
hindi
,
hungarian
,
indonesian
,
irish
,
italian
,
latvian
,
norwegian
,
persian
,
portuguese
,
romanian
,
russian
,
sorani
,
spanish
,
swedish
,
turkish
,
thai
.
All analyzers support setting custom
stopwords
either internally in
the config, or by using an external stopwords file by setting
stopwords_path
. Check
Stop Analyzer
for
more details.
The
stem_exclusion
parameter allows you to specify an array
of lowercase words that should not be stemmed. Internally, this
functionality is implemented by adding the
keyword_marker
token filter
with the
keywords
set to the value of the
stem_exclusion
parameter.
The following analyzers support setting custom
stem_exclusion
list:
arabic
,
armenian
,
basque
,
catalan
,
bulgarian
,
catalan
,
czech
,
finnish
,
dutch
,
english
,
finnish
,
french
,
galician
,
german
,
irish
,
hindi
,
hungarian
,
indonesian
,
italian
,
latvian
,
norwegian
,
portuguese
,
romanian
,
russian
,
sorani
,
spanish
,
swedish
,
turkish
.
The
arabic
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "arabic_stop": { "type": "stop", "stopwords": "_arabic_""arabic_keywords": { "type": "keyword_marker", "keywords": []
"arabic_stemmer": { "type": "stemmer", "language": "arabic" "analyzer": { "arabic": { "tokenizer": "standard", "filter": [ "lowercase", "arabic_stop", "arabic_normalization", "arabic_keywords", "arabic_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
armenian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "armenian_stop": { "type": "stop", "stopwords": "_armenian_""armenian_keywords": { "type": "keyword_marker", "keywords": []
"armenian_stemmer": { "type": "stemmer", "language": "armenian" "analyzer": { "armenian": { "tokenizer": "standard", "filter": [ "lowercase", "armenian_stop", "armenian_keywords", "armenian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
basque
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "basque_stop": { "type": "stop", "stopwords": "_basque_""basque_keywords": { "type": "keyword_marker", "keywords": []
"basque_stemmer": { "type": "stemmer", "language": "basque" "analyzer": { "basque": { "tokenizer": "standard", "filter": [ "lowercase", "basque_stop", "basque_keywords", "basque_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
brazilian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "brazilian_stop": { "type": "stop", "stopwords": "_brazilian_""brazilian_keywords": { "type": "keyword_marker", "keywords": []
"brazilian_stemmer": { "type": "stemmer", "language": "brazilian" "analyzer": { "brazilian": { "tokenizer": "standard", "filter": [ "lowercase", "brazilian_stop", "brazilian_keywords", "brazilian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
bulgarian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "bulgarian_stop": { "type": "stop", "stopwords": "_bulgarian_""bulgarian_keywords": { "type": "keyword_marker", "keywords": []
"bulgarian_stemmer": { "type": "stemmer", "language": "bulgarian" "analyzer": { "bulgarian": { "tokenizer": "standard", "filter": [ "lowercase", "bulgarian_stop", "bulgarian_keywords", "bulgarian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
catalan
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "catalan_elision": { "type": "elision", "articles": [ "d", "l", "m", "n", "s", "t"] "catalan_stop": { "type": "stop", "stopwords": "_catalan_""catalan_keywords": { "type": "keyword_marker", "keywords": []
"catalan_stemmer": { "type": "stemmer", "language": "catalan" "analyzer": { "catalan": { "tokenizer": "standard", "filter": [ "catalan_elision", "lowercase", "catalan_stop", "catalan_keywords", "catalan_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
chinese
analyzer cannot be reimplemented as a
custom
analyzer
because it depends on the ChineseTokenizer and ChineseFilter classes,
which are not exposed in Elasticsearch. These classes are
deprecated in Lucene 4 and the
chinese
analyzer will be replaced
with the
Standard Analyzer
in Lucene 5.
The
cjk
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_""analyzer": { "cjk": { "tokenizer": "standard", "filter": [ "cjk_width", "lowercase", "cjk_bigram", "english_stop" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters.
The
czech
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "czech_stop": { "type": "stop", "stopwords": "_czech_""czech_keywords": { "type": "keyword_marker", "keywords": []
"czech_stemmer": { "type": "stemmer", "language": "czech" "analyzer": { "czech": { "tokenizer": "standard", "filter": [ "lowercase", "czech_stop", "czech_keywords", "czech_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
danish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "danish_stop": { "type": "stop", "stopwords": "_danish_""danish_keywords": { "type": "keyword_marker", "keywords": []
"danish_stemmer": { "type": "stemmer", "language": "danish" "analyzer": { "danish": { "tokenizer": "standard", "filter": [ "lowercase", "danish_stop", "danish_keywords", "danish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
dutch
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "dutch_stop": { "type": "stop", "stopwords": "_dutch_""dutch_keywords": { "type": "keyword_marker", "keywords": []
"dutch_stemmer": { "type": "stemmer", "language": "dutch" "dutch_override": { "type": "stemmer_override", "rules": [ "fiets=>fiets", "bromfiets=>bromfiets", "ei=>eier", "kind=>kinder" "analyzer": { "dutch": { "tokenizer": "standard", "filter": [ "lowercase", "dutch_stop", "dutch_keywords", "dutch_override", "dutch_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
english
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_""english_keywords": { "type": "keyword_marker", "keywords": []
"english_stemmer": { "type": "stemmer", "language": "english" "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" "analyzer": { "english": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "english_keywords", "english_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
finnish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "finnish_stop": { "type": "stop", "stopwords": "_finnish_""finnish_keywords": { "type": "keyword_marker", "keywords": []
"finnish_stemmer": { "type": "stemmer", "language": "finnish" "analyzer": { "finnish": { "tokenizer": "standard", "filter": [ "lowercase", "finnish_stop", "finnish_keywords", "finnish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
french
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "french_elision": { "type": "elision", "articles": [ "l", "m", "t", "qu", "n", "s", "j", "d", "c", "jusqu", "quoiqu", "lorsqu", "puisqu" "french_stop": { "type": "stop", "stopwords": "_french_""french_keywords": { "type": "keyword_marker", "keywords": []
"french_stemmer": { "type": "stemmer", "language": "light_french" "analyzer": { "french": { "tokenizer": "standard", "filter": [ "french_elision", "lowercase", "french_stop", "french_keywords", "french_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
galician
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "galician_stop": { "type": "stop", "stopwords": "_galician_""galician_keywords": { "type": "keyword_marker", "keywords": []
"galician_stemmer": { "type": "stemmer", "language": "galician" "analyzer": { "galician": { "tokenizer": "standard", "filter": [ "lowercase", "galician_stop", "galician_keywords", "galician_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
german
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "german_stop": { "type": "stop", "stopwords": "_german_""german_keywords": { "type": "keyword_marker", "keywords": []
"german_stemmer": { "type": "stemmer", "language": "light_german" "analyzer": { "german": { "tokenizer": "standard", "filter": [ "lowercase", "german_stop", "german_keywords", "german_normalization", "german_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
greek
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "greek_stop": { "type": "stop", "stopwords": "_greek_""greek_lowercase": { "type": "lowercase", "language": "greek" "greek_keywords": { "type": "keyword_marker", "keywords": []
"greek_stemmer": { "type": "stemmer", "language": "greek" "analyzer": { "greek": { "tokenizer": "standard", "filter": [ "greek_lowercase", "greek_stop", "greek_keywords", "greek_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
hindi
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "hindi_stop": { "type": "stop", "stopwords": "_hindi_""hindi_keywords": { "type": "keyword_marker", "keywords": []
"hindi_stemmer": { "type": "stemmer", "language": "hindi" "analyzer": { "hindi": { "tokenizer": "standard", "filter": [ "lowercase", "indic_normalization", "hindi_normalization", "hindi_stop", "hindi_keywords", "hindi_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
hungarian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "hungarian_stop": { "type": "stop", "stopwords": "_hungarian_""hungarian_keywords": { "type": "keyword_marker", "keywords": []
"hungarian_stemmer": { "type": "stemmer", "language": "hungarian" "analyzer": { "hungarian": { "tokenizer": "standard", "filter": [ "lowercase", "hungarian_stop", "hungarian_keywords", "hungarian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
indonesian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "indonesian_stop": { "type": "stop", "stopwords": "_indonesian_""indonesian_keywords": { "type": "keyword_marker", "keywords": []
"indonesian_stemmer": { "type": "stemmer", "language": "indonesian" "analyzer": { "indonesian": { "tokenizer": "standard", "filter": [ "lowercase", "indonesian_stop", "indonesian_keywords", "indonesian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
irish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "irish_elision": { "type": "elision", "articles": [ "h", "n", "t" ] "irish_stop": { "type": "stop", "stopwords": "_irish_""irish_lowercase": { "type": "lowercase", "language": "irish" "irish_keywords": { "type": "keyword_marker", "keywords": []
"irish_stemmer": { "type": "stemmer", "language": "irish" "analyzer": { "irish": { "tokenizer": "standard", "filter": [ "irish_stop", "irish_elision", "irish_lowercase", "irish_keywords", "irish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
italian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "italian_elision": { "type": "elision", "articles": [ "c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d" "italian_stop": { "type": "stop", "stopwords": "_italian_""italian_keywords": { "type": "keyword_marker", "keywords": []
"italian_stemmer": { "type": "stemmer", "language": "light_italian" "analyzer": { "italian": { "tokenizer": "standard", "filter": [ "italian_elision", "lowercase", "italian_stop", "italian_keywords", "italian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
latvian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "latvian_stop": { "type": "stop", "stopwords": "_latvian_""latvian_keywords": { "type": "keyword_marker", "keywords": []
"latvian_stemmer": { "type": "stemmer", "language": "latvian" "analyzer": { "latvian": { "tokenizer": "standard", "filter": [ "lowercase", "latvian_stop", "latvian_keywords", "latvian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
norwegian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "norwegian_stop": { "type": "stop", "stopwords": "_norwegian_""norwegian_keywords": { "type": "keyword_marker", "keywords": []
"norwegian_stemmer": { "type": "stemmer", "language": "norwegian" "analyzer": { "norwegian": { "tokenizer": "standard", "filter": [ "lowercase", "norwegian_stop", "norwegian_keywords", "norwegian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
persian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "char_filter": { "zero_width_spaces": { "type": "mapping", "mappings": [ "\\u200C=> "]"filter": { "persian_stop": { "type": "stop", "stopwords": "_persian_"
"analyzer": { "persian": { "tokenizer": "standard", "char_filter": [ "zero_width_spaces" ], "filter": [ "lowercase", "arabic_normalization", "persian_normalization", "persian_stop" Replaces zero-width non-joiners with an ASCII space. The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters.
The
portuguese
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "portuguese_stop": { "type": "stop", "stopwords": "_portuguese_""portuguese_keywords": { "type": "keyword_marker", "keywords": []
"portuguese_stemmer": { "type": "stemmer", "language": "light_portuguese" "analyzer": { "portuguese": { "tokenizer": "standard", "filter": [ "lowercase", "portuguese_stop", "portuguese_keywords", "portuguese_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
romanian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "romanian_stop": { "type": "stop", "stopwords": "_romanian_""romanian_keywords": { "type": "keyword_marker", "keywords": []
"romanian_stemmer": { "type": "stemmer", "language": "romanian" "analyzer": { "romanian": { "tokenizer": "standard", "filter": [ "lowercase", "romanian_stop", "romanian_keywords", "romanian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
russian
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "russian_stop": { "type": "stop", "stopwords": "_russian_""russian_keywords": { "type": "keyword_marker", "keywords": []
"russian_stemmer": { "type": "stemmer", "language": "russian" "analyzer": { "russian": { "tokenizer": "standard", "filter": [ "lowercase", "russian_stop", "russian_keywords", "russian_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
sorani
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "sorani_stop": { "type": "stop", "stopwords": "_sorani_""sorani_keywords": { "type": "keyword_marker", "keywords": []
"sorani_stemmer": { "type": "stemmer", "language": "sorani" "analyzer": { "sorani": { "tokenizer": "standard", "filter": [ "sorani_normalization", "lowercase", "sorani_stop", "sorani_keywords", "sorani_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
spanish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "spanish_stop": { "type": "stop", "stopwords": "_spanish_""spanish_keywords": { "type": "keyword_marker", "keywords": []
"spanish_stemmer": { "type": "stemmer", "language": "light_spanish" "analyzer": { "spanish": { "tokenizer": "standard", "filter": [ "lowercase", "spanish_stop", "spanish_keywords", "spanish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
swedish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "swedish_stop": { "type": "stop", "stopwords": "_swedish_""swedish_keywords": { "type": "keyword_marker", "keywords": []
"swedish_stemmer": { "type": "stemmer", "language": "swedish" "analyzer": { "swedish": { "tokenizer": "standard", "filter": [ "lowercase", "swedish_stop", "swedish_keywords", "swedish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
turkish
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "turkish_stop": { "type": "stop", "stopwords": "_turkish_""turkish_lowercase": { "type": "lowercase", "language": "turkish" "turkish_keywords": { "type": "keyword_marker", "keywords": []
"turkish_stemmer": { "type": "stemmer", "language": "turkish" "analyzer": { "turkish": { "tokenizer": "standard", "filter": [ "apostrophe", "turkish_lowercase", "turkish_stop", "turkish_keywords", "turkish_stemmer" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters. This filter should be removed unless there are words which should be excluded from stemming.
The
thai
analyzer could be reimplemented as a
custom
analyzer as follows:
{ "settings": { "analysis": { "filter": { "thai_stop": { "type": "stop", "stopwords": "_thai_""analyzer": { "thai": { "tokenizer": "thai", "filter": [ "lowercase", "thai_stop" The default stopwords can be overridden with the
stopwords
orstopwords_path
parameters.
An analyzer of type
snowball
that uses the
standard tokenizer
, with
standard filter
,
lowercase filter
,
stop filter
, and
snowball filter
.
The Snowball Analyzer is a stemming analyzer from Lucene that is originally based on the snowball project from snowball.tartarus.org .
Sample usage:
{ "index" : { "analysis" : { "analyzer" : { "my_analyzer" : { "type" : "snowball", "language" : "English" }
The
language
parameter can have the same values as the
snowball filter
and defaults to
English
. Note that not all the language
analyzers have a default set of stopwords provided.
The
stopwords
parameter can be used to provide stopwords for the
languages that have no defaults, or to simply replace the default set
with your custom list. Check
Stop Analyzer
for more details. A default set of stopwords for many of these
languages is available from for instance
here.
A sample configuration (in YAML format) specifying Swedish with stopwords:
index : analysis : analyzer : my_analyzer: type: snowball language: Swedish stopwords: "och,det,att,i,en,jag,hon,som,han,på,den,med,var,sig,för,så,till,är,men,ett,om,hade,de,av,icke,mig,du,henne,då,sin,nu,har,inte,hans,honom,skulle,hennes,där,min,man,ej,vid,kunde,något,från,ut,när,efter,upp,vi,dem,vara,vad,över,än,dig,kan,sina,här,ha,mot,alla,under,någon,allt,mycket,sedan,ju,denna,själv,detta,åt,utan,varit,hur,ingen,mitt,ni,bli,blev,oss,din,dessa,några,deras,blir,mina,samma,vilken,er,sådan,vår,blivit,dess,inom,mellan,sådant,varför,varje,vilka,ditt,vem,vilket,sitta,sådana,vart,dina,vars,vårt,våra,ert,era,vilkas"
The following are settings that can be set for a
custom
analyzer type:
Setting | Description |
---|---|
|
The logical / registered name of the tokenizer to use. |
|
An optional list of logical / registered name of token filters. |
|
An optional list of logical / registered name of char filters. |
|
An optional number of positions to increment between each field value of a field using this analyzer. |
A tokenizer of type
standard
providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29
.
The following are settings that can be set for a
standard
tokenizer
type:
Setting | Description |
---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is split at
|
A tokenizer of type
edgeNGram
.
The following are settings that can be set for a
edgeNGram
tokenizer
type:
Setting | Description | Default value |
---|---|---|
|
Minimum size in codepoints of a single n-gram |
|
|
Maximum size in codepoints of a single n-gram |
|
|
Characters classes to keep in the tokens, Elasticsearch will split on characters that don’t belong to any of these classes. |
|
token_chars
accepts the following character classes:
letter
for example
a
,
b
,
ï
or
京
digit
for example
3
or
7
whitespace
for example
" "
or
"\n"
punctuation
for example
!
or
"
symbol
for example
$
or
√
|
side
deprecated
There used to be a
side
parameter up to
0.90.1
but it is now deprecated. In
order to emulate the behavior of
"side" : "BACK"
a
reverse
token filter
should be used together
with the
edgeNGram
token filter
. The
edgeNGram
filter must be enclosed in
reverse
filters like this:
"filter" : ["reverse", "edgeNGram", "reverse"]
which essentially reverses the token, builds front
EdgeNGrams
and reverses
the ngram again. This has the same effect as the previous
"side" : "BACK"
setting.
A tokenizer of type
keyword
that emits the entire input as a single
output.
The following are settings that can be set for a
keyword
tokenizer
type:
Setting | Description |
---|---|
|
The term buffer size. Defaults to
|
A tokenizer of type
lowercase
that performs the function of
Letter Tokenizer
and
Lower Case Token Filter
together. It divides text at non-letters and converts
them to lower case. While it is functionally equivalent to the
combination of
Letter Tokenizer
and
Lower Case Token Filter
, there is a performance advantage to doing the two
tasks at once, hence this (redundant) implementation.
The following are settings that can be set for a
nGram
tokenizer type:
Setting | Description | Default value |
---|---|---|
|
Minimum size in codepoints of a single n-gram |
|
|
Maximum size in codepoints of a single n-gram |
|
|
Characters classes to keep in the tokens, Elasticsearch will split on characters that don’t belong to any of these classes. |
|
token_chars
accepts the following character classes:
letter
for example
a
,
b
,
ï
or
京
digit
for example
3
or
7
whitespace
for example
" "
or
"\n"
punctuation
for example
!
or
"
symbol
for example
$
or
√
|
Setting | Description |
---|---|
|
The regular expression pattern, defaults to
|
|
The regular expression flags. |
|
Which group to extract into tokens. Defaults to
|
IMPORTANT : The regular expression should match the token separators , not the tokens themselves.
Path Hierarchy Tokenizer
The following are settings that can be set for a
uax_url_email
tokenizer type:
Setting | Description |
---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to
|
The
path_hierarchy
tokenizer takes something like this:
/something/something/else
/something /something/something /something/something/else
Setting | Description |
---|---|
|
The character delimiter to use, defaults to
|
|
An optional replacement character to use. Defaults to
the
|
|
The buffer size to use, defaults to
|
|
Generates tokens in reverse order, defaults to
|
|
Controls initial tokens to skip, defaults to
|
|
Required. Must be set to
|
The following are settings that can be set for a
classic
tokenizer
type:
Setting | Description |
---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to
|
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).
Elasticsearch has a number of built in token filters which can be used to build custom analyzers .
A token filter of type
standard
that normalizes tokens extracted with
Standard Tokenizer
.
The
standard
token filter currently does nothing. It remains as a placeholder
in case some filtering function needs to be added in a future version.
"index" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["standard", "my_ascii_folding"] "filter" : { "my_ascii_folding" : { "type" : "asciifolding", "preserve_original" : true Lowercase Token Filter
A token filter of type
length
that removes words that are too long or
too short for the stream.
The following are settings that can be set for a
length
token filter
type:
Setting | Description |
---|---|
|
The minimum number. Defaults to
|
|
The maximum number. Defaults to
|
A token filter of type
nGram
.
The following are settings that can be set for a
nGram
token filter
type:
Setting | Description |
---|---|
|
Defaults to
|
|
Defaults to
|
A token filter of type
edgeNGram
.
The following are settings that can be set for a
edgeNGram
token
filter type:
Setting | Description |
---|---|
|
Defaults to
|
|
Defaults to
|
|
Either
|
Note, the input to the stemming filter must already be in lower case, so
you will need to use
Lower Case Token Filter
or
Lower Case Tokenizer
farther down the Tokenizer chain in order for this to
work properly!. For example, when using custom analyzer, make sure the
lowercase
filter comes before the
porter_stem
filter in the list of
filters.
The following are settings that can be set for a
shingle
token filter
type:
Setting | Description |
---|---|
|
The maximum shingle size. Defaults to
|
|
The minimum shingle size. Defaults to
|
|
If
|
|
If
|
|
The string to use when joining adjacent tokens to
form a shingle. Defaults to
|
|
The string to use as a replacement for each position
at which there is no actual token in the stream. For instance this string is
used if the position increment is greater than one when a
|
A token filter of type
stop
that removes stop words from token
streams.
The following are settings that can be set for a
stop
token filter
type:
stopwords
A list of stop words to use. Defaults to
_english_
stop words.
stopwords_path
A path (either relative to
config
location, or absolute) to a stopwords
file configuration. Each stop word should be in its own "line" (separated
by a line break). The file must be UTF-8 encoded.
ignore_case
Set to
true
to lower case all words first. Defaults to
false
.
remove_trailing
Set to
false
in order to not ignore the last term of a search if it is a
stop word. This is very useful for the completion suggester as a query
like
green a
can be extended to
green apple
even though you remove
stop words in general. Defaults to
true
.
|
The
stopwords
parameter accepts either an array of stopwords:
or a predefined language-specific list:
Elasticsearch provides the following predefined list of languages:
For the empty stopwords list (to disable stopwords) use:
_none_
.
XML based hyphenation grammar files can be found in the
Objects For Formatting Objects
(OFFO) Sourceforge project. You can download
offo-hyphenation.zip
directly and look in the
offo-hyphenation/hyph/
directory.
Credits for the hyphenation code go to the Apache FOP project .
The following parameters can be used to configure a compound word token filter:
Either
dictionary_decompounder
or
hyphenation_decompounder
.
word_list
A array containing a list of words to use for the word dictionary.
word_list_path
The path (either absolute or relative to the
config
directory) to the word dictionary.
hyphenation_patterns_path
The path (either absolute or relative to the
config
directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
min_word_size
Minimum word size. Defaults to 5.
min_subword_size
Minimum subword size. Defaults to 2.
max_subword_size
Maximum subword size. Defaults to 15.
only_longest_match
Whether to include only the longest matching subword or not. Defaults to
false
|
Accepts
articles
setting which is a set of stop words articles. For
example:
"index" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["standard", "elision"] "filter" : { "elision" : { "type" : "elision", "articles" : ["l", "m", "t", "qu", "n", "s", "j"] Unique Token Filter
would produce the tokens: [
abc123
,
abc
,
123
,
def456
,
def
,
456
]
Another example is analyzing email addresses:
When the above analyzer is used on an email address like:
it would produce the following tokens:
[email protected], john-smith_123, john, smith, 123, foo-bar.com, foo, bar, com
The
pattern_replace
token filter allows to easily handle string
replacements based on a regular expression. The regular expression is
defined using the
pattern
parameter, and the replacement string can be
provided using the
replacement
parameter (supporting referencing the
original text, as explained
here
).
Limits the number of tokens that are indexed per document and field.
Setting | Description |
---|---|
|
The maximum number of tokens that should be indexed
per document and field. The default is
|
|
If set to
|
Each dictionary can be configured with one setting:
This setting can be configured globally in
elasticsearch.yml
using
One can use the hunspell stem filter by configuring it the analysis settings:
The hunspell token filter accepts four options:
Token filter that generates bigrams for frequently occuring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.
For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is_a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.
When
query_mode
is enabled, the token filter removes common words and
single terms followed by a common word. This parameter should be enabled
in the search analyzer.
For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
The following are settings that can be set:
Setting | Description |
---|---|
|
A list of common words to use. |
|
A path (either relative to
|
|
If true, common words matching will be case insensitive
(defaults to
|
|
Generates bigrams then removes common words and single
terms followed by a common word (defaults to
|
Note,
common_words
or
common_words_path
field is required.
Here is an example:
index : analysis : analyzer : index_grams : tokenizer : whitespace filter : [common_grams] search_grams : tokenizer : whitespace filter : [common_grams_query] filter : common_grams : type : common_grams common_words: [a, an, the] common_grams_query : type : common_grams query_mode: true common_words: [a, an, the]
Arabic
arabic_normalization
German
german_normalization
Hindi
hindi_normalization
Indic
indic_normalization
Kurdish (Sorani)
sorani_normalization
Persian
persian_normalization
Scandinavian
scandinavian_normalization
,
scandinavian_folding
CJK Bigram Token Filter
The
This token filter can be viewed as a subset of NFKC/NFKD Unicode normalization. See the ICU Analysis Plugin for full normalization support.
The
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the
Bigrams are generated for characters in
{ "index" : { "analysis" : { "analyzer" : { "han_bigrams" : { "tokenizer" : "standard", "filter" : ["han_bigrams_filter"] "filter" : { "han_bigrams_filter" : { "type" : "cjk_bigram", "ignore_scripts": [ "hiragana", "katakana", "hangul" "output_unigrams" : true Keep Words Token Filter Options
Settings example{ "index" : { "analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "words_till_three"] "my_analyzer1" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "words_on_file"] "filter" : { "words_till_three" : { "type" : "keep", "keep_words" : [ "one", "two", "three"] "words_on_file" : { "type" : "keep", "keep_words_path" : "/path/to/word/file" Classic Token Filter Options
Settings example{ "index" : { "analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "extract_numbers"] "filter" : { "extract_numbers" : { "type" : "keep_types", "types" : [ "<NUM>" ] Apostrophe Token Filter
The
This filter removes the english possessive from the end of words, and it removes dots from acronyms.
Character filters are used to preprocess the string of
characters before it is passed to the
tokenizer
.
A character filter may be used to strip out HTML markup, , or to convert
Elasticsearch has built in characters filters which can be used to build custom analyzers .
A char filter of type
Here is a sample configuration:
Otherwise the setting
The
Here is a sample configuration: { "index" : { "analysis" : { "char_filter" : { "my_pattern":{ "type":"pattern_replace", "pattern":"sample(.*)", "replacement":"replacedSample $1" "analyzer" : { "custom_with_char_filter" : { "tokenizer" : "standard", "char_filter" : ["my_pattern"] The ICU analysis plugin allows for unicode normalization, collation and folding. The plugin is called elasticsearch-analysis-icu . The plugin includes the following analysis components: ICU Normalization
Normalizes characters as explained
here
. It
registers itself by default under
{ "index" : { "analysis" : { "analyzer" : { "normalization" : { "tokenizer" : "keyword", "filter" : ["icu_normalizer"] } ICU FoldingFiltering
The folding can be filtered by a set of unicode characters with the
parameter
The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below. { "index" : { "analysis" : { "analyzer" : { "folding" : { "tokenizer" : "standard", "filter" : ["my_icu_folding", "lowercase"] "filter" : { "my_icu_folding" : { "type" : "icu_folding" "unicodeSetFilter" : "[^åäöÅÄÖ]" } ICU Collation
Uses collation token filter. Allows to either specify the rules for
collation (defined
here
)
using the
Here is a sample settings: { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["icu_collation"] } And here is a sample of custom collation: { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["myCollator"] "filter" : { "myCollator" : { "type" : "icu_collation", "language" : "en" } Options
Expert options:
ICU TokenizerBreaks text into words according to UAX #29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/ . { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "icu_tokenizer", } ICU Normalization CharFilter
Normalizes characters as explained
here
.
It registers itself by default under
{ "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "char_filter" : ["icu_normalizer"] Discovery Shards AllocationThe following settings may be used: Shard Allocation Awarenessnode.rack_id: rack_one cluster.routing.allocation.awareness.attributes: rack_id
Now, if we start two more nodes, with
The awareness attributes can hold several values, for example: cluster.routing.allocation.awareness.attributes: rack_id,zone NOTE : When using awareness attributes, shards will not be allocated to nodes that don’t have values set for those attributes. NOTE : Number of primary/replica of a shard allocated on a specific group of nodes with the same awareness attribute value is determined by the number of attribute values. When the number of nodes in groups is unbalanced and there are many replicas, replica shards may be left unassigned. Forced AwarenessAutomatic Preference When Searching / GETingRealtime Settings UpdateThe settings can be updated using the cluster update settings API on a live cluster. Shard Allocation Filtering
In the same manner,
curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.exclude._ip" : "10.0.0.1" Azure Discovery Azure discovery allows to use the Azure APIs to perform automatic discovery (similar to multicast). Please check the plugin website to find the full documentation. EC2 discovery allows to use the EC2 APIs to perform automatic discovery (similar to multicast). Please check the plugin website to find the full documentation. Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). Please check the plugin website to find the full documentation. The zen discovery is integrated with other modules, for example, all communication between nodes is done using the transport module. It is separated into several sub modules, which are explained below: PingMulticast
Unicast
The unicast discovery uses the transport module to perform the discovery. Master ElectionFault Detection
The following settings control the fault detection process using the
External MulticastCluster state updates
The master node is the only node in a cluster that can make changes to the
cluster state. The master node processes one cluster state update at a time,
applies the required changes and publishes the updated cluster state to all
the other nodes in the cluster. Each node receives the publish message,
updates its own cluster state and replies to the master node, which waits for
all nodes to respond, up to a timeout, before going ahead processing the next
updates in the queue. The
No master block
The
Dangling indices
|