You are on page 1of 5

Best practices in Elasticsearch

Best Indexing:
When searching across multiple fields for a single concept, we want to look
for as many words as possible within the same field.
We figured out which is the best way to index in the following two approaches and
the second one turns out to be the best.
i.
Search across many fields within index OR
ii.
Consolidate all relevant keywords into a keyword field
i) Search across many fields within index OR:
We have search the text across multiple fields and we got the following
result.

Data Set:
"hits": [
{

"_index": "searchtest_28",
"_type": "test28",
"_id": "0",
"_score": 1,
"_source": {
"Keyword": "Chad Holan 10070723 1000110679 40ZS West Des
Moines PENDING 100001",
"Module": "Contract",
"Sub Moudule": "Contracts",
"First Name": "Chad",
"Last Name": "Holan",
"Customer Id": "10070723",
"Context": "ContractInfo",
"Date": "10/17/2016",
"Contract#": "1000110679",
"Description": "40ZS",
"Address": "West Des Moines",
"Status": "PENDING",
"PO#": "100001"
}
},

Search across multiple fields:

ii) Consolidate all relevant keywords into a keyword field:


We have search the text across single field (keyword) and we got the
following result.

Data Set:
"hits": [
{
"_index": "searchtest_28",
"_type": "test28",
"_id": "0",
"_score": 1,
"_source": {
"Keyword": "Chad Holan 10070723 1000110679 40ZS West Des Moines
PENDING 100001",
"Module": "Contract",
"Sub Moudule": "Contracts",
"First Name": "Chad",
"Last Name": "Holan",
"Customer Id": "10070723",
"Context": "ContractInfo",
"Date": "10/17/2016",
"Contract#": "1000110679",
"Description": "40ZS",
"Address": "West Des Moines",
"Status": "PENDING",
"PO#": "100001"
}
},

Search across the single filed (Keyword field):

Observation: We found second one is the best approach.


Use index aliases instead of index names:
Except for special use cases like dropping a specific index, always use index
aliases instead of index names as mappings cannot be updated. We need to drop
the index and create a new one.
When our service is live, we will need to stop it in order to drop the old index. We
can create the new index with a new mapping, start your data migration script.
Once everything is OK, just switch aliases. If something bad happens, we can still
switch back. This is called index versioning.
Explicit mapping:
etc.

Set explicit mappings, even for primitive types like float, boolean, decimal,

Aggregate field and explicit search field:


We can specify which fields get aggregated using the copy_to property. For
example:
{

"my_doctype": {
"properties": {
"search": {
"type": "string",
"analyzer": "my_fulltext_analyzer"
},
"first_name": {
"type": "string",
"copy_to": "search"
},
"last_name": {
"type": "string",
"copy_to": "search"

}
}

We can specify on which field to search on with the default_field key. For
example for a query search with filters, we may have a query body like:
{

"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "search",
"query": "foobar"
},
"term": {
"foo": true
}
}
]
}
}

Best practices in system level configuration:

Change default location of data and logs.

Routing : Routing your document to a particular shard, e.g. for a ecommerce site, you can user category name as routing value.

Unique cluster name, not to use the default elastic search.

Meaningful/Telling name for nodes, names according to the rack or instance


name, e.g. rack1.vm1.node1

Avoiding split brain by setting discovery.zen.minimum_master_nodes to


(num_of_nodes/2)+1 on clusters with num_of_nodes greater than 2.

ElasticSearch heap should have around 50% of the available memory on the
machine.

Configure Elastic Search to instruct the OS to never swap, by setting


bootstrap.mlockall to true.

Gateway: Avoiding shard shuffle on recovery by setting


gateway.recover_after_nodes, gateway.recover_after_time and
gateway.expected_nodes.

File Descriptors: Raise the number of available file descriptors to the user
running Elastic Search to 65535.

Update Indices Settings: Increase indexing performance by setting higher


refresh_interval as per need, default refresh_interval is 1 second.

Disable _source if we don't need it.

If you are using _source, no need to store fields.

Disable all if we don't need it.

Setting norms enabled to false, for analyzed fields.

Date Format: we need to compute/sort/search dates to the precision of


seconds and mini-seconds and set the date format accordingly.

You might also like