Open In App

Manage Elasticsearch documents with indices and shards

Last Updated : 11 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Elasticsearch is an open-source search and analytics engine that is designed to uniquely handle large data patterns with great efficiency. The major parts of it include indices and shards, which help in management, storing and obtaining documents.

This article goes deeper and explains the basics of managing Elasticsearch documents using indices and shards Ensuring that clients gain an understanding of the roles of these components in Elasticsearch as well as a right procedure to adopt when optimizing on them.

It is crucial to understand these elements, even if you are new to Elasticsearch or if you want to improve your knowledge to further improve the performance and scalability of your search solutions.

Elasticsearch Architecture

Elasticsearch is a distributed, scalable, and highly available search and analytics engine that is used for indexing large data sets. Here are the key components that constitute Elasticsearch's architecture:

Nodes

A node is one instance of the Elasticsearch running in a network of Elasticsearch nodes. This resides on a physical or a virtual machine and stores part of its data in the cluster. Search and indexing activities are implemented directly by nodes. Elasticsearch indexes have nodes where each of them has its identification number and belongs to a cluster.

Cluster

A cluster is a set of one or more nodes where the whole data is stored and implemented, and the nodes also perform the federated indexing and search. A cluster is defined by a string name and its default value is “elasticsearch”. This name can be used to join a cluster by nodes.

Indices

Indices are also known as practical divisions of documents. They are similar to a record in the relational model of a database system. Every index has a name by which it is referenced during the process of search and reconstruction of indices.

Shards

Shards are the basic building blocks of an index in Elasticsearch and help in distribution and parallelism of operations. Each index can further be split into multiple shards while a shard of an index is an instance of a Lucene index.

Understanding Elasticsearch Indices

Elasticsearch index are logical grouping of documents indexed with specific aspects, type of data, or data from a common source. In the same way, each index is a set of documents that have the same structure as the documents required by the user. Key points about indices include:

  1. Creation: It is formed whenever the first document gets indexed into Elasticsearch by the index name provided or the default settings of Elasticsearch.
  2. Settings and Mapping: Each index has config settings that determine its properties (e.g., number of shards or replication settings), and mappings that determine how fields in the documents are analyzed and stored.
  3. Usage: Indices help in the organization, quick search, and querying of documents. They can be controlled and fine tuned for high availability and operational capacity.

Elasticsearch Documents

Documents are the items of data which are dealt with e.g. in Elasticsearch. These are represented in JSON format and includes actual data to be indexed and searched. Key aspects of documents include:

  1. Structure: Documents are formatted in JSON (JavaScript Object Notation) that can be of varying sizes consisting of nested objects and/or arrays.
  2. Indexing: This is the process that occurs when a document is indexed in that Elasticsearch will store and enable its search based on the content. Every document has an identifier (_id) for each index used and is not repeated across them.
  3. Fields: Documents contain respective fields, which can be of different types, for example strings, numerical values or dates, and documents are indexed by the means of given mappings.

Sharding in Elasticsearch

Shards are the smallest components that can be stored in Elasticsearch and they are used to help distribute data within the cluster. Understanding shards is essential for managing data distribution, scalability, and performance.

Purpose

Sharding is used to distribute the indices of Elasticsearch in a cluster horizontally by dividing it into smaller units known as shards.

Types

There are two categories of shards:

  • Primary Shards: These shards are primary for holding the data of main indexes and are answerable for searching and indexing too.
  • Replica Shards: These are obtained copies of primary shards and its essential purpose is to help in times of node failure and data loss.

Configuration

The number of shards for an index is defined at the time of creation of the index and it determines how data will be stored and recovered from the cluster. It is particularly important to make proper configurations with shard since it is among the vital aspects of performance optimization and resource utilization.

Shard Management

1. Shard Allocation

  • The load is balanced in a smooth manner at Elasticsearch because of the default property of shards distribution in nodes.
  • Shard allocation is set and managed with shard allocation awareness and allocation filtering.

2. Shard Rebalancing

  • Elasticsearch redistributes shards their every time nodes are joined or left so that shards are spread evenly.
  • This process enable efficiency and effectiveness of resources and consumptions in the organization.

3. Shard Recovery

  • If a node does not respond to the cluster, then Elasticsearch redistributes the shards across the nodes in the cluster.
  • Shard recovery is critical to ensure data is preserved and can always be accessed when needed.

Managing Documents with Indices and Shards

Index Creation and Management

  • When creating indices, define proper settings concerning the number of shards and replicas based on the amount of data to be stored and typical querying operations.
  • Emphasize aliases for index management and all cases where you would want to alias multiple indices under one logical one.

Document Indexing and Retrieval

  • Index documents using Elasticsearch API or libraries.
  • Incorporate search queries that apply Elasticsearch’s query DSL (Domain Specific Language) to retrieve every document.

Monitoring and Optimization

  • Ensure that you know how to check and manage shard distribution and health through Elasticsearch API and Monitoring tools.
  • To increase the efficiency of the queries made on each shard, the allocation of the sharding keys and data should be balanced.

Scaling and Resilience

  • Distribute shards and increase the handling of size request by adding nodes to the Elasticsearch clusters.
  • Set up replica shards to enhance the ability of a cluster to have duplicate copies of the data in case of node failure.

Main Concepts and Syntax

1. Creating an Index:

To create an index in Elasticsearch, you can use the following syntax:

PUT /my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": {
"type": "text"
},
"content": {
"type": "text"
},
"timestamp": {
"type": "date"
}
}
}
}

Explanation:

  • PUT /my_index: This API call creates an index named my_index.
  • "number_of_shards": 5: Specifies that the index should be divided into 5 primary shards.
  • "number_of_replicas": 1: Configures 1 replica shard for each primary shard, ensuring data redundancy.
  • "mappings": Defines the structure of the documents within the index, specifying types and properties of fields like title, content, and timestamp.

2. Indexing a Document:

To index a document into Elasticsearch, use the POST method:

POST /my_index/_doc/1
{
"title": "Introduction to Elasticsearch",
"content": "Elasticsearch is a distributed search engine built on Apache Lucene.",
"timestamp": "2024-07-04"
}

Explanation:

  • POST /my_index/_doc/1: Indexes a document with ID 1 into the my_index index.
  • "_doc": Represents the document type (deprecated in Elasticsearch 7.x and removed in 8.x).
  • Document fields (title, content, timestamp) are indexed according to the mappings defined earlier.

3. Searching Documents:

To search for documents within an index, use the GET method:

GET /my_index/_search
{
"query": {
"match": {
"title": "Elasticsearch"
}
}
}

Explanation:

  • GET /my_index/_search: Executes a search query within the my_index index.
  • "match": { "title": "Elasticsearch" }: Specifies a query that matches documents where the title field contains the term "Elasticsearch".

4. Updating Documents

Documents can be updated using the Elasticsearch update API.

POST /my_index/_update/1
{
"doc": {
"content": "Updated content of the document."
}
}

Explanation:

  • POST /my_index/_update/1: Updates the document with ID 1 in the my_index index.
  • "doc": The fields to be updated.

5. Deleting Documents

Delete documents using the delete API.

DELETE /my_index/_doc/1

Explanation:

  • DELETE /my_index/_doc/1: Deletes the document with ID 1 from the my_index index.

6. Managing Indices

i. Viewing Index Information

Retrieve index information to monitor and manage indices.

GET /my_index

Explanation:

  • GET /my_index: Retrieves information about the my_index index.

ii. Deleting an Index

Delete indices when they are no longer needed.

DELETE /my_index

Explanation:

  • DELETE /my_index: Deletes the my_index index.

Examples

Index Creation

Creating an index with 3 primary shards and 1 replica shard:

PUT /library
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": {
"type": "text"
},
"author": {
"type": "text"
},
"publish_date": {
"type": "date"
},
"content": {
"type": "text"
}
}
}
}

Output:

{
"acknowledged": true,
"shards_acknowledged": true,
"index": "library"
}

Indexing a Document

Indexing a book document into the library index:

POST /library/_doc/1
{
"title": "Elasticsearch Basics",
"author": "John Doe",
"publish_date": "2023-06-01",
"content": "This book covers the basics of Elasticsearch."
}

Output:

{
"_index": "library",
"_type": "_doc",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}

Searching Documents

Searching for documents with the title containing "Elasticsearch":

GET /library/_search
{
"query": {
"match": {
"title": "Elasticsearch"
}
}
}

Output:

{
"took": 10,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "library",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "Elasticsearch Basics",
"author": "John Doe",
"publish_date": "2023-06-01",
"content": "This book covers the basics of Elasticsearch."
}
}
]
}
}

Conclusion

In this guide, we will learn when and how to use indices and shards to manage documents in Elasticsearch, how to create new indices as well as index documents, and how to execute queries and analyze the outcomes. By grasping these ideas, users are able to put Elasticsearch into practice for building large scale and efficient data stores and information search.


Next Article
Article Tags :

Similar Reads