Elasticsearch is an open-source search and analytics engine that is designed to uniquely handle large data patterns with great efficiency. The major parts of it include indices and shards, which help in management, storing and obtaining documents.
This article goes deeper and explains the basics of managing Elasticsearch documents using indices and shards Ensuring that clients gain an understanding of the roles of these components in Elasticsearch as well as a right procedure to adopt when optimizing on them.
It is crucial to understand these elements, even if you are new to Elasticsearch or if you want to improve your knowledge to further improve the performance and scalability of your search solutions.
Elasticsearch Architecture
Elasticsearch is a distributed, scalable, and highly available search and analytics engine that is used for indexing large data sets. Here are the key components that constitute Elasticsearch's architecture:
Nodes
A node is one instance of the Elasticsearch running in a network of Elasticsearch nodes. This resides on a physical or a virtual machine and stores part of its data in the cluster. Search and indexing activities are implemented directly by nodes. Elasticsearch indexes have nodes where each of them has its identification number and belongs to a cluster.
Cluster
A cluster is a set of one or more nodes where the whole data is stored and implemented, and the nodes also perform the federated indexing and search. A cluster is defined by a string name and its default value is “elasticsearch”. This name can be used to join a cluster by nodes.
Indices
Indices are also known as practical divisions of documents. They are similar to a record in the relational model of a database system. Every index has a name by which it is referenced during the process of search and reconstruction of indices.
Shards
Shards are the basic building blocks of an index in Elasticsearch and help in distribution and parallelism of operations. Each index can further be split into multiple shards while a shard of an index is an instance of a Lucene index.
Understanding Elasticsearch Indices
Elasticsearch index are logical grouping of documents indexed with specific aspects, type of data, or data from a common source. In the same way, each index is a set of documents that have the same structure as the documents required by the user. Key points about indices include:
- Creation: It is formed whenever the first document gets indexed into Elasticsearch by the index name provided or the default settings of Elasticsearch.
- Settings and Mapping: Each index has config settings that determine its properties (e.g., number of shards or replication settings), and mappings that determine how fields in the documents are analyzed and stored.
- Usage: Indices help in the organization, quick search, and querying of documents. They can be controlled and fine tuned for high availability and operational capacity.
Elasticsearch Documents
Documents are the items of data which are dealt with e.g. in Elasticsearch. These are represented in JSON format and includes actual data to be indexed and searched. Key aspects of documents include:
- Structure: Documents are formatted in JSON (JavaScript Object Notation) that can be of varying sizes consisting of nested objects and/or arrays.
- Indexing: This is the process that occurs when a document is indexed in that Elasticsearch will store and enable its search based on the content. Every document has an identifier (_id) for each index used and is not repeated across them.
- Fields: Documents contain respective fields, which can be of different types, for example strings, numerical values or dates, and documents are indexed by the means of given mappings.
Sharding in Elasticsearch
Shards are the smallest components that can be stored in Elasticsearch and they are used to help distribute data within the cluster. Understanding shards is essential for managing data distribution, scalability, and performance.
Purpose
Sharding is used to distribute the indices of Elasticsearch in a cluster horizontally by dividing it into smaller units known as shards.
Types
There are two categories of shards:
- Primary Shards: These shards are primary for holding the data of main indexes and are answerable for searching and indexing too.
- Replica Shards: These are obtained copies of primary shards and its essential purpose is to help in times of node failure and data loss.
Configuration
The number of shards for an index is defined at the time of creation of the index and it determines how data will be stored and recovered from the cluster. It is particularly important to make proper configurations with shard since it is among the vital aspects of performance optimization and resource utilization.
Shard Management
1. Shard Allocation
- The load is balanced in a smooth manner at Elasticsearch because of the default property of shards distribution in nodes.
- Shard allocation is set and managed with shard allocation awareness and allocation filtering.
2. Shard Rebalancing
- Elasticsearch redistributes shards their every time nodes are joined or left so that shards are spread evenly.
- This process enable efficiency and effectiveness of resources and consumptions in the organization.
3. Shard Recovery
- If a node does not respond to the cluster, then Elasticsearch redistributes the shards across the nodes in the cluster.
- Shard recovery is critical to ensure data is preserved and can always be accessed when needed.
Managing Documents with Indices and Shards
Index Creation and Management
- When creating indices, define proper settings concerning the number of shards and replicas based on the amount of data to be stored and typical querying operations.
- Emphasize aliases for index management and all cases where you would want to alias multiple indices under one logical one.
Document Indexing and Retrieval
- Index documents using Elasticsearch API or libraries.
- Incorporate search queries that apply Elasticsearch’s query DSL (Domain Specific Language) to retrieve every document.
Monitoring and Optimization
- Ensure that you know how to check and manage shard distribution and health through Elasticsearch API and Monitoring tools.
- To increase the efficiency of the queries made on each shard, the allocation of the sharding keys and data should be balanced.
Scaling and Resilience
- Distribute shards and increase the handling of size request by adding nodes to the Elasticsearch clusters.
- Set up replica shards to enhance the ability of a cluster to have duplicate copies of the data in case of node failure.
Main Concepts and Syntax
1. Creating an Index:
To create an index in Elasticsearch, you can use the following syntax:
PUT /my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": {
"type": "text"
},
"content": {
"type": "text"
},
"timestamp": {
"type": "date"
}
}
}
}
Explanation:
- PUT /my_index: This API call creates an index named my_index.
- "number_of_shards": 5: Specifies that the index should be divided into 5 primary shards.
- "number_of_replicas": 1: Configures 1 replica shard for each primary shard, ensuring data redundancy.
- "mappings": Defines the structure of the documents within the index, specifying types and properties of fields like title, content, and timestamp.
2. Indexing a Document:
To index a document into Elasticsearch, use the POST method:
POST /my_index/_doc/1
{
"title": "Introduction to Elasticsearch",
"content": "Elasticsearch is a distributed search engine built on Apache Lucene.",
"timestamp": "2024-07-04"
}
Explanation:
- POST /my_index/_doc/1: Indexes a document with ID 1 into the my_index index.
- "_doc": Represents the document type (deprecated in Elasticsearch 7.x and removed in 8.x).
- Document fields (title, content, timestamp) are indexed according to the mappings defined earlier.
3. Searching Documents:
To search for documents within an index, use the GET method:
GET /my_index/_search
{
"query": {
"match": {
"title": "Elasticsearch"
}
}
}
Explanation:
- GET /my_index/_search: Executes a search query within the my_index index.
- "match": { "title": "Elasticsearch" }: Specifies a query that matches documents where the title field contains the term "Elasticsearch".
4. Updating Documents
Documents can be updated using the Elasticsearch update API.
POST /my_index/_update/1
{
"doc": {
"content": "Updated content of the document."
}
}
Explanation:
- POST /my_index/_update/1: Updates the document with ID 1 in the my_index index.
- "doc": The fields to be updated.
5. Deleting Documents
Delete documents using the delete API.
DELETE /my_index/_doc/1
Explanation:
- DELETE /my_index/_doc/1: Deletes the document with ID 1 from the my_index index.
6. Managing Indices
i. Viewing Index Information
Retrieve index information to monitor and manage indices.
GET /my_index
Explanation:
- GET /my_index: Retrieves information about the my_index index.
ii. Deleting an Index
Delete indices when they are no longer needed.
DELETE /my_index
Explanation:
- DELETE /my_index: Deletes the my_index index.
Examples
Index Creation
Creating an index with 3 primary shards and 1 replica shard:
PUT /library
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": {
"type": "text"
},
"author": {
"type": "text"
},
"publish_date": {
"type": "date"
},
"content": {
"type": "text"
}
}
}
}
Output:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "library"
}
Indexing a Document
Indexing a book document into the library index:
POST /library/_doc/1
{
"title": "Elasticsearch Basics",
"author": "John Doe",
"publish_date": "2023-06-01",
"content": "This book covers the basics of Elasticsearch."
}
Output:
{
"_index": "library",
"_type": "_doc",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
Searching Documents
Searching for documents with the title containing "Elasticsearch":
GET /library/_search
{
"query": {
"match": {
"title": "Elasticsearch"
}
}
}
Output:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "library",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "Elasticsearch Basics",
"author": "John Doe",
"publish_date": "2023-06-01",
"content": "This book covers the basics of Elasticsearch."
}
}
]
}
}
Conclusion
In this guide, we will learn when and how to use indices and shards to manage documents in Elasticsearch, how to create new indices as well as index documents, and how to execute queries and analyze the outcomes. By grasping these ideas, users are able to put Elasticsearch into practice for building large scale and efficient data stores and information search.
Similar Reads
Searching Documents in Elasticsearch Searching documents in Elasticsearch is a foundational skill for anyone working with this powerful search engine. Whether you're building a simple search interface or conducting complex data analysis, understanding how to effectively search and retrieve documents is essential. In this article, we'll
4 min read
Shards and Replicas in Elasticsearch Elasticsearch, built on top of Apache Lucene, offers a powerful distributed system that enhances scalability and fault tolerance. This distributed nature introduces complexity, with various factors influencing performance and stability. Key among these are shards and replicas, fundamental components
4 min read
Filtering Documents in Elasticsearch Filtering documents in Elasticsearch is a crucial skill for efficiently narrowing down search results to meet specific criteria. Whether you're building a search engine for an application or performing detailed data analysis, understanding how to use filters can greatly enhance your ability to find
5 min read
Handling Document Updates, Deletes, and Upserts in Elasticsearch Elasticsearch is a robust search engine widely used for its scalability and powerful search capabilities. Beyond simple indexing and querying, it offers sophisticated operations for handling document updates, deletes, and upserts. This article will explore these operations in detail, providing easy-
5 min read
Indexing Attachments and Binary Data with Elasticsearch Plugins Elasticsearch is renowned for its powerful search capabilities, but its functionality extends beyond just text and structured data. Often, we need to index and search binary data such as PDFs, images, and other attachments. Elasticsearch supports this through plugins, making it easy to handle and in
5 min read
Mapping Types and Field Data Types in Elasticsearch Mapping types and field data types are fundamental concepts in Elasticsearch that define how data is indexed, stored and queried within an index. Understanding these concepts is crucial for effectively modeling our data and optimizing search performance. In this article, We will learn about the mapp
5 min read
Interacting with Elasticsearch via REST API Elasticsearch is a powerful tool for managing and analyzing data, offering a RESTful API that allows developers to interact with it using simple HTTP requests. This API is built on the principles of Representational State Transfer (REST) making it accessible and intuitive for developers of all level
5 min read
How to Become an Elasticsearch Engineer? In the world of big data and search technologies, Elasticsearch has emerged as a leading tool for real-time data analysis and search capabilities. As businesses increasingly rely on data-driven decisions, the role of an Elasticsearch Engineer has become crucial. These professionals are responsible f
6 min read
Integrating Elasticsearch with External Data Sources Elasticsearch is a powerful search and analytics engine that can be used to index, search, and analyze large volumes of data quickly and in near real-time. One of its strengths is the ability to integrate seamlessly with various external data sources, allowing users to pull in data from different da
5 min read
Scaling Elasticsearch by Cleaning the Cluster State Scaling Elasticsearch to handle increasing data volumes and user loads is a common requirement as organizations grow. However, simply adding more nodes to the cluster may not always suffice. Over time, the cluster state, which manages metadata about indices, shards, and nodes, can become bloated, le
4 min read