I thought about OpenSearch configuration patterns
This article was published more than one year ago. Please be aware that the information may be outdated.
This page has been translated by machine translation. View original
Recently, we experienced an issue with our OpenSearch Service used for business operations.
We couldn't access the Dashboard, and when checking the logs, there was an endless stream of Java error messages.
Upon investigation, we found that the t3.small instances being used for data nodes had become unstable, resulting in a loss of quorum between these two nodes.
Since the usage was limited to internal company purposes, the impact wasn't too severe. However, during our investigation, we discovered that our current configuration of two data nodes (using t3.small instances) is not recommended for production environments.
Since we had been using the existing setup somewhat mindlessly, I'd like to take this opportunity to consider various configuration patterns for OpenSearch from the perspective of instance types and server count.
※This article focuses on OpenSearch version 2.13.
- Reference documents
- AWS OpenSearch Service Best Practices
- Amazon OpenSearch Service Quotas
- Cost Optimization
- Amazon OpenSearch Service Pricing
- How do I improve the fault tolerance of my Amazon OpenSearch Service domain?
- Amazon OpenSearch Service Dedicated Master Nodes
- Four Common Misunderstandings About Operating Elasticsearch
Instance Type Selection
First, let's look at instance type selection.
The documentation recommends r6g.large for small production workloads (both as data nodes and dedicated master nodes).
I've listed relatively low-cost instance types in this range:
- Instance usage fees (
ap-northeast-1) ※As of September 2024
| Instance Type | CPU | Memory | Hourly Cost (USD) | Monthly Cost (USD) |
|---|---|---|---|---|
t3.small.search |
2 | 2 | 0.056 | 40.8 |
t3.medium.search |
2 | 4 | 0.112 | 81.6 |
r6g.large.search |
2 | 16 | 0.202 | 147.6 |
r6g.xlarge.search |
4 | 32 | 0.404 | 295.2 |
r7g.medium.search |
1 | 8 | 0.107 | 78.12 |
r7g.large.search |
2 | 16 | 0.214 | 156.24 |
r7g.xlarge.search |
4 | 32 | 0.429 | 312.48 |
m6g.large.search |
2 | 8 | 0.164 | 119.52 |
m7g.large.search |
2 | 8 | 0.175 | 127.68 |
- T-series instances
Burstable T-series instances (such as t2 or t3) can handle temporary loads but become unstable under sustained loads, so they should be avoided in production environments.
Conversely, for development environments, T-series instances may be chosen to keep costs down.
r6g.xxx.searchseries instances
OpenSearch Service is constantly adopting new Amazon EC2 instance types that provide better performance at a lower cost. We recommend always using the latest generation instances.
Don't use T2 or t3.small instances for production domains because they can become unstable under sustained heavy loads. r6g.large instances are an option for small production workloads (both as data nodes and dedicated master nodes).
For stability, it seems safest to use r6g.large.search in production environments.
If higher specifications are needed, r6g.xlarge.search would be the next best choice.
r7g.xxx.searchseries instances
The r7g series, which is the next generation after the recommended r6g.large, includes the r7g.medium which is less expensive despite having somewhat lower specifications.
Since it's not a burstable instance, it offers stability, but the CPU and memory directly impact search and write performance, so caution is needed.
If you can sacrifice some performance, this could be an option to consider when trying to reduce costs.
On the other hand, r7g.large.search is more expensive compared to the 6th generation r6g.large.search.
OpenSearch Service is constantly adopting new Amazon EC2 instance types that provide better performance at a lower cost. We recommend always using the latest generation instances.
Although the documentation states this, the pricing may be revised downward in the future. Let's hope for that.
m6g.xxx.searchseries instances
According to this document, m6g.large.search is included in the recommendations as a minimum instance type for dedicated master nodes.
It's less expensive compared to the r6g.xxx series, so it might be a viable option.
- Other instance types
There are other instance types like c6g.xxx.search and higher-spec instances, but including them would give us too many options to consider, so I'll omit them for now.
Master Node & Data Node Count Selection
Next, let's look at selecting the number of master and data nodes.
Here we'll consider:
- What is the appropriate number of master and data nodes?
- Whether to add dedicated master nodes or not?
The recommended configuration is quite large with 3 dedicated master nodes + 3 data nodes, so I've enumerated some smaller patterns below.
| Pattern | Master Nodes | Data Nodes | Characteristics | Use Case |
|---|---|---|---|---|
| Minimum Configuration | - | 1 | Low fault tolerance | Development/Testing |
| Low Workload Configuration | - | 3 | Can continue with 1 node down | Development or light production |
| Recommended Configuration | 3 | 3 | Cluster stability and redundancy | Production |
| Serverless Configuration | - | - | Automatic provisioning | Production |
※In configurations with data nodes only, the data nodes also serve as master nodes.
- Minimum Configuration
With only one data node, fault tolerance is low, and the entire cluster will stop if the node goes down.
Therefore, this is an option to consider for development or test environments where cost is a priority.
- Low Workload Configuration
In a low workload configuration with 3 data nodes, operations can continue even if one node fails.
However, without dedicated master nodes to focus on cluster management (master election, shard allocation, etc.), there's a risk of unstable performance.
Since documents recommend using dedicated master nodes for production workloads, this configuration is suitable for development environments or light production environments.
- Recommended Configuration
The recommended configuration consists of 3 dedicated master nodes + 3 data nodes, providing high cluster stability and redundancy.
This configuration also allows the use of Multi-AZ with Standby for higher availability.
However, the increased number of instances results in higher costs.
- Serverless Configuration
For larger-scale operations, OpenSearch Services Serverless can also be considered.
OpenSearch Service Serverless doesn't require provisioning instances in advance and automatically scales, ensuring flexible scalability.
Computing power is measured in OpenSearch Compute Units (OCUs), and you're billed based on the number of OCUs used per hour.
For pricing details, this blog is helpful.
Minimum price with replicas enabled (2OCU: 0.5 * 4) ※For production
$488.976
Minimum price with replicas disabled (1OCU: 0.5 * 2) ※For development/testing
$244.488
- Other patterns
First, configurations with an even number of nodes, such as 2 data nodes, should be avoided.
This was something I misunderstood as well, but with a configuration of only 2 data nodes, if one data node goes down, the cluster will stop due to quorum loss.
OpenSearch's Quorum value is calculated as "number of dedicated master nodes/2 + 1 (rounded down to the nearest integer)," so to prevent the cluster from stopping when one node goes down, at least 3 data nodes are needed.
So, a configuration with only 2 data nodes might seem highly available, but from an availability perspective, it's no different from a configuration with just 1 data node. For more details, refer to this document.
Having an odd number of nodes in the cluster ensures that during a network partition, there will be a group that meets the quorum (majority) requirement and can elect a new master.
When adding dedicated master nodes, you can choose between 3 or 5 master nodes, but a 5-node configuration becomes too expensive, so I've omitted it from this discussion.
Personally Recommended Patterns
For Development Environments
If cost is the primary consideration, for development environments:
- Instance type
t3.small.search - 1 data node
would be sufficient.
If you want to increase the specifications, selecting t3.medium.search might be a good option.
For Low Workload Environments
Personally, since the recommended configuration is quite costly, if you can sacrifice some performance and stability:
- Instance type
r7g.medium.search - 3 data nodes
might be a good starting point.
From there, you can adjust the instance type and node count, or add dedicated master nodes, based on your workload.
T-series burstable instances often become unstable when CPU usage increases during shard reallocation (such as when adding nodes or updating versions), so I recommend avoiding them for production use.
In my experience, identifying the cause and recovering from such issues can take quite some time.
For Production Environments
It's a difficult decision, but starting with the recommended configuration:
- Instance type
r6g.large.search - 3 dedicated master nodes + 3 data nodes
would be the safest approach. (Though it costs quite a bit at $147.6/month × 6 nodes = $885.6/month.)
On the other hand, a serverless configuration (with replicas enabled) would have a minimum usage fee of $488.976/month, which is cheaper than the recommended configuration. However, if you have a stable workload over a long period, the serverless configuration might end up costing more.
Also, OpenSearch Serverless doesn't support some OpenSearch API operations and OpenSearch plugins.
If these constraints are acceptable, considering a serverless configuration might be a good idea.
One more point to note is that data migration between an OpenSearch Service domain and an OpenSearch Serverless Collection isn't yet provided by AWS, so if you decide to switch, you'll need to migrate the data yourself.
Conclusion
I found there's quite a lot to consider when selecting instance types and server configurations for OpenSearch.
Especially for production environments, it's important to select an odd number of nodes to avoid quorum failures and to introduce dedicated master nodes to ensure cluster stability.
I hope this article will be helpful for future OpenSearch deployments and operations.