Getting Started with Elasticsearch: A Practical Introduction

Elasticsearch is a powerful, distributed search and analytics engine used for a wide range of use cases like log analysis, full-text search, and business intelligence. This post covers the fundamentals, local setup, querying data, and how to visualize your results.

What is Elasticsearch?

Elasticsearch is an open-source, distributed, RESTful search engine based on Apache Lucene.
Designed for full-text search, structured search, analytics, and near real-time data retrieval.
Used in products like the ELK stack (Elasticsearch, Logstash, Kibana).

What is REST/RESTful?
Representational State Transfer, a style for building scalable APIs using HTTP verbs (GET, POST, etc.). Elasticsearch exposes a RESTful API.

Core Concepts

Cluster: A group of nodes (servers) working together.
Node: An instance of Elasticsearch running on a machine.
Index: Like a database in RDBMS; stores documents.
Document: A JSON object representing the basic unit of data.
Shard: A horizontal partition of an index.
Replica: Copies of shards for fault tolerance and high availability.

Querying Data with Query DSL

Query DSL (Domain-Specific Language) is Elasticsearch’s powerful, JSON-based language for building queries.

Example queries:

Match Query (full-text search):

  {
    "query": {
      "match": {
        "message": "error"
      }
    }
  }

Term Query (exact match):

  {
    "query": {
      "term": {
        "status.keyword": "active"
      }
    }
  }

Range Query (e.g., prices greater than 180):

  {
    "query": {
      "range": {
        "close": {
          "gt": 180
        }
      }
    }
  }

Setting Up Elasticsearch Locally

Build Docker Image

Run Elasticsearch (version 8.10.2) as a single-node cluster:

docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.2

Elasticsearch 8.x enables security (HTTPS, authentication) by default.
For quick local testing, you can disable security features with this Docker command: -e "xpack.security.enabled=false"

Check http://localhost:9200 in your browser for a successful JSON response.

Ingesting Yahoo Finance Data

1. Download Data with yfinance

import yfinance as yf

data = yf.download("AAPL", period="1y", interval="1d")
data.to_csv("aapl_1year.csv")

2. Index Data into Elasticsearch

from elasticsearch import Elasticsearch, helpers
import pandas as pd

# Connect to local Elasticsearch (HTTP - security disabled)
es = Elasticsearch("http://localhost:9200")

# Load the CSV data (or directly use the dataframe)
df = pd.read_csv("aapl_1year.csv")

# Handle NaN values by filling them with 0 or dropping rows with NaN
df = df.fillna(0)  # Replace NaN with 0

# Prepare data for bulk indexing
actions = []
for idx, row in df.iterrows():
    doc = {
        "_index": "stocks",
        "_id": idx,
        "_source": {
            "date": row['Date'],
            "open": row['Open'],
            "high": row['High'],
            "low": row['Low'],
            "close": row['Close'],
            "volume": int(row['Volume'])
        }
    }
    actions.append(doc)

# Bulk index
helpers.bulk(es, actions)
print("Data indexed successfully!")

Testing and Querying Your Data

Once your data is indexed, you can test various queries using curl commands or Python. Here are some practical examples with real outputs:

1. Basic Health Check

curl -X GET "http://localhost:9200"

Output:

{
  "name" : "xxxxxx",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "xxxxxx",
  "version" : {
    "number" : "8.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "6d20dd8ce62365be9b1aca96427de4622e970e9e",
    "build_date" : "2023-09-19T08:16:24.564900370Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

2. Check Index Statistics

curl -X GET "http://localhost:9200/stocks/_count?pretty"

Output:

{
  "count" : 250,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Scaling and Resilience

Elasticsearch is built for distributed environments and high availability:

Distributed Architecture: Elasticsearch clusters can span multiple nodes (servers)
Sharding: Large indices are split into smaller pieces distributed across nodes
Replication: Creates copies of shards for fault tolerance and high availability
Master Election: Nodes coordinate via master elections to manage cluster state
Auto-Recovery: Supports automatic failover and rebalancing when nodes join/leave

Common Use Cases

1. Log and Event Analysis

ELK Stack (Elasticsearch + Logstash + Kibana)
Real-time monitoring and troubleshooting
Application performance monitoring (APM)

2. Search Applications

Website search engines with relevance ranking
E-commerce product search
Document management systems

3. Analytics and Monitoring

Real-time dashboards and alerting
Business intelligence and metrics
Security information and event management (SIEM)

4. Advanced Analytics

Text analytics and natural language processing
Geospatial search and location-based services
Machine learning and anomaly detection

Next Steps

With your Elasticsearch foundation in place, consider exploring:

Kibana Integration - Add visual dashboards and data exploration
Logstash/Beats - Implement real-time data ingestion pipelines
Production Deployment - Enable security, clustering, and monitoring
Advanced Queries - Explore full-text search, aggregations, and machine learning features

Feel free to leave your comments below if you want to see any other topics covered! 💬