Home Artificial Intelligence System Design Cheatsheets: ElasticSearch Introduction ElasticSearch When not to make use of ElasticSearch Methods to use ElasticSearch in your system design Conclusion

System Design Cheatsheets: ElasticSearch Introduction ElasticSearch When not to make use of ElasticSearch Methods to use ElasticSearch in your system design Conclusion

0
System Design Cheatsheets: ElasticSearch
Introduction
ElasticSearch
When not to make use of ElasticSearch
Methods to use ElasticSearch in your system design
Conclusion

Understand how and when to make use of ElasticSearch in systems, with three practical system design examples

Towards Data Science

What’s Search? And why it’s important?

In the event you’ve read my previous articles on search, you’d understand how critical search is to an application. Give it some thought: out of all different web apps and mobile apps you utilize daily, be it Netflix, Amazon, Swiggy, etc., the search bar might be the one common UI element in all of them, and that too will likely be on the homepage, right at the highest. In the event you are designing a system, ninety-nine times out of 100, you’ll consider power search.

Constructing a search system is not any small feat, but an amazing start line is ElasticSearch. In the event you don’t know anything about how search or suggestion systems work, this blog post is a great start line for you. We are going to discuss what ElasticSearch is, where it really works and where it doesn’t, and three common designs wherein ElasticSearch is used. There are so much more attributes of a search system, but more on that towards the tip of the article.

What’s ElasticSearch?

ElasticSearch is a preferred database that does something that almost all databases struggle with: Searching. Searching is so core to ElasticSearch, it’s literally in its name!

But when you haven’t heard about ElasticSearch, you’re probably pondering: why is searching so difficult? Why can’t a relational database perform a search? Most relational databases support various ways to look and filter through data, just like the WHERE query, the LIKE keyword, or indexes. Or why can’t a document database like MongoDB work? You possibly can write find queries in MongoDB as well.

To grasp the reply, imagine you might be constructing a news website. When the user searches for news using your search bar, possibly for “COVID19 infections in Recent Delhi”, the user is enthusiastic about all of the articles that talk about COVID infections in Recent Delhi. In a straightforward search system, it will mean scanning all of the articles within the database, and returning those who contain the words “COVID19”, “infections” or “Recent Delhi”. You possibly can’t do this with a relational database. A relational database would let you seek for articles based on specific attributes, for instance, articles written by a selected creator or articles published today, etc. but it may’t (no less than, not efficiently) perform a search wherein it scans each news article (normally in tens of hundreds of thousands) and return those who contain certain words.

Furthermore, there are so much more intricacies to think about. How do you rating these articles? Perhaps there may be an article that talks about COVID19 infection spread, and possibly there may be one which talks about recent infections, how do which is more relevant to the user query, or in other words, how do you sort these articles based on relevance?

Answer: ElasticSearch! ElasticSearch can do all this and far way more right out of the box.

But, like all the things else on the earth, it comes with its fair proportion of disadvantages. Let’s discuss what ElasticSearch is, when to make use of it, and most significantly when it doesn’t make sense.

Searching Capabilities

ElasticSearch provides a strategy to perform a “full-text search”. Full-text search refers to looking for a phrase or a word in an enormous corpus of documents. Let’s proceed with our previous example, imagine you might be constructing a news website that comprises hundreds of thousands of reports articles. Each article comprises some data, like a heading, subheading, the content of the article, when it was published, etc. Within the context of ElasticSearch, each article is stored as a JSON document.

You possibly can load all these documents into ElasticSearch after which seek for specific words or phrases inside each of those documents in a couple of milliseconds. So when you load up all of the news articles, after which perform a search, “COVID19 infections in Delhi”, ElasticSearch returns all of the articles which have the words “COVID19”, “infections”, or “Delhi”.

To display searching in ElasticSearch, let’s arrange Elasticsearch and cargo some data in it. For this post, I’ll use this News dataset I discovered on Kaggle(Misra, Rishabh. “News Category Dataset.” arXiv preprint arXiv:2209.11429 (2022)) (Source) (License). The dataset is pretty easy, it comprises around 210,000 news articles, with their headlines, short descriptions, authors, and another fields we don’t care much about. We don’t actually need all 210,000 documents, so I’ll load up around 10,000 documents in ES and begin searching.

These are a couple of examples of the documents within the dataset —

[
{
"link": "https://www.huffpost.com/entry/new-york-city-board-of-elections-mess_n_60de223ee4b094dd26898361",
"headline": "Why New York City’s Board Of Elections Is A Mess",
"short_description": "“There’s a fundamental problem having partisan boards of elections,” said a New York elections attorney.",
"category": "POLITICS",
"authors": "Daniel Marans",
"country": "IN",
"timestamp": 1689878099
},
....
]

Each document represents a news article. Each article comprises a link, headline, a short_description, a category, authors, country(random values, added by me), and timestamp(again random values, added by me).

Elasticsearch queries are written in JSON. As an alternative of diving deep into all different syntaxes you should utilize to create search queries, let’s start easy and construct from there.

One in all the best full-text queries is the multi_match query(don’t worry an excessive amount of about querying data in ElasticSearch, it’s pretty easy and we’ll speak about it towards the tip of the article). The concept is easy, you write a question and Elasticsearch performs a full-text search, essentially scanning all of the documents in your database, finding those who contain the words in that question, assigning a rating to them, and returning them. For instance,

GET news/_search
{
"query": {
"multi_match": {
"query": "COVID19 infections"
}
}
}

The above query finds relevant articles for the query “COVID19 infections”. These are the outcomes I got back –

 [
{
"_index" : "news",
"_id" : "czrouIsBC1dvdsZHkGkd",
"_score" : 8.842152,
"_source" : {
"link" : "https://www.huffpost.com/entry/china-shanghai-lockdown-coronavirus_n_62599aa1e4b0723f8018b9c2",
"headline" : "Strict Coronavirus Shutdowns In China Continue As Infections Rise",
"short_description" : "Access to Guangzhou, an industrial center of 19 million people near Hong Kong, was suspended this week.",
"category" : "WORLD NEWS",
"authors" : "Joe McDonald, AP",
"country" : "IN",
"timestamp" : 1695106458
}
},
{
"_index" : "news",
"_id" : "ODrouIsBC1dvdsZHlmoc",
"_score" : 8.064016,
"_source" : {
"link" : "https://www.huffpost.com/entry/who-covid-19-pandemic-report_n_6228912fe4b07e948aed68f9",
"headline" : "COVID-19 Cases, Deaths Continue To Drop Globally, WHO Says",
"short_description" : "The World Health Organization said new infections declined by 5 percent in the last week, continuing the downward trend in COVID-19 infections globally.",
"category" : "WORLD NEWS",
"authors" : "",
"country" : "US",
"timestamp" : 1695263499
}
},
....
]

As you’ll be able to see, it returns documents that debate COVID19 infections. It also returns them sorted within the order of relevance(The _score field indicates how relevant a selected document is).

ElasticSearch has a wealthy query language with a whole lot of features, but for now, it is sufficient to know that constructing a straightforward search system may be very easy, simply load all of your data into ElasticSearch and use a straightforward query that we discussed. We’ve got a plethora of options to enhance, configure, and tweak search performance and relevance (again, more on search queries towards the tip of this post).

Distributed Architecture

ElasticSearch works as a distributed database. Which means that there are multiple nodes in a single ElasticSearch cluster. If a single node becomes unavailable or fails, that doesn’t normally mean downtime for our system, and other nodes would normally pick up the additional work and proceed to serve user requests. So multiple nodes facilitate higher availability.

Multiple nodes also help us scale our systems, data and user requests could be divided across these nodes which ends up in less load per node. For instance, if you wish to store 100 million news articles in ElasticSearch, you’ll be able to split that data into multiple nodes, with each node storing a certain set of articles. And it’s pretty easy to do, the truth is, ElasticSearch comes with built-in features to make this as easy and seamless as possible.

Scalability

ElasticSearch scales horizontally and is capable of partition data across multiple nodes. This implies that you could all the time improve query performance by adding more nodes to your ElasticSearch cluster.

There’s so much more thought process about architecting your ElasticSearch cluster than simply running more servers though. There are several types of nodes, these nodes run processes called “shards”, and every shard, node, can have multiple types and configuration options. There’s so much to debate concerning the architecture of an ElasticSearch cluster and the way it really works, so I’ve written an entire post on the architecture here if you wish to dive deeper into it.

TLDR: you’ll be able to add more machines to scale your cluster and improve performance. Data and queries could be divided into multiple machines. This facilitates higher performance and high scalability.

Document-based data modeling

ElasticSearch is a document database, that stores data in JSON document format, much like MongoDB. So, in our example, every news article is stored as a JSON document within the cluster.

Real-time data evaluation

Real-time data evaluation is user actions in real-time and understanding user patterns and behavior. We are able to chart user behavior and higher understand our users, using which we are able to improve our product. For instance, let’s say we measure each click, scroll event, and reading time per user on our news website. We chart these metrics in a dashboard and observe them for a couple of days. Using this, we are able to collect a whole lot of actionable insights to enhance our news app. We discovered that users normally use the web site at 9–10 AM within the morning, and we discovered that users generally click on articles which can be relevant to their country. Using this information, we are able to overprovision resources during peak times (9–10 AM) and possibly show articles from the user’s country on their homepage.

Elasticsearch is well-suited for real-time data evaluation as a result of its distributed architecture and powerful search capabilities. When coping with real-time data, comparable to logs, metrics, or social media updates, Elasticsearch efficiently indexes and stores this information. Its near real-time indexing allows data to be searchable almost immediately after ingestion. ElasticSearch also works well with other tools, like Kibana for visualization or Logstash and Beats for collecting metrics.

Towards the tip of the article, we’ll take a look at an architecture that facilitates this.

Cost

ElasticSearch is dear to run and maintain. As with all the things on this world, all the things good comes at a price. To perform full-text search, ElasticSearch keeps a considerable amount of data in RAM and builds complex indices. This implies it requires a whole lot of RAM to run, which is dear.

So, briefly, it gives you amazing performance when performing full-text search however it ain’t low-cost.

ACID compliance

ElasticSearch, like most NoSQL databases, has very limited support for ACID, so when you want strong consistency or transactional support, ElasticSearch may not be the selection of database for you. Consequences of this are that when you insert a document (called “indexing” a document in ElasticSearch) in ElasticSearch, it may not be available to other nodes immediately and might take a couple of milliseconds before it’s visible to other nodes.

Let’s say you might be constructing a banking system; if a user deposits money into his/her account, you wish that data visible immediately to each other transaction that the user performs. However, when you are using ElasticSearch to power searches in your news website when a brand new article gets published, it’s probably acceptable that the article will not be visible to all users for the primary few milliseconds.

Once you need complex joins

ElasticSearch doesn’t support JOIN operations or relationships amongst different tables. In the event you’ve been using relational databases, this might come as a little bit of a shock to you but most NoSQL databases have limited support for most of these operations.

If you wish to perform JOINs or use foreign keys for highly related structured data, ElasticSearch is probably not the very best selection to your use case.

Small dataset or easy query needs

ElasticSearch is complex and dear. Running and managing a big ElasticSearch cluster not only requires the knowledge and skill of software engineers and DevOps engineers but might even require specialists who excel at managing and architecting ElasticSearch clusters, called “ElasticSearch Architects”. There’s a plethora of configuration options and architectural selections to mess around with and every certainly one of them has a major impact in your queries and ingestion, thus having an indirect impact on user experience on core flows in your system.

If you wish to execute easy queries or have relatively low data, then a straightforward database may be higher to your application.

A single software system would normally require multiple databases, each powering a special set of functionalities. Let’s take an example to grasp the design selections of using ElasticSearch higher.

Let’s say you wish to construct a video streaming service, something like Netflix. Let’s see where ElasticSearch can slot in in this instance.

As a Search system

A quite common use case of ElasticSearch is as a secondary database powering full-text search queries. This may be very useful for our video streaming application. We are able to’t store the videos in ElasticSearch, and we probably don’t need to store data related to billing or users in ElasticSearch as well.

For that, we are able to produce other databases, but we are able to store the titles of films, together with their description, genres, rankings, etc. in ElasticSearch.

We are able to have an architecture much like this:

Image by creator

We are able to ingest data on which we would like to power full-text search into ElasticSearch. When the user performs a search operation, we are able to query the ElasticSearch cluster. This fashion we get the full-text search capabilities of ElasticSearch and when we would like to update user information, we are able to perform those updates in our primary storage.

As a real-time data evaluation pipeline

As we discussed, understanding user behavior and patterns is a vital step in deciding evolve the product. We are able to publish events, comparable to clickstream events, and scroll events to higher understand how our users use our product.

For instance, in our video streaming application, we are able to publish an event with user and movie data every time a user clicks on a movie or a show. We are able to then analyze and chart aggregations to higher understand how users are using our product. For instance, we’d notice that users use our product more within the evening than within the afternoon or that users may prefer shows or movies of their local language over other languages. Using this, we are able to develop our product to enhance user experience.

That is how a basic system for real-time data evaluation using ElasticSearch and Kibana (a dashboarding tool that works well with ElasticSearch) would seem like:

Image by creator

As a recommendations system

We are able to construct queries in ElasticSearch that may give more preference(called boosting) to certain attributes. For instance, as an alternative of a straightforward query

We are able to construct basic suggestion systems with ElasticSearch. We are able to store information concerning the user, comparable to the user’s country, age, preferences, etc., and generate queries to get popular movie shows or series for that user.

Understanding the query language and boost certain fields, and perform aggregations is a big topic in itself, but I’ve written a blog post covering the fundamentals here:

Methods to Architect ElasticSearch Clusters?

Architecting an ElasticSearch cluster is not any easy feat, it requires knowledge of nodes, shards, indexes, and orchestrate all of them. There are near-infinite architectural selections to make, and the sphere is continually evolving(especially more with the recognition of AI and AI-powered search). To debate it more, I’ve written an entire blog post that starts from the very basics to all the things you’d have to know to architect a search cluster:

Understanding Search Queries and Improving Search Systems

Search is complex, very complex. There are a whole lot of ways we are able to improve search systems, making them more powerful and understanding of user needs. You have got already learned about ElasticSearch and what it’s. Proceed this journey as we start from here, construct a basic search query, understand the issues within the query and our system, and evolve and improve the system, step-by-step with examples.

Context-aware Searching

I recently read an amazing analogy on search systems. You possibly can consider the search system now we have discussed to date as a mechanical, rigid search. When a user enters a word, we discover all of the documents where the word appears and return them.

Or you’ll be able to consider a search system as a librarian. When the user asks an issue, let’s say, “What was Winston Churchill’s role within the second world war?”, the librarian doesn’t just tell him the books which have the words “Winston”, “Churchill” or “Second World War”. As an alternative, the librarian evaluates and understands the shopper and the context. Perhaps it’s a faculty kid, so as an alternative of recommending an enormous textbook, she finds a book more relevant to a younger kid. Or possibly she doesn’t have any book with the title of Winston Churchill, so she finds a book that talks concerning the Second World War or British prime ministers and recommends that as an alternative. The librarian may even recommend different books for exams and different for summer vacation homework(a few of you might not know this, but in some countries, you might be given an enormous amount of homework for summer vacations)

This is simple to grasp for you and me but how would our system know that Winston Churchill was a British prime minister and recommend books on Britain in the course of the Second World War, or how would our system understand the context of the discussion, understand the user, and recommend appropriate books?

As difficult because it could seem, it’s actually not so hard. It’s called Semantic Search and it’s how most big tech corporations construct their search systems.
Semantic search is a set of search techniques that goals to grasp the meaning behind user queries and the context of content, enabling more accurate and contextually relevant search results by considering the relationships between words and the intent behind the search.

It is a large topic, and I’m still reading and understanding more about it, but a blog post that starts at the fundamentals is coming soon, so if you wish to know more about this topic, follow me here on Medium.

Other databases

I write about system design concepts, like databases, queues, and pub-sub systems, so follow me here on Medium for similar articles. I also write a whole lot of byte-sized content on LinkedIn (for instance, this post on the differences between RabbitMQ and Kafka), so follow me on LinkedIn for shorter types of content here.

Meanwhile, you’ll be able to try my blog posts on other databases and system design concepts-

LEAVE A REPLY

Please enter your comment!
Please enter your name here