Wednesday, April 6, 2016

Find causes for slower elasticsearch responses

Leave a Comment

I have been using elasticsearch for an e-commerce site for quite some time - not only for search, but also to retrieve product data (/index/type/{id}) to avoid SQL queries.

Generally this works really well and most requests are answered between 1ms and 3ms. But there are some requests which take 100ms-250ms - just for a GET request like /index/type/{id}, where no actual searching is done and which normally takes 1-2ms. It seems to me that something must be wrong if such a response takes more than 100ms, because the server has a lot of RAM & a fast 6-core-CPU, the data is stored on very fast SSDs, there are only 150'000 entries (about 300MB in Elasticsearch) and there is almost no load. Elasticsearch has 5GB of RAM, and there is enough spare RAM for Lucene to cache all entries all the time. Requests are made through a local network with a dedicated switch. The index has only one shard and I am running Elasticsearch 2.3.

I am doing the requests in PHP. I have already tried using Nginx as reverse proxy for Elasticsearch, but this did not solve anything - it happens with and without Nginx inbetween.

Edit: Slow requests happen about 1% of the time (in relation to total number of requests). I can also reproduce it by just making 1000 requests in PHP to /index/type/{id} in Elasticsearch - always 1% will be really slow, even when using the same ID like /index/type/55 (as long as the ID exists). This also means there is no "cache effect" - after the first request Elasticsearch should have the data "ready", yet the number of slow requests is the same no matter what IDs I request or if I request the same ID over and over.

Edit2: I have looked at the stats of my nodes with Marvel & Kibana, and nothing indicates slowdowns there: between 20-40% of JVM heap memory is used, and almost no latency (between 0.1ms and 0.5ms). It confirms that there are more than enough resources and I see no correlations or hints for the cause of any slow requests.

After a lot of testing:

These are now my definite testing results:

  • The larger the response from Elasticsearch, the more likely slow requests are going to happen. Many small responses have a MUCH larger chance of not being exceptionally slow than one large response.
  • Bombarding Elasticsearch with simple GET requests reduces the likelihood of slow responses when I run more requests in parallel.
  • When using a simple search for one keyword over and over again, Elasticsearch tells me in the response it "took" 2-3ms, even when a response takes 200ms until my application receives it. But also here: the larger the response, the higher the chances are of slow responses. 1KB response is never slow when I run loops of requests, 2.5KB is only a little slow (30ms) in very few instances, 10KB response always has up to 1% of slow requests with up to 200ms.

I have taken into account that it might be a network "problem", especially when Elasticsearch thinks it is fast even when it is slow. But it would be a strange root cause, because my setup is so standard (Debian Jessie). Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.

Anybody know how to find the root cause, and what could possibly be happening?

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment