Thursday, December 21, 2017

Ways to only process new(index after last run) data in Elasticsearch?

Leave a Comment

Is there a way to get the date and time that an elastic search document was written?

I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.

What is the best most efficient way to do this?

I have looked at;

  • updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
  • index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
  • ??

Elasticsearch version 5.6

3 Answers

Answers 1

I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.

Answers 2

I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.

A workaround could be :

While inserting data using Logstash to Elasticsearch, Logstash appends a @timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline

After that we can query based on the timestamp.

For more on this please have a look at :

  1. Mapping changes
  2. There is no way to ask ES to insert a timestamp at index time

Answers 3

Elasticsearch doesn't have such functionality.

You need manually save with each document date. In this case you will be able to search by date range.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment