I’ve been a heavy user of ElasticSearch for coming up 7 years now. During that time I’ve used it for a few main usecases: A Search Engine, An APM Solution (after NewRelic started being stupidly expensive), a backend for Jaeger, and as a log storage system. In all of those usecases I’ve really pushed ElasticSearch to its limits, with hundreds of terrabytes of data across dozens of machines and tens of thousands of shards and in all that time I’ve found that it really only works well for one of those situations. Particularly with Elastic’s push towards being anti-user, I wanted to question whether storing log data is a good usecase for ElasticSearch and suggest some better options.
Why ElasticSearch doesn’t work well for log data
ElasticSearch (actually Lucene but for our purposes here they are interchangable) was designed with the search engine index usecase in mind (and I’ve found it works exceedingly well for this). To that end, it expects documents to follow a relatively static (at least within a given index) structure, and optimises for full text search (notably through the use of an Inverted Index). But are these what we want for log storage? I’d argue no.
In a large company, you generally find many different services logging many different types of things. If you’re using structured logging then this means many different teams logging many different field names. In order to not place too much of a burden on your users, you’re probably using something like dynamic mappings to dynamically update your index mapping whenever a new field is seen. This works fine, until one day your index just suddenly stops accepting new documents. You’ve hit your maximum number of fields per index. You bump that and continue on, but it’ll keep happening. And what’s worse, your search performance will degrade more and more, with your index size bloating as more fields are in the index.
What you might consider doing then is to limit the fields in your documents to a strict set and closely guard additions to that set. But that will only serve to decrease the usefulness of your logs and serves to annoy engineers that have to conform to the format.
The Inverted Index
text fields in ElasticSearch, like one might use for a log message or an error string, ElasticSearch maintains an Inverted Index of the data. This makes it very efficient for substring matches, like if you want to find a web page with certain words in it (the search engine usecase). But how often do we actually do substring matches? Particulary across a whole dataset (without filtering by other things). I ran a survey across my production ElasticSearch cluster and found out of a total of ~30,000 queries in the last week ~ 40% had any sort of full text search, and a grand total of none of them were for full text searches without at least one other
This Inverted Index comes at a cost however - it has to be updated on every insert, and takes up disk space and RAM to keep in memory (don’t @ me about mmap). So in the logging usecase, we are expending a number of resources to store an index that is arguably not actually useful for our logging usecase.
So what can we do?
ElasticSearch is the pretty entrenched behemoth when it comes to logging and search, but if not ElasticSearch then what?
Grafana Labs' Loki is very exciting. Instead of storing a costly Inverted Index, Loki only indexes on fields (the equivalent of
keyword fields in ElasticSearch) and leaves the full text search up to a more costly search after that initial filtering. I’m of the opinion that this is a fanstastic model for storing log data (NB: I’m not associated with Grafana Labs in any way). Based on the search patterns I outlined above - with sufficient filtering using keyword data it doesn’t really matter doing an expensive search across not that many documents, and we save all the resources of not having to maintain and process an Inverted Index. My only issue with Loki is the hard dependence on some sort of object store (S3, or something S3 compatible). That makes it great for Cloud based systems (if not very costly for large deployments), but pretty unworkable for on-prem deployments (unless you have a few Ceph experts laying around).
Ubers Clickhouse as a Log Storage thing
Uber recently blogged about something that I’m really interested in - using Clickhouse as an unstructured log storage backend, proxying requests between Kibana and Clickhouse through a translator. I’m really interested in this - Clickhouse in general as a Column-oriented DBMS will become (I believe) one of the datastores in the future for storing the sort of “wide events” that really enable Observability. Unfortunatly, Uber has not open sourced this work so we are unable to benchmark it and see how it performs but from my initial testing on compression and ingest volume, I was able to turn a 5TB index in ElasticSearch into 800GB in Clickhouse - a stat that I was quite astonished by.
I honestly believe that ElasticSearch should be relegated towards its intended usecase as a search engine. There are better logging solutions out there, with more coming due to the exciting recent work in new datastores. I suspect (and hope) that as an industry we will move on from ElasticSearch for log storage and embrace a more appropriate storage solution in the future.Like this post, or just want to yell at me? Follow me on Twitter: @sinkingpoint