I’ve been thinking about this for a while.
Maybe we are doing something wrong.
Take the standard log prologue:
02/22/2020 14:05:00 Application started
02/22/2020 14:05:01 Verifying dependencies
02/22/2020 14:05:02 Database and upstreams OK
02/22/2020 14:05:03 Got request from IP:184.108.40.206
02/22/2020 14:05:03 Calling upstream blah, request ID 1
02/22/2020 14:05:04 Got success response for request ID 1
02/22/2020 14:05:05 Responding to caller 220.127.116.11 request ID 1
Now add on top of this structured logging with enriched information.
A few points.
How much of that data is really necessary ?
How often do you read it all ?
Have you ever measured the ratio logged/viewed items ?
Wanna bet the above ratio is less than 5% ?
So what to do ?
Suppose you have a ‘special’ logging library that scans your code and builds a dictionary. “Application started” becomes 1, “Verifying dependencies” becomes 2, and so on. For more complex statements, you save an entry in the dictionary with placeholders, something like 5->”Request %s solved in %d seconds”. Also, you save the timestamp only once, at startup, and then you either put a displacement in millis, or you put a market every second, so you have reduced also the timestamp part. So what you log is actually something like:
6 5 3 40
First int is the time displacement, second int is the dictionary index, the rest are substitution parameters.
My guess is a 80–90% reduction in data that is being send over the wire and stored somewhere. One issue that is a little bit thorny is exceptions thrown ( think Java ) that tend to spit out a full stack trace from which, again, 80% is completely useless. Map them in a separate dictionary.
OK, now what to do when you really want to look at your logs ?
Merge the dictionary with the logged values on the fly, stream them to the GUI and… Tada, you get back log as you expected. You pay the price of waiting a little longer to actually “see” the logs, but you cut your traffic and log storage needs by 80%. And you actually expand only what you’re looking at. Think of it as the double-slit experiment, when the electron behaves as a particle or as a wave, depending on the fact that you’re observing it or not.
Is it worthy ? I tend to say yes. Would love to hear other opinions …
One step further.
Assume you have a fleet of a few hundred servers. And you do the logging as above. You might think as every log entry as an integer (its index in the dictionary). So you get a time serie of ints. You want anomaly detection ? Run something like SARIMA or Facebook Prophet on this series. I bet there are clear patterns in this logging.
But wait, there’s more …
How about you correlate all the series of integers across all your servers on the same time base, so you will have an array of a few hundred ints for every timestamp. If you look at each timestamp being a row, and use the ints to map them into a color space, one int to one pixel of a certain color, you can generate a picture that has for each line the status of all your servers — maybe status is too strong, just the current log. But you have one picture that gives you an overall view of your system over a certain timespan. Now, collect a lot of these pictures, timestamp them. Then label them according to your current watch system, as normal, or with the kind of error or warning that comes for that time for the existing system. Train a CNN on these pictures, pretty easy to label. Or plug it into a variational autoencoder.
Boom, here’s your anomaly detection.
One step further, by constantly removing oldest line from the picture and adding newest on top, you can generate a video of your system in motion, from the logging perspective. Do as above, with a video detection system. By the way, you can still have human in the loop, recognizing patterns in images and/or video.
So … is it worthy ?
PS. I found a while ago a C library that does this kind of substitution, but it doesn’t goes to the end. Unfortunately I lost the link. If someone knows it, please feel free to add it in comments.