Grand Unified Theory of logging
Is their a Grand Unified Theory of logging? Shall we develop one? Question (just to show this is not a discussion :), how can I improve on the following? (note that I live mainly in the embedded world, but non-embedded suggestions are also welcome)
How do you log, when do you log, what do you log, what do you do with log files?
How do you log - I generally have macros, #ifdef TESTING, sort of thing. They write to RAM and a low priority process writes them o开发者_运维技巧ut when the system is idle (using UDP, since I do embedded systems)
When do you log - same as voting, early and often. At every (in)significant program event, I log at varying levels. Events received, transaction succeed/fail, data updated, etc
What do you log - Fatal/Error/Warning/Info/Debug/Trace is covered in When to use the different log levels?
What do you do with log files - 1) keep them (in CVS), both pass and fail 2) capture everything and filter later in case I can't repeat a problem. I have tools to filter the log by "level" (Fatal/Error/etc), process, file, etc. And to draw message sequence charts, dump data structures, draw histograms of memory usage - what am I missing?
Hmmm, binary or ascii log file format? Ascii is bulkier, but binary requires more processing. I have done both, currently I use ascii
Question - did I miss anything, and how can I improve on this?
You could "instrument" your code in many different ways, everything from start-up/shut-down events to individual machine instruction execution (using a processor emulator). Of all the possibilities, what's worth doing? Don't just do it for the sake of completeness; have a specific goal in mind. A business case if you like, with a benefit you expect to receive. E.g.:
- Insight into CPU task execution times/patterns to enable optimisation (if you need to improve performance).
- Insight into other systems to resolve system integration issues (e.g. what messages is your VoIP box sending and receiving when it connects to a particular peer?)
- Insight into the nature of errors (for field diagnostics)
- Aid in development
- Aid in validation testing
I imagine that there's no grand unified theory of logging, because what you do would depend on many details:
- Quantity of data
- Type of data
- Events
- Streamed audio/video
- Available storage
- Storage speed
- Storage capacity
- Available channels to extract data
- Bandwidth
- Cost
- Availability
- Internet connected 24×7
- Site visit required
- Need to unlock a rusty gate, climb a ladder onto a roof, to plug in a cable, after filling out OHS documentation
- Need to wait until the Antarctic winter is over and the ice sheets thaw
- Random access vs linear access (e.g. if you compress it, do you need to read from the start to decompress and access some random point?)
- Need to survive error conditions
- Watchdog reboots
- Possible data corruption
- Due to failing power supply
- Due to unreliable storage media
- Need to survive a plane crash
As for ASCII vs binary, I usually prefer to keep the logging simple, and put any nice presentation in a PC application that decodes the data. It's usually easier to create a user-friendly presentation in PC software (written in e.g. Python) rather than in the embedded system itself.
did I miss anything, and how can I improve on this?
- Asynchronous logging.
- Using multiple log files for the same process for different logging abstractions. e.g. the process' activities are logged in a normal log file. And the process' stats (periodic statistics that you might be interested in) are logged in a separate stats log file.
Hmmm, binary or ascii log file format? Ascii is bulkier, but binary requires more processing. I have done both, currently I use ascii
ASCII is good. More often than not, logs are meant to be used for debugging purposes. A human readable form eases and speeds this up. However, if your logs are used mostly to record information which is used later on for analysis and generation of reports (e.g. stats or latencies etc.) a binary format would be preferred. You can go one step ahead and use a custom format along with a db service which does index based sorting, where the index can be a tuple of time with the event type.
--
One thing which may be helpful is to have a "maybeLogger" object which will accept log records for an operation which may or may not succeed, and then either ditch those records if the operation succeeds or fails in an uninteresting way, or log them if it does something interesting. This is relatively easy to do in something like .net. In an embedded system, it can only be done really easily if the amount of stuff to be logged is small enough to fit in free RAM, but one could probably use a garbage-collection-based approach to hold stuff in flash (have one 'stream' of data in flash for new log entries, and another for ones that are confirmed to be interesting; periodically move data which is known to be good from the first stream to the second).
Here's my $0.02.
I only log when I'm having a problem and need to track down the source. Usually this has to do with a customer's environment, so I can't just attach the debugger. My solution is to enable the Telnet port and use that to print out statements as to where the program is and values of variables.
I do ASCII only because it's over telnet.
Another aspect of telnet is that it is pretty simple. It's a TCP port with text being thrown out. Very little processing other than the normal TCP headaches.
The log files are dumped as soon as I get them because I have not tried to capture and save a telnet session. I guess I could with WireShark, but I don't need a history of that session. I just need to find the problem and verify a fix.
精彩评论