开发者

your challenges with using splunk

In our application, we log critical information to log text files for later debugging purpose. With splunk its easy to identify a problem if I already have some data points like order number or "object reference not found" type of error.开发者_开发知识库 But its challending for me to get an overall picture of a problem using splunk. To be able to identify an actual problem in software, I have to read through possibly multiple log files or an entire log file to see what application was doing before the problem happened. Reading entire log file in a human fashion helps me to identify how application behaved with other data points before the actual problem happened. In other words its hard for me to see the "real root cause" for the error with splunk. What has been your experience out there in the field of software development.


The proper place to ask questions of splunk is

http://splunkbase.com

where there's a community of people that will help you...

@Cninroh is wrong on most accounts. I work at Splunk in development.

  1. with splunk you do NOT need to modify your app or logs
  2. events without dates will work fine.
  3. you can craft searches to find the root cause as you want. You need to craft the search to your need. I don't know your data, but, for example, the below search would find the user that logged in as root right before a "disk-error"...

    login root [ search disk-error | head 1 | eval latest = _time - 5*60 ] | head 1

You can also set up alerting to look for anomalous values, and other unexpected things, so you don't have to be actively looking for problems. They can be pushed to you.


It's very difficult to remove the human aspect. That being said, I've recently had to head the development side of a splunk rollout, and there are some fantastic tools to at least fulfill some of your needs. Using splunk's built in alerts is the easiest way to do some of this. Unfortunately, there is a dearth of actual practical answers and examples for many splunk related things(i mean, seriously, do not use curl with an unsecured flag for every example of a webservice or rest api, please) in both splunkbase or elsewhere on the internet.

Either way, some of the most elegant solutions I've found for finding particular types of logs or log data has been heavy use of piping the "rex" command in my searches. It will specify Perl regexs for help in extracting the right information out of the right fields. Here's the new-ish page on it from splunk's website.

This of course assumes that you know what fields contain the data you're looking for. Unfortunately, this can be an issue with windows logs if things are not set up correctly at the indexer.


For Splunk to work in a way you probably want (judging from your description) you will be "forced" to modify some parts of your application to fit with the Splunk approach of indexing and storing data.

Splunk relies on Indexes of events, these events being error or status messages from various logs in the system. The main target for Splunk index is the date, so no dates means no index, which afterwards will result in less atractie results in your Splunk searches.

It is also a good practice to keep a list or a cheat-sheet of different search terms for different situation, so afterwards you can easily generate results you once found as success.

In our company we created a specific version of the software just for clients, who chosed to use Splunk as their monitoring tool of choice. The onyly difference basically between this custom application and our standart application is that it very strongly relies on writing all it does to a log files and saving them in logical events with a "segemtn - time - root cause - user- group - file - action - result - note" This heps us to find, and identify problems in a real time, and with much lesser costs than before.


Just like you need to learn how to read and evaluate a log file, you need to learn how to use Splunk to leverage your efforts.

For a developer running code on a single machine, you can easily read one log file and see what happened when. Once you release that code and run on a multi-tiered or distributed architecture you can't easily track what happened where and when to trace an issue. Maybe you don't even have access to the production system logs.

Here's an example. Find your error with a simple keyword search. Click on the timestamp of the interesting event. This snaps Splunk's time range to 1 second. Clear your keyword search and enter * to see all events during that 1 second. Restrict the hosts and devices you're seeing events from to the systems/applications you're interested in.

You are now seeing the events from all your systems involved in the app in one view. You can zoom out to more time if needed to catch the root events. Use the "| reverse" command if you want to read in chronological order - top (oldest) to bottom (newest) instead of Splunk's default reverse chronological order.

Once you locate your issue, you can convert that search into a Splunk alert, so you get notified in the future. After you fix that issue, you can leave the alert enabled so if your fix wasn't 100% effective in all situations, or doesn't get rolled into your next major/minor release you'll get notified.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜