开发者

A regex pattern for different tomcat's log entries

I’m a newbie in regex.

If I have the following line from tomcat’s access log file:

123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] \"GET /java/javaResources.html HTTP/1.0\" 200 10450 \"-\" \"Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)\""

The following pattern works fine with entries that look exactly like the one above:

"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\""

But not all log entries looks exactly like the one above, some time it contains 9 fields, sometimes 7. Example of 9 field entires:

82.132.139.79 - - [14/Jul/2011:18:52:44 +0100] "GET /~roger/cpp/introans.htm HTTP/1.1" 200 11195 "http:开发者_Python百科//www.dcs.bbk.ac.uk/~roger/cpp/intro3.htm" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"

However, I’m only interested in the IP, date and time and the URL. Is there a pattern that only searches for matching entries from the log entries regardless of their fields’ number?


The line you give in the example is in the pseudo standard combined log format. This 9 field format extends the widely used common log format with two additional fields: referrer and user-agent.

By making the final two fields optional in your regex you can match lines in either common or combined format:

"^(\\S+) (\\S+) (\\S+) \\[(.*?)\\] \"(.*?)\" (\\S+) (\\S+)( \"(.*?)\" \"(.*?)\")?"

The capture groups are:

  1. remote host
  2. RFC 1413 identity
  3. userid
  4. datetime
  5. request
  6. status
  7. bytes
  8. optional combined fields
  9. referrer
  10. user-agent

This pattern is purposely non-specific on the contents of the specific fields in the log message. Generally when parsing a log you want to extract whatever you can rather than attempt to validate a specification.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜