Searching in an unordered log file
Where I work we have a log file which contains lines like this:
31201007061308000000161639030001
Which is to be read like this:
31|year(4)|month(4)|day(2)|hour(2)|min(2)|000000|facility(3)|badge(5)|0001
So there's supposed to be a line for each record, but happens stuff like this:
31201007192000000000161206930004 31201007192001000000161353900004 31201031201007192004000000161204690004 31201007192004000000090140470004 31201007192005000000090148140004 3120100719200500031201007191515000000161597180001 31201007191700000000161203490001 31201007191700000000161203490001 312010071917000000001612028300开发者_开发技巧01 31201007191700000000
That's because the software that's supposed to read the file, sometimes it misses some of the newests records and the guy in charge copies the older records to the end of the file. So basically it's like that because of human mistakes.
When a record isn't saved in the DB I have to search the file. At first I did just a cicle that went through every record on the file, but it's really slow and the problems mentioned above made it slower. The approach I have right now is with a Regular Expression and it's like this:
//Starts Reader
StreamReader reader = new StreamReader(path);
string fileLine = reader.ReadLine();
while (!reader.EndOfStream)
{
//Regex Matcher
Regex rx = new Regex(@"31\d\d\d\d\d\d\d\d\d\d\d\d000000161\d\d\d\d\d0001");
//Looks for all valid lines
MatchCollection matches = rx.Matches(fileLine);
//Compares each match against what we are looking for
foreach (Match m in matches)
{
string s = m.Value;
compareLine(date, badge, s);
}
reader.ReadLine();
}
reader.Close(); //Closes reader
My question is this: What's a good way to search through the file? Should I order/clean it first?
You'd probably be best off following these steps:
- Parse each line into an object. A struct should be appropriate for these lines. Include a
DateTime
object as well as any other related fields. This can be done easily with Regex if you clean it up a bit. Use capture groups and repeaters. For a year, you can use(\d{4})
to get 4 numbers in row, instead of\d\d\d\d
. - Create a
List<MyStruct>
that holds each line as an object. Use LINQ to search through the list, for example:
var searchResults = from eachEntry in MyList where eachEntry.Date > DateTime.Now and eachEntry.facility.Contains("003") select eachEntry;
Also, add this line to your Regex, it will speed it up, if only by a few milliseconds:
MatchCollection matches = rx.Matches(fileLine, RegexOptions.Compiled);
If you know (in advance) which entry you are looking for, I.e. you exactly know the date, facility and batch you are looking for, you do not need to parse the data at all. It might be faster to generate the expected string and make a simple string search instead of using regular expressions:
string expectedValue = getExpectedValue(date, badge);
// expectedValue = "31201007192000000000161206930004"
foreach (string line in lines)
{
if (line.IndexOf(expectedValue) >= 0)
{
// record found
}
}
If you are only interested wether the file contains your id or not, you can read the complete file into a single string and search by
string completeFile = GetFileContents(file);
if (completeFile.IndexOf(expectedValue) >= 0)
{
// record found
}
精彩评论