开发者

Regex to extract portions of file name

I have text files formatted as such:

R156484COMP_004A7001_20100104_065119.txt

I need to consistently extract the R******COMP, the 004A7001 number, 20100104 (date), and don't care about the 065119 number. the problem is that not ALL of the files being parsed have the exact naming convention. some may be like this:

R168166CRIT_156B2075_SU2_20091223_123456.txt

or

R285476COMP_SU1_125A6025_20100407_123456.txt

So how could I use regex instead of split to ensure I am always getting that serial (ex. 004A7001), the date (ex. 20100104), and the R******COMP (or CRIT)???

Here is what I do now but it only gets the files formatted like my first example.

if (file.Count(c => c == '_') != 3) continue;

and further down in the code I have:

string RNumber = Path.GetFileNameWithoutExtension(file);

string RNumberE = RNumber.Split('_')[0];

string RNumberD = RNumber.Split('_')[1];

string RNumberDate = RNumber.Split('_')[2];

DateTime dateTime = DateTime.ParseExact(RNumberDate, "yyyyMMdd", Thread.CurrentThread.CurrentCulture);
string cmmDate = dateTime.ToString("dd-MMM-yyyy");

UPDATE: This is now where I am at -- I get an error to pa开发者_JAVA百科rse RNumberDate to an actual date format. "Cannot implicitly convert type 'RegularExpressions.Match' to 'string'

 string RNumber = Path.GetFileNameWithoutExtension(file);

 Match RNumberE = Regex.Match(RNumber, @"^(R|L)\d{6}(COMP|CRIT|TEST|SU[1-9])(?=_)", RegexOptions.IgnoreCase);

 Match RNumberD = Regex.Match(RNumber, @"(?<=_)\d{3}[A-Z]\d{4}(?=_)", RegexOptions.IgnoreCase);
 Match RNumberDate = Regex.Match(RNumber, @"(?<=_)\d{8}(?=_)", RegexOptions.IgnoreCase);



DateTime dateTime = DateTime.ParseExact(RNumberDate, "yyyyMMdd", Thread.CurrentThread.CurrentCulture);
string cmmDate = dateTime.ToString("dd-MMM-yyyy")


You can use the power of multiple regular expressions to solve this problem.

compNumber:   /^R\d{6}(COMP|CRIT)(?=_)/
date:         /(?<=_)\d{8}(?=_)/
serialNumber: /(?<=_)\d{3}[A-Z]\d{4}(?=_)/

part:         /(?<=_).*?(?=_)/

Run each regular expression on the string separately to pull out the parts.


I don't completely understand the rules for parsing your string, but advice that might help is:

Have a look at RegEx.Split and RegEx.Matches to break your string up using a RegEx.

Do create your RegEx, I suggest an excellent RegEx builder/checker/tutorial. With that tool, you can enter a bunch of strings in the big text area (e.g. your serial numbers or whatever they are) and interactively enter your RegEx, seeing which parts currently match. There's a "tutorial" on the right side of the page that will assist you in learning how to build the RegEx.


string filename = "R285476COMP_SU1_125A6025_20100407_123456.txt";

Match m = Regex.Match(filename,
    @"^(R\d+(?:COMP|CRIT))_(?:SU\d+_)?(\d+[A-Z]+\d+)_(?:SU\d+_)?(\d{8})_.*$",
    RegexOptions.IgnoreCase);

if (m.Success)
{
    Console.WriteLine(m.Groups[1].Value);    // R285476COMP
    Console.WriteLine(m.Groups[2].Value);    // 125A6025
    Console.WriteLine(m.Groups[3].Value);    // 20100407
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜