开发者

How to split a string into an array

I have a string of attribute names and definitions. I am trying to split the string on the attribute name, into a Dictionary of string string. Where the key is the attribute name and the definition is the value. I won't know the attribute names ahead of time, so I have been trying to somehow split on the ":" character, but am having trouble with that because the attribute name is is not included in the split.

For example, I need to split this string on "Organization:", "OranizationType:", and "Nationality:" into a Dictionary. Any ideas on the best way to do this with C#.Net?

Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)


Here is some sample code to help:

private static void Main()
{
    const string str = "Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)";

    var array = str.Split(':');
    var d开发者_StackOverflow中文版ictionary = array.ToDictionary(x => x[0], x => x[1]);

    foreach (var item in dictionary)
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }

    // Expecting to see the following output:

    // Organization: Name of a governmental, military or other organization.
    // OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party.
    // Nationality: Organization nationality if mentioned in the document. (required)
}

Here is a visual explanation of what I am trying to do:

http://farm5.static.flickr.com/4081/4829708565_ac75b119a0_b.jpg


I'd do it in two phases, firstly split into the property pairs using something like this:

Regex.Split(input, "\s(?=[A-Z][A-Za-z]*:)")

this looks for any whitespace, followed by a alphabetic string followed by a colon. The alphabetic string must start with a capital letter. It then splits on that white space. That will get you three strings of the form "PropertyName: PropertyValue". Splitting on that first colon is then pretty easy (I'd personally probably just use substring and indexof rather than another regular expression but you sound like you can do that bit fine on your own. Shout if you do want help with the second split.

The only thing to say is be carful in case you get false matches due to the input being awkward. In this case you'll just have to make the regex more complicated to try to compensate.


You would need some delimiter to indicate when it is the end of each pair as opposed to having one large string with sections in between e.g.

Organization: Name of a governmental, military or other organization.|OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) |Nationality: Organization nationality if mentioned in the document. (required)

Notice the | character which is indicating the end of the pair. Then it is just a case of using a very specific delimiter, something that is not likely to be used in the description text, instead of one colon you could use 2 :: as one colon could possibly crop up on occassions as others have suggested. That means you would just need to do:

// split the string into rows
string[] rows = myString.Split('|');
Dictionary<string, string> pairs = new Dictionary<string, string>();
foreach (var r in rows)
{
    // split each row into a pair and add to the dictionary
    string[] split = Regex.Split(r, "::");
    pairs.Add(split[0], split[1]);
}

You can use LINQ as others have suggested, the above is more for readability so you can see what is happening.

Another alternative is to devise some custom regex to do what you need but again you would need to be making a lot of assumptions of how the description text would be formatted etc.


Considering that each word in front of the colon always has at least one capital (please confirm), you could solve this by using regular expressions (otherwise you'd end up splitting on all colons, which also appear inside the sentences):

var resultDict = Regex.Split(input, @"(?<= [A-Z][a-zA-Z]+):")
                 .ToDictionary(a => a[0], a => a[1]);

The (?<=...) is a positive look-behind expression that doesn't "eat up" the characters, thus only the colon is removed from the output. Tested with your input here.

The [A-Z][a-zA-Z]+ means: a word that starts with a capital.

Note that, as others have suggested, a "smarter" delimiter will provide easier parsing, as does escaping the delimiter (i.e. like "::" or ":" when you are required to use colons. Not sure if those are options for you though, hence the solution with regular expressions above.

Edit

For one reason or another, I kept getting errors with using ToDictionary, so here's the unwinded version, at least it works. Apologies for earlier non-working version. Not that the regular expression is changed, the first did not include the key, which is the inverse of the data.

var splitArray = Regex.Split(input, @"(?<=( |^)[A-Z][a-zA-Z]+):|( )(?=[A-Z][a-zA-Z]+:)")
                            .Where(a => a.Trim() != "").ToArray();

Dictionary<string, string> resultDict = new Dictionary<string, string>();
for(int i = 0; i < splitArray.Count(); i+=2)
{
    resultDict.Add(splitArray[i], splitArray[i+1]);
}

Note: the regular expression becomes a tad complex in this scenario. As suggested in the thread below, you can split it in smaller steps. Also note that the current regex creates a few empty matches, which I remove with the Where-expression above. The for-loop should not be needed if you manage to get ToDictionary working.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜