开发者

Regex gurus! here's a teaser: mixed thousands separators and csv's

I've got a string like.开发者_StackOverflow中文版..

"labour 18909, liberals 12,365,conservatives 14,720"

...and i'd like a regex which can get rid of any thousands separators so i can pull out the numbers easily. Or even a regex which could give me a tidy array like:

(labour => 18909, liberals => 12365, conservatives => 14720)

Oh i wish i had the time to figure out regexes! Maybe i'll buy one as a toilet book, mmm.


Two-liner. Will also get Independents:

preg_match_all('/([a-zA-Z]+)\s*([\d,]+)(?:,|$)/', $str, $matches);
$totals = array_combine($matches[1], $matches[2]);

/* total:
Array
(
    [labour] => 18909
    [liberals] => 12,365
    [conservatives] => 14,720
)
*/


You could do a search and replace such as with sed:

> echo '"labour 18909, liberals 12,365,conservatives 14,720"'
    | sed -r -e 's/([0-9]),([0-9]{3})/\1\2/g'
"labour 18909, liberals 12365,conservatives 14720"

I'm not entirely certain what the PHP syntax would be but it basically takes a pattern consisting of a digit (X), a comma, and three other digits (Y) and replaces them with just the XY bit.


What you want seems to be to remove commas only if they are surrounded by digits. Sorry, I don't know the particulars of PHP regex syntax, but a couple of more abstract examples are

str.replace("(\d+),(\d+)", "$1$2")
s/([0-9]+),([0-9]+)/\1\2/g

These would get all correct numbers, but would also get something that wasn't really proper, such as "2,41,11"


Well, using the following regular expression you can separate the numbers from the rest:

labour\s*([\d,.]+),\s*liberals\s*([\d,.]+),\s*conservatives\s*([\d,.]+)

after all, a number clearly ends at a point where no digit follows anymore. You can then proceed with removing the commas from the values.

PowerShell demo (a little bit condensed, sorry):

PS Home:\> $s -match 'labour\s*(?<labour>[\d,.]+),\s*liberals\s*(?<liberals>[\d,.]+),\s*conservatives\s*(?<conservatives>[\d,.]+)' |
               Out-Null
PS Home:\> "Labour: {0}`nLiberals: {1}`nConservatives: {2}" -f `
           ($Matches['labour'],$Matches['liberals'],$Matches['conservatives'] |
               foreach { $_ -replace ',' })

Labour: 18909
Liberals: 12365
Conservatives: 14720


In a former life, I did a lot of data processing like this, except there were 100's of millions of records taking days to process.

I always found it was a wise to follow this strategy

  1. Know you data

    . The customer will always say their data is perfect, well formed and correct

    . It invariably is a pile of steaming dodo poop.

  2. Define the rules for the data, sometimes it is easier to define what the data isn't

  3. Use a regex or even macro search and replace within and editor to find where the data breaks the rules

  4. Repair, request new data sets, discard data

  5. Repeat steps 3 and 4 till the data is clean

  6. Now think about the format of the data, can the regex matching be simplified by some simple manipulation of the data

    . For example in you case, replacing a comma followed by multiple white space with a single comma

    . Then strip every comma surrounded by numbers

    . strip multiple white space (leave single white space)

    . strip white space immediately before and alpha character

  7. Define rules for this new data set and make sure it's clean

    . this can now include range checking on the numeric data

    . even more complex rules

  8. Now your data looks like "labour 18909,liberals 12365,conservatives 14720"

  9. Build you import tool for this new data set (the easy bit)

  10. Make sure you have a repeatable system for 1..9 as the customer will want a simple change or just this extra little bit they need right now.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜