Regex gurus! here's a teaser: mixed thousands separators and csv's

2022-12-28 15:56 问答作者：

I've got a string like.开发者_StackOverflow中文版..

"labour 18909, liberals 12,365,conservatives 14,720"

...and i'd like a regex which can get rid of any thousands separators so i can pull out the numbers easily. Or even a regex which could give me a tidy array like:

(labour => 18909, liberals => 12365, conservatives => 14720)

Oh i wish i had the time to figure out regexes! Maybe i'll buy one as a toilet book, mmm.

Two-liner. Will also get Independents:

preg_match_all('/([a-zA-Z]+)\s*([\d,]+)(?:,|$)/', $str, $matches);
$totals = array_combine($matches[1], $matches[2]);

/* total:
Array
(
    [labour] => 18909
    [liberals] => 12,365
    [conservatives] => 14,720
)
*/

You could do a search and replace such as with sed:

> echo '"labour 18909, liberals 12,365,conservatives 14,720"'
    | sed -r -e 's/([0-9]),([0-9]{3})/\1\2/g'
"labour 18909, liberals 12365,conservatives 14720"

I'm not entirely certain what the PHP syntax would be but it basically takes a pattern consisting of a digit (X), a comma, and three other digits (Y) and replaces them with just the XY bit.

What you want seems to be to remove commas only if they are surrounded by digits. Sorry, I don't know the particulars of PHP regex syntax, but a couple of more abstract examples are

str.replace("(\d+),(\d+)", "$1$2")
s/([0-9]+),([0-9]+)/\1\2/g

These would get all correct numbers, but would also get something that wasn't really proper, such as "2,41,11"

Well, using the following regular expression you can separate the numbers from the rest:

labour\s*([\d,.]+),\s*liberals\s*([\d,.]+),\s*conservatives\s*([\d,.]+)

after all, a number clearly ends at a point where no digit follows anymore. You can then proceed with removing the commas from the values.

PowerShell demo (a little bit condensed, sorry):

PS Home:\> $s -match 'labour\s*(?<labour>[\d,.]+),\s*liberals\s*(?<liberals>[\d,.]+),\s*conservatives\s*(?<conservatives>[\d,.]+)' |
               Out-Null
PS Home:\> "Labour: {0}`nLiberals: {1}`nConservatives: {2}" -f `
           ($Matches['labour'],$Matches['liberals'],$Matches['conservatives'] |
               foreach { $_ -replace ',' })

Labour: 18909
Liberals: 12365
Conservatives: 14720

In a former life, I did a lot of data processing like this, except there were 100's of millions of records taking days to process.

I always found it was a wise to follow this strategy

Know you data

. The customer will always say their data is perfect, well formed and correct

. It invariably is a pile of steaming dodo poop.
Define the rules for the data, sometimes it is easier to define what the data isn't
Use a regex or even macro search and replace within and editor to find where the data breaks the rules
Repair, request new data sets, discard data
Repeat steps 3 and 4 till the data is clean
Now think about the format of the data, can the regex matching be simplified by some simple manipulation of the data

. For example in you case, replacing a comma followed by multiple white space with a single comma

. Then strip every comma surrounded by numbers

. strip multiple white space (leave single white space)

. strip white space immediately before and alpha character
Define rules for this new data set and make sure it's clean

. this can now include range checking on the numeric data

. even more complex rules
Now your data looks like "labour 18909,liberals 12365,conservatives 14720"
Build you import tool for this new data set (the easy bit)
Make sure you have a repeatable system for 1..9 as the customer will want a simple change or just this extra little bit they need right now.

继续阅读：php regex

Regex gurus! here's a teaser: mixed thousands separators and csv's

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？