Exploding acronyms to ensure a synthesizer reads them properly?
If I feed a speech synthesizer (festival, in this case, but it applies to all) the following bit of text:
"At the USPGA championship in the US, the BBC reporter went MIA". it reads "At the uspga championship in the us, the BBC reporter went mia".
In other words, I guess that because it's a cluster of consonants,开发者_StackOverflow it reads "BBC" properly but makes "words" out of the others.
The simplest thing to do, I suppose, would be to run it through a php script which looked for 2 or more capital letters, and simply "explodes" the word into spaces, like U S P G A.
I realise it would would cause weirdness with things like "I told him N O T to do that", but in news reports that tends to happen less.
Here's the thing; I can "explode" a word OK, the problem is, I'm one of those people who, despite months of trying, just can't get their head round certain aspects of REGEX. In this case, it's looking for: two or more letters next to each other in capitals.
The reason I gave all the pre-amble above is in case there's a better way of doing this I hadn't found or through of - perhaps a db of acronyms to words or something.
A pattern to match acronyms:
/\b([A-Z]{2,})\b/
That matches any 'word' with two or more capitals.
you can greatly simplify your code by using a lookahead assertion
$input = "At the USPGA championship in the US, the BBC reporter went MIA";
echo preg_replace('~[A-Z](?=[A-Z])~', '$0 ', $input);
[A-Z](?=[A-Z])
says "every capital followed by a capital"
Using Delan's regular expression with preg_replace_callback() makes it very easy to put a single space between all the letters of the identified acronyms
$input = "At the USPGA championship in the US, the BBC reporter went MIA";
function cb_separateCapitals($matches) {
return implode(' ',str_split($matches[0]));
}
echo $input,'<br />';
$output = preg_replace_callback('/\b([A-Z]{2,})\b/','cb_separateCapitals',$input);
echo $output;
giving
At the USPGA championship in the US, the BBC reporter went MIA
At the U S P G A championship in the U S, the B B C reporter went M I A
"[A-Z][A-Z]"
will match any instance of two capital letters next to each other.
精彩评论