Programmatically determine whether to describe an object with "a" or "an"?
I have a database of nouns (ex "house", "exclamation point", "apple") that I need to output and describe in my application. It's hard to put together a natural-sounding sentence to describe an item without using "a" or "an" - "a house is BIG", "an exclamation point 开发者_开发百科is SMALL", etc.
Is there any function, library, or hack i can use in PHP to determine whether it is more appropriate to describe any given noun with A or AN?
I needed this for a C# project so here's the C# port of the Python code mentioned above. Make sure to include using System.Text.RegularExpressions;
in your source file.
private string GetIndefiniteArticle(string noun_phrase)
{
string word = null;
var m = Regex.Match(noun_phrase, @"\w+");
if (m.Success)
word = m.Groups[0].Value;
else
return "an";
var wordi = word.ToLower();
foreach (string anword in new string[] { "euler", "heir", "honest", "hono" })
if (wordi.StartsWith(anword))
return "an";
if (wordi.StartsWith("hour") && !wordi.StartsWith("houri"))
return "an";
var char_list = new char[] { 'a', 'e', 'd', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'x' };
if (wordi.Length == 1)
{
if (wordi.IndexOfAny(char_list) == 0)
return "an";
else
return "a";
}
if (Regex.Match(word, "(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]").Success)
return "an";
foreach (string regex in new string[] { "^e[uw]", "^onc?e\b", "^uni([^nmd]|mo)", "^u[bcfhjkqrst][aeiou]" })
{
if (Regex.IsMatch(wordi, regex))
return "a";
}
if (Regex.IsMatch(word, "^U[NK][AIEO]"))
return "a";
else if (word == word.ToUpper())
{
if (wordi.IndexOfAny(char_list) == 0)
return "an";
else
return "a";
}
if (wordi.IndexOfAny(new char[] { 'a', 'e', 'i', 'o', 'u' }) == 0)
return "an";
if (Regex.IsMatch(wordi, "^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)"))
return "an";
return "a";
}
I was also looking for such solution but in JavaScript. So I ported it over to JS, you can check out the actual project in github https://github.com/rigoneri/indefinite-article.js
Here is the code snippet:
function indefinite_article(phrase) {
// Getting the first word
var match = /\w+/.exec(phrase);
if (match)
var word = match[0];
else
return "an";
var l_word = word.toLowerCase();
// Specific start of words that should be preceeded by 'an'
var alt_cases = ["honest", "hour", "hono"];
for (var i in alt_cases) {
if (l_word.indexOf(alt_cases[i]) == 0)
return "an";
}
// Single letter word which should be preceeded by 'an'
if (l_word.length == 1) {
if ("aedhilmnorsx".indexOf(l_word) >= 0)
return "an";
else
return "a";
}
// Capital words which should likely be preceeded by 'an'
if (word.match(/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/)) {
return "an";
}
// Special cases where a word that begins with a vowel should be preceeded by 'a'
regexes = [/^e[uw]/, /^onc?e\b/, /^uni([^nmd]|mo)/, /^u[bcfhjkqrst][aeiou]/]
for (var i in regexes) {
if (l_word.match(regexes[i]))
return "a"
}
// Special capital words (UK, UN)
if (word.match(/^U[NK][AIEO]/)) {
return "a";
}
else if (word == word.toUpperCase()) {
if ("aedhilmnorsx".indexOf(l_word[0]) >= 0)
return "an";
else
return "a";
}
// Basic method of words that begin with a vowel being preceeded by 'an'
if ("aeiou".indexOf(l_word[0]) >= 0)
return "an";
// Instances where y follwed by specific letters is preceeded by 'an'
if (l_word.match(/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/))
return "an";
return "a";
}
What you want is to determine the appropriate indefinite article. Lingua::EN::Inflect
is a Perl module that does an great job. I've extracted the relevant code and pasted it below. It's just a bunch of cases and some regular expressions, so it shouldn't be difficult to port to PHP. A friend ported it to Python here if anyone is interested.
# 2. INDEFINITE ARTICLES
# THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
# CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
# TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)
my $A_abbrev = q{
(?! FJO | [HLMNS]Y. | RY[EO] | SQU
| ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
[FHLMNRSX][A-Z]
};
# THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
# 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
# IMPLIES AN ABBREVIATION.
my $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';
# EXCEPTIONS TO EXCEPTIONS
my $A_explicit_an = enclose join '|',
(
"euler",
"hour(?!i)", "heir", "honest", "hono",
);
my $A_ordinal_an = enclose join '|',
(
"[aefhilmnorsx]-?th",
);
my $A_ordinal_a = enclose join '|',
(
"[bcdgjkpqtuvwyz]-?th",
);
sub A {
my ($str, $count) = @_;
my ($pre, $word, $post) = ( $str =~ m/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i );
return $str unless $word;
my $result = _indef_article($word,$count);
return $pre.$result.$post;
}
sub AN { goto &A }
sub _indef_article {
my ( $word, $count ) = @_;
$count = $persistent_count
if !defined($count) && defined($persistent_count);
return "$count $word"
if defined $count && $count!~/^($PL_count_one)$/io;
# HANDLE USER-DEFINED VARIANTS
my $value;
return "$value $word"
if defined($value = ud_match($word, @A_a_user_defined));
# HANDLE ORDINAL FORMS
$word =~ /^($A_ordinal_a)/i and return "a $word";
$word =~ /^($A_ordinal_an)/i and return "an $word";
# HANDLE SPECIAL CASES
$word =~ /^($A_explicit_an)/i and return "an $word";
$word =~ /^[aefhilmnorsx]$/i and return "an $word";
$word =~ /^[bcdgjkpqtuvwyz]$/i and return "a $word";
# HANDLE ABBREVIATIONS
$word =~ /^($A_abbrev)/ox and return "an $word";
$word =~ /^[aefhilmnorsx][.-]/i and return "an $word";
$word =~ /^[a-z][.-]/i and return "a $word";
# HANDLE CONSONANTS
$word =~ /^[^aeiouy]/i and return "a $word";
# HANDLE SPECIAL VOWEL-FORMS
$word =~ /^e[uw]/i and return "a $word";
$word =~ /^onc?e\b/i and return "a $word";
$word =~ /^uni([^nmd]|mo)/i and return "a $word";
$word =~ /^ut[th]/i and return "an $word";
$word =~ /^u[bcfhjkqrst][aeiou]/i and return "a $word";
# HANDLE SPECIAL CAPITALS
$word =~ /^U[NK][AIEO]?/ and return "a $word";
# HANDLE VOWELS
$word =~ /^[aeiou]/i and return "an $word";
# HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)
$word =~ /^($A_y_cons)/io and return "an $word";
# OTHERWISE, GUESS "a"
return "a $word";
}
Was looking for just such a solution so thanks marcog. Here's an attempt to port your friend's python version (I don't know python or perl so there's probably some mistakes):
function indefinite_article($word) {
// Lowercase version of the word
$word_lower = strtolower($word);
// An 'an' word (specific start of words that should be preceeded by 'an')
$an_words = array('euler', 'heir', 'honest', 'hono');
foreach($an_words as $an_word) {
if(substr($word_lower,0,strlen($an_word)) == $an_word) return "an";
}
if(substr($word_lower,0,4) == "hour" and substr($word_lower,0,5) != "houri") return "an";
// An 'an' letter (single letter word which should be preceeded by 'an')
$an_letters = array('a','e','f','h','i','l','m','n','o','r','s','x');
if(strlen($word) == 1) {
if(in_array($word_lower,$an_letters)) return "an";
else return "a";
}
// Capital words which should likely by preceeded by 'an'
if(preg_match('/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/', $word)) return "an";
// Special cases where a word that begins with a vowel should be preceeded by 'a'
$regex_array = array('^e[uw]','^onc?e\b','^uni([^nmd]|mo)','^u[bcfhjkqrst][aeiou]');
foreach($regex_array as $regex) {
if(preg_match('/'.$regex.'/',$word_lower)) return "a";
}
// Special capital words
if(preg_match('/^U[NK][AIEO]/',$word)) return "a";
// Not sure what this does
else if($word == strtoupper($word)) {
$array = array('a','e','d','h','i','l','m','n','o','r','s','x');
if(in_array($word_lower[0],$array)) return "an";
else return "a";
}
// Basic method of words that begin with a vowel being preceeded by 'an'
$vowels = array('a','e','i','o','u');
if(in_array($word_lower[0],$vowels)) return "an";
// Instances where y follwed by specific letters is preceeded by 'an'
if(preg_match('/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/', $word_lower)) return "an";
// Default to 'a'
return "a";
}
There's one bit (below the comment "// Not sure what this does") that I was unsure of what it did. If anyone can figure it out, I'd be happy to know.
The problem with a rule based system is that they deal poorly with edge cases, and that they're complicated. If you can base your decisions on actual data, you'll do better. In this answer I describe how you might use wikipedia to build a lookup dictionary, and link to a (very simple) javascript implementation using such a dictionary.
A prefix-dictionary will deal fairly well with acronyms and numbers, though with some effort you could probably do better.
I've written a PHP port of the popular JS a-vs-an code as described in this stackoverflow post https://stackoverflow.com/a/1288473/1526020.
Github page: https://github.com/UseAllFive/a-vs-an.
E.g.
$result = $aVsAn->query('0800 number');
print_r($result);
Returns
Array
(
[aCount] => 8
[anCount] => 25
[prefix] => 08
[article] => an
)
Make an array with vowels in it. Check if the first letter of the word you are checking is in the vowel array. Will work except when dealing with acronyms.
It should be pretty easy to write from scratch, tbh. If a word starts with a vowel, it gets an 'a'; if it begins with a consonant, it gets an 'an'. Programmatically it's easy to do - if you have any edge cases (for eg you might use the BBC english-style 'an historic occasion') you can handle them individually.
Kind of like using an inflector, only with the 'a'/'an' grammar rule instead of plurals. Look into how CakePHP or Rails handle inflection for a more thorough discussion of the concept, including how to handle edge cases - you don't want to inflect 'deer' as 'deers' in the plural, for example, or 'goose' as 'gooses', so they need to be handled individually, just like your own edge cases like 'universe' or aspirated/non-aspirated 'H's.
精彩评论