Programmatically determine whether to describe an object with "a" or "an"?

2023-02-02 04:34 问答作者：

I have a database of nouns (ex "house", "exclamation point", "apple") that I need to output and describe in my application. It's hard to put together a natural-sounding sentence to describe an item without using "a" or "an" - "a house is BIG", "an exclamation point 开发者_开发百科is SMALL", etc.

Is there any function, library, or hack i can use in PHP to determine whether it is more appropriate to describe any given noun with A or AN?

I needed this for a C# project so here's the C# port of the Python code mentioned above. Make sure to include using System.Text.RegularExpressions; in your source file.

private string GetIndefiniteArticle(string noun_phrase)
{
    string word = null;
    var m = Regex.Match(noun_phrase, @"\w+");
    if (m.Success)
        word = m.Groups[0].Value;
    else
        return "an";

    var wordi = word.ToLower();
    foreach (string anword in new string[] { "euler", "heir", "honest", "hono" })
        if (wordi.StartsWith(anword))
            return "an";

    if (wordi.StartsWith("hour") && !wordi.StartsWith("houri"))
        return "an";

    var char_list = new char[] { 'a', 'e', 'd', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'x' };
    if (wordi.Length == 1)
    {
        if (wordi.IndexOfAny(char_list) == 0)
            return "an";
        else
            return "a";
    }

    if (Regex.Match(word, "(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]").Success)
        return "an";

    foreach (string regex in new string[] { "^e[uw]", "^onc?e\b", "^uni([^nmd]|mo)", "^u[bcfhjkqrst][aeiou]" })
    {
        if (Regex.IsMatch(wordi, regex))
            return "a";
    }

    if (Regex.IsMatch(word, "^U[NK][AIEO]"))
        return "a";
    else if (word == word.ToUpper())
    {
        if (wordi.IndexOfAny(char_list) == 0)
            return "an";
        else
            return "a";
    }

    if (wordi.IndexOfAny(new char[] { 'a', 'e', 'i', 'o', 'u' }) == 0)
        return "an";

    if (Regex.IsMatch(wordi, "^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)"))
        return "an";

    return "a";
}

I was also looking for such solution but in JavaScript. So I ported it over to JS, you can check out the actual project in github https://github.com/rigoneri/indefinite-article.js

Here is the code snippet:

 function indefinite_article(phrase) {

    // Getting the first word 
    var match = /\w+/.exec(phrase);
    if (match)
        var word = match[0];
    else
        return "an";

    var l_word = word.toLowerCase();
    // Specific start of words that should be preceeded by 'an'
    var alt_cases = ["honest", "hour", "hono"];
    for (var i in alt_cases) {
        if (l_word.indexOf(alt_cases[i]) == 0)
            return "an";
    }

    // Single letter word which should be preceeded by 'an'
    if (l_word.length == 1) {
        if ("aedhilmnorsx".indexOf(l_word) >= 0)
            return "an";
        else
            return "a";
    }

    // Capital words which should likely be preceeded by 'an'
    if (word.match(/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/)) {
        return "an";
    }

    // Special cases where a word that begins with a vowel should be preceeded by 'a'
    regexes = [/^e[uw]/, /^onc?e\b/, /^uni([^nmd]|mo)/, /^u[bcfhjkqrst][aeiou]/]
    for (var i in regexes) {
        if (l_word.match(regexes[i]))
            return "a"
    }

    // Special capital words (UK, UN)
    if (word.match(/^U[NK][AIEO]/)) {
        return "a";
    }
    else if (word == word.toUpperCase()) {
        if ("aedhilmnorsx".indexOf(l_word[0]) >= 0)
            return "an";
        else 
            return "a";
    }

    // Basic method of words that begin with a vowel being preceeded by 'an'
    if ("aeiou".indexOf(l_word[0]) >= 0)
        return "an";

    // Instances where y follwed by specific letters is preceeded by 'an'
    if (l_word.match(/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/))
        return "an";

    return "a";
}

What you want is to determine the appropriate indefinite article. Lingua::EN::Inflect is a Perl module that does an great job. I've extracted the relevant code and pasted it below. It's just a bunch of cases and some regular expressions, so it shouldn't be difficult to port to PHP. A friend ported it to Python here if anyone is interested.

# 2. INDEFINITE ARTICLES

# THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
# CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
# TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)

my $A_abbrev = q{
(?! FJO | [HLMNS]Y.  | RY[EO] | SQU
  | ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
[FHLMNRSX][A-Z]
};

# THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
# 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
# IMPLIES AN ABBREVIATION.

my $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';

# EXCEPTIONS TO EXCEPTIONS

my $A_explicit_an = enclose join '|',
(
    "euler",
    "hour(?!i)", "heir", "honest", "hono",
);

my $A_ordinal_an = enclose join '|',
(
    "[aefhilmnorsx]-?th",
);

my $A_ordinal_a = enclose join '|',
(
    "[bcdgjkpqtuvwyz]-?th",
);

sub A {
    my ($str, $count) = @_;
    my ($pre, $word, $post) = ( $str =~ m/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i );
    return $str unless $word;
    my $result = _indef_article($word,$count);
    return $pre.$result.$post;
}

sub AN { goto &A }

sub _indef_article {
    my ( $word, $count ) = @_;

    $count = $persistent_count
        if !defined($count) && defined($persistent_count);

    return "$count $word"
        if defined $count && $count!~/^($PL_count_one)$/io;

    # HANDLE USER-DEFINED VARIANTS

    my $value;
    return "$value $word"
        if defined($value = ud_match($word, @A_a_user_defined));

    # HANDLE ORDINAL FORMS

    $word =~ /^($A_ordinal_a)/i         and return "a $word";
    $word =~ /^($A_ordinal_an)/i        and return "an $word";

    # HANDLE SPECIAL CASES

    $word =~ /^($A_explicit_an)/i       and return "an $word";
    $word =~ /^[aefhilmnorsx]$/i        and return "an $word";
    $word =~ /^[bcdgjkpqtuvwyz]$/i      and return "a $word";


    # HANDLE ABBREVIATIONS

    $word =~ /^($A_abbrev)/ox           and return "an $word";
    $word =~ /^[aefhilmnorsx][.-]/i     and return "an $word";
    $word =~ /^[a-z][.-]/i              and return "a $word";

    # HANDLE CONSONANTS

    $word =~ /^[^aeiouy]/i              and return "a $word";

    # HANDLE SPECIAL VOWEL-FORMS

    $word =~ /^e[uw]/i                  and return "a $word";
    $word =~ /^onc?e\b/i                and return "a $word";
    $word =~ /^uni([^nmd]|mo)/i         and return "a $word";
    $word =~ /^ut[th]/i                 and return "an $word";
    $word =~ /^u[bcfhjkqrst][aeiou]/i   and return "a $word";

    # HANDLE SPECIAL CAPITALS

    $word =~ /^U[NK][AIEO]?/            and return "a $word";

    # HANDLE VOWELS

    $word =~ /^[aeiou]/i                and return "an $word";

    # HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)

    $word =~ /^($A_y_cons)/io           and return "an $word";

    # OTHERWISE, GUESS "a"
    return "a $word";
}

Was looking for just such a solution so thanks marcog. Here's an attempt to port your friend's python version (I don't know python or perl so there's probably some mistakes):

function indefinite_article($word) {
    // Lowercase version of the word
    $word_lower = strtolower($word);

    // An 'an' word (specific start of words that should be preceeded by 'an')
    $an_words = array('euler', 'heir', 'honest', 'hono');
    foreach($an_words as $an_word) {
            if(substr($word_lower,0,strlen($an_word)) == $an_word) return "an";
    }
    if(substr($word_lower,0,4) == "hour" and substr($word_lower,0,5) != "houri") return "an";

    // An 'an' letter (single letter word which should be preceeded by 'an')
    $an_letters = array('a','e','f','h','i','l','m','n','o','r','s','x');
    if(strlen($word) == 1) {
            if(in_array($word_lower,$an_letters)) return "an";
            else return "a";
    }

    // Capital words which should likely by preceeded by 'an'
    if(preg_match('/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/', $word)) return "an";

    // Special cases where a word that begins with a vowel should be preceeded by 'a'
    $regex_array = array('^e[uw]','^onc?e\b','^uni([^nmd]|mo)','^u[bcfhjkqrst][aeiou]');
    foreach($regex_array as $regex) {
            if(preg_match('/'.$regex.'/',$word_lower)) return "a";        
    }

    // Special capital words
    if(preg_match('/^U[NK][AIEO]/',$word)) return "a";
    // Not sure what this does
    else if($word == strtoupper($word)) {
            $array = array('a','e','d','h','i','l','m','n','o','r','s','x');
            if(in_array($word_lower[0],$array)) return "an";
            else return "a";
    }

    // Basic method of words that begin with a vowel being preceeded by 'an'
    $vowels = array('a','e','i','o','u');
    if(in_array($word_lower[0],$vowels)) return "an";

    // Instances where y follwed by specific letters is preceeded by 'an'
    if(preg_match('/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/', $word_lower)) return "an";

    // Default to 'a'
    return "a";
}

There's one bit (below the comment "// Not sure what this does") that I was unsure of what it did. If anyone can figure it out, I'd be happy to know.

The problem with a rule based system is that they deal poorly with edge cases, and that they're complicated. If you can base your decisions on actual data, you'll do better. In this answer I describe how you might use wikipedia to build a lookup dictionary, and link to a (very simple) javascript implementation using such a dictionary.

A prefix-dictionary will deal fairly well with acronyms and numbers, though with some effort you could probably do better.

I've written a PHP port of the popular JS a-vs-an code as described in this stackoverflow post https://stackoverflow.com/a/1288473/1526020.

Github page: https://github.com/UseAllFive/a-vs-an.

E.g.

$result = $aVsAn->query('0800 number');
print_r($result);

Returns

Array
(
    [aCount] => 8
    [anCount] => 25
    [prefix] => 08
    [article] => an
)

Make an array with vowels in it. Check if the first letter of the word you are checking is in the vowel array. Will work except when dealing with acronyms.

It should be pretty easy to write from scratch, tbh. If a word starts with a vowel, it gets an 'a'; if it begins with a consonant, it gets an 'an'. Programmatically it's easy to do - if you have any edge cases (for eg you might use the BBC english-style 'an historic occasion') you can handle them individually.

Kind of like using an inflector, only with the 'a'/'an' grammar rule instead of plurals. Look into how CakePHP or Rails handle inflection for a more thorough discussion of the concept, including how to handle edge cases - you don't want to inflect 'deer' as 'deers' in the plural, for example, or 'goose' as 'gooses', so they need to be handled individually, just like your own edge cases like 'universe' or aspirated/non-aspirated 'H's.

继续阅读：grammar php

Programmatically determine whether to describe an object with "a" or "an"?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？