Regular Expression approach

2023-02-09 19:08 问答作者：

I'm trying to come up with a function that produces possible database matches for specified strings where most these specified strings can't easily be matched as they are in different naming forms for example acronyms for movies. Database values only use fullnames at this stage. What i have come up with so far is a function that produces a pattern which has the initial letter of each word separated by .* from database candidates:

pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*0.*M.*, title: 007
Moonraker   pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*1.*A.*M.*, title: 12
Angry Men  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*3.*, title: 300 
pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*P.*, title: A
Prophet  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title: Adaptation 
pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title:
Adventureland  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title: Amelie 
pkgName开发者_运维问答 matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*P.*, title: American
Psycho

The problem is this method is producing too many unwanted suggested matches (all unwanted in my previous example). Can anyone suggest a better method that would trim down unwanted these matches? Are regular expressions even suitable for this?

public ArrayList<Movie> databaseMatches(String pkgName) {
    Connection conn = getConnection();
    ArrayList<Movie> dbMatches = new ArrayList<Movie>();
    try {
        for (Movie dbTitle : getDatabaseMovies(conn)) {
            Pattern p = Pattern.compile(createTitlePattern(dbTitle.getTitle()));
            Matcher m = p.matcher(pkgName);
            if (m.find()) {
                System.out.println("pkgName matched: " + pkgName + ", for pattern: " + createTitlePattern(dbTitle.getTitle()) + ", title: " + dbTitle.getTitle());
                dbMatches.add(dbTitle);
            }
        }
    } catch (SQLException e) {
        e.printStackTrace();
    }
    return dbMatches;
}

private String createTitlePattern(String dbTitle) {

    // System.out.println("dbTitle: " + dbTitle + "split(' ')");

    String titleParts[] = dbTitle.split(" ");
    String searchPattern = ".*";
    for (int i = 0; i < titleParts.length; i++) {
        char c = titleParts[i].charAt(0);
        searchPattern += (c + ".*");
    }
    // System.out.println("pattern produced: " + searchPattern);
    return searchPattern;
}

Edit: I have come across instances of strings with various characters between each acronym's letter so i thought this pattern would be appropriate.

Because you have so little criteria around the format of your data you may need to use a slightly different approach, which may or may not be feasible depending on the size of your data / throughput need for the application. One suggestion is begin with a full text match and only if that fails to produce results move to a more generalized search, or other variations.

With the preceeding example you may start with a full keyword search:

.*American.*Psycho.*

and if that fails to produce results try a pure acronym search

.*AP.*

and if that fails a single keyword search

.*((American)|(Psycho)).*

and then onto a mixed keyword / abbreviation search

.*(A|(American)).*(P|(Psycho))

etc. Again this approach can be significantly hindered based on how quickly the searches run / how quickly you need them to run.

If this is not acceptable you might try using a single "loose" pattern as above with the modifications of trying to allow a full word match if possible, and also minimizing the grouping between keywords.

.*(A[merican]*)(.*?)(P[sycho]*)

Note that we a character class (square brackets) instead of a regular grouping (round brackets) to allow a partial match on the remaining title. i.e. the previous would match "Amer. Psy.". Then based on the matches you get you could do further examination of the grouping to eliminate false positives. For example if group 1 only matched "A" you might expect group 2 to be empty, or contain only non alpha-numerics, and if it didn't you would reject it as a false positive.

In order to match unpredictable abbreviations, you need to use a better technique than First Letters. This post on Stack Overflow has some ideas, including alternative algorithms for matching the distance between two words:

Regex - Matching Abbreviations of a Word

A regex in the form of .*x.*y.*z.* means "any string in which we can find x, y, z in that order, separated by any amount of any characher", and there is no indication that x, y, or z must be at the first letter of a single word.

Before the initials, you have to put a character class with all charachters you expect as word separator.

You can use the predefined \W character class to consider all non-word characters as word separator.

Word characters are A-Z, a-z, 0-9 and _ (underscore). All others are non-word charachters.

If this can go well for you, replace ".*" with ".*\W".

继续阅读：regex

Regular Expression approach

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？