开发者

grouping strings by substring in mysql/python or mysql/.net

the data will be stored in a mysql database like this:

5911    CD  $4.99   Eben, Landscapes of Patmos {w.Martin Lenniger, percussion}; 2 Choral Phantasies; Laudes. (All w.Sieglinde Ahrens, organ)
5913    CD  $5.99   Turina, Sevilliana; Rafaga; Hommage a Tarrega; Sonata. Rodrigo, 3 Piezas Espanolas; En Los Trigales; Sarabande Lointaine. (Eric Hill, guitar^)
145460  CD  $13.98  Wagner, The Flying Dutchman. (Hans Hotter, Astrid Varnay, Set Svanholm et al. Cond. Reiner. Rec.1950. PLEASE NOTE: Limited-pressing CDRs)
145461  CD  $13.98  Montemezzi, L'Amore dei Tre Re. (Virgilio Lazzari, Dorothy Kirsten, Charles Kullman, Robert Weede, Leslie Chabay et al. Cond. Giuseppe Antonicelli. Rec. 1949. PLEASE NOTE: Limited-pressing CDRs)
145462  CD  $13.98  Ponchielli, La Gioconda. (Zinka Milanov, Giacomo Vaghi, Leonard Warren, Rise Stevens, Richard Tucker, Margaret Harshaw et al. Cond. Emil Cooper. Rec. 1946. PLEASE NOTE: Limited-pressing CDRs)
145465  CD  $5.99   ' Yankele: Yiddish Songs'. (16 titles incl. Az der Rebe, Rozhinkes mit Mandlekh, Shabes, Yankele, Belz, Di Grine Kuzine. Moshe Leiser, voice and guitar. Ami Flammer, violin. Gerard Barreaux, accordion. Rec. 'live', Lyon Opera. Total time: 78')
145467  CD  $4.99   Brahms, Piano Trios 2 & 3. (Trio Bamberg: Evgeny Schuk, violin; Stephan Gerlinghaus, cello. Robert Benz, piano. Rec. Nuremberg, 4/7/2000. Total time: 51'45')
145468  CD  $4.99   Gaubert, Piece Romantique; Trois Aquarelles. Debussy, Premier Trio in G. Francaix, Trio. (Trio Cantabile: Hans-Jorg Wegner, flute. Guido Larisch, cello. Christiane Kroeker, piano. Rec. Hannover, 3/2001. Total time: 62'35')
145469  CD  $4.99   Gattermeyer, Heinrich [b.1923]: Ophelias Schattentheater [text by Michael Ende]. Matthias Drude [b.1960], Jorinde und Joringel. Christoph J. Keller [b.1959], Die Kristallkugel [both texts by Brother Grimm]. (Helmut Thiele, narrator w.Bernd-Christian Schulze, piano. Total time: 68'08')
145470  CD  $2.99   Morrill, Dexter [b.1938]: Dance Bagatelles for Viola & Piano; Three Lyric Pieces for Violin and Piano [Laura Klugherz, viola & violin. Jill Timmons, piano]; Fantasy for Solo Cello [James Kirkwood, cello]; String Quartet #2 [Tremont String Quartet]. (Total time: 51'03')
145471  CD  $2.99   Werntz, Julia: String Trio with Homage to Chopin [Curtis Macomber, violin. Lois Martin, viola. Ted Mook, cello]; 'To You Strangers'- Five Poems of Dylan Thomas for Mezzo-Soprano Solo [Christina Ascher]; Piano Piece [John McDonald]. John Mallia, Lock [Stephanie Kay, clarinet]; Poor Denizens of Hell [chamber ensemble/ Daniel Hosken]; Plexus 2. (Aura Group for New Music)
145472  CD  $2.99   Morrill, Dexter [b.1938]- 'Music for Trumpets': 'Ponzo' for Two Trumpets; 'Nine Pieces' for Solo Trumpet; 'TARR' for Four Trumpets & Computer; 'Studies' for Trumpet & Computer; 'Trumpet Concerto' for Trumpet & Piano. (Mark Ponzo, trumpet with Barbara Butler [trumpet] & William Koehler, piano. Total time: 52'02')
145473  CD  $2.99   Kallstrom, Michael [b.1956]: 'Stories'. (A chamber opera for solo performer with puppets and electronic tape based on Old Testament stories)
145474  CD  $2.99   Carosio, Vailati, Lechi, Ponchielli, D'Alessandro, Sterzati, Riva, Pucci, Casazza, Denti, Gnaga, Anelli, Feroldi: 'The Mandolins of Stradivari'. (16 pieces for mandolin ensemble et al. Ugo Orlandi, mandolin. Alessandro Bono, guitar. Maura Mazzonetto, piano. Giampaolo Baldin, baritone. Quartetto romantico a plettro 'Umbert Sterzati'. Orchestra di Mandolini e Chitarre 'Citta di Brescia'/ Mandonico. Total time: 77'19')
145475  CD  $3.99   Rachmaninov, Symphony #3; Symphonic Dances. (St. Petersburg Philharmonic/ Jansons. Total time: 72'16')

i need each title to be grouped wi开发者_运维百科th 4 other titles that have words in common. for example if i would it to group 4 cds that have both the word BEETHOVEN and MOZART in the string.

HOWEVER, i do not want to specify which words it should group by. i would like this to be done in sort of an artificially intelligent way

here's what i think the algorithm should look like:

  1. do a frequency distribution on all words
  2. throw out any words that are frequently used in the english (like if, or, the, of where can i get a list of these)??
  3. start to group by the words that occur least often

does anyone know any intelligent way of grouping this?


Re (2), what you want are called "stopwords" -- e.g., in NLTK (which is Python, but I imagine there will be C# equivalents), per chapter 2 in its excellent online book,

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always',
 ...]

The book I quoted can also help with your point 1, but point 3 is really a different field -- clustering. You want a very peculiar kind of clustering (specified and identical cluster size), so existing algorithms may not be suitable for you, but it's not too hard to devise some based on what you mention.

Basically you want each word to be worth a "score" that's higher for words that are rarer in English (and NLTK, or any equivalently powerful natural language processing toolkit in C#, can of course help you with that) -- minus the logarithm of the word's frequency for example could be a start.

You only need to score non-stop words that occur in at least five documents, according to the specs you mentioned, so the number of meaningful words should be pretty low, and exhaustive search might even be feasible.

In fact the biggest issue might be another -- what if there's a group of fewer than 5 docs that, collectively, don't have any non-stop words in common with any of the others? The possibility of such occurrences shows you'll have to relax your specs in some respect (since I don't know anything about your app I can't give specific suggestions, of course, but it might be anything from allowing groups with a number of docs different from 5, to relaxing the criteria for grouping, etc).

Or, would you rather just diagnose that some situation exists where actually meeting your tight constraints is impossible, and provide an error message instead of any results if it occurs?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜