Searching for Number of Term Appearances in Mathematica

2023-04-06 22:42 问答作者：

I'm trying to search across a large array of textual files in Mathematica 8 (12k+). So far, I've been able to plot the sheer numbers of times that a word appears (i.e. the word "love" appears 5,000 times across those 12k files). However, I'm running into difficulty determining the number of files in which "lov开发者_开发问答e" appears once - which might only be in 1,000 files, with it repeating several times in others.

I'm finding the documentation WRT FindList, streams, RecordSeparators, etc. a bit murky. Is there a way to set it up so it finds an incidence of a term once in a file and then moves onto the next?

Example of filelist:

{"89001.txt", "89002.txt", "89003.txt", "89004.txt", "89005.txt", "89006.txt", "89007.txt", "89008.txt", "89009.txt", "89010.txt", "89011.txt", "89012.txt", "89013.txt", "89014.txt", "89015.txt", "89016.txt", "89017.txt", "89018.txt", "89019.txt", "89020.txt", "89021.txt", "89022.txt", "89023.txt", "89024.txt"}

The following returns all of the lines with love across every file. Is there a way to return only the first incidence of love in each file before moving onto the next one?

FindList[filelist, "love"]

Thanks so much. This is my first post and I'm largely learning Mathematica through peer/supervisory help, online tutorials, and the documentation.

In addition to Daniel's answer, you also seem to be asking for a list of files where the word only occurs once. To do that, I'd continue to run FindList across all the files

res =FindList[filelist, "love"]

Then, reduce the results to single lines only, via

lines = Select[ res, Length[#]==1& ]

But, this doesn't eliminate the cases where there is more than one occurrence in a single line. To do that, you could use StringCount and only accept instances where it is 1, as follows

Select[ lines, StringCount[ #, RegularExpression[ "\\blove\\b" ] ] == 1& ]

The RegularExpression specifies that "love" must be a distinct word using the word boundary marker (\\b), so that words like "lovely" won't be included.

Edit: It appears that FindList when passed a list of files returns a flattened list, so you can't determine which item goes with which file. For instance, if you have 3 files, and they contain the word "love", 0, 1, and 2 times, respectively, you'd get a list that looked like

{, love, love, love }

which is clearly not useful. To overcome this, you'll have to process each file individually, and that is best done via Map (/@), as follows

res = FindList[#, "love"]& /@ filelist

and the rest of the above code works as expected.

But, if you want to associate the results with a file name, you have to change it a little.

res = {#, FindList[#, "love"]}& /@ filelist
lines = Select[res, 
         Length[ #[[2]] ] ==1 &&  (* <-- Note the use of [[2]] *)
         StringCount[ #[[2]], RegularExpression[ "\\blove\\b" ] ] == 1&
        ]

which returns a list of the form

{ {filename, { "string with love in it" }, 
  {filename, { "string with love in it" }, ...}

To extract the file names, you simply type lines[[All, 1]].

Note, in order to Select on the properties you wanted, I used Part ([[ ]]) to specify the second element in each datum, and the same goes for extracting the file names.

Help > Documentation Center > FindList item 4:

"FindList[files,text,n] includes only the first n lines found."

So you could set n to 1.

Daniel Lichtblau

继续阅读：search text wolfram-mathematica

Searching for Number of Term Appearances in Mathematica

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？