Word frequencies from strings in Postgres?

2023-02-15 04:18 问答作者：

Is it possible to identify distinct words and a count开发者_运维百科 for each, from fields containing text strings in Postgres?

Something like this?

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

Getting the distinct words is easy then:

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

or getting the count for each word:

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

You could also use the PostgreSQL text-searching functionality for this, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple instead of the english dictionary, see below.)

The nested SELECT statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector function to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the title and body fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc column).

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

继续阅读：postgresql text word-frequency

Word frequencies from strings in Postgres?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？