How to extract the nth word and count word occurrences in a MySQL string?
I would like to have a mysql query like this:
select <second word in text> word, count(*) from table group by word;
All the regex examples in mysql are used to query if the text matches the expr开发者_Python百科ession, but not to extract text out of an expression. Is there such a syntax?
The following is a proposed solution for the OP's specific problem (extracting the 2nd word of a string), but it should be noted that, as mc0e's answer states, actually extracting regex matches is not supported out-of-the-box in MySQL. If you really need this, then your choices are basically to 1) do it in post-processing on the client, or 2) install a MySQL extension to support it.
BenWells has it very almost correct. Working from his code, here's a slightly adjusted version:
SUBSTRING(
sentence,
LOCATE(' ', sentence) + CHAR_LENGTH(' '),
LOCATE(' ', sentence,
( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
)
As a working example, I used:
SELECT SUBSTRING(
sentence,
LOCATE(' ', sentence) + CHAR_LENGTH(' '),
LOCATE(' ', sentence,
( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
) as string
FROM (SELECT 'THIS IS A TEST' AS sentence) temp
This successfully extracts the word IS
Shorter option to extract the second word in a sentence:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('THIS IS A TEST', ' ', 2), ' ', -1) as FoundText
MySQL docs for SUBSTRING_INDEX
According to http://dev.mysql.com/ the SUBSTRING function uses start position then the length so surely the function for the second word would be:
SUBSTRING(sentence,LOCATE(' ',sentence),(LOCATE(' ',LOCATE(' ',sentence))-LOCATE(' ',sentence)))
No, there isn't a syntax for extracting text using regular expressions. You have to use the ordinary string manipulation functions.
Alternatively select the entire value from the database (or the first n characters if you are worried about too much data transfer) and then use a regular expression on the client.
As others have said, mysql does not provide regex tools for extracting sub-strings. That's not to say you can't have them though if you're prepared to extend mysql with user-defined functions:
https://github.com/mysqludf/lib_mysqludf_preg
That may not be much help if you want to distribute your software, being an impediment to installing your software, but for an in-house solution it may be appropriate.
I used Brendan Bullen's answer as a starting point for a similar issue I had which was to retrive the value of a specific field in a JSON string. However, like I commented on his answer, it is not entirely accurate. If your left boundary isn't just a space like in the original question, then the discrepancy increases.
Corrected solution:
SUBSTRING(
sentence,
LOCATE(' ', sentence) + 1,
LOCATE(' ', sentence, (LOCATE(' ', sentence) + 1)) - LOCATE(' ', sentence) - 1
)
The two differences are the +1 in the SUBSTRING index parameter and the -1 in the length parameter.
For a more general solution to "find the first occurence of a string between two provided boundaries":
SUBSTRING(
haystack,
LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'),
LOCATE(
'<rightBoundary>',
haystack,
LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>')
)
- (LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'))
)
I don't think such a thing is possible. You can use SUBSTRING
function to extract the part you want.
My home-grown regular expression replace function can be used for this.
Demo
See this DB-Fiddle demo, which returns the second word ("I") from a famous sonnet and the number of occurrences of it (1).
SQL
Assuming MySQL 8 or later is being used (to allow use of a Common Table Expression), the following will return the second word and the number of occurrences of it:
WITH cte AS (
SELECT digits.idx,
SUBSTRING_INDEX(SUBSTRING_INDEX(words, '~', digits.idx + 1), '~', -1) word
FROM
(SELECT reg_replace(UPPER(txt),
'[^''’a-zA-Z-]+',
'~',
TRUE,
1,
0) AS words
FROM tbl) delimited
INNER JOIN
(SELECT @row := @row + 1 as idx FROM
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t1,
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t2,
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t3,
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t4,
(SELECT @row := -1) t5) digits
ON LENGTH(REPLACE(words, '~' , '')) <= LENGTH(words) - digits.idx)
SELECT c.word,
subq.occurrences
FROM cte c
LEFT JOIN (
SELECT word,
COUNT(*) AS occurrences
FROM cte
GROUP BY word
) subq
ON c.word = subq.word
WHERE idx = 1; /* idx is zero-based so 1 here gets the second word */
Explanation
A few tricks are used in the SQL above and some accreditation is needed. Firstly the regular expression replacer is used to replace all continuous blocks of non-word characters - each being replaced by a single tilda (~
) character. Note: A different character could be chosen instead if there is any possibility of a tilda appearing in the text.
The technique from this answer is then used for transforming a string with delimited values into separate row values. It's combined with the clever technique from this answer for generating a table consisting of a sequence of incrementing numbers: 0 - 10,000 in this case.
The field's value is:
"- DE-HEB 20% - DTopTen 1.2%"
SELECT ....
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DE-HEB ', -1), '-', 1) DE-HEB ,
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DTopTen ', -1), '-', 1) DTopTen ,
FROM TABLA
Result is:
DE-HEB DTopTEn
20% 1.2%
精彩评论