Extract title tags from normal text
I am working on one task, to extract title tag from given normal text ( it's not a HTML DOM ). I have below cases where need to extract title tag(s) :
Case 1 :
<html>
<head>
<title>Title of the document</title>
</head>
<body>
The content of the document......
</body>
</html>
Expected : Title of the document
Case 2 :
<html>
<head>
<title>Title of the document</title>
<title>Continuing title</title>
</head>
<body>
The content of the document......
</body>
</html>
Expected : Title of the document Continuing title
Case 3 (Nested title tags)
<html>
<head>
<title>Title of the document
<title开发者_如何学Go>Continuing title</title></title>
</head>
<body>
The content of the document......
</body>
</html>
Expected : Title of the document Continuing title
I wanted to extract title tags using regular expression in javascript. Reg-ex should work for above case.
Is anyone knows about this..please let me know... Thanks in Advance
This is a solution for this specific problem using this broken "pseudo-HTML". It's not applicable to normal HTML:
function extractTitle(text) {
var m = /<title>(.*)<\/title>/.exec(text);
if (m && m[1]) {
return m[1].replace(/<\/?title>/g," ").replace(/\s+/," ");
}
return; // returns undefined
}
Don't parse HTML with regexen! Seriously, it's literally impossible in the general case. And in fact, you cannot do what you want with regexen. This is the same problem as matching balanced nested pairs of parentheses, except you want to match nested <title>
/</title>
pairs, and that is not a regular language.
(Edit 1: I had to revise my answer since I saw that you didn't have access to a DOM; for what I originally had, see below.)
So, why do you need to do this? Perhaps there's a better way. This is tagged JavaScript, but you never mention it in your answer. If you aren't in JavaScript, there's probably an HTML parser you can use, which would likely be a better choice. If you are in JavaScript, there may still be, but I'm not a JavaScript guru.
Now, a note: having multiple or nested title
tags isn't actually legal HTML, so you shouldn't need to worry about it. If this is true, and if we can make some more assumptions, you could build a use case that would probably work. For instance: no comments, no CDATA
blocks, etc. (Although you might be able to handle these, because they can't nest.) But there may be edge cases I'm forgetting! Also, neither Safari nor Firefox treated your third case as nested title tags, instead viewing it as one title tag containing the literal string Title of the document <title> Continuing title
. Thus, if you can ignore that case, it might be possible to hack together a fragile set of regular expressions which would work. Perhaps (lightly tested!) something like this:
// Edit 2: Made this function case-insensitive where it needed to be.
// Edit 3: Used substring() instead of replace() to remove the extraneous
// title tags and fixed the "not matching" case.
function getTitle(html) {
return (html.replace( /<!\[CDATA\[(.+?)\]\]>/g
, function (_match, body) {
return body.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
} )
.replace(/<!--.+?-->/g, '')
.match(/<title>.+?<\/title>/ig) || [])
.map(function (t) { return t.substring(7, t.length - 8) })
.join(' ')
}
I am not an HTML guru, so I probably missed a couple edge cases, but here's what this does. First, we find every CDATA section. We take its innards and turn every illegal character into its entity equivalent, and get rid of the <![CDATA[
and the ]]>
. Next, we delete every comment. After that, we match each title and get an array of the matches (getting an array of matches is not compatible with extracting subgroups), in case we're in the invalid-multiple-title
s case. Edit 3: We then check if nothing matched, in which case .match()
returns null
, and return []
instead if that was the case; this way, we always have an array. We then trim the tags from the beginning and the end (edit 3: no longer using regexen for this step), and finally string each title fragment together with a space. This will handle, I think, your case one and case two. If you only need the legal case (case one), replace the last three lines (except the }
) with the single line .match(/<title>(.+?)<\/title>/)[0]
. However, although this will work (I think) in many cases, I make assumptions (both about our input (e.g., the title tags all appear together and where you want them) and about the fact that we're only looking for a single (set of) <title>...</title>
s) and probably missed some edge case or other. Hopefully it will turn out that you can use a nicer solution.
Edit 1: I missed the fact that you need to work on plain text; the rest of my original answer assumed that you had access to a DOM. I'll leave it here for posterity, but it isn't particularly relevant to you.
If you had access to a DOM in JavaScript, you could do the following if you had proper HTML with one title
tag:
var titles = document.getElementsByTagName('title')
var titleText = titles.length > 0 ? titles[0].text : ''
However, if you actually have HTML which looks like the second two cases you showed us (I hope not, but you never know), then you'll have to do something else. Neither Firefox nor Safari treated your third case as nested title tags, instead viewing it as one title tag containing the literal string Title of the document <title> Continuing title
. Thus, if you only have to deal with the first two cases, this will work:
var titles = document.getElementsByTagName('title')
var tlength = titles.length
var titleText = ''
for (var i = 0; i < tlength; ++i)
titleText += titles[i].text
And if you have the third case, then what you need to do is remove the extraneous <title>
tag, which could be slightly tricky but probably isn't. If you know that <title>
will never show up except because of malformed HTML like above, then you can use the replace
method to get rid of it. In the single-standalone-<title>
, case, you want
// Edit 2: Case-insensitivity
var titles = document.getElementsByTagName('title')
var titleText = titles.length > 0 ? titles[0].text.replace(/<title>/ig,'') : ''
In the malformed multiple-standalone-<title>
case, you want
// Edit 2: Case-insensitivity
var titles = document.getElementsByTagName('title')
var tlength = titles.length
var titleText = ''
for (var i = 0; i < tlength; ++i)
titleText += titles[i].text.replace(/<title>/ig,'')
If <title>
could occur as a valid string for other reasons, however, then you're in trouble; you'd have to figure out why it was in the string and only replace it if you were supposed to. And as far as I can tell, there's no good general way to do that. But hopefully (though not necessarily) you have legal HTML.
精彩评论