Parsing XHTML with inline tags
I'm trying to parse an XHTML document using TBXML on the iPhone (although I would be happy to use either libxml2 or NSXMLParser if it would be easier). I need to extract the content of the body as a series of paragraphs and maintain the inline tags, for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Title</title>
<link rel="stylesheet" href="css/style.css" type="text/css"/>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
</head>
<body>
<div class="body">
<div>
<h3>Title</h3>
<p>Paragraph with <em>inline</em> tags</p>
<img src="image.png" />
</div>
</div>
</body>
</html>
I need to extract the paragraph but maintain the <em>inline</em>
content with the paragraph, all my testing so far has extracted that as a subelement without me knowing exactl开发者_开发知识库y where it fitted in the paragraph.
Can anyone suggest a way to do this?
Thanks.
Assumption 1. You are only interested in the data in the p (paragraph) element and that you are using NSXMLParser.
Assumption 2. You want to keep any element inside of p intact.
The strategy that you want to use is to create a state machine for your parser so that it knows when it needs to save data and when to ignore data as it is received.
Set up your NSXMLParser delegate
using the sample code from Apple.
Your delegate will need an ivar BOOL inParagraph
for tracking when data will be retained or discarded. The initial value of inParagaph
is NO
.
When your delegate receives the parser:didStartElement:namespaceURI:qualifiedName:attributes:
message, if ([element isEqual:@"p"])
clear your receivedData
variable and set inParagraph = YES
EDIT: receivedData is an NSMutableString. Fixed the code examples
At this point your parser delegate
wants to save data received.
When the parser delegate
receives the parser:foundCharacters:
message, append the string to receivedData
as in the sample code.
- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string
{
if (inParagraph) [receivedData appendString:string];
}
When the parser encounters the inline element, the delegate will receive the parser:didStartElement:namespaceURI:qualifiedName:attributes:
again. This is when the inParagraph
state variable is important. The parser will not receive the enclosing '<' and '>' characters of an element, so you will have to wrap the elementName
in the '<' and '>' characters and add to receivedData
. Something like
- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qualifiedName attributes:(NSDictionary *)attributeDict
{ if (inParagraph)
{
NSString *inlineElementName = [NSString stringWithFormat:@"<%@>", elementName];
[receivedData appendString:inlineElementName];
}
....
}
When the parser delegate
receives the parser:didEndElement:namespaceURI:qualifiedName:
message, it checks whether it is in the "p" element, if (inParagraph && ![elementName isEqual:@"p"]
, close the inline element. if ([elementName isEqual:@"p"])
add the contents of receivedData
to the NSMutableArray
holding your paragraphs.
- (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName
{
if (inParagraph)
{
if (![elementName isEqual:@"p"])
{
NSString *inlineElementName = [NSString stringWithFormat:@"</%@>", elementName];
[receivedData appendString:inlineElementName];
} else { // received closing </p> tag add receivedData to the paragraph array
[paragraphsArray addObject:[receivedData copy]];
[self setInParagraph:NO];
}
}
}
}
精彩评论