
WebKit & Objective-C: how to parse a HTML string into a DOMDocument?

How do you get a DOMDocument from a given HTML string using WebKit? In other words, what's the implementation for DOMDocumentFromHTML: for something like the following:

NSString * htmlString = @"<html><body><p>Test</body></html>";
DOMDocument * document = [self DOMDocumentFromHTML: htmlString];

DOMNode * bodyNode = [[document getElementsByTagName: @"body"] item: 0];
// ... etc.

This seems like it should be straightforward to do, yet I'm still having trouble figuring out how ...

Not an actual answer to the question, but I've now concluded that WebKit and DOMDocument are likely not the most appropriate tools for what I want to do; which is to process an HTML document that is not shown to the user. The class NSXMLDocument straightforwardly and synchronously supports turning an HTML document into a manipulable object structure:

NSError * error = nil;
NSString * htmlString = @"<html><body><p>Test</body></html>";

NSXMLDocument * doc =
  [[NSXMLDocument alloc]
     initWithXMLString: htmlString
     options: NSXMLDocumentTidyHTML
     error: &error];
NSLog(@"Error is: %@", error);
NSLog(@"Doc is: %@", doc);
NSLog(@"Root element is: %@", [doc rootElement]);
NSLog(@"Root element's children are: %@", [[doc rootElement] children]);

According to what I can derive from another answer on this site, there is no synchronous method such as my requested DOMDocumentFromHTML: available in WebKit.

So far, the best I've been able to do is the following asynchronous combination of giveDOMDocumentFromHTML:usingBaseURL: and takeDOMDocument:.

- (void) giveDOMDocumentFromHTML: (NSString *) htmlString
         usingBaseURL: (NSURL *) baseURL
    WebView * webView = [[WebView alloc] init];
    [webView setFrameLoadDelegate: self];
    [[webView mainFrame] loadHTMLString: htmlString
                         baseURL: baseURL];

- (void) takeDOMDocument: (DOMDocument *) document
    DOMHTMLElement * bodyNode =
        (DOMHTMLElement *) [[document getElementsByTagName: @"body"] item: 0];
    NSLog(@"Body is: %@", [bodyNode innerHTML]);

They are hooked together through the following delegate method:

- (void) webView: (WebView *) webView
         didFinishLoadForFrame: (WebFrame *) frame
    if (frame == [webView mainFrame]) {
        [self takeDOMDocument: [frame DOMDocument]];

The above works, but has at least the following remaining issues:

  • I'm not sure where the allocated WebView should be sent a release or autorelease message.
  • I would prefer/need the application to remain blocked until the HTML page has been processed. In the above scheme the application will be processing any user input while the WebView is loading/parsing the HTML. (Note that the WebView will never be shown on screen.)

So this is still very much up for improvement. Anyone who can provide a synchronous implementation for DOMDocumentFromHTML: as outlined in the original question?





