Select elements added to the DOM by a script

2023-01-14 09:02 问答作者：

I've been trying to get either an <object> or an <embed> tag using:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

This doesn't seem to work.

Can anyone please tell me how to get these tags and their InnerHtml?

A YouTube embedded video looks like this:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

I got a feeling the JavaScript might stop the 开发者_开发技巧swf player from working, hope not...

Cheers

Update 2010-08-26 (in response to OP's comment):

I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.

Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.

In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).

I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.

`YouTubeScraper`

OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

And in case you're interested, here's a little demo I threw together (super fancy, I know):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}

Original Answer

Why not try using the element's Id instead?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

继续阅读：asp.net html-agility-pack

Select elements added to the DOM by a script

`YouTubeScraper`

Original Answer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

YouTubeScraper

Original Answer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

`YouTubeScraper`

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？