Skip to content Skip to sidebar Skip to footer

Parsing Html With The Html Agility Pack And Linq

I have the following HTML (..) Test1 Data Data 2 &l

Solution 1:

As for your attempt, you have two issues with your code:

  1. ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
  2. As James Walford commented, the spaces around the text are significant, you probably want to trim them.

With these two corrections, the following works:

var data =from tr in doc.DocumentNode.Descendants("tr")
      from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value== "name")
     where td.InnerText.Trim() == "Test1"
    select tr;

Solution 2:

Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)

This function gets all data values associated with a name:

publicstatic IEnumerable<string> GetData(HtmlDocument document, string name)
{
    returnfrom HtmlNode node in
        document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
        select node.InnerText.Trim();
}

For example, this code will dump all 'Test2' data:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(yourHtml);

    foreach (string data inGetData(doc, "Test2"))
    {
        Console.WriteLine(data);
    }

Solution 3:

Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:

HtmlWeb hw = newHtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
                              .SelectNodes("//table[@id='MyTable']//tr");
var data = nodes.Select(
    node => node.Descendants("td")
        .ToDictionary(descendant => descendant.Attributes["class"].Value,
                      descendant => descendant.InnerText.Trim())
        ).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];

Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.

Solution 4:

instead of

td.InnerText == "Test1"

try

td.InnerText == " Test1 "

or

d.InnerText.Trim() == "Test1"

Post a Comment for "Parsing Html With The Html Agility Pack And Linq"