Parsing Html With The Html Agility Pack And Linq
Solution 1:
As for your attempt, you have two issues with your code:
ChildNodes
is weird - it also returns whitespace text nodes, which don't have aclass
attributes (can't have attributes, of course).- As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value== "name")
where td.InnerText.Trim() == "Test1"
select tr;
Solution 2:
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
publicstatic IEnumerable<string> GetData(HtmlDocument document, string name)
{
returnfrom HtmlNode node in
document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data inGetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Solution 3:
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = newHtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[@id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr>
to a dictionary, where the class of the <td>
is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name
of every <tr>
is the key.
Solution 4:
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
Post a Comment for "Parsing Html With The Html Agility Pack And Linq"