What is HtmlAgilityPack?
HtmlAgilityPack (or simply Html Agility Pack) is a library (.dll) for .Net that provides necessary methods and properties that enables developers (C# and VB developers) to conveniently extract (or parse) and/or manipulate HTML documents.
There’s one thing I found very interesting about HtmlAgilityPack, is that it can extract data even if the page has bad markup. What is bad markup? In HTML, a tag starts with an opening and closing tag. If you have missed the closing tag, it will still extract data of that particular tag or element.
HtmlAgilityPack is very easy to use. I'll show you how.
How to install HtmlAgilityPack?
First, let me show you how to install HtmlAgilityPack. Its a library that you need to install in your computer.
There a two ways you can install the library.
1) Install HtmlAgilityPack using Nuget".
If you are using .Net 4 or later, you must have access to Nuget Packages with Visual Studio.
Follow these steps.
a) Create a new website using Visual Studio
b) Open "Solution Explorer", right click solution and click Manage Nuget Packages… option.
c) In the Nuget packages window, type HtmlAgilityPack in the search box and click the Install button.
2) In case you don't have access to Nuget or you could not install the library using Nuget package, you can straightway download the library from their website page.
It will download a zip file, extract the file and copy the library inside the bin folder of your project. If you don’t find bin, create the folder in the root directory of your project.
Now, lets see some examples. All examples have both C# and VB codes.
Example 1: Get Metadata of a web page
In this example, we'll extract metadata that is available in the web page. All you'll need, is the URL or the address of the web page.
Remember: Metadata in a webpage, provides information about the web page or the HTML document. These data are assigned using the <meta> tag.
using System; using HtmlAgilityPack; public partial class SiteMaster : System.Web.UI.MasterPage { protected void Page_Load(object sender, EventArgs e) { string url = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm"; HtmlWeb HtmlWEB = new HtmlWeb(); HtmlDocument HtmlDocument = HtmlWEB.Load(url); // Load the web page. // Parse <meta> tag details of a web page. var metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta"); if (metaTags != null) { foreach (var tag in metaTags) { if ((tag.Attributes["name"] != null) & (tag.Attributes["content"] != null)) { div.InnerHtml = div.InnerHtml + "<br /> " + "<b> Page " + tag.Attributes["name"].Value + " </b>: " + tag.Attributes["content"].Value + "<br />"; } } } } }
Output:
Option Explicit On Imports HtmlAgilityPack Partial Class Site Inherits System.Web.UI.MasterPage Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load Dim url As String = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm" Try Dim HtmlWEB As HtmlWeb = New HtmlWeb() Dim HtmlDocument As HtmlDocument = HtmlWEB.Load(url) ' Parse <meta> tag details of a web page. Dim metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta") Dim tag If Not IsNothing(metaTags) Then For Each tag In metaTags If Not IsNothing(tag.Attributes("name")) And Not IsNothing(tag.Attributes("content")) Then divPageDescription.InnerHtml = divPageDescription.InnerHtml & "<br /> " & _ "<b> Page " & tag.Attributes("name").value & " </b>: " & tag.Attributes("content").value & "<br />" End If Next End If Catch ex As Exception Finally End Try End Sub End Class
Exampe 2: Get all Images with details on a web page
Web pages may or may not have images. The following example shows how to extract (parse) details about images on a web page.
using System; using HtmlAgilityPack; public partial class SiteMaster : System.Web.UI.MasterPage { protected void Page_Load(object sender, EventArgs e) { string url = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm"; HtmlWeb HtmlWEB = new HtmlWeb(); HtmlDocument HtmlDocument = HtmlWEB.Load(url); // Load the web page. // Parse <img> tag details of a web page. (get image details) var imgTags = HtmlDocument.DocumentNode.SelectNodes("//img"); if (imgTags != null) { foreach (var tag in imgTags) { if (tag.Attributes["src"].Value != null) { div.InnerHtml = div.InnerHtml + "<br /> " + "<b>Image</b>: " + tag.Attributes["src"].Value + " <br/> <b>Alt text</b>: " + tag.Attributes["alt"].Value + "<br />"; } } } } }
Option Explicit On Imports HtmlAgilityPack Partial Class Site Inherits System.Web.UI.MasterPage Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load Dim url As String = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm" 'url = tbEditor.Text Try Dim HtmlWEB As HtmlWeb = New HtmlWeb() Dim HtmlDocument As HtmlDocument = HtmlWEB.Load(url) ' Parse <img> tag details of a web page. Dim imgTags = HtmlDocument.DocumentNode.SelectNodes("//img") Dim tag If Not IsNothing(imgTags) Then For Each tag In imgTags If Not IsNothing(tag.Attributes("src")) Then div.InnerHtml = div.InnerHtml & "<br /> " & _ "<b>Image</b>: " & tag.Attributes("src").value & "<br /> <b>Alt text</b>: " & tag.Attributes("alt").value & "<br />" End If Next End If Catch ex As Exception Finally End Try End Sub End Class
That's it.
I have shared two examples here in this tutorial, showing how to extract "metadata" and "image details" of a web page. You can however extract (or get) more information from a web page that you may find essential using HtmlAgilityPack library in Asp.Net.
Hope you find this information useful.