Parsing websites in C# with Html Agility Pack or AngleSharp

While developing the "new" canonical check feature for elmah.io Uptime Monitoring, I had to parse a website from C# and inspect the DOM. I have been using Html Agility Pack in the past so this was an obvious choice. I also looked at what happened in the space and found that AngleSharp is an excellent alternative. In this blog post, I'll showcase both frameworks to help you get started.

Parsing websites in C# with Html Agility Pack or AngleSharp

Before we start coding, let's talk about web scraping in general. The techniques presented in this post can be used to extract content from other people's websites. While you may think that publically available data can be consumed from code, this is often not the case. Most websites don't allow crawling their content and you may be violating copyright laws by doing so. Always make sure that web crawling is permitted and/or reach out to the company behind the website to make sure that scraping their content is allowed. With that out of the way, let's look at some code.

There are numerous tools and services providing web scraping capabilities from C#. With more than 100M downloads from NuGet, HtmlAgilityPack is the most popular one, closely followed by AngleSharp with almost 70M downloads (both numbers most likely increased since the writing of this post). For this demo, I'll show how to create a web scraping CLI with both frameworks, highlighting some of the main differences.

We'll be building a CLI showing price information about books. The CLI uses a service named Books to Scrape, which is a fictional bookstore that is intended for testing scraping tools.

Start by creating a new console application:

dotnet new console -n Books.Cli

Next, install the HtmlAgilityPack package (an example with AngleSharp is coming up in a few minutes):

dotnet add package HtmlAgilityPack

I'll also use Spectre.Console to make the CLI look pretty:

dotnet add package Spectre.Console

If you are not familiar with Spectre.Console it's an awesome tool including colors and widgets for CLIs. For more information, check out this post: Create a colored CLI with System.CommandLine and Spectre.

The CLI we'll create in this post will present a list of books available on the front page of Books to Scrape:

Books to Scrape frontpage

When the user selects a book, we'll make another request to the landing page of the back and fetch a price:

A book on Books to Scrape

I know that the price is on the front page as well, but it will illustrate the scenario of parsing and navigating a website.

Start by creating a new HttpClient and fetching the front page:

var httpClient = new HttpClient
{
    BaseAddress = new Uri("https://books.toscrape.com/")
};
var response = await httpClient.GetAsync("/");
var content = await response.Content.ReadAsStreamAsync();

HtmlAgilityPack provide a class named HtmlDocument which acts as the facade of the HTML content. Loading the stream from the HTTP client is easy:

var htmlDocument = new HtmlDocument();
htmlDocument.Load(content);

You can query the DOM using the HtmlDocument and you can even modify the content (not covered in this post). Queries are written using XPath which some people love and other people hate. There's an extension that will allow you to query a document using CSS selectors if you prefer that. I'll start by fetching the list of books and adding the title and URL to a dictionary:

var selections = new Dictionary<string, string>();

var books = htmlDocument.DocumentNode.SelectNodes("//article[@class='product_pod']");
foreach (var book in books)
{
    var anchor = book.SelectSingleNode("div[1]/a");
    var href = anchor.GetAttributeValue("href", string.Empty);

    var image = anchor.SelectSingleNode("img[@class='thumbnail']");
    var alt = image.GetAttributeValue("alt", string.Empty);

    selections.Add(alt, href);
}

This is not an in-depth tutorial on XPath syntax. But //article[@class='product_pod'] basically means "Give me all <article> elements with a class attribute with the value product_pod. For each <article> element (a book) we'll select the anchor element to get the URL for the book, as well as the image element to get the title. You will need to inspect the markup of the site you are scraping to get this right. As an alternative, you can highlight the wanted element in the browser's Developer Tools, right-click, and copy the XPath:

Copy XPath

This will produce a rather long XPath expression, why an understanding of the XPath syntax is preferable here.

The next step is to present the user with the list of available books. We can use the Prompt widget from Spectre.Console to present all titles:

var selectedBook = AnsiConsole.Prompt(
    new SelectionPrompt<string>()
        .Title("Pick a book")
        .PageSize(10)
        .MoreChoicesText("[grey](Move up and down to reveal more books)[/]")
        .AddChoices(selections.Keys));

When the user selects a book, the title is returned and stored in the selectedBook variable. We can use that to lookup the landing page for that book and inspect it with another XPath expression:

var selectedHref = selections[selectedBook];

response = await httpClient.GetAsync(selectedHref);
content = await response.Content.ReadAsStreamAsync();

htmlDocument = new HtmlDocument();
htmlDocument.Load(content);

var price = htmlDocument.DocumentNode.SelectSingleNode(
    "//*[@id=\"content_inner\"]/article/div[1]/div[2]/p[1]");

var table = new Table();
table.AddColumn("Title");
table.AddColumn("Price");
table.AddRow(selectedBook, price.InnerText);
AnsiConsole.Write(table);

Like in the previous example, we use HttpClient to get the content of the URL, and HtmlAgilityPack to query for the price element. If have used the Copy XPath feature in Chrome to illustrate how a generated XPath looks. Finally, I'm using the Table widget from Spectre.Console to output the title and price in a table structure. Let's run the code:

Pick book

Sweet! Our very own Books CLI presents random books with imaginary prices.

To be fair to AngleSharp let's re-implement the code to support that framework instead. Start by installing the NuGet package:

dotnet add package AngleSharp

AngleSharp uses a slightly different approach to fetch the content:

var browsingContext = BrowsingContext.New(Configuration.Default.WithDefaultLoader());
var baseUrl = new Uri("https://books.toscrape.com/");
var document = await browsingContext.OpenAsync(baseUrl.AbsoluteUri);

The content of the URL is loaded into the document variable of type IDocument. This document corresponds the HtmlDocument provided by HtmlAgilityPack.

Next, we'll parse the list of books:

var selections = new Dictionary<string, string>();

var books = document.DocumentElement.QuerySelectorAll("article.product_pod");
foreach (var book in books)
{
    var anchor = book.QuerySelector("div.image_container > a");
    var href = anchor.GetAttribute("href");

    var image = anchor.QuerySelector("img.thumbnail");
    var alt = image.GetAttribute("alt");

    selections.Add(alt, href);
}

var selectedBook = AnsiConsole.Prompt(
    new SelectionPrompt<string>()
        .Title("Pick a book")
        .PageSize(10)
        .MoreChoicesText("[grey](Move up and down to reveal more books)[/]")
        .AddChoices(selections.Keys));

Notice that the query syntax is different from the previous example. Out of the box, AngleSharp uses CSS query selectors over XPath. Some people prefer them, while others really dig XPath. Like HtmlAgilityPack, there's an extension to work with both query syntaxes if you prefer one over the other.

The final step is to fetch the landing page of the selected book and present it in the same way as in the previous example:

var selectedHref = selections[selectedBook];

document = await browsingContext.OpenAsync(new Uri(baseUrl, selectedHref).AbsoluteUri);

var price = document.DocumentElement.QuerySelector(
    "#content_inner > article > div.row > div.col-sm-6.product_main > p.price_color");

var table = new Table();
table.AddColumn("Title");
table.AddColumn("Price");
table.AddRow(selectedBook, price.TextContent);
AnsiConsole.Write(table);

Again, I have used the trick of copying the selector through Developer Tools in Chrome. You will need to pick Copy | Copy selector to copy the full CSS query selector.

I hope this will help getting started with scraping websites from C#. But remember getting permission from the author of the website, before going bananas on building your very own IMDb clone.

elmah.io: Error logging and Uptime Monitoring for your web apps

This blog post is brought to you by elmah.io. elmah.io is error logging, uptime monitoring, deployment tracking, and service heartbeats for your .NET and JavaScript applications. Stop relying on your users to notify you when something is wrong or dig through hundreds of megabytes of log files spread across servers. With elmah.io, we store all of your log messages, notify you through popular channels like email, Slack, and Microsoft Teams, and help you fix errors fast.

See how we can help you monitor your website for crashes Monitor your website