Skip to content Skip to sidebar Skip to footer

Replace All < And > That Are Not Part Of An Html Tag

I have been trying to work through a RegEx that I could use to replace all < and > text strings, EXCEPT for when those strings are part of an HTML tag. For example: var str

Solution 1:

This is not easy. See the authoritative answer to a related question here.

Regular expressions are not built for this type of parsing. Even tokenizing or dom parsing can cause problems. The title of your question illustrates the problem:

Replace all < and > that are NOT part of an HTML tag

How can your parser know if < and > is an <AND> tag, or simply two orphan angle brackets around the word and?

An HTML parser is probably your best bet, but how the orphan brackets are handled is key. Also, you would need to look for unmatched tags or illegal tags to catch cases such as the title of your question.

Solution 2:

HTML is notoriously difficult to parse using regular expressions. The HTML specifications are very forgiving, and browser implementations tend to be even more forgiving. The result of this is that attempting to match something like this using regular expressions alone is almost impossible.

Its far more robust to use a full blown HTML parser that understands all the special cases to generate a DOM, and then walk through the resulting DOM in code looking for angle brackets.

As you have tagged your question with .NET I can recommend the HTML Agility Pack for performing this type of task.

Solution 3:

There have been several questions asked regarding how to detect text that is or is not in an HTML tag; you should be able to modify the concept to your needs.

Basically, you're looking for a < that is not followed by a >, and you want to replace it with the ampersand-notated form &lt;. Try something like:

var output = Regex.Replace(input, "<(?!.*?[>])", "&lt;");

Post a Comment for "Replace All < And > That Are Not Part Of An Html Tag"