How to Match HTML Tags with Regex

Extracting or stripping HTML tags from a string is a common requirement when sanitizing user input or parsing web scraping results. While you should generally use a DOM parser for complex HTML, regex is perfect for lightweight stripping.

The Pattern Breakdown

The pattern <\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?> captures opening, closing, and self-closing tags, along with their attributes:

<\/?: Matches the opening bracket and an optional forward slash (for closing tags).
\w+: Matches the tag name (e.g., div, p, span).
The middle block handles any number of attributes (like class="foo" or disabled).
\/?>: Matches the optional self-closing slash and the final closing bracket.

Why Regex is Not a Full HTML Parser

Remember the golden rule of web scraping: You cannot parse HTML strictly with regex. Because HTML is not a regular language, nested tags and malformed DOMs will eventually break this pattern. Use this strictly for sanitization!

Match HTML Tags Regex

How to Match HTML Tags with Regex

The Pattern Breakdown

Why Regex is Not a Full HTML Parser