How to Match HTML Tags with Regex
Extracting or stripping HTML tags from a string is a common requirement when sanitizing user input or parsing web scraping results. While you should generally use a DOM parser for complex HTML, regex is perfect for lightweight stripping.
The Pattern Breakdown
The pattern <\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?> captures opening, closing, and self-closing tags, along with their attributes:
<\/?: Matches the opening bracket and an optional forward slash (for closing tags).\w+: Matches the tag name (e.g., div, p, span).- The middle block handles any number of attributes (like
class="foo"ordisabled). \/?>: Matches the optional self-closing slash and the final closing bracket.
Why Regex is Not a Full HTML Parser
Remember the golden rule of web scraping: You cannot parse HTML strictly with regex. Because HTML is not a regular language, nested tags and malformed DOMs will eventually break this pattern. Use this strictly for sanitization!