// Literal — preferred for static patterns
const re1 = /hello/i;
// Constructor — for dynamic patterns
const word = 'hello';
const re2 = new RegExp(word, 'i');
// With flags
/abc/g; // global — find all matches
/abc/i; // case-insensitive
/abc/m; // multiline (^ and $ match line boundaries)
/abc/s; // dotAll (. matches newlines)
/abc/u; // unicode
// Dynamic pattern with special chars — must escape
const userInput = 'price: $10';
// Use an escape function before creating RegExp from user input
const dynamicRe = new RegExp(escapeRegex(userInput));
Regex literals are preferred for static patterns because they are validated at parse time and have better performance. Use the RegExp constructor only when the pattern needs to be built dynamically, and remember to double-escape backslashes in constructor strings since they go through string parsing first.
Why it matters: Choosing between regex literal and constructor is not just style — it affects parse-time validation and performance. Using the wrong one causes hard-to-find bugs when backslashes aren't properly escaped in dynamic patterns.
Real applications: Static validation patterns (email, phone) use regex literals; search features where the user types a search term use the RegExp constructor; URL routers that generate patterns from string templates use the constructor with escaping.
Common mistakes: Forgetting to double-escape backslashes in constructor strings (new RegExp('\d+') vs /\d+/), not escaping user input before passing to RegExp constructor (regex injection vulnerability), and creating new RegExp instances inside loops instead of caching the compiled pattern.
const re = /(d{4})-(d{2})/;
const str = '2024-03 and 2025-06';
re.test(str); // true
re.exec(str); // ["2024-03", "2024", "03", index: 0]
str.match(re); // ["2024-03", "2024", "03"] (first match)
str.match(/d{4}-d{2}/g); // ["2024-03", "2025-06"] (all matches)
// matchAll — iterator of all detailed matches
for (const m of str.matchAll(/(d{4})-(d{2})/g)) {
console.log(m[1], m[2]); // "2024","03" then "2025","06"
}
// search() — returns index of first match
str.search(/d{4}/); // 0
str.search(/xyz/); // -1 (not found)
// split() with regex
'one, two, three'.split(/,s*/); // ["one", "two", "three"]
Use test() when you only need a yes/no answer, exec() when you need detailed match info with groups, and matchAll() (ES2020) when you need all matches with full detail. Note that exec() with the g flag is stateful — it advances lastIndex on each call.
Why it matters: Using the wrong method leads to either incomplete results or subtle statefulness bugs. matchAll() returning an iterator instead of an array is a common source of confusion in modern code.
Real applications: Form validation (test()), extracting named capture groups from dates/URLs (exec()), finding all tag names in HTML strings (matchAll()), highlighting all occurrences in a text editor, and data extraction from structured log lines.
Common mistakes: Calling test() with a /g flag regex repeatedly and getting alternating true/false (lastIndex bug), not spreading matchAll() into an array before iterating twice, and using match() without the g flag when expecting all matches (only returns the first).
// Practical email validation
const emailRe = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$/;
emailRe.test('user@example.com'); // true
emailRe.test('a.b+tag@sub.co.uk'); // true
emailRe.test('missing@.com'); // false
emailRe.test('@no-local.com'); // false
// For production, prefer built-in validation
// handles most cases
// Or use a well-tested library for strict validation
// Other common validations
const phoneRe = /^+?[1-9]d{1,14}$/; // E.164 format
const urlRe = /^https?://[^s/$.?#].[^s]*$/; // basic URL
const hexColorRe = /^#([0-9A-Fa-f]{3}|[0-9A-Fa-f]{6})$/;
For production applications, rely on the HTML5 input type="email" validation or a well-tested validation library rather than writing your own regex. The true RFC-compliant email regex is extremely complex and impractical for most use cases.
Why it matters: Regex-based email validation is one of the most commonly attempted and commonly broken tasks in web development. Knowing when to use regex and when to delegate to the browser or a library is a mark of practical engineering judgment.
Real applications: Basic syntax validation for user-facing forms, extracting emails from unstructured text (log files, documents), filtering email lists, and building simple parsers for structured string formats like phone numbers and postal codes.
Common mistakes: Writing overly strict email regex that rejects valid addresses (e.g., user+tag@domain.co.uk), writing too-permissive patterns that accept invalid strings, validating format instead of deliverability (a regex can't tell if an email exists), and not anchoring with ^ and $ (matches a substring instead of the whole value).
// Character classes
/\d/.test('9'); // true (digit)
/\w+/.test('hello'); // true (word characters)
/\s/.test(' '); // true (whitespace)
/[aeiou]/.test('e'); // true (vowel)
/[^0-9]/.test('a'); // true (NOT a digit)
// Quantifiers
/a{3}/.test('aaa'); // true (exactly 3)
/a{2,4}/.test('aaa'); // true (2 to 4)
/colou?r/.test('color'); // true (u is optional)
/\d+/.exec('abc123'); // ["123"] (one or more digits)
// Greedy vs Lazy quantifiers
'bold'.match(/<.*>/); // ["bold"] greedy
'bold'.match(/<.*?>/); // [""] lazy (minimal match)
By default, quantifiers are greedy — they match as much as possible. Adding ? after a quantifier makes it lazy (non-greedy), matching as little as possible. The negated character class [^...] matches any character NOT in the set, which is often more precise than lazy quantifiers.
Why it matters: Greedy vs lazy matching is one of the most common sources of regex bugs. Understanding the difference is essential for extracting data from HTML or structured text where you need to match the minimum possible content between delimiters.
Real applications: Extracting HTML tag content with <tag>(.*?)</tag> (lazy) instead of greedy, scraping structured text, matching JSON-like patterns, and building template engines that process {{...}} markers.
Common mistakes: Using greedy .* and accidentally consuming more than intended (matches across multiple tags), forgetting that ? after a quantifier makes it lazy — not optional, and using lazy quantifiers when a negated character class would be cleaner and more efficient.
// Capturing groups
const match = /(w+)@(w+).(w+)/.exec('user@site.com');
// match[1] = "user", match[2] = "site", match[3] = "com"
// Non-capturing group — groups but doesn't capture
/(?:https?)://(w+)/.exec('https://example');
// match[1] = "example" (only one capture)
// Back-reference
/(\w+) \1/.test('hello hello'); // true (\1 = first captured group)
// Alternation in group
/(?:cat|dog)s/.test('cats'); // true
/(?:cat|dog)s/.test('dogs'); // true
// Practical: extract parts of a URL
const urlRe = /^(https?)://([^/]+)(/.*)?$/;
const parts = urlRe.exec('https://example.com/path');
// parts[1] = "https", parts[2] = "example.com", parts[3] = "/path"
Use non-capturing groups when you only need grouping for logical structure (like alternation) but do not need to reference the matched text. This keeps the match result array cleaner and slightly improves regex engine performance in complex patterns.
Why it matters: Understanding capturing vs non-capturing groups is essential once you build multi-group patterns. Non-capturing groups keep match arrays from filling with unwanted captures, making the results easier to process.
Real applications: Grouping alternation without polluting capture groups ((?:jpg|png|gif)), building complex validation patterns with logical groupings, and writing named-capture-group patterns where unnamed captures would interfere with group numbering.
Common mistakes: Using capturing groups when non-capturing would suffice (bloats the match array), confusing (?:...) (non-capturing) with (?=...) (lookahead), and not knowing that non-capturing groups still apply quantifiers — (?:ab)+ matches "ababab".
// Positive lookahead — followed by
'100px'.match(/\d+(?=px)/); // ["100"]
// Negative lookahead — NOT followed by
'100em'.match(/\d+(?!px)/); // ["100"]
// Positive lookbehind — preceded by
'$50'.match(/(?<=\$)\d+/); // ["50"]
// Negative lookbehind — NOT preceded by
'€50'.match(/(?<!\$)\d+/); // ["50"]
// Password: at least one digit and one uppercase
/(?=.*\d)(?=.*[A-Z]).{8,}/.test('Pass1234'); // true
// Add commas to numbers using lookahead
'1234567'.replace(/\B(?=(d{3})+(?!d))/g, ',');
// "1,234,567"
Lookaheads and lookbehinds are zero-width — they assert a condition at a position but do not consume characters or advance the match position. This makes them perfect for password validation (multiple conditions at the same position) and number formatting. Lookbehind support was added in ES2018.
Why it matters: Lookaheads/lookbehinds enable "contextual matching" — matching a pattern only when it's followed by or preceded by something specific. This is used in countless production validation patterns and text processing tasks.
Real applications: Password strength validation (must contain uppercase, digit, special char), number formatting with thousands separators, context-sensitive replacements (only replace a word when not preceded by another specific word), and parsing tokens with context constraints.
Common mistakes: Using a lookbehind in environments that don't support ES2018 (Safari < 16.4), confusing positive lookahead (?=...) with non-capturing group (?:...), and forgetting that lookaheads and lookbehinds don't consume characters — the main pattern still needs to match the desired text.
// Simple replace
'hello world'.replace(/world/, 'JS'); // "hello JS"
// Global replace
'aabba'.replace(/a/g, 'x'); // "xxbbx"
// Using capture groups
'2024-03-15'.replace(/(\d{4})-(\d{2})-(\d{2})/, '$2/$3/$1');
// "03/15/2024"
// Callback function
'hello'.replace(/./g, (char, i) => {
return i % 2 === 0 ? char.toUpperCase() : char;
}); // "HeLlO"
// replaceAll (ES2021) — no g flag needed
'aabba'.replaceAll('a', 'x'); // "xxbbx"
// Named group references in replacement
'2024-03-15'.replace(
/(?\d{4})-(?\d{2})-(?\d{2})/,
'$/$/$'
); // "03/15/2024"
The callback function receives the full match, each captured group, the match offset, and the original string as parameters. Use replaceAll() (ES2021) as a cleaner alternative to replace() with the g flag for simple string replacements.
Why it matters: The replace callback unlocks powerful transformation capabilities — not just substitution but computation on matched text. It's how template engines, code formatters, and text sanitizers work internally.
Real applications: Converting camelCase to kebab-case, escaping HTML entities in user input, transforming markdown syntax to HTML, replacing placeholders in templates with computed values, and normalizing date/phone formats across multiple formats.
Common mistakes: Using replace() without the /g flag and only replacing the first match, returning undefined from the replace callback (inserts "undefined" as the replacement string), forgetting that replaceAll() requires a string pattern (not a regex without the g flag), and not capturing groups that the callback references.
// g — find all matches, not just the first
'abab'.match(/a/g); // ["a", "a"]
// i — case-insensitive
/hello/i.test('Hello'); // true
// m — ^ and $ match line boundaries
'line1\nline2'.match(/^line/gm); // ["line", "line"]
// s — dot matches newline
/a.b/s.test('a\nb'); // true (without s: false)
// u — proper unicode support
/\u{1F600}/u.test('😀'); // true
// d — match indices
/a(b)/.exec('ab'); // no indices
/a(b)/d.exec('ab'); // includes .indices property
// Combining flags
/pattern/gims; // global, case-insensitive, multiline, dotAll
The u (unicode) flag enables proper handling of Unicode characters beyond the Basic Multilingual Plane (like emoji). The d (hasIndices) flag adds start and end positions for each captured group in the match result. Always use the u flag when working with international text.
Why it matters: Regex without the u flag operates on UTF-16 code units, causing incorrect behavior with emoji and characters outside the BMP (like many Chinese/Japanese characters). This is a common internationalization bug.
Real applications: Validating usernames that allow international characters, processing multilingual text, matching emoji in social media content moderation, and building international character-aware word boundaries for text search.
Common mistakes: Not using /u flag with Unicode patterns (causes emoji to count as 2 characters), writing regex with the s flag expecting . to match newlines (it doesn't by default — /s is dotAll mode), and combining incompatible flags like u and v (ES2024 v supersedes u).
const dateRe = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = dateRe.exec('2024-03-15');
console.log(match.groups.year); // "2024"
console.log(match.groups.month); // "03"
console.log(match.groups.day); // "15"
// Named groups in replace
'2024-03-15'.replace(dateRe, 'lt;day>/lt;month>/lt;year>');
// "15/03/2024"
// Destructuring named groups
const { groups: { year, month, day } } = dateRe.exec('2024-03-15');
console.log(year, month, day); // "2024" "03" "15"
// Named back-reference
/(?\w+) \k/.test('hello hello'); // true
// \k references the named group
Named groups are especially useful for complex patterns where numbered references like $1 and $2 become confusing. They make the regex self-documenting and the code that processes matches easier to understand. Named groups can be combined with destructuring for clean variable extraction.
Why it matters: Named capturing groups transform regex results from cryptic arrays to self-documenting objects. When your pattern has more than 2-3 groups, named groups are essential for maintainability — especially when patterns change and group numbers shift.
Real applications: Parsing date/time strings (year, month, day), extracting URL components (protocol, host, path, query), parsing log line formats, and building template substitution engines with named placeholder replacement.
Common mistakes: Using positional groups (e.g., $1, $2) in complex patterns (breaks when groups are added/reordered), not knowing you can destructure .groups directly, and forgetting that named back-references use \k<name> syntax within the pattern itself.
// Split on multiple delimiters
'one, two; three four'.split(/[,;\s]+/);
// ["one", "two", "three", "four"]
// Split and keep the delimiter (capturing group)
'hello123world456end'.split(/(\d+)/);
// ["hello", "123", "world", "456", "end"]
// Tokenize simple expressions
'3 + 5 * 2 - 1'.split(/\s*([+\-*/])\s*/);
// ["3", "+", "5", "*", "2", "-", "1"]
// Parse CSV line (handles quoted values)
const csvLine = 'name,"city, state",age';
const csvRe = /,(?=(?:[^"]*"[^"]*")*[^"]*$)/;
csvLine.split(csvRe); // ["name", '"city, state"', "age"]
// Limit splits
'a-b-c-d'.split(/-/, 2); // ["a", "b"]
When the regex contains a capturing group, the captured delimiters are included in the result array. This is useful when you need to preserve the separators, like when tokenizing mathematical expressions. Use the second argument of split() to limit the number of result pieces.
Why it matters: Regex-based splitting handles complex real-world data that can't be separated by a simple fixed string. CSV parsing, log tokenization, and expression parsing all require this capability.
Real applications: Parsing CSV with quoted fields, tokenizing arithmetic expressions for evaluation, splitting markdown text while preserving code blocks, processing multi-delimiter data exports (tab/comma/semicolon), and building simple lexers for custom configuration languages.
Common mistakes: Using split('.') instead of split(/\./) when splitting on a literal dot (unescaped dot matches any character), not accounting for capturing groups including delimiters in the result array, and using an unlimited split when only the first N parts are needed.
const re = /\d+/g;
// exec advances lastIndex on each call
re.exec('abc 123 def 456'); // ["123"], lastIndex = 7
re.exec('abc 123 def 456'); // ["456"], lastIndex = 15
re.exec('abc 123 def 456'); // null, lastIndex = 0
// Common bug: reusing regex with g flag
const re2 = /hello/g;
re2.test('hello world'); // true, lastIndex = 5
re2.test('hello world'); // false! starts from index 5
// Fix: reset lastIndex
re2.lastIndex = 0;
re2.test('hello world'); // true again
// Sticky flag (y) — must match at exactly lastIndex
const sticky = /\d+/y;
sticky.lastIndex = 4;
sticky.exec('abc 123'); // ["123"] — match at index 4
sticky.exec('abc 123'); // null — no match at index 7
The y (sticky) flag is stricter than g — it requires the match to occur exactly at lastIndex, not just anywhere after it. Always reset lastIndex to 0 before reusing a global regex, or create a new regex instance each time. Using matchAll() or match() avoids this issue entirely.
Why it matters: The stateful lastIndex bug with reused global regex is one of the most surprising JavaScript gotchas. It causes tests to pass and fail alternately with no apparent reason, and is notoriously hard to debug.
Real applications: Any code that stores regex patterns as module-level constants and reuses them with .test() or .exec() is vulnerable. This affects validation modules, search utilities, and any performance-optimized code that avoids creating new regex instances.
Common mistakes: Declaring const re = /pattern/g at module scope and calling re.test() in a loop (alternating results), not knowing the sticky /y flag has the same issue, and forgetting to reset lastIndex between test calls in unit tests that reuse the same regex instance.
// Special chars need escaping to match literally
/1\.5/.test('1.5'); // true (escaped dot)
/1.5/.test('1X5'); // true (unescaped dot matches any char)
/\$10/.test('$10'); // true (escaped dollar sign)
/\(hello\)/.test('(hello)'); // true
// Escape function for dynamic patterns
function escapeRegex(str) {
// Replaces each special regex char with a backslash prefix
var specials = /[.\*+?^$|(){}[\]\\-]/g;
return str.replace(specials, '\\');
}
// Usage with user input
const userSearch = 'price: $10.00 (USD)';
const escaped = escapeRegex(userSearch);
const re = new RegExp(escaped);
re.test('The price: $10.00 (USD) is final'); // true
// Without escaping, special chars cause errors or wrong matches
// new RegExp('$10.00'); // matches "10X00" at end of string
Always use an escape function when building regex patterns from user input or dynamic strings. Without escaping, characters like . will match any character, $ will match end-of-string, and unmatched parentheses will throw a SyntaxError.
Why it matters: Building regex from user input without escaping is a security vulnerability known as ReDoS (Regular Expression Denial of Service) and can also cause incorrect matching. This is an OWASP-recognized attack vector for Node.js servers.
Real applications: Search-as-you-type features that highlight matches in text, building dynamic find/replace tools, constructing regex from config files or database values, and any user-facing feature that translates user text into a search pattern.
Common mistakes: Forgetting to escape user input before new RegExp(userInput) (security vulnerability + potential crash), not knowing that String.prototype.escapeRegex doesn't exist (must implement or import), and escaping the string but forgetting the g flag when replacing all occurrences.
// Password validation — multiple conditions
const passwordRe = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%]).{8,}$/;
passwordRe.test('Str0ng!Pass'); // true
passwordRe.test('weakpass'); // false
// Phone number (international format)
const phoneRe = /^\+?[1-9]\d{1,14}$/;
phoneRe.test('+919876543210'); // true
// Date format (YYYY-MM-DD)
const dateRe = /^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/;
dateRe.test('2024-03-15'); // true
dateRe.test('2024-13-01'); // false
// IP address
const ipRe = /^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$/;
ipRe.test('192.168.1.1'); // true
ipRe.test('256.1.1.1'); // false
// Username (alphanumeric, 3-16 chars)
const usernameRe = /^[a-zA-Z0-9_]{3,16}$/;
Always use ^ (start) and $ (end) anchors for validation to ensure the entire string matches the pattern, not just a substring. For complex validations, consider combining simple regex checks with JavaScript logic for better readability and maintainability.
Why it matters: Without anchors, a validation regex will happily pass strings that contain the pattern anywhere — a critical security flaw. /\d{5}/ passes "abc12345xyz" as a valid ZIP code. Anchors are the most overlooked regex best practice.
Real applications: ZIP/postal code validation, password strength checking, username format enforcement, credit card number format validation, and any input field with strict format requirements.
Common mistakes: Forgetting ^/$ anchors (validates substring instead of full input), using one large complex regex instead of separate readable checks (hard to maintain and explain errors to users), and not considering edge cases like empty strings, international characters, or whitespace at start/end.
// Catastrophic backtracking example
// This regex is very slow on non-matching input:
// /^(a+)+$/.test('aaaaaaaaaaaaaaaaX')
// The engine tries every possible way to split the a's
// Safe alternative — avoid nested quantifiers
/^a+$/.test('aaaaaaaaaaaaaaaaX'); // fast — single quantifier
// Other problematic patterns:
// /(a|b)*$/ on "aaaaaaaX"
// /(a+b)+$/ on "aaaaaaaX"
// Prevention strategies:
// 1. Avoid nested quantifiers: (a+)+ → a+
// 2. Use specific character classes: .* → [^\n]*
// 3. Make patterns more specific
// 4. Use possessive quantifiers in other languages: a++
// Atomic group simulation with lookahead
// (?=(pattern))\1 — captures in lookahead, then matches
/(?=(a+))\1b/.test('aaab'); // works like atomic group
Catastrophic backtracking most commonly occurs with nested quantifiers like (a+)+ or alternation combined with quantifiers. To avoid it, prefer specific character classes over ., avoid nesting quantifiers unnecessarily, and test your regex with both matching and non-matching inputs to verify performance.
Why it matters: ReDoS (Regex Denial of Service) is a real OWASP-listed vulnerability. A single malicious input string can hang a Node.js server indefinitely if catastrophic backtracking is triggered. This has caused major outages (Cloudflare 2019 outage was caused by a ReDoS regex).
Real applications: Any web server that validates user input against regex (URL validation, email validation, markdown parsing) is a potential target. Security-reviewed production systems use tools like safe-regex or vuln-regex-detector to audit patterns.
Common mistakes: Writing email validation regex with exponential backtracking potential, not testing regex performance against adversarial inputs (e.g., 50 'a' characters followed by a non-matching character), and deploying patterns with nested quantifiers like (a+)+ in user-facing endpoints.
// Without u flag — broken Unicode handling
/^.$/.test('😀'); // false (emoji is 2 code units)
/^..$/.test('😀'); // true (treated as 2 chars)
// With u flag — correct Unicode handling
/^.$/u.test('😀'); // true (emoji is 1 character)
// Unicode property escapes (requires u flag)
/\p{Letter}/u.test('ñ'); // true (any letter)
/\p{Number}/u.test('①'); // true (any number)
/\p{Emoji}/u.test('🎉'); // true
/\p{Script=Han}/u.test('中'); // true (Chinese characters)
// Unicode-aware word boundary
/\b\w+\b/u; // correct word boundaries with unicode
// v flag (ES2024) — extends unicode support
/[\p{Letter}&&[^\p{Script=Latin}]]/v; // set intersection
/[\p{Emoji}--[😀]]/v; // set subtraction
Always use the u flag when working with international text or emoji. The v (unicodeSets) flag (ES2024) adds set operations like intersection (&&) and subtraction (--) within character classes, enabling more precise Unicode matching patterns.
Why it matters: Unicode-aware regex is a prerequisite for any application serving international users. Without /u, emoji break string length calculations and character class matching produces wrong results for non-Latin scripts.
Real applications: Username validation that allows international characters, emoji filtering in content moderation, searching across multilingual content, building regex for non-Latin text processing (Arabic, CJK, Cyrillic), and emoji-aware string splitting.
Common mistakes: Not using /u when matching characters above U+FFFF (emoji are matched as two code units without it), using . to match any character including newlines without the /s flag, and confusing the ES2024 /v flag (unicodeSets — for advanced Unicode) with the older /u flag.
// Extract all URLs from text
const urlRe = /https?:\/\/[^\s<>]+/g;
text.match(urlRe);
// Remove HTML tags
const stripTags = /<[^>]*>/g;
'Hello World
'.replace(stripTags, '');
// "Hello World"
// Trim whitespace (beyond String.trim)
str.replace(/^\s+|\s+$/g, ''); // trim both ends
str.replace(/\s+/g, ' '); // collapse whitespace
// Extract hashtags
'#hello world #coding'.match(/#\w+/g); // ["#hello", "#coding"]
// camelCase to kebab-case
'camelCaseText'.replace(/([A-Z])/g, '-$1').toLowerCase();
// "camel-case-text"
// Mask sensitive data
'Card: 4532-1234-5678-9012'.replace(/(\d{4}-){3}/, '****-****-****-');
// "Card: ****-****-****-9012"
// Validate hex color
/^#([0-9A-Fa-f]{3}|[0-9A-Fa-f]{6})$/.test('#FF00FF'); // true
When stripping HTML tags, note that regex is not a proper HTML parser — for complex HTML manipulation, use DOMParser or a library. For simple sanitization of user-generated text, regex works well. Always prefer dedicated validation libraries for security-critical input validation.
Why it matters: Regex-based HTML sanitization is a well-known source of XSS vulnerabilities — clever attackers craft inputs that bypass naive regex sanitizers. Knowing regex limitations prevents dangerous over-reliance on it for security-sensitive processing.
Real applications: Stripping basic HTML from CMS-generated content before displaying in email, extracting plain text from HTML snippets, building simple markdown-to-HTML converters, and pre-processing log data that may contain HTML characters.
Common mistakes: Using regex to sanitize HTML for XSS prevention (use DOMPurify instead), writing overly-complex regex patterns where a dedicated parser (cheerio, DOMParser) would be safer and more maintainable, and treating "it matches in tests" as proof of correctness for edge-case-heavy HTML content.