最新消息:Welcome to the puzzle paradise for programmers! Here, a well-designed puzzle awaits you. From code logic puzzles to algorithmic challenges, each level is closely centered on the programmer's expertise and skills. Whether you're a novice programmer or an experienced tech guru, you'll find your own challenges on this site. In the process of solving puzzles, you can not only exercise your thinking skills, but also deepen your understanding and application of programming knowledge. Come to start this puzzle journey full of wisdom and challenges, with many programmers to compete with each other and show your programming wisdom! Translated with DeepL.com (free version)

regex to match HTML <p ...> tag starting with a lowercase letter - Stack Overflow

matteradmin6PV0评论

I am editing an epub file in Sigil and would like to match HTML <p ...> tags when the 1st char after the closing tag > is in lower case. I saw some answers on this site to match p tags with attributes, but not a p tag without attributes. I don't know how regular expressions work, so I'm trying to figure out what change I need to do to match both?

Examples:

<p class="calibre1">All the while I was...</p>
<p>All the while I was...</p>
<p class="calibre1">all the while I was...</p>
<p>all the while I was...</p>

The regex should match the last 2 tags in the example above.

The code that I have (/<\/?([^p](\s.+?)?|..+?)>[a-z]/) matches only the 3rd, not the 4th tag.

Important: Sigil has no HTML parser, so I have to stick to using the simple search engine which accepts regular expressions.

I am editing an epub file in Sigil and would like to match HTML <p ...> tags when the 1st char after the closing tag > is in lower case. I saw some answers on this site to match p tags with attributes, but not a p tag without attributes. I don't know how regular expressions work, so I'm trying to figure out what change I need to do to match both?

Examples:

<p class="calibre1">All the while I was...</p>
<p>All the while I was...</p>
<p class="calibre1">all the while I was...</p>
<p>all the while I was...</p>

The regex should match the last 2 tags in the example above.

The code that I have (/<\/?([^p](\s.+?)?|..+?)>[a-z]/) matches only the 3rd, not the 4th tag.

Important: Sigil has no HTML parser, so I have to stick to using the simple search engine which accepts regular expressions.

Share Improve this question edited Nov 18, 2024 at 8:20 Patrick Janser 4,3271 gold badge19 silver badges22 bronze badges asked Nov 15, 2024 at 19:38 MichaelMichael 1175 bronze badges 5
  • 3 Don't use regular expressions to process HTML, use an HTML parser. – Barmar Commented Nov 15, 2024 at 20:12
  • 2 It's a Thing That Should Not Be. – zer00ne Commented Nov 15, 2024 at 20:21
  • 2 Don't try to run a regex directly on the HTML, but use the DOM to first get the paragraphs with document.querySelectorAll('p'). Then on each of them, look at the innerText property and test it against /^\p{Ll}/u, to see if it starts with a lowercase letter in any language (using the Unicode flag). The reasons are multiple: A. HTML entities: &eacute; is é and is a lowercase letter. B. Your paragraph could start with an inner tag like <p><strong>shit</strong> starts with a lowercase...</p>. C. Spaces, tabs, new lines, HTML comments before the first letter. – Patrick Janser Commented Nov 15, 2024 at 23:49
  • sorry, I fot to mention that I'm editing an epub file using Sigil, which has only regex and no HTML parser. – Michael Commented Nov 16, 2024 at 9:18
  • In Sigil search for <p([^>]*)>((?:\s*|<!--.*?-->|<\s*\w+[^>]*>)*)(\p{Ll}) with the regex options "Dot All" and "Unicode Property" and replace by <p\1>\2\U\3\E to directly convert the lowercase letter to its uppercase version. \U will uppercase the capturing group n°3, which is the lowercase letter. \E stops the uppercase modifier. – Patrick Janser Commented Nov 18, 2024 at 11:46
Add a comment  | 

1 Answer 1

Reset to default 1

The following regex looks like a good starting place:

  • <p[^>]*?>[a-z]

From there I'm not sure what you want to capture, but it'll work. And yes, of course you should you an HTMLParser for this, but for something as simple as this I don't see why regex is an issue (provided you know the input, it won't work on a generalized html input).

Post a comment

comment list (0)

  1. No comments so far