Regex Isn't Hard

CharlesW · on July 14, 2023

A less troll-y title rewrite: "Regex Isn't Hard As Long As You're The Only One Writing It And You Barely Use Any Of It".

I think the actual advice is fine, BTW. Tools like regex101.com and now LLMs are super-handy for reading and writing regex as well.

SOLAR_FIELDS · on July 14, 2023

It’s been one of my favorite uses of ChatGPT over the past couple of weeks. Several months ago, my favorite, fast ad-free version of Reddit was removed (i.reddit.com). I reluctantly switched to Apollo, which as many know went away at the end of last month. Some commenter on HN helpfully pointed out that whoever killed i.reddit.com didn’t kill individual pages, just the entry point. In other words, if you append /.i to a Reddit URL, you get this nice fast ad free mobile website still. But when you click through to links it doesn’t work; they take you to the garbage new mobile site.

So with the help of an iOS app called “Redirect Web”, and a clever regex written mostly by ChatGPT, I now have my nice i.reddit.com experience back

fargle · on July 14, 2023

not less troll-y. that's far more troll-y.

i don't particularly like the way the article is presented on technical grounds. it should stick the real basics ., * first.

but no, regex is an extremely simple language and a basic underpinning of computer science. if it's "too hard", you've either been taught wrong or have no business writing code of any other sort.

agree on props to regex101 and other similar sites.

witrak · on July 14, 2023

Unfortunately, the suggestions related to characters representing elements of regular expressions are completely wrong when you consider the non-ASCII text. Thus except for program text containing string literals and comments in pure English the use of regex isn't so simple, especially without a dot symbol.

So in general I would classify the article into the "marketing text" category: not completely false but not telling the truth ;-)

pcthrowaway · on July 14, 2023

I'm confused by this:

    [^ab] means “everything but a or b
    ...
    . — The dot (.) matches any character, but not always. Sometimes it doesn’t match newlines. In some programming languages it never matches newlines. I’ve gotten bitten too often by the . not behaving like I think it should. It’s best to ignore this entirely

Is the negation class somehow more standardized than the `.`? I mean, if you do [^X] it's typically going to match any character that isn't X except character returns and line feeds. `.` works the same way, except it will also match X. And for what it's worth, I can't think of the last time I ran into it somewhere that it also matched newlines

mmh0000 · on July 14, 2023

Check out this StackExchange thread which provides many answers[1]

The TL;DR is, RegEx behaves slightly differently in each programming language. For example, if and how newlines are matched.

There are three main flavors of RegEx: POSIX, Extended, and PCRE. Most UNIX tools implement either POSIX or Extended. Most Programming languages implement PCRE. Some programs (/cough vim/) just make up their own regex langauge as they go.

[1] https://stackoverflow.com/questions/159118/how-do-i-match-an...

pcthrowaway · on July 14, 2023

Right, I understand different regex flavors work differently; what I was questioning was whether the capture group negation syntax was any more "predictable" (for OP's concerns) than the `.` matching.

Because the article recommends using a capture group negation to match a character when you know a specific character or set of characters won't be in the group, but doesn't mention why capture groups (and negations) are acceptable to use, while the `.` should be avoided ("because it might have this less predictable behaviour that I'm totally going to avoid talking about with capture group negations")

_l4jh · on July 14, 2023

Hi mmh0000 apologies for replying like this but not long ago we had a little conversation regarding TRT. I replied to your last message but it is from a while ago and as HN doesn't alert you if you get a reply I've no idea if you read back through your older replies to see it. I was wondering if you had a bit of time to talk with me in private about TRT? I would be extremely grateful if you can and if so just drop me an email, my username satsyin at gmail will get to me. Thanks :)

mmh0000 · on July 14, 2023

Yes, please email me:

mmh-hn-trt@xn0.org

sir-g · on July 14, 2023

I agree with the title but almost nobody uses regex the way he described it. If you're going to work with other people's code you'll have to learn \w \d and \s

cafard · on July 14, 2023

I was looking at the Regular Expressions chapter of Perl best practices yesterday, and my recollection is that it runs to about ten pages. Some work with regular expressions isn't hard, but some is complicated, and assumptions may not carry over from language to language. It could be that some level of complexity is a sign that regular expressions are no longer the correct tool.

xwowsersx · on July 14, 2023

> This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this.

is there one too many close brackets ] in there there?

pipeline_peak · on July 14, 2023

Regex isn’t hard as long as you use it consistently enough to remember it’s weird syntax

carlmr · on July 14, 2023

I don't even think the syntax is too weird. It's terse, but it's also small enough to remember most things you need, if you regularly use it to search for stuff.

Regex everywhere isn't what makes Perl hard to read IMHO.

yashvg · on July 14, 2023

Regex isn't hard, it's like abstract art. It looks like ugly nonsense at first, but if you squint your eyes, you'll start to get it.