> *Letter: a character from the Unicode Letter category (L)* This definition is ...

> Letter: a character from the Unicode Letter category (L)

This definition is insufficient for many scripts, such as Indic scripts. My name “Chris” is written in Telugu as “క్రిస్”: letter ka, sign virama (which suppresses the inherent vowel, and also joins the next letter and vowel sign as a conjunct in this instance, as part of the same syllable—if you didn’t want that, you’d insert a ZERO-WIDTH NON-JOINER and get క్ రి [minus the space, HN turned by ZWNJ into a space :/] instead of క్రి), letter ra, vowel sign i, letter sa, sign virama (which this time just suppresses the inherent vowel). Six code points, of which three are Other_Letter (part of Letter) and three Nonspacing_Mark (part of Mark, not Letter).

Unicode’s Letter general category is almost never what you want. To begin with, you should instead use the “Alphabetic” property, which examination of https://www.unicode.org/reports/tr44/#Alphabetic and https://www.unicode.org/reports/tr44/#GC_Values_Table shows to be a superset of Letter, adding the Letter_Number (Nl) general category and the Other_Lowercase, Other_Uppercase and Other_Alphabetic properties.

But actually, even Alphabetic isn’t quite the right tool: some scripts don’t separate words with non-alphabetical characters, and some scripts use non-alphabetical characters in the middle of words (e.g. ZWJ and ZWNJ in Indic texts, or apostrophes in English). For most correct results, I believe you actually need to get into full text segmentation to find word breaks, as defined in UAX #29 <https://www.unicode.org/reports/tr29/>.

As usual: language is fiendishly complicated.

The other definitions are a strange mixture. Digit is ASCII-only, which makes sense for due date, but is inconsistent with Unicode letters elsewhere.

> The tag name MUST only contain letters, digits, or the characters _ or -. It MUST be treated as case-insensitive.

This is somewhat poorly defined, and I’m confident that from such a definition you’ll get multiple incompatible implementations. I would say that you probably want to specify something based on Unicode’s caseless matching; see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G217... which will give you much to read and despair of ever understanding. (The whole two and a bit pages of that subsection are worth reading. And I wouldn’t complain if you decided it was worth reading a lot more of the Unicode spec.)

But in this place, I don’t think you actually want letters/digits/_/- anyway. I think that for tag names you’d do better with using UAX #31 <https://www.unicode.org/reports/tr31/> identifiers of some form. Note its optional medial characters (how you’d add -), its “hashtag identifiers”, and its remarks on case folding.