Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Letter: a character from the Unicode Letter category (L)

This definition is insufficient for many scripts, such as Indic scripts. My name “Chris” is written in Telugu as “క్రిస్”: letter ka, sign virama (which suppresses the inherent vowel, and also joins the next letter and vowel sign as a conjunct in this instance, as part of the same syllable—if you didn’t want that, you’d insert a ZERO-WIDTH NON-JOINER and get క్ రి [minus the space, HN turned by ZWNJ into a space :/] instead of క్రి), letter ra, vowel sign i, letter sa, sign virama (which this time just suppresses the inherent vowel). Six code points, of which three are Other_Letter (part of Letter) and three Nonspacing_Mark (part of Mark, not Letter).

Unicode’s Letter general category is almost never what you want. To begin with, you should instead use the “Alphabetic” property, which examination of https://www.unicode.org/reports/tr44/#Alphabetic and https://www.unicode.org/reports/tr44/#GC_Values_Table shows to be a superset of Letter, adding the Letter_Number (Nl) general category and the Other_Lowercase, Other_Uppercase and Other_Alphabetic properties.

But actually, even Alphabetic isn’t quite the right tool: some scripts don’t separate words with non-alphabetical characters, and some scripts use non-alphabetical characters in the middle of words (e.g. ZWJ and ZWNJ in Indic texts, or apostrophes in English). For most correct results, I believe you actually need to get into full text segmentation to find word breaks, as defined in UAX #29 <https://www.unicode.org/reports/tr29/>.

As usual: language is fiendishly complicated.

The other definitions are a strange mixture. Digit is ASCII-only, which makes sense for due date, but is inconsistent with Unicode letters elsewhere.

> The tag name MUST only contain letters, digits, or the characters _ or -. It MUST be treated as case-insensitive.

This is somewhat poorly defined, and I’m confident that from such a definition you’ll get multiple incompatible implementations. I would say that you probably want to specify something based on Unicode’s caseless matching; see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G217... which will give you much to read and despair of ever understanding. (The whole two and a bit pages of that subsection are worth reading. And I wouldn’t complain if you decided it was worth reading a lot more of the Unicode spec.)

But in this place, I don’t think you actually want letters/digits/_/- anyway. I think that for tag names you’d do better with using UAX #31 <https://www.unicode.org/reports/tr31/> identifiers of some form. Note its optional medial characters (how you’d add -), its “hashtag identifiers”, and its remarks on case folding.



Thanks for your detailed notes, I’ll look through your links.

Just to clarify for context, the “letter” definition is only relevant for tags, so outside of #tags (in item descriptions, or group titles) you can use any Unicode character you want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: