[d565449] | 1 | # Graphemer: Unicode Character Splitter 🪓
|
---|
| 2 |
|
---|
| 3 | ## Introduction
|
---|
| 4 |
|
---|
| 5 | This library continues the work of [Grapheme Splitter](https://github.com/orling/grapheme-splitter) and supports the following unicode versions:
|
---|
| 6 |
|
---|
| 7 | - Unicode 15 and below `[v1.4.0]`
|
---|
| 8 | - Unicode 14 and below `[v1.3.0]`
|
---|
| 9 | - Unicode 13 and below `[v1.1.0]`
|
---|
| 10 | - Unicode 11 and below `[v1.0.0]` (Unicode 10 supported by `grapheme-splitter`)
|
---|
| 11 |
|
---|
| 12 | In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.
|
---|
| 13 |
|
---|
| 14 | For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,
|
---|
| 15 |
|
---|
| 16 | ```javascript
|
---|
| 17 | '🌷'.length == 2;
|
---|
| 18 | ```
|
---|
| 19 |
|
---|
| 20 | The combined emoji are even longer:
|
---|
| 21 |
|
---|
| 22 | ```javascript
|
---|
| 23 | '🏳️🌈'.length == 6;
|
---|
| 24 | ```
|
---|
| 25 |
|
---|
| 26 | What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:
|
---|
| 27 |
|
---|
| 28 | ```javascript
|
---|
| 29 | var two = 'ñ'; // unnormalized two-char n+◌̃, i.e. "\u006E\u0303";
|
---|
| 30 | var one = 'ñ'; // normalized single-char, i.e. "\u00F1"
|
---|
| 31 |
|
---|
| 32 | console.log(one != two); // prints 'true'
|
---|
| 33 | ```
|
---|
| 34 |
|
---|
| 35 | Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can **sometimes** fix those differences and turn two-char sequences into single characters. But it is **not** enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations.
|
---|
| 36 | For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:
|
---|
| 37 |
|
---|
| 38 | अ + न + ु + च + ् + छ + े + द
|
---|
| 39 |
|
---|
| 40 | which is in fact just 5 user-perceived letters:
|
---|
| 41 |
|
---|
| 42 | अ + नु + च् + छे + द
|
---|
| 43 |
|
---|
| 44 | and which Unicode normalization would not combine properly.
|
---|
| 45 | There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.
|
---|
| 46 |
|
---|
| 47 | Enter the `graphemer` library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the [Default Grapheme Cluster Boundary](http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table) of [UAX #29](http://www.unicode.org/reports/tr29/).
|
---|
| 48 |
|
---|
| 49 | ## Installation
|
---|
| 50 |
|
---|
| 51 | Install `graphemer` using the NPM command below:
|
---|
| 52 |
|
---|
| 53 | ```
|
---|
| 54 | $ npm i graphemer
|
---|
| 55 | ```
|
---|
| 56 |
|
---|
| 57 | ## Usage
|
---|
| 58 |
|
---|
| 59 | If you're using [Typescript](https://www.typescriptlang.org/) or a compiler like [Babel](https://babeljs.io/) (or something like Create React App) things are pretty simple; just import, initialize and use!
|
---|
| 60 |
|
---|
| 61 | ```javascript
|
---|
| 62 | import Graphemer from 'graphemer';
|
---|
| 63 |
|
---|
| 64 | const splitter = new Graphemer();
|
---|
| 65 |
|
---|
| 66 | // split the string to an array of grapheme clusters (one string each)
|
---|
| 67 | const graphemes = splitter.splitGraphemes(string);
|
---|
| 68 |
|
---|
| 69 | // iterate the string to an iterable iterator of grapheme clusters (one string each)
|
---|
| 70 | const graphemeIterator = splitter.iterateGraphemes(string);
|
---|
| 71 |
|
---|
| 72 | // or do this if you just need their number
|
---|
| 73 | const graphemeCount = splitter.countGraphemes(string);
|
---|
| 74 | ```
|
---|
| 75 |
|
---|
| 76 | If you're using vanilla Node you can use the `require()` method.
|
---|
| 77 |
|
---|
| 78 | ```javascript
|
---|
| 79 | const Graphemer = require('graphemer').default;
|
---|
| 80 |
|
---|
| 81 | const splitter = new Graphemer();
|
---|
| 82 |
|
---|
| 83 | const graphemes = splitter.splitGraphemes(string);
|
---|
| 84 | ```
|
---|
| 85 |
|
---|
| 86 | ## Examples
|
---|
| 87 |
|
---|
| 88 | ```javascript
|
---|
| 89 | import Graphemer from 'graphemer';
|
---|
| 90 |
|
---|
| 91 | const splitter = new Graphemer();
|
---|
| 92 |
|
---|
| 93 | // plain latin alphabet - nothing spectacular
|
---|
| 94 | splitter.splitGraphemes('abcd'); // returns ["a", "b", "c", "d"]
|
---|
| 95 |
|
---|
| 96 | // two-char emojis and six-char combined emoji
|
---|
| 97 | splitter.splitGraphemes('🌷🎁💩😜👍🏳️🌈'); // returns ["🌷","🎁","💩","😜","👍","🏳️🌈"]
|
---|
| 98 |
|
---|
| 99 | // diacritics as combining marks, 10 JavaScript chars
|
---|
| 100 | splitter.splitGraphemes('Ĺo͂řȩm̅'); // returns ["Ĺ","o͂","ř","ȩ","m̅"]
|
---|
| 101 |
|
---|
| 102 | // individual Korean characters (Jamo), 4 JavaScript chars
|
---|
| 103 | splitter.splitGraphemes('뎌쉐'); // returns ["뎌","쉐"]
|
---|
| 104 |
|
---|
| 105 | // Hindi text with combining marks, 8 JavaScript chars
|
---|
| 106 | splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]
|
---|
| 107 |
|
---|
| 108 | // demonic multiple combining marks, 75 JavaScript chars
|
---|
| 109 | splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]
|
---|
| 110 | ```
|
---|
| 111 |
|
---|
| 112 | ## TypeScript
|
---|
| 113 |
|
---|
| 114 | Graphemer is built with TypeScript and, of course, includes type declarations.
|
---|
| 115 |
|
---|
| 116 | ```javascript
|
---|
| 117 | import Graphemer from 'graphemer';
|
---|
| 118 |
|
---|
| 119 | const splitter = new Graphemer();
|
---|
| 120 |
|
---|
| 121 | const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞');
|
---|
| 122 | ```
|
---|
| 123 |
|
---|
| 124 | ## Contributing
|
---|
| 125 |
|
---|
| 126 | See [Contribution Guide](./CONTRIBUTING.md).
|
---|
| 127 |
|
---|
| 128 | ## Acknowledgements
|
---|
| 129 |
|
---|
| 130 | This library is a fork of the incredible work done by Orlin Georgiev and Huáng Jùnliàng at https://github.com/orling/grapheme-splitter.
|
---|
| 131 |
|
---|
| 132 | The original library was heavily influenced by Devon Govett's excellent [grapheme-breaker](https://github.com/devongovett/grapheme-breaker) CoffeeScript library.
|
---|