[d565449] | 1 | Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
|
---|
| 2 | ========
|
---|
| 3 |
|
---|
| 4 | A regex that tokenizes JavaScript.
|
---|
| 5 |
|
---|
| 6 | ```js
|
---|
| 7 | var jsTokens = require("js-tokens").default
|
---|
| 8 |
|
---|
| 9 | var jsString = "var foo=opts.foo;\n..."
|
---|
| 10 |
|
---|
| 11 | jsString.match(jsTokens)
|
---|
| 12 | // ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
|
---|
| 13 | ```
|
---|
| 14 |
|
---|
| 15 |
|
---|
| 16 | Installation
|
---|
| 17 | ============
|
---|
| 18 |
|
---|
| 19 | `npm install js-tokens`
|
---|
| 20 |
|
---|
| 21 | ```js
|
---|
| 22 | import jsTokens from "js-tokens"
|
---|
| 23 | // or:
|
---|
| 24 | var jsTokens = require("js-tokens").default
|
---|
| 25 | ```
|
---|
| 26 |
|
---|
| 27 |
|
---|
| 28 | Usage
|
---|
| 29 | =====
|
---|
| 30 |
|
---|
| 31 | ### `jsTokens` ###
|
---|
| 32 |
|
---|
| 33 | A regex with the `g` flag that matches JavaScript tokens.
|
---|
| 34 |
|
---|
| 35 | The regex _always_ matches, even invalid JavaScript and the empty string.
|
---|
| 36 |
|
---|
| 37 | The next match is always directly after the previous.
|
---|
| 38 |
|
---|
| 39 | ### `var token = matchToToken(match)` ###
|
---|
| 40 |
|
---|
| 41 | ```js
|
---|
| 42 | import {matchToToken} from "js-tokens"
|
---|
| 43 | // or:
|
---|
| 44 | var matchToToken = require("js-tokens").matchToToken
|
---|
| 45 | ```
|
---|
| 46 |
|
---|
| 47 | Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
|
---|
| 48 | String, value: String}` object. The following types are available:
|
---|
| 49 |
|
---|
| 50 | - string
|
---|
| 51 | - comment
|
---|
| 52 | - regex
|
---|
| 53 | - number
|
---|
| 54 | - name
|
---|
| 55 | - punctuator
|
---|
| 56 | - whitespace
|
---|
| 57 | - invalid
|
---|
| 58 |
|
---|
| 59 | Multi-line comments and strings also have a `closed` property indicating if the
|
---|
| 60 | token was closed or not (see below).
|
---|
| 61 |
|
---|
| 62 | Comments and strings both come in several flavors. To distinguish them, check if
|
---|
| 63 | the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
|
---|
| 64 |
|
---|
| 65 | Names are ECMAScript IdentifierNames, that is, including both identifiers and
|
---|
| 66 | keywords. You may use [is-keyword-js] to tell them apart.
|
---|
| 67 |
|
---|
| 68 | Whitespace includes both line terminators and other whitespace.
|
---|
| 69 |
|
---|
| 70 | [is-keyword-js]: https://github.com/crissdev/is-keyword-js
|
---|
| 71 |
|
---|
| 72 |
|
---|
| 73 | ECMAScript support
|
---|
| 74 | ==================
|
---|
| 75 |
|
---|
| 76 | The intention is to always support the latest ECMAScript version whose feature
|
---|
| 77 | set has been finalized.
|
---|
| 78 |
|
---|
| 79 | If adding support for a newer version requires changes, a new version with a
|
---|
| 80 | major verion bump will be released.
|
---|
| 81 |
|
---|
| 82 | Currently, ECMAScript 2018 is supported.
|
---|
| 83 |
|
---|
| 84 |
|
---|
| 85 | Invalid code handling
|
---|
| 86 | =====================
|
---|
| 87 |
|
---|
| 88 | Unterminated strings are still matched as strings. JavaScript strings cannot
|
---|
| 89 | contain (unescaped) newlines, so unterminated strings simply end at the end of
|
---|
| 90 | the line. Unterminated template strings can contain unescaped newlines, though,
|
---|
| 91 | so they go on to the end of input.
|
---|
| 92 |
|
---|
| 93 | Unterminated multi-line comments are also still matched as comments. They
|
---|
| 94 | simply go on to the end of the input.
|
---|
| 95 |
|
---|
| 96 | Unterminated regex literals are likely matched as division and whatever is
|
---|
| 97 | inside the regex.
|
---|
| 98 |
|
---|
| 99 | Invalid ASCII characters have their own capturing group.
|
---|
| 100 |
|
---|
| 101 | Invalid non-ASCII characters are treated as names, to simplify the matching of
|
---|
| 102 | names (except unicode spaces which are treated as whitespace). Note: See also
|
---|
| 103 | the [ES2018](#es2018) section.
|
---|
| 104 |
|
---|
| 105 | Regex literals may contain invalid regex syntax. They are still matched as
|
---|
| 106 | regex literals. They may also contain repeated regex flags, to keep the regex
|
---|
| 107 | simple.
|
---|
| 108 |
|
---|
| 109 | Strings may contain invalid escape sequences.
|
---|
| 110 |
|
---|
| 111 |
|
---|
| 112 | Limitations
|
---|
| 113 | ===========
|
---|
| 114 |
|
---|
| 115 | Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
|
---|
| 116 | perfect. But that’s not the point either.
|
---|
| 117 |
|
---|
| 118 | You may compare jsTokens with [esprima] by using `esprima-compare.js`.
|
---|
| 119 | See `npm run esprima-compare`!
|
---|
| 120 |
|
---|
| 121 | [esprima]: http://esprima.org/
|
---|
| 122 |
|
---|
| 123 | ### Template string interpolation ###
|
---|
| 124 |
|
---|
| 125 | Template strings are matched as single tokens, from the starting `` ` `` to the
|
---|
| 126 | ending `` ` ``, including interpolations (whose tokens are not matched
|
---|
| 127 | individually).
|
---|
| 128 |
|
---|
| 129 | Matching template string interpolations requires recursive balancing of `{` and
|
---|
| 130 | `}`—something that JavaScript regexes cannot do. Only one level of nesting is
|
---|
| 131 | supported.
|
---|
| 132 |
|
---|
| 133 | ### Division and regex literals collision ###
|
---|
| 134 |
|
---|
| 135 | Consider this example:
|
---|
| 136 |
|
---|
| 137 | ```js
|
---|
| 138 | var g = 9.82
|
---|
| 139 | var number = bar / 2/g
|
---|
| 140 |
|
---|
| 141 | var regex = / 2/g
|
---|
| 142 | ```
|
---|
| 143 |
|
---|
| 144 | A human can easily understand that in the `number` line we’re dealing with
|
---|
| 145 | division, and in the `regex` line we’re dealing with a regex literal. How come?
|
---|
| 146 | Because humans can look at the whole code to put the `/` characters in context.
|
---|
| 147 | A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
|
---|
| 148 | look backwards. See the [ES2018](#es2018) section).
|
---|
| 149 |
|
---|
| 150 | When the `jsTokens` regex scans throught the above, it will see the following
|
---|
| 151 | at the end of both the `number` and `regex` rows:
|
---|
| 152 |
|
---|
| 153 | ```js
|
---|
| 154 | / 2/g
|
---|
| 155 | ```
|
---|
| 156 |
|
---|
| 157 | It is then impossible to know if that is a regex literal, or part of an
|
---|
| 158 | expression dealing with division.
|
---|
| 159 |
|
---|
| 160 | Here is a similar case:
|
---|
| 161 |
|
---|
| 162 | ```js
|
---|
| 163 | foo /= 2/g
|
---|
| 164 | foo(/= 2/g)
|
---|
| 165 | ```
|
---|
| 166 |
|
---|
| 167 | The first line divides the `foo` variable with `2/g`. The second line calls the
|
---|
| 168 | `foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
|
---|
| 169 | sees forwards, it cannot tell the two cases apart.
|
---|
| 170 |
|
---|
| 171 | There are some cases where we _can_ tell division and regex literals apart,
|
---|
| 172 | though.
|
---|
| 173 |
|
---|
| 174 | First off, we have the simple cases where there’s only one slash in the line:
|
---|
| 175 |
|
---|
| 176 | ```js
|
---|
| 177 | var foo = 2/g
|
---|
| 178 | foo /= 2
|
---|
| 179 | ```
|
---|
| 180 |
|
---|
| 181 | Regex literals cannot contain newlines, so the above cases are correctly
|
---|
| 182 | identified as division. Things are only problematic when there are more than
|
---|
| 183 | one non-comment slash in a single line.
|
---|
| 184 |
|
---|
| 185 | Secondly, not every character is a valid regex flag.
|
---|
| 186 |
|
---|
| 187 | ```js
|
---|
| 188 | var number = bar / 2/e
|
---|
| 189 | ```
|
---|
| 190 |
|
---|
| 191 | The above example is also correctly identified as division, because `e` is not a
|
---|
| 192 | valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
|
---|
| 193 | (any letter) as flags, but it is not worth it since it increases the amount of
|
---|
| 194 | ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
|
---|
| 195 | allowed. This means that the above example will be identified as division as
|
---|
| 196 | long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
|
---|
| 197 | characters long.
|
---|
| 198 |
|
---|
| 199 | Lastly, we can look _forward_ for information.
|
---|
| 200 |
|
---|
| 201 | - If the token following what looks like a regex literal is not valid after a
|
---|
| 202 | regex literal, but is valid in a division expression, then the regex literal
|
---|
| 203 | is treated as division instead. For example, a flagless regex cannot be
|
---|
| 204 | followed by a string, number or name, but all of those three can be the
|
---|
| 205 | denominator of a division.
|
---|
| 206 | - Generally, if what looks like a regex literal is followed by an operator, the
|
---|
| 207 | regex literal is treated as division instead. This is because regexes are
|
---|
| 208 | seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
|
---|
| 209 | could likely be part of such an expression.
|
---|
| 210 |
|
---|
| 211 | Please consult the regex source and the test cases for precise information on
|
---|
| 212 | when regex or division is matched (should you need to know). In short, you
|
---|
| 213 | could sum it up as:
|
---|
| 214 |
|
---|
| 215 | If the end of a statement looks like a regex literal (even if it isn’t), it
|
---|
| 216 | will be treated as one. Otherwise it should work as expected (if you write sane
|
---|
| 217 | code).
|
---|
| 218 |
|
---|
| 219 | ### ES2018 ###
|
---|
| 220 |
|
---|
| 221 | ES2018 added some nice regex improvements to the language.
|
---|
| 222 |
|
---|
| 223 | - [Unicode property escapes] should allow telling names and invalid non-ASCII
|
---|
| 224 | characters apart without blowing up the regex size.
|
---|
| 225 | - [Lookbehind assertions] should allow matching telling division and regex
|
---|
| 226 | literals apart in more cases.
|
---|
| 227 | - [Named capture groups] might simplify some things.
|
---|
| 228 |
|
---|
| 229 | These things would be nice to do, but are not critical. They probably have to
|
---|
| 230 | wait until the oldest maintained Node.js LTS release supports those features.
|
---|
| 231 |
|
---|
| 232 | [Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
|
---|
| 233 | [Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
|
---|
| 234 | [Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
|
---|
| 235 |
|
---|
| 236 |
|
---|
| 237 | License
|
---|
| 238 | =======
|
---|
| 239 |
|
---|
| 240 | [MIT](LICENSE).
|
---|