1 | Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
|
---|
2 | ========
|
---|
3 |
|
---|
4 | A regex that tokenizes JavaScript.
|
---|
5 |
|
---|
6 | ```js
|
---|
7 | var jsTokens = require("js-tokens").default
|
---|
8 |
|
---|
9 | var jsString = "var foo=opts.foo;\n..."
|
---|
10 |
|
---|
11 | jsString.match(jsTokens)
|
---|
12 | // ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
|
---|
13 | ```
|
---|
14 |
|
---|
15 |
|
---|
16 | Installation
|
---|
17 | ============
|
---|
18 |
|
---|
19 | `npm install js-tokens`
|
---|
20 |
|
---|
21 | ```js
|
---|
22 | import jsTokens from "js-tokens"
|
---|
23 | // or:
|
---|
24 | var jsTokens = require("js-tokens").default
|
---|
25 | ```
|
---|
26 |
|
---|
27 |
|
---|
28 | Usage
|
---|
29 | =====
|
---|
30 |
|
---|
31 | ### `jsTokens` ###
|
---|
32 |
|
---|
33 | A regex with the `g` flag that matches JavaScript tokens.
|
---|
34 |
|
---|
35 | The regex _always_ matches, even invalid JavaScript and the empty string.
|
---|
36 |
|
---|
37 | The next match is always directly after the previous.
|
---|
38 |
|
---|
39 | ### `var token = matchToToken(match)` ###
|
---|
40 |
|
---|
41 | ```js
|
---|
42 | import {matchToToken} from "js-tokens"
|
---|
43 | // or:
|
---|
44 | var matchToToken = require("js-tokens").matchToToken
|
---|
45 | ```
|
---|
46 |
|
---|
47 | Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
|
---|
48 | String, value: String}` object. The following types are available:
|
---|
49 |
|
---|
50 | - string
|
---|
51 | - comment
|
---|
52 | - regex
|
---|
53 | - number
|
---|
54 | - name
|
---|
55 | - punctuator
|
---|
56 | - whitespace
|
---|
57 | - invalid
|
---|
58 |
|
---|
59 | Multi-line comments and strings also have a `closed` property indicating if the
|
---|
60 | token was closed or not (see below).
|
---|
61 |
|
---|
62 | Comments and strings both come in several flavors. To distinguish them, check if
|
---|
63 | the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
|
---|
64 |
|
---|
65 | Names are ECMAScript IdentifierNames, that is, including both identifiers and
|
---|
66 | keywords. You may use [is-keyword-js] to tell them apart.
|
---|
67 |
|
---|
68 | Whitespace includes both line terminators and other whitespace.
|
---|
69 |
|
---|
70 | [is-keyword-js]: https://github.com/crissdev/is-keyword-js
|
---|
71 |
|
---|
72 |
|
---|
73 | ECMAScript support
|
---|
74 | ==================
|
---|
75 |
|
---|
76 | The intention is to always support the latest ECMAScript version whose feature
|
---|
77 | set has been finalized.
|
---|
78 |
|
---|
79 | If adding support for a newer version requires changes, a new version with a
|
---|
80 | major verion bump will be released.
|
---|
81 |
|
---|
82 | Currently, ECMAScript 2018 is supported.
|
---|
83 |
|
---|
84 |
|
---|
85 | Invalid code handling
|
---|
86 | =====================
|
---|
87 |
|
---|
88 | Unterminated strings are still matched as strings. JavaScript strings cannot
|
---|
89 | contain (unescaped) newlines, so unterminated strings simply end at the end of
|
---|
90 | the line. Unterminated template strings can contain unescaped newlines, though,
|
---|
91 | so they go on to the end of input.
|
---|
92 |
|
---|
93 | Unterminated multi-line comments are also still matched as comments. They
|
---|
94 | simply go on to the end of the input.
|
---|
95 |
|
---|
96 | Unterminated regex literals are likely matched as division and whatever is
|
---|
97 | inside the regex.
|
---|
98 |
|
---|
99 | Invalid ASCII characters have their own capturing group.
|
---|
100 |
|
---|
101 | Invalid non-ASCII characters are treated as names, to simplify the matching of
|
---|
102 | names (except unicode spaces which are treated as whitespace). Note: See also
|
---|
103 | the [ES2018](#es2018) section.
|
---|
104 |
|
---|
105 | Regex literals may contain invalid regex syntax. They are still matched as
|
---|
106 | regex literals. They may also contain repeated regex flags, to keep the regex
|
---|
107 | simple.
|
---|
108 |
|
---|
109 | Strings may contain invalid escape sequences.
|
---|
110 |
|
---|
111 |
|
---|
112 | Limitations
|
---|
113 | ===========
|
---|
114 |
|
---|
115 | Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
|
---|
116 | perfect. But that’s not the point either.
|
---|
117 |
|
---|
118 | You may compare jsTokens with [esprima] by using `esprima-compare.js`.
|
---|
119 | See `npm run esprima-compare`!
|
---|
120 |
|
---|
121 | [esprima]: http://esprima.org/
|
---|
122 |
|
---|
123 | ### Template string interpolation ###
|
---|
124 |
|
---|
125 | Template strings are matched as single tokens, from the starting `` ` `` to the
|
---|
126 | ending `` ` ``, including interpolations (whose tokens are not matched
|
---|
127 | individually).
|
---|
128 |
|
---|
129 | Matching template string interpolations requires recursive balancing of `{` and
|
---|
130 | `}`—something that JavaScript regexes cannot do. Only one level of nesting is
|
---|
131 | supported.
|
---|
132 |
|
---|
133 | ### Division and regex literals collision ###
|
---|
134 |
|
---|
135 | Consider this example:
|
---|
136 |
|
---|
137 | ```js
|
---|
138 | var g = 9.82
|
---|
139 | var number = bar / 2/g
|
---|
140 |
|
---|
141 | var regex = / 2/g
|
---|
142 | ```
|
---|
143 |
|
---|
144 | A human can easily understand that in the `number` line we’re dealing with
|
---|
145 | division, and in the `regex` line we’re dealing with a regex literal. How come?
|
---|
146 | Because humans can look at the whole code to put the `/` characters in context.
|
---|
147 | A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
|
---|
148 | look backwards. See the [ES2018](#es2018) section).
|
---|
149 |
|
---|
150 | When the `jsTokens` regex scans throught the above, it will see the following
|
---|
151 | at the end of both the `number` and `regex` rows:
|
---|
152 |
|
---|
153 | ```js
|
---|
154 | / 2/g
|
---|
155 | ```
|
---|
156 |
|
---|
157 | It is then impossible to know if that is a regex literal, or part of an
|
---|
158 | expression dealing with division.
|
---|
159 |
|
---|
160 | Here is a similar case:
|
---|
161 |
|
---|
162 | ```js
|
---|
163 | foo /= 2/g
|
---|
164 | foo(/= 2/g)
|
---|
165 | ```
|
---|
166 |
|
---|
167 | The first line divides the `foo` variable with `2/g`. The second line calls the
|
---|
168 | `foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
|
---|
169 | sees forwards, it cannot tell the two cases apart.
|
---|
170 |
|
---|
171 | There are some cases where we _can_ tell division and regex literals apart,
|
---|
172 | though.
|
---|
173 |
|
---|
174 | First off, we have the simple cases where there’s only one slash in the line:
|
---|
175 |
|
---|
176 | ```js
|
---|
177 | var foo = 2/g
|
---|
178 | foo /= 2
|
---|
179 | ```
|
---|
180 |
|
---|
181 | Regex literals cannot contain newlines, so the above cases are correctly
|
---|
182 | identified as division. Things are only problematic when there are more than
|
---|
183 | one non-comment slash in a single line.
|
---|
184 |
|
---|
185 | Secondly, not every character is a valid regex flag.
|
---|
186 |
|
---|
187 | ```js
|
---|
188 | var number = bar / 2/e
|
---|
189 | ```
|
---|
190 |
|
---|
191 | The above example is also correctly identified as division, because `e` is not a
|
---|
192 | valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
|
---|
193 | (any letter) as flags, but it is not worth it since it increases the amount of
|
---|
194 | ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
|
---|
195 | allowed. This means that the above example will be identified as division as
|
---|
196 | long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
|
---|
197 | characters long.
|
---|
198 |
|
---|
199 | Lastly, we can look _forward_ for information.
|
---|
200 |
|
---|
201 | - If the token following what looks like a regex literal is not valid after a
|
---|
202 | regex literal, but is valid in a division expression, then the regex literal
|
---|
203 | is treated as division instead. For example, a flagless regex cannot be
|
---|
204 | followed by a string, number or name, but all of those three can be the
|
---|
205 | denominator of a division.
|
---|
206 | - Generally, if what looks like a regex literal is followed by an operator, the
|
---|
207 | regex literal is treated as division instead. This is because regexes are
|
---|
208 | seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
|
---|
209 | could likely be part of such an expression.
|
---|
210 |
|
---|
211 | Please consult the regex source and the test cases for precise information on
|
---|
212 | when regex or division is matched (should you need to know). In short, you
|
---|
213 | could sum it up as:
|
---|
214 |
|
---|
215 | If the end of a statement looks like a regex literal (even if it isn’t), it
|
---|
216 | will be treated as one. Otherwise it should work as expected (if you write sane
|
---|
217 | code).
|
---|
218 |
|
---|
219 | ### ES2018 ###
|
---|
220 |
|
---|
221 | ES2018 added some nice regex improvements to the language.
|
---|
222 |
|
---|
223 | - [Unicode property escapes] should allow telling names and invalid non-ASCII
|
---|
224 | characters apart without blowing up the regex size.
|
---|
225 | - [Lookbehind assertions] should allow matching telling division and regex
|
---|
226 | literals apart in more cases.
|
---|
227 | - [Named capture groups] might simplify some things.
|
---|
228 |
|
---|
229 | These things would be nice to do, but are not critical. They probably have to
|
---|
230 | wait until the oldest maintained Node.js LTS release supports those features.
|
---|
231 |
|
---|
232 | [Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
|
---|
233 | [Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
|
---|
234 | [Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
|
---|
235 |
|
---|
236 |
|
---|
237 | License
|
---|
238 | =======
|
---|
239 |
|
---|
240 | [MIT](LICENSE).
|
---|