1 |
|
---|
2 | chardet [![Build Status](https://travis-ci.org/runk/node-chardet.png)](https://travis-ci.org/runk/node-chardet)
|
---|
3 | =====
|
---|
4 |
|
---|
5 | Chardet is a character detection module for NodeJS written in pure Javascript.
|
---|
6 | Module is based on ICU project http://site.icu-project.org/, which uses character
|
---|
7 | occurency analysis to determine the most probable encoding.
|
---|
8 |
|
---|
9 | ## Installation
|
---|
10 |
|
---|
11 | ```
|
---|
12 | npm i chardet
|
---|
13 | ```
|
---|
14 |
|
---|
15 | ## Usage
|
---|
16 |
|
---|
17 | To return the encoding with the highest confidence:
|
---|
18 | ```javascript
|
---|
19 | var chardet = require('chardet');
|
---|
20 | chardet.detect(Buffer.alloc('hello there!'));
|
---|
21 | // or
|
---|
22 | chardet.detectFile('/path/to/file', function(err, encoding) {});
|
---|
23 | // or
|
---|
24 | chardet.detectFileSync('/path/to/file');
|
---|
25 | ```
|
---|
26 |
|
---|
27 |
|
---|
28 | To return the full list of possible encodings:
|
---|
29 | ```javascript
|
---|
30 | var chardet = require('chardet');
|
---|
31 | chardet.detectAll(Buffer.alloc('hello there!'));
|
---|
32 | // or
|
---|
33 | chardet.detectFileAll('/path/to/file', function(err, encoding) {});
|
---|
34 | // or
|
---|
35 | chardet.detectFileAllSync('/path/to/file');
|
---|
36 |
|
---|
37 | //Returned value is an array of objects sorted by confidence value in decending order
|
---|
38 | //e.g. [{ confidence: 90, name: 'UTF-8'}, {confidence: 20, name: 'windows-1252', lang: 'fr'}]
|
---|
39 | ```
|
---|
40 |
|
---|
41 | ## Working with large data sets
|
---|
42 |
|
---|
43 | Sometimes, when data set is huge and you want to optimize performace (in tradeoff of less accuracy),
|
---|
44 | you can sample only first N bytes of the buffer:
|
---|
45 |
|
---|
46 | ```javascript
|
---|
47 | chardet.detectFile('/path/to/file', { sampleSize: 32 }, function(err, encoding) {});
|
---|
48 | ```
|
---|
49 |
|
---|
50 | ## Supported Encodings:
|
---|
51 |
|
---|
52 | * UTF-8
|
---|
53 | * UTF-16 LE
|
---|
54 | * UTF-16 BE
|
---|
55 | * UTF-32 LE
|
---|
56 | * UTF-32 BE
|
---|
57 | * ISO-2022-JP
|
---|
58 | * ISO-2022-KR
|
---|
59 | * ISO-2022-CN
|
---|
60 | * Shift-JIS
|
---|
61 | * Big5
|
---|
62 | * EUC-JP
|
---|
63 | * EUC-KR
|
---|
64 | * GB18030
|
---|
65 | * ISO-8859-1
|
---|
66 | * ISO-8859-2
|
---|
67 | * ISO-8859-5
|
---|
68 | * ISO-8859-6
|
---|
69 | * ISO-8859-7
|
---|
70 | * ISO-8859-8
|
---|
71 | * ISO-8859-9
|
---|
72 | * windows-1250
|
---|
73 | * windows-1251
|
---|
74 | * windows-1252
|
---|
75 | * windows-1253
|
---|
76 | * windows-1254
|
---|
77 | * windows-1255
|
---|
78 | * windows-1256
|
---|
79 | * KOI8-R
|
---|
80 |
|
---|
81 | Currently only these encodings are supported, more will be added soon.
|
---|