Skip to content

Use Buffers when decoding

Alexander Shtuchkin edited this page Jun 5, 2014 · 5 revisions

Decoding a string is probably the most common mistake when working with legacy encoded resources. Why? Lets see.

Problem

This is wrong:

var http = require('http'),
    iconv = require('iconv-lite');

http.get("http://website.com/", function(res) {
  var body = '';
  res.on('data', function(chunk) {
    body += chunk;
  });
  res.on('end', function() {
    var decodedBody = iconv.decode(body, 'win1252');
    console.log(decodedBody);
  });
});

Before being decoded with iconv.decode function, the original resource was (unintentionally) already decoded in body += chunk via javascript type conversion. What really happens here is:

  res.on('data', function(chunkBuffer) {
    body += chunkBuffer.toString('utf8');
  });

The same conversion is done behind the scenes if you call res.setEncoding('utf8');.

Not only the double-decoding will lead to wrong results, it is also nearly impossible to restore original bytes (utf8 conversion is lossy), so even iconv.decode(new Buffer(body, 'utf8'), 'win1252') will not help.

Solution

Keep original Buffer-s and provide them to iconv.decode. Use Buffer.concat() if needed.

In general, keep in mind that all javascript strings are already decoded and should not be decoded again.

http.get("http://website.com/", function(res) {
  var chunks = [];
  res.on('data', function(chunk) {
    chunks.push(chunk);
  });
  res.on('end', function() {
    var decodedBody = iconv.decode(Buffer.concat(chunks), 'win1252');
    console.log(decodedBody);
  });
});

// Or, with [email protected] and Node v0.10+, you can use streaming support with `collect` helper
http.get("http://website.com/", function(res) {
  res.pipe(iconv.decodeStream('win1252')).collect(function(err, decodedBody) {
    console.log(decodedBody);
  });
});
Clone this wiki locally