json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

Edward Falk :

This tiny python program:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import json
import sys

x = { "name":u"This doesn't work β" }

json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
print

Generates this output when run at a terminal:

$ ./tester.py
{"name": "This doesn't work β"}

Which is exactly as I would expect. However, if I redirect stdout to a file, it fails:

$ ./tester.py > output.json
Traceback (most recent call last):
  File "./tester.py", line 9, in <module>
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b2' in position 19: ordinal not in range(128)

However, a direct print (without json.dump) can can be redirect to file:

 print u"This does work β".encode('utf-8')

It's as if the json package ignores the encoding option if stdout is not a terminal.

How can I get the json package to do what I want?

Edward Falk :

Consolidating all the comments and answers into one final answer:

Note: this answer is for Python 2.7. Python 3 is likely to be different.

The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.

You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.

There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.

The encoding argument was a red herring. That option tells the json package how the input strings are encoded.

Here's what finally worked for me:

ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)

tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=12342&siteId=1