Serhiy Storchaka
2015-08-12 07:36:05 UTC
New submission from Serhiy Storchaka:
Trying to implement UTF-7 codec in Python I found some warts in error handling.
1. Non-ASCII bytes.
Terminating '-' at the end of the string is optional.
And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
But if following char is not ASCII, it is accepted as well, and this looks as a bug.
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
2. Ending lone high surrogate.
Lone surrogates are silently accepted by utf-7 codec.
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence
3. Incorrect shift sequence.
Strange behavior happens when shift sequence ends with wrong bits.
The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.
----------
components: Unicode
messages: 248450
nosy: ezio.melotti, haypo, lemburg, loewis, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Warts in UTF-7 error handling
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6
_______________________________________
Python tracker <***@bugs.python.org>
<http://bugs.python.org/issue24848>
_______________________________________
Trying to implement UTF-7 codec in Python I found some warts in error handling.
1. Non-ASCII bytes.
'a€b'.encode('utf-7')
b'a+IKw-b'b'a+IKw-b'.decode('utf-7')
'a€b'Terminating '-' at the end of the string is optional.
b'a+IKw'.decode('utf-7')
'a€'And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
b'a+IKw;b'.decode('utf-7')
'a€;b'But if following char is not ASCII, it is accepted as well, and this looks as a bug.
b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'b'a\xffb'.decode('utf-7')
Traceback (most recent call last):File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
b'a\xffb'.decode('utf-7', 'replace')
'a�b'2. Ending lone high surrogate.
Lone surrogates are silently accepted by utf-7 codec.
'\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-''\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence
3. Incorrect shift sequence.
Strange behavior happens when shift sequence ends with wrong bits.
b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.
----------
components: Unicode
messages: 248450
nosy: ezio.melotti, haypo, lemburg, loewis, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Warts in UTF-7 error handling
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6
_______________________________________
Python tracker <***@bugs.python.org>
<http://bugs.python.org/issue24848>
_______________________________________