thats what your (1.3) django says about
entry_word = hashlib.new('md5', entry_word_id).hexdigest()
on my local machine this is okay
also when 'entry_word_id' is non ascii theres UnicodeException and not decode nor encode to utf8 didnt help
:(
thats what your (1.3) django says about
entry_word = hashlib.new('md5', entry_word_id).hexdigest()
on my local machine this is okay
also when 'entry_word_id' is non ascii theres UnicodeException and not decode nor encode to utf8 didnt help
:(
The MD5 hash is defined in terms of bytes not Unicode code points, so if you have a unicode
object then you need to encode it to a str
(i.e. a byte representation such as UTF-8) before calculating the hash of it. Obviously for verifying the hash you need to use the same encoding on all systems which use the hash. This is fundamental to the nature of the MD5 (and many other) hash functions, this isn't a PA-specific or even Python-specific issue.
Your one line of code is that it's actually a shortcut for this:
m = hashlib.new("md5")
m.update(entry_word_id)
entry_word = m.hexdigest()
In this case, if entry_word_id
is a unicode
object then the call to update()
first tries to convert it to a str
, because the MD5 hash requires bytes. To do this it uses the default ascii
encoding under which standard ASCII characters with values under 128 remain unchanged and any other unicode characters raise a UnicodeEncodeError
. If the string happens to all be standard ASCII then everything is OK, but if not then the conversion fails because Python can't guess which encoding you want to use. In this case, you'd get an exception from the update()
call something like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I suspect in your one-line version this exception is simply being caught by a generic handler within the new()
method which is producing the following error:
ValueError: unsupported hash type md5
Arguably this is a small bug in Python's hashlib
library that it hides errors, but it's an understandable one and easily resolved by using the longer form.
In any case, you should be able to fix your issue by always using a fixed encoding to convert to bytes:
entry_word = hashlib.new("md5", entry_word_id.encode("utf-8")).hexdigest()
I know you said encoding to UTF-8 didn't help but this definitely resolved the issue to me when I just reproduced it on a console, so could you try exactly the code above and confirm which error you're still seeing (if any). It may be that you're getting an error from the encoding rather than from calculating the hash, but it's hard to tell without seeing the exact error.
Also, if you still have problems I'd suggest converting your code to the longer form with a manual call to the update()
method as above, at least temporarily while you're tracking problems down. This means the exceptions thrown should be easier to trace. Please post the error you get from this longer form if you still have problems (including the full traceback).
Cartroo, thank you. I found this:
entry_word_id = "%s_%s".encode("utf8") % (word_info.word, word_info.entryId)
entry_word = hashlib.new('md5', entry_word_id).hexdigest()
where value of entry_word_id is :
u'\u0447\u0435\u043b\u043e\u0432\u0435\u043a_00036e54b0f0f679b0cfbbadd94e4d78'
and it leads to ValueError "unsupported hash type md5".
If I write like that (encode inside hashlib's constructor):
entry_word_id = "%s_%s" % (word_info.word, word_info.entryId)
entry_word = hashlib.new('md5', entry_word_id.encode("utf8")).hexdigest()
page is opens fine with no exceptions; as for me, both variants are equals (as for python it's not);
The first example still ends up producing a unicode
object because the encode()
method is called too early. What that code is doing is encoding string "%s_%s"
into UTF-8, which yields exactly the same string because all those characters is ASCII. Then it uses the string format operator %
to substitute in the two values - if either of these is a unicode
type then Python upgrades the whole string to unicode
and that's what you end up passing into hashlib.new()
.
Maybe it's easier to read with extra brackets to make it clear:
entry_word_id = ("%s_%s".encode("utf-8")) % (word_info.word, word_info.entryId)
I think that's it's trying to do is to move the brackets like this, which should work:
entry_word_id = ("%s_%s" % (word_info.word, word_info.entryId)).encode("utf-8")
But personally I think that looks a little ugly. I would prefer:
entry_word_id = "%s_%s" % (word_info.word, word_info.entryId)
entry_word = hashlib("md5", entry_word_id.encode("utf-8")).hexdigest()
You can choose whichever version you prefer.