Nov 072013

Update: Thanks to Edward Kmett, this issue was quickly fixed for those of you who depend on Bytes >= 0.13. Cereal was also updated. There is some interesting discussion on this issue on the Haskell Reddit page. In the end I learned about some of the inner workings of serialization in Haskell, and learned to appreciate the extremely helpful and responsive Haskell community that is strongly dedicated to making their ecosystem and tools better in response to feedback.

Recently I’ve been making some minor contributions to Elm, a fantastic, functional language that compiles to Javascript. If you haven’t already seen some of the things that you can create with Elm using a relatively small amount of code, you should spend some time checking out the project’s web site.

Last weekend I did some work on adding a bit more robustness to the Elm compiler’s mechanism of loading interface definitions of compiled programs (1, 2). I’m not going to go into details about the checks that I added since they’re probably not that interesting (hopefully they just work as intended and it makes it easier and more predictable for you to use the Elm compiler!), but I did come across some interesting behavior with regard to Haskell’s serialization of certain data types while I was hacking on this part of the code.

Basically, if you serialize a data structure such as a Hash/Map/Dictionary in a given language, and then re-read that data structure, you’d expect to get back something like the original dictionary, right? This may or may not be the case in Haskell. Here’s a simple example:

import Data.Binary as B
import Data.Map as M

myMap :: M.Map String String
myMap = B.decode b
    where m = [("one", "blue"), ("two", "red"), ("three", "green")]
          b = B.encode m

main = do
  putStrLn $ "The third key is: " ++ show (M.member "three" myMap)

On line 10 above we use M.member, which takes a key and returns a Boolean indicating if that key is in the Map. If you’re new at Haskell you may expect this to return True. Look closely at the structure that is serialized on line 6, though. It’s a List of two-element tuples. We could have turned this into a Map with M.fromList, but we didn’t – the actual structure that we serialize is a List!

Now, when we decode the serialized structure on line 5, what do you think will happen? We don’t specify a type on that line, and if you look at the type of the function B.decode it’s as follows:

λ: :t B.decode
  :: Binary a => Data.ByteString.Lazy.Internal.ByteString -> a

It takes a ByteString, and returns an ‘a’ which, according to the type constraint on the function should be something in the Binary typeclass. Readers of this post familiar with Haskell may notice that Haskell will try to return a Map, specifically with keys and values of type String. This is because of the type signature on line 4.

If you’re still following, now it’s time for the real fun to begin. To summarize the process above, Haskell used type inference to figure out what the type should be that it attempts to deserialize. You may think that you would get a runtime error when Haskell determines that a given ByteString can’t be decoded into a Map since it’s actually a serialized List, but you’d be wrong! Instead it returns a corrupt Map which doesn’t know about all of the keys it contains (for some reason related to the internal representation of the Map it sees some of them, though):

λ: M.member "one" myMap
λ: M.member "two" myMap
λ: M.member "three" myMap

When requested, Haskell happily deserializes into a Map with a broken internal representation without warning but with incorrect behavior. Interestingly, the Map can tell you it’s broken when asked (the function M.valid on the map returns False). In my case though the warning came too late, after I had already spent a bunch of time tracking down a bug in the program.

Although I haven’t spent much time in Haskell, it really surprised me how deserialization can lead to strangely broken data structures. In conjunction with the way that type inference is used to determine the target class to which to deserialize, it lead to an error that not only went uncaught by the compiler, but caused strange behavior at runtime as well. This is in contrast to pretty much all of my other experiences in Haskell, where I’ve seen the language and the libraries really be a great help in detecting trouble before even running the program.

In conclusion, if you’re using Haskell’s serialization on types like a Map (and perhaps others?) think carefully about the assumptions that type inference is making for you in terms of deserialization, and carefully check your work by hand (or better, with automated tests). Please let me know in the comments if others have had this experience, and if you think that this is something that can be addressed to make serialization of data structures like Data.Map more safe through providing safer defaults. Alternatively, let me know if there is something that could have helped me catch this error earlier since I’m new in the language. Thanks!

  3 Responses to “Mind-bending behavior for deserialization in Haskell”

  1. It looks like Data.Map serializes its keys in ascending order, and expects to deserialize them the same way, most likely for performance reasons. There are functions to build a map from an ascending key-value list in O(n), rather than the O(n*log n) of fromList, with the caveat that if the list is not in ascending order the result will be invalid. That is probably why you could see the keys for “one” and “two” but not “three” (which sorts before “two”); “three” was inserted into a space *after* “two”, but the member function expected to find it *before* “two”.

    The result is exactly what you would expect if you think of “decode . encode :: [(a,b)] -> Map a b” as the equivalent of “fromAscList”, though the relationship is probably accidental. I doubt anyone intended for lists of any sort to be deserialized as maps.

    • Thanks Jesse – it seems that is what is occurring. If you check the links at the top of the blog post, there have been some fixes made to the bytes library so that if you do make this mistake of deserializing a list of tuples to a Map, you won’t get a corrupted map in return.

      Thanks for your comment!

  2. I think “decode” should get a big warning since it only works on bytestrings that are created with “encode”. If you read data from disk then you cannot assert that. You have to use decodeOrFail. A special ByteString type annotated with the type it serializes would prevent your bug.

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>