One important aspect of working with Data is serialisation. Basically, this means that classes can be persisted to a storage (e.g. the file system, HDFS or S3). With Spark, a lot of file formats are possible. However, in this tutorial we will have a look on how to deal with JSON, a very popular file format and often used in Spark. Now we will have a look at Python serialization.
What is it and how does Python serialization work?
JSON stands for “Java Script Object Notation” and was usually developed for Client-Server applications with JavaScript as main user of it. It was built to have less overhead than XML.
First, let’s start with copying objects. Basically, Python knows two ways: normal copies and deep copies. The difference is that with normal copies, references to objects within the copied object are built. This is relevant when using objects as classes. In a deep copy, no references are built but every value is copied to the new object. This means that you can now use it independent from the previous one.
To copy objects to another, you only need to import copy and call the copy or deepcopy function. The following code shows how this works.
import copy ps1 = Person("Mario", 35) pss = copy.copy(ps1) psd = copy.deepcopy(ps1) ps1.name = "Meir-Huber" print(ps1.name) print(pss.name) print(psd.name)
And the output should be this:
Meir-Huber Mario Mario
JSON serialization in Python
Now, let’s look at how we can serialise an object with the use of JSON. Basically, you need to import “json”. An object that you want to serialise needs to be serialise-able. A lot of classes in Python already implement that. However, when we want to serialise our own object (e.g. the “Person” class that we have created in this tutorial), we need to implement the serialise-function or a custom serialiser. However, Python is great and provides us the possibility to access all variables in an object via the “__dict__” dictionary. This means that we don’t have to write our own serialiser and can do this via an easy call to “dumps” of “json”:
import json js = json.dumps(ps1.__dict__) print(js)
The above function creates a JSON representation of the entire class
{"name": "Meir-Huber", "age": 35}
We might want to add more information to the JSON string – e.g. the class name that it was originally stored in. We can do this by calling a custom function in the “dumps” method. This method gets the object to be serialised as only parameter. We then only pass the original object (Person) and the function we want to execute. We name this function “make_nice”. In the function, we create a dictionary and add the name of the class as first parameter. We give this the key “obj_name”. We then join the dictionary of the object into the new dictionary and return it.
Finishing the serialization
Another parameter added to the “dumps” function is “indent”. The only thing it does is printing it pretty – by adding line breaks and indents. This is just for improved readability. The method and call looks like this:
def make_nice(obj): dict = { "obj_name": obj.__class__.__name__ } dict.update(obj.__dict__) return dict js_pretty = json.dumps(ps1, default=make_nice,indent=3) print(js_pretty)
And the result should now look like the following:
{ "obj_name": "Person", "name": "Meir-Huber", "age": 35 }
Now, we know how we can serialise an object to a JSON string. Basically, you can now store this string to a file or an object on S3. The only thing that we haven’t discussed yet is how to get back an object from a string. We therefore take the JSON object we “dumps” before. Our goal now is to create a Person object from it. This can be done via the call “loads” from the json-object. We also define a method to do the casting via the “object_hook” parameter. This object_hook method has one argument – the JSON object itself. We access each of the parameters from the object with named indexers and return the new object.
str_json = "{\"name\": \"Meir-Huber\", \"age\": 35}" def create(obj): print(obj) return Person(obj["name"], obj["age"]) obj = json.loads(str_json, object_hook=create) print(obj)
The output should now look like this.
{'name': 'Meir-Huber', 'age': 35} <__main__.Person object at 0x7fb84831ddd8>
Now we know how to create JSON serialisers and how to get them back from a string value. In the next tutorial, we will have a look on how to improve this and make it more dynamic – by dynamic class creation in Python.
If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.
Leave a Reply
Want to join the discussion?Feel free to contribute!