Review: Serious Python

Jul 26, 2020

Serious reading at the driving range.

I just finished reading Serious Python, by Julien Danjou. It was really satisfying to read a technical book cover to cover again, and it was also nice to read a technical book that didn’t require sitting in front of a laptop the whole time.

Serious Python was a really good read for me, because I had heard of most of the topics covered in the book but hadn’t dug into any of them in depth. As I was reading, I could think of a number of previous projects where I could have applied these concepts, and I’ll use many of them in my current and future projects.

The book is well organized; you can either read it cover to cover, or pick the parts you’re most interested in and read just those chapters. I read it cover to cover, skimming just a few parts that are less relevant to my work. You can find a complete table of contents here. I had a lot of takeaways, and even the concepts I won’t use directly leave me with a better understanding of how Python works internally. It also helps me understand some of the code I see in the libraries I use. The parts that will have the most impact on my projects are the sections on unit testing, methods, and optimizing your code.

I liked the profiling section.

I used to keep my books nice and clean, but then I found that marking them up helped me get a lot more out of them. I’ll share some of my takeaways from the sections I marked up most heavily in the book.

Writing better classes and methods

I have long been aware of more advanced concepts related to object-oriented programming, but I’ve gotten a lot of work done using just the basics. As I start to work on more significant data analysis projects though, with larger data sets and more complex analysis, I’m starting to see my code slow down. People used to claim this happens because Python is inherently a slow language, but it almost always means you’re just not using Python efficiently. That certainly applies to me. One clear takeaway from reading this book is how to write better classes and methods that make my intentions clearer to myself and other programmers, and result in more efficient code as well.

Using slots

The first takeaway for me is to consider more carefully how my classes will be used, and then consider the most efficient way to structure my classes. For example many of my classes will only ever use the attributes I define; they’ll never have new attributes added at run time. One of my projects uses readings from a stream gauge. The gauge measures things like river height and river flow rate. Here’s a simplified version of a class I created, focusing on river height readings and associated timestamps:

class RiverReading:

    def __init__(self, height, ts_reading):
        self.height = height
        self.ts_reading = ts_reading

I’ve written classes like this for a long time now, without spending much time thinking about what’s happening internally when this code is run. Now that my code is slowing down due to the volume of data I’m working with and the complexity of the analysis I’m doing, it’s helpful to know more about what Python is doing internally. One thing I learned from this book is that Python, when it interprets this code, builds the data structure in a way that allows us to add more attributes later. This is the simple flexibility that we love about Python, but when we use a large enough volume of data, it can start to affect performance.

If we know we’re going to use only these attributes, we can tell Python this by defining the __slots__ attribute. This tells Python to build a data structure for just these attributes, without the flexibility it normally includes. Here’s how the RiverReading class would look with __slots__ defined:

class RiverReading:

    __slots__ = ('height', 'ts_reading')

    def __init__(self, height, ts_reading):
        self.height = height
        self.ts_reading = ts_reading

I’m going to try this on my project when I’m finished this post, and I’m curious to see if this has any impact on the project’s performance.

Named Tuples

Using __slots__ is good if we know the attributes of a class won’t change, but we need to write a number of custom methods. If we don’t need any methods, we can use an even simpler data structure, named tuples. A named tuple keeps the dot notation that makes class attributes simple to work with, but stores the data even more efficiently. They’re useful if we don’t need methods, and the values assigned to attributes won’t change once an object is created.

Here’s how RiverReading would be written as a named tuple:

from collections import namedtuple

RiverReading = namedtuple('RiverReading', ['height', 'ts_reading'])

You create an object just like you would for a class, and access attributes using dot notation. You can also pass a list of default values for the attributes in a named tuple. The values in a named tuple object can’t be modified, but there’s a _replace() method that lets you create a new object from an existing one, replacing the values of whichever attributes you need to. There’s also an _asdict() method that returns a dictionary representation of the object.

I don’t know that I’ll use named tuples in my current project, because I need some methods to work with readings. But I know I’ve written regular classes in the past where named tuples would have sufficed, and I’ll keep my eye out for the opportunity to use them in new projects.

Data classes

There’s another option that we should be aware of, if the values assigned to attributes might need to be modified after creation. The dataclass structure is similar to a named tuple, but you can change the value of attributes using dot notation after an object has been created.

Here’s what RiverReading looks like as a dataclass:

import datetime
from dataclasses import dataclass


@dataclass
class RiverReading:
    ts_reading: datetime.datetime
    height: float


ts = datetime.datetime.now()
rr = RiverReading(ts, 23.25)

Data classes require you to declare what type of data each attribute will refer to. The @dataclass decorator automatically generates an __init__() method that creates attributes from the values you pass in when you create an object; you don’t need to manually attach these values to the self object. As with regular classes and namedtuples, you can assign default values for attributes in a dataclass. You can also write custom methods in a dataclass.

Data classes weren’t covered in the book, but reading about slots and named tuples reminded me to finally research them, and make sure I’m ready to use them in the next project I work on where they’d be appropriate.

Static methods and class methods

I’ve been aware of static and class methods before, and I’ve used them at times, but I haven’t been entirely clear about how to think about them when designing a class from scratch.

A static method doesn’t need access to any of an object’s attributes. When this is the case, we should decorate the function definition with @staticmethod. This tells Python that the method doesn’t need a self argument sent each time the method is called. Also, each new object doesn’t need a static method bound to it; Python just creates one method for the entire class, which is more efficient.

A class method needs access to class attributes, but it doesn’t need access to individual objects. A class method should be decorated with @classmethod. These methods automatically receive an argument referencing the class, which is usually labelled cls or klass to avoid a name clash with the keyword class.

Optimization and performance

The second big takeaway for me was a more disciplined and informed approach to refactoring. Since I’ve mostly worked on small solo projects for nontechncial users, I’ve gotten away with writing messy code that works. This has been perfectly fine, and I wouldn’t change much going back. Most of these were one-off projects, and there was no need for optimization; my time was better spent working with the results of my code, not the code itself.

Now that I’m writing more complex code and collaborating more with other technical people, I need to pay more attention to what my code looks like and how it performs after I’ve gotten through the exploratory phase of a project. This book was hugely helpful in offering a clear approach to optimization, with a focus on performance. In the past, I would just look for my ugliest and most repetitive code and start refactoring, writing clearer comments and breaking things into more coherent chunks. Sometimes I’d even write a few tests along the way. After reading Serious Python, my approach will be:

Write critical tests, if I haven’t done so already.
Run a profiler on my code to identify bottlenecks.
Focus optimization and refactoring efforts on functions that are run most often, take the most CPU time, and consume the most memory.
Make sure to use efficient data structures, as described earlier.

I was aware of these kinds of tools and approaches, but I never had a pressing need to learn about them earlier. This book offered a great high-level overview that really helps me get started on a more disciplined approach to optimization. I’m going to start with cProfile, and see how far that takes me. I’m really looking forward to using this on one of my current projects, which takes about a minute to run and has code that’s messy enough that I’m embarassed to show it in its current state.

Other takeaways

I had many smaller takeaways. Here’s a brief summary of some of these:

I have reasonably good discipline around formatting my code, but I’ve never used a linter or other automated code-checking tools. The book encouraged using tools such as pep8, Pyflakes, Pylint, and flake8. I’ll also check out Black, which wasn’t mentioned but has seen heavy adoption recently.
There was a good summary of how a typical Python project should be set up. My larger projects have a reasonable overall structure, but it was good to see clear recommendations on what experienced Python programmers expect to see in a project structure. This was most helpful for me in thinking about how to organize tests, especially as I start to implement more extensive testing in some of my projects.
Speaking of testing I have mainly used unittest in the past, but this book has me looking forward to using Pytest. In particular, I like the ability to skip some tests and make groups of specific tests. I’m also looking forward to seeing what running tests in parallel does for the performance of my test suites.
The short, one-page description of how generators work was the best explanation I’ve ever read. It’s the first time I was really clear on what the yield keyword does that makes it more efficient than a simple for loop.
I’ve used list comprehensions for years, but never knew the similar syntax for generator expressions.
There was a great section on using any(), all(), map(), and set() to solve a number of common problems in simple, concise ways.
I appreciated the overview of the itertools module, particularly chain(), combinations(), cycle(), groupby(), and permutations().
I’ve written my own code to check if a key exists in a dictionary where the values are lists; if the key exists, add an item to the list. If the key doesn’t exist, add a new key-value pair. I was happy to find that collections.defaultdict already does this.
collections.Counter counts items in a collection, but also finds the most common elements. I’ve written code to do this before.
I’d heard of memoization, but had no idea what it was. It’s the technique of caching the results of a function, so if the function is called again with the same arguments the result is returned immedidately, without having to run the function again. This can be a huge performance boost in certain cases. Memoization is implemented in Python using the functools.lru_cache() decorator.
It’s always good to be reminded of PyPy. I’ve never used it, but one of these days I’ll give it a try and see if it makes my code run more efficiently.
There’s a great discussion of multithreading, multiprocessing, and asyncio. I’ve used the first two before, but wasn’t clear on the distinction between these three tools and concepts. I have a better sense now of the purpose and strengths of each, and I can see that for most of my use cases multiprocessing is probably going to have the most impact. Multiprocessing can take advantage of the multiple cores that most of our machines have now.
The context manager with, when used to open files, can handle multiple arguments. This provides a nice clean way to work with two open files at once, for example for processing data from one file and writing it to a different file. This is better than nested with statements.
When working with SQL, don’t use Python to enforce constraints that are better enforced at the database level. Enforcing constraints in both might feel more secure, but it’s really redundant and inefficient. Instead, focus on using Python to respond to constraint violations such as failed inserts.

It’s been a long time since I read a technical book cover to cover, and learned as much as I could from it. As a mostly self-taught programmer, I fell into many of the less disciplined and less efficient approaches that Julien Danjou set out to help people move past. I appreciate his efforts in putting together this fantastic resource.

If you want to work through Serious Python yourself, you can buy it direct from No Starch Press and you’ll get a copy of the ebook with your print copy. You can also find it at Barnes and Noble, and on Amazon. (I do not use affiliate links, and I was not asked to write this review.)

ehmatthes.com

code, politics, and life

Review: Serious Python

Writing better classes and methods

Using slots

Named Tuples

Data classes

Static methods and class methods

Optimization and performance

Other takeaways

ehmatthes.com

code, politics, and life

Review: Serious Python

Writing better classes and methods

Using __slots__

Named Tuples

Data classes

Static methods and class methods

Optimization and performance

Other takeaways

Using slots