20180415

How does the 'in' keyword work in Python?

A subtitle to this post could be More yak shaving

A few days go I played a bit with a naive implementation of Bloom filters in Python. I wanted to time them against just checking whether a field is in a set/collection. I found something slightly puzzling: it looked like in worked too fast for smaller lists. And I wondered: maybe small lists are special internally, and allow for really fast lookups? Maybe they have some internal index? This raised the question: how does in find stuff in sequences?


Instead of reading the documentation I delved into the cpython source. Skipping the documentation was a bad idea, since it is not only pretty comprehensive but explains everything... if you know what you are looking for. But a dive into this big pile of C was more fun. I was also stupid enough to not Google the answer until I was editing this post... This excellent Stack Overflow answer covers more or less what I explain here.

The in keyword (defined here) is actually a shorthand for running __contains__ from the target object, you probably know this already. Your class will be able to provide in if you add this magic method, you can read about this in the first chapter of Fluent Python. But tracing this inside cpython is a bit cumbersome and got me diving into interesting pieces of the code. By the way, how the Python compiler works is documented here.

First, after we have parsed the code and generated a parse tree from our text, we go to the abstract syntax tree. Converting the string in as part of some node of Python source in this tree into an In here:


And now, what does actually In do? Well, we need to move forward in the compilation chain, and check compile.c:


This is inside a function called cmpop, which is called when we find the COMPARE_OP opcode. This is the opcode we’d see by disassembling anything running in or == or any other comparison operator (all comparison operators are in the same basket, see here). We can follow the route through ceval.c now:


So, we are calling this PySequence_Contains thingy. A bit more grep and we can find it defined in abstract.c:


And now we can see what it does:
  1. Get a pointer to the sequence’s base sequence methods in the C struct slot tp_as_sequence
  2. If there is a sq_contains method pointer there, invoke it and return
  3. Otherwise, use iterative search
But hey, where is my __contains__? The magic happens on new object/class/type definition: typeobject.c The base object all classes extend from looks like the following struct (from object.h):


This means that, technically, all objects have a field in their defining struct for sequences: tp_as_sequence. This is populated when we define a new class (which internally is known as type) in typeobject.c. Slots are populated from what is essentially a dictionary of methods by invoking fixup_slot_dispatchers. This maps the python name __contains__ to the corresponding slot in the struct, sq_contains and defines which function sets it up, slot_sq_contains:


Built-in objects (and likely libraries with C extensions) implement these directly in C, and point its slot to the C method:


Finally, this method looks for a method defined in the class and called __contains__. If it is None (that is, it is defined and is None), object is not a container, that’s it. If it is not defined (hence the null, and this one is actually puzzling... reduces to it not being provided when defining the class... I think), Python falls back to iterating for search using __iter__ (which is what eventually gets called under PySequence_IterSearch). If this is also not valid or available, an error returns, following a chain of -1 in the method lookups.

If you have been paying attention you’ll see that we are actually deferring to iterative search in two places: when defining the slot in sq_contains but also when invoking PySequence_Contains. I’m not 100% sure about why this is the case, and experimenting with the REPL does not get you very far, since you can never be sure if you are hitting PySequence_Contains -> PY_ITERSEARCH_CONTAINS or PySequence_Contains -> sq_contains -> PY_ITERSEARCH_CONTAINS without changing the messages (and I don’t feel like recompiling Cpython). Weirdly, the second case should be faster since it is going straight for the method via the slot without needing an extra method lookup.

As expected, dictionary lookup is fancier. It is a common and known performance improvement in Python to change lists or other sequence-like datatypes for dictionaries, since they show the best performance for most operations. Since internally in Python everything from methods/functions to classes is implemented in one way or another as dictionaries (or reusing the machinery that is built for dictionaries), anything that speeds dictionaries up, speeds the whole of Python code. Of course, dictionary lookup is usually fast no matter the language: hash table lookup in general (leaving aside how collision resolution might be implemented in lookup) is O(1) fast. Note that below we have the macro: #define PyUnicode_CheckExact(op) (Py_TYPE(op) == &PyUnicode_Type)


There even is a specialised method for cases when the hash key is known (not sure of the use case, since magic methods are hardcoded in the object structs, maybe it’s used to optimise tighter loops?).


And if you wonder how key in dict works, it is of course by introducing the method in the struct for sequences. The snippet below is from dictobject.c, as the methods above.


And here finishes my exploration of cpython to figure out how in contains. Not sure about you, but I had a lot of fun.
Written by Ruben Berenguel