Why you should consider Python for your backend

Anirban Mukherjee
5 min readOct 15, 2022

--

Python used to be called a scripting language and now 21% of Facebook's infrastructure is powered by Python. In this post, let us dive into how Python has grown in popularity on large-traffic websites.

Python was originally built with the purpose of being an easy-to-learn and use computing language. Over the years, it has found tremendous acceptance in different fields like ML computations, DevOps, and now in high-traffic backends. Python was not built for a singular purpose, unlike Rust or Go, but all the time allowed for easy expansion and adaptation if you have a specific purpose in mind.

Why adaptation to high-traffic websites WAS low

Python as a language has the concept of a Global Interpreter Lock (GIL), a lock that ensures that only 1 thread can run on the CPU at a given time; irrespective of how many CPU cores the machine had.

A Python program can fork out multiple processes (multiprocessing module). Each process gets a single instance of the Python interpreter and thereby its own GIL. Each process can fork out multiple threads (threading module), and all threads of the same process are ‘governed’ by the same GIL. All threads of a single process are pinned to using a single CPU core (not necessarily a single one, depending on the underlying OS scheduling)

Since the GIL code allows only a single thread to run its bytecode on the underlying CPU core, the threads essentially execute sequentially on the CPU. Also, all threads of a single process are going to be executed on a single CPU core. So, if one thread is stuck on an I/O operation (for example), it can release the GIL and make way for another thread to acquire it, but nonetheless, the execution on the CPU still happens sequentially between the threads.

What else?

Each thread gets its own dedicated memory stack; on top, creating and maintaining a thread is a good amount of work for the Python VM. Switching between 2 threads is even more work. So every time that one thread releases the GIL and another acquires it, the cost of the context switch is significant. So in essence, there are 3 things that are hurting us here:

  1. The amount of memory on the process stack dedicated to each thread.
  2. The cost of switching context between threads.
  3. Inability to use more than 1 CPU core in parallel.

Coroutines and concurrency

With asyncio becoming fully matured by Python 3.9, coroutines in Python have started gaining traction. A coroutine is basically a part of the application code that can be switched in and out within the context of the same thread.

Say, you have a thread that handles incoming HTTP requests, needs to update a redis cache and a DB, and then does a HTTP response out. In this case, in a traditional threading model, the sequence of executions will be:

  1. read in HTTP request
  2. update Redis cache
  3. update DB record
  4. write HTTP response

Even if the updates to the Redis and DB may not be dependent on each other, the thread still had to do it sequentially. There was simply no other way out.

With coroutines, the thread can free itself off to the next operation, if it encounters an operation it will “need to wait on”. So after the thread makes the outgoing call to Redis, it does not need to wait to get an ACK response for the update from Redis; It can simply move on. The sequence of executions will be:

  1. read in HTTP request
  2. send an update request to Redis in coroutine1, and move on
  3. send an update request to DB in coroutine2, and move on
  4. join() back both coroutine1 and coroutine2
  5. write HTTP response

As you can see, the speed of execution of I/O blocking operations increases by using coroutines. This is a classic example of concurrency in design.

Concurrency is the ability to break tasks down to execute separate from each other; Parallelism is the ability to execute 2 (or more) tasks at the same time.

Concurrency is required for parallelism to be possible, but not necessarily the other way around.

This solves our 2nd problem “ the context of context switching “, as there is nothing to be switched on the CPU; nothing for the Python VM to get a GIL lock on and nothing that the OS has to know/do!

Coroutines — memory footprint

Coroutines are not a feature of the operating system and technically do not relate directly to any OS functionality. Coroutines are a programming language feature that enables the underlying systems (Python VM in this case) to switch between tasks (or rather code sections) without having to do anything with the underlying OS.

Hence, Coroutines do not need to have their own memory space in the process stack. How the variables and code blocks share their memory within themselves is completely onto Python to decide. Python actually does not create separate memory stacks for any coroutines, and therefore there is no memory overhead of opening up a new coroutine; enabling programmers to open up many more coroutines than they could open threads before.

Coroutines — what we gain

With coroutines, a single thread can now finish off independent tasks much faster. So if your API server code has lots of I/O heavy functions ( like interfacing HTTP calls, updating DB records and caches ), using coroutines, your individual threads (assuming a single thread for each HTTP connection) will be ‘waiting’ state much longer, and tasks that are independent of each other get fired away simultaneously.

Further optimizing context switches

Even though we now have our threads being in a waiting state for longer, the threads will nonetheless still need to do some computing tasks in real-world applications. And that means being blocked on the GIL.

Python, as a language was built on the premise of having C modules pluggable into it. The pluggable modules also have the freedom to release and acquire GIL as they wish. CPython (a variant of traditional Python) by design implements all I/O operations as:

  1. release the GIL
  2. perform the operation: write(), recv(), accept(), etc
  3. acquire the GIL

This means it frees up the GIL before it needs to do any I/O and acquires the GIL only when it needs to do CPU computes. This pattern releases the GIL lock for much longer times from each thread and enables more threads to be provisioned on using the CPU within a certain time.

However, this still does not take away the memory cost of the context switch. Neither does it make true multithreading possible in Python (excluding any variants of it).

Nonetheless, the different concurrency capabilities introduced with asyncio module, combined with the possibility to write your own modules in close-to-machine C language help with the biggest pain points that architects have not chosen Python for high-traffic backend APIs. However, Netflix’s (and others') adoption of Python for their media backends shows how Python has evolved to be a first-class choice for handling high-volume I/O.

Added on top of it, language features like generators, tons of literals, and ease of learning, have definitely made Python a prime choice of backend designers today.

I regularly write about different topics in tech including career guidance, the latest news, upcoming technologies, and much more. This blog was originally posted in my blogs at anirban-mukherjee.com

--

--

Anirban Mukherjee

Loves writing code, building projects, writing about tech stuff, running side hustles; Engineering leader by day, nerd builder by night.