Lo-Fi Python

Jul 15, 2020

Benefits of Go and Threads in Distributed Systems

Preface

These are my YouTube lecture notes from MIT's Distributed Systems course. Thank you MIT and Professor Morris!

MIT 6.824 Distributed Systems

Lecture 2: RPC and Threads - Feb 7, 2020

Prof. Robert Morris (Spring 2020)

Introduction

Go is a popular programming language choice so my ears perked up when this lecture began. These notes were taken as the professor explains why he teaches his class in Go. He also mentioned he'd be able to teach it with Python or Java. He used C++ years ago.

The beginning of this lecture was a great summary of:

  • key benefits of Golang
  • what threads are and why they're great
  • how Go, threads and async tie together

Go is Good for Distributed Systems

Go is concurrency-friendly. With concurrent threads, you can effectively split a task such as making web requests to a server into many threads, completing them simultaneously.

Golang's Convenient Features and Benefits

Why use threads?

  • I/O Concurrency
  • Multi-core Parallelism
  • Convenience, e.g. "create 10 threads that sleep for a second and then do a little bit of work"

"Threads are the main tool we're using to manage concurrency in programs."

-Prof. Robert Morris

Contrast WithEvent-driven Programming("Asynchronomous")

A single thread, single loop that waits for an event.

Combining Threads and Event Driven Programming

"Create one thread for each procedure call."... "On each of the threads run a stripped down event driven loop. Sort of one event loop per core. That results in parallelism and I/O concurrency."

-Prof. Robert Morris

Postface: Concurrent Python Context

I've rarely if ever used multiple threads in Python. Simply running a single threaded script seems sufficient for most of my tasks. Maybe I could speed up API requests by splitting into threads when making a few hundred thousand requests? Apparently I'm missing out on concurrent threading efficiency gains.

I once experimented with the multiprocessing module's Process class, which worked on Linux but not Windows for me. I ended up taking an simpler, single thread approach instead. I've also heard of using multiprocessing pool objects. There's also the asyncio library concurrent.futures modules to consider. The ProcessPoolExecutor looks promising.

Python also has the queue module. I haven't used it yet but at one point I watched a talk where Raymond Hettinger recommended queue as a good option if you want concurrency in Python.

It seems there are many options available in Python but it's not clear which tools should be deployed and when. And your chosen concurrency strategy may add extra complexity. Handle with care. Or consider learning Go if you want to use threads to scale your distributed system.

Update: Python Concurrency Success

I recently deployed the ThreadPoolExecutor from the concurrent.futures module to efficiently move thousands of files to a new folder. So Python does have fairly accessible alternatives to concurrency. I guess I'll need to try Go sometime to compare!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import shutil
import os

def main():
    """Move files concurrently from the current working directory to a new folder.
    This script is adapted from the Python ThreadPoolExecutor documentation:
    https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.shutdown
    """
    csvs = [f for f in os.listdir(os.getcwd()) if '.csv' in f]
    split_num = len(csvs) / 4 + 1
    file_batches = np.array_split(csvs, split_num)
    # write to local folder named "csvs"
    dst_folder = "/csvs"
    with ThreadPoolExecutor(max_workers=4) as e:
        for i, files in enumerate(file_batches):
            csv_A, csv_B, csv_C, csv_D = files
            e.submit(shutil.move, csv_A, dst_folder)
            e.submit(shutil.move, csv_B, dst_folder)
            e.submit(shutil.move, csv_C, dst_folder)
            e.submit(shutil.move, csv_D, dst_folder)

if __name__ == '__main__':
    main()

Additional Reading

New Case Studies About Google's Use of Go

go.dev