Lo-Fi Python

Jul 15, 2020

Benefits of Go and Threads in Distributed Systems

Preface

These are my YouTube lecture notes from MIT's Distributed Systems course. Thank you MIT and Professor Morris!

MIT 6.824 Distributed Systems

Lecture 2: RPC and Threads - Feb 7, 2020

Prof. Robert Morris (Spring 2020)

Introduction

Go is a popular programming language choice so my ears perked up when this lecture began. These notes were taken as the professor explains why he teaches his class in Go. He also mentioned he'd be able to teach it with Python or Java. He used C++ years ago.

The beginning of this lecture was a great summary of:

  • key benefits of Golang
  • what threads are and why they're great
  • how Go, threads and async tie together

Go is Good for Distributed Systems

Go is concurrency-friendly. With concurrent threads, you can effectively split a task such as making web requests to a server into many threads, completing them simultaneously.

Golang's Convenient Features and Benefits

Why use threads?

  • I/O Concurrency
  • Multi-core Parallelism
  • Convenience, e.g. "create 10 threads that sleep for a second and then do a little bit of work"

"Threads are the main tool we're using to manage concurrency in programs."

-Prof. Robert Morris

Contrast WithEvent-driven Programming("Asynchronomous")

A single thread, single loop that waits for an event.

Combining Threads and Event Driven Programming

"Create one thread for each procedure call."... "On each of the threads run a stripped down event driven loop. Sort of one event loop per core. That results in parallelism and I/O concurrency."

-Prof. Robert Morris

Postface: Concurrent Python Context

I've rarely if ever used multiple threads in Python. Simply running a single threaded script seems sufficient for most of my tasks. Maybe I could speed up API requests by splitting into threads when making a few hundred thousand requests? Apparently I'm missing out on concurrent threading efficiency gains.

I once experimented with the multiprocessing module's Process class, which worked on Linux but not Windows for me. I ended up taking an simpler, single thread approach instead. I've also heard of using multiprocessing pool objects. There's also the asyncio library concurrent.futures modules to consider. The ProcessPoolExecutor looks promising.

Python also has the queue module. I haven't used it yet but at one point I watched a talk where Raymond Hettinger recommended queue as a good option if you want concurrency in Python.

It seems there are many options available in Python but it's not clear which tools should be deployed and when. And your chosen concurrency strategy may add extra complexity. Handle with care. Or consider learning Go if you want to use threads to scale your distributed system.

Update: Python Concurrency Success

I recently deployed the ThreadPoolExecutor from the concurrent.futures module to efficiently move thousands of files to a new folder. So Python does have fairly accessible alternatives to concurrency. I guess I'll need to try Go sometime to compare!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import shutil
import os

def main():
    """Move files concurrently from the current working directory to a new folder.
    This script is adapted from the Python ThreadPoolExecutor documentation:
    https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.shutdown
    """
    csvs = [f for f in os.listdir(os.getcwd()) if '.csv' in f]
    split_num = len(csvs) / 4 + 1
    file_batches = np.array_split(csvs, split_num)
    # write to local folder named "csvs"
    dst_folder = "/csvs"
    with ThreadPoolExecutor(max_workers=4) as e:
        for i, files in enumerate(file_batches):
            csv_A, csv_B, csv_C, csv_D = files
            e.submit(shutil.move, csv_A, dst_folder)
            e.submit(shutil.move, csv_B, dst_folder)
            e.submit(shutil.move, csv_C, dst_folder)
            e.submit(shutil.move, csv_D, dst_folder)

if __name__ == '__main__':
    main()

Additional Reading

New Case Studies About Google's Use of Go

go.dev

Jun 23, 2020

Free Computer Science Courses and Talks To Absorb

Below you'll find a balanced curriculum of juicy courses and videos that are available for free on the internet. I'll definitely be diving into most of these in the 2nd half of 2020. Stay curious!

University CS Courses For Free

CS50's Web Programming with Python and JavaScript | Harvard University

CS 61-C Great Ideas in Computer Architecture (Machine Structures), Spring 2015 | UC Berkeley

CS 109: Data Science, 2015 | Harvard University

Mathematical Modeling of Football, Fall 2020 | Uppsala Universitet

CS 162 - Operating Systems and Systems Programming, Fall 2013 | UC Berkeley

15-445/645 Intro to Database Systems, Fall 2019 | Carnegie Mellon University

15-721 Advanced Database Systems, Spring 2020 | Carnegie Mellon University

Missing Semester: Shell Tools & Scripting, Spring 2020 | MIT

6.824 Distributed Systems, Spring 2020 | MIT

CSE 373 - Analysis of Algorithms, 2016 | Stony Brook University

CS 4150 Algorithms, Spring 2020 | University of Utah

CS 241 System Programming, Spring 2020 [course wiki] | University of Illinois

CS 6120: Advanced Compilers: The Self-Guided Online Course | Cornell University

Intriguing Coursera Classes

DevOps Culture and Mindset | UC-Davis

Computer Science: Algorithms, Theory, and Machines | Princeton University

Excel Fundamentals for Data Analysis | Macquarie University

Build a Data Science Web App with Streamlit and Python | Guided Project [$10]

Programming Talks & Tutorials

These programming talks piqued my interest, highly recommended.

David Beazley | Built in Super Heroes [YouTube]

Mr. Beazley shows how to use pure Python built-in functions to clean and analyze the City of Chicago's food inspection data. No pandas in this talk, behold the power of the Python standard library. Spoiler: Don't eat at O'hare airport. He also has a new course, available for free:

David Beazley | Practical Python Programming [Course]

This is not a course for absolute beginners on how to program a computer. It is assumed that you already have programming experience in some other programming language or Python itself.

Sebastian Witowski | Modern Python Developer's Toolkit [YouTube]

An overview covering editing tools and setup from PyCon 2020. Honing your development environment is crucial to being an efficent coder. This example uses VS Code. I use Atom as my primary text editor. The most recommended linters are usually pylint, flake8 or pyflakes.

Jake VanderPlas | Reproducible Data Analysis in Jupyter [YouTube]

This 10 video series is a must-watch for aspiring data scientists and analysts if you use Python. Includes a git workflow demonstration, working in Jupyter Notebooks and many other essentials.

Rich Hickey | Hammock Driven Development [YouTube]

Sometimes, the best thing we can do is step away from the keyboard. I really enjoy this speaker's communication style.

Eric J. Ma | Demystifying Deep Learning for Data Scientists [YouTube]

Tutorial-style Pythonmachine learning walk-through from PyCon 2020.

Julie Michelman | Pandas, Pipelines, and Custom Transformers [YouTube]

This video shows a deep dive into the world of sci-kit learn and machine learning. PyCon and PyData videos usually include some cutting edge tech. Machine learning moves so fast there are always new tools surfacing. But certain libraries like sci-kit learn, TensorFlow, keras and PyTorch have been constant.

Ville Tuuls | A Billion Rows per Second: Metaprogramming Python for Big Data [YouTube]

Make your data dense by tactically re-arranging into efficient structures and compiling it down to lower-level bytes. This details a successful Python / Postgres / Numba / Multicorn big data implementation.

Video & Course Grab Bag

Discover the role of Python in space exploration [course]

Microsoft and NASA made a free course about Python in space! 🤓

Ted Nelson | Computers for Cynics [YouTube]

I find these videos to be an entertaining, thought-provoking take on software history. Recommended from Joe Armstrong, the creator of Erlang.

GNU Typist [Tutorial]

You may be able to teach yourself to type more efficiently with this tutorial. I definitely need to do this. It's worth mentioning, per Rich Hickey: with a proper design phase, you'll spend less time typing in the first place!

Extra Credit: Python Wikipedia Library

import wikipedia [GitHub]