Lo-Fi Python

Feb 14, 2021

So You Want to Learn Python?

Here are a few Python concepts for beginners to explore if you are starting out with the language. In this post, I'll highlight my favorite "must-learn" tools to master that come with your Python installation. Understanding them will make you a more capable Python programmer and problem solver.

  1. Built-in Functions. They are awesome! You can do so much with these. Learn to apply them. You won't regret it! See also: An Intro to Python's Built-in Functions
  2. String methods. Want to capitalize, lowercase or replace characters in text? How about checking if a str.isdigit()? Get to know Python's string methods. I use these frequently. Also, the pandas string method implementations are great for applying them to tabular data.
  3. Docstrings. I truly enjoy adding docstrings at the beginning of my functions. They add clarity and ease of understanding.
  4. The Mighty Dictionary. Lists and tuples are useful too, but dictionaries are so handy with the ability to store and access key-value pairs.
  5. List Comprehensions. These allow you to perform transformations on lists in one line of code! I love the feeling when I apply a list comprehension that is concise, yet readable.
  6. Lambda Expressions. These can be used to apply a function "on the fly". I love their succinctness. It took me a few years to become comfortable with them. Sometimes it makes sense to use a lambda expression instead of a regular function to transform data.
  7. Date Objects. Wielding date objects and formatting them to your needs is a pivotal Python skill. Once you have it down, it unlocks a lot of automation and scripting abilities when combined with libraries like pathlib, os or glob for reading file metadata and then executing an action based on the date of the file, for example. I use date.today() a lot when I want to fetch today's date and timedelta to compare two dates. The datetime module is your friend, dive in. Must know for custom date formatting: strftime() and strptime(). See also: Time Format Codes

For tabular data, I often use pd.to_datetime() to convert a series of strings to datetime objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# install pandas with this command: python -m pip install pandas
import pandas as pd
events = [
    ["USA Born", "1776-07-04"],
    ["WTC Bombings", "2001-09-11"],
    ["Biden Inauguration", "2021-01-20"],
]
df = pd.DataFrame(events, columns=["events", "dates"])
# convert a pandas series of strings to datetime objects
df.dates = pd.to_datetime(df.dates)
print(df.dtypes)
print(df.head())

Just the tip of the iceberg...

The amazing part of Python is that its community has developed an astonishing plethora of external libraries which can be installed by pip. Usually I'll learn how to use new libraries after googling to find a well-written README on Github or helpful documentation. The language comes with an impressive line-up of baked-in tools and libraries way beyond what I've mentioned here. But I think this is a great start. Get to know these common Python language features and you'll be surprised how much you can do!

Additional Comprehensive Python Learning Resources

How long did it take you to learn Python?

Practical Python Programming (free course)

Google Python Style Guide

What the f*ck Python!

PySanity

Aug 09, 2020

Pondering Join Algorithms

Truly enjoying this Intro to Database Systems course from Carnegie Mellon University. Some really great breakdowns of common join algorithms in this lecture. Here are my notes.

Lecture 11- Join Algorithms(CMU Databases Systems / Fall 2019)

Prof. Andy Pavlo, Carnegie Mellon Database Group

Join Algorithms

screenshot from lecture

Table Positioning for a Join

"In general, your smaller table should be the "left" table when joining two tables."... Professor demonstrates better performance by making the smaller table the "outer" table in a join.

Block Nested Loop Join [mysql example]

  • "The brute force approach"
  • If you have enough memory to hold a large table, a good option for joining.
  • Always pick the smaller table as the outer table.
  • Buffer as much of your outer table in memory as possible to reduce redundant I/O.
  • Loop over the inner table or use an index.

Index Nested Loop Join [CS Course definition]

If indexes are available, or you could create an index to use for a join.

Sort-Merge Join [wikipedia]

Useful if one or both tables are sorted on a join key. Maximize sequential I/O.

Sort - Merge Join

screenshot from lecture

Hash Join

Best performance. For large datasets.

  1. Phase #1 Build (Hash Table)
  2. Phase #2 Probe

Use a Bloom Filter set operations for probe phase optimization.

  1. insert a key
  2. lookup a key

Additional Reading on Bloom Filters

Let's implement a Bloom Filter

Bloom Filters Debunked

Grace Hash Join [wikipedia]

  • "Do hash joins when things don't fit in memory."
  • Use a hash table for each table. Break the tables into buckets then do a nested loop join on each bucket. If the buckets do not fit in memory, use recursive partitioning. Then everything fits in memory for the join.

"Split outer relation into partitions based on the hash key."

Prof. Andy Pavlo on Hash Join algorithm

  • Hashing is almost always better than sorting for operator execution.

"No join algorithm works well in all scenarios."

-Prof. Andy Pavlo

webmention

webmention