Lo-Fi Python

Dec 31, 2021

Phone Number Cleaning Regex + pandas Series Example

This is a solution I worked out recently to strip phone numbers into a uniform format. To install pandas with pip, enter in command prompt:

python -m pip install pandas

The pandas library has regex built in and it's pretty neat! Behold the power of pandas and a regular expression to do trivial telephone tidying:

strip phone formatting with Python
1
2
3
4
5
6
import pandas as pd
s = pd.Series(data=["(010) 001-1010"], name="Phone", dtype="str")
# remove parentheses, hyphens and spaces with pandas + regex
s = s.str.replace(pat="\(|\)|-| ", repl="", regex=True)
print(s)
# resulting number: "0100011010"

Regex is cool.

Grasping the intricacies of what this code is doing feels elegant when you connect the dots.. or pipes. The replace is done via a pandas str accessor. In the pat string, the parentheses are escaped with slashes and separated by pipes "|". They act as an or operator, succinctly chaining multiple characters together for matching and in this case replacing them with nothing. Pretty nifty. If you read the pandas docs, you'll find regex is accessible in different parts of the API. Dive in, it's some of my favorite documentation to snoop. There is so much you can do with pandas. This example demonstrates how its flexible functions get the job done efficiently.

Further Reading:

pandas.Series documentation

pandas str.replace documentation

Source of the famous “Now you have two problems” quote

Jul 15, 2018

Findstr, RegEx File Searches for Windows

Findstr is the Windows alternative to GREP, which runs on the Unix operating system. Findstr searches files with regular expressions and seems useful for string matching within files and directories.  It is one of over 280 command prompt commands. Here's the official Windows Documentation and some Linux vs. Windows Examples.

Update: Windows announced that Grep and several other Unix command line tools will be added to Windows 10. This is a new alternative to findstr.

This findstr command returns all lines containing an '@' in a text file.

findstr @ test.txt
findstr Emails

I was happy to see Findstr's convenient help menu:

findstr -?
findstr_help

Regular expressions are so powerful. It's nice to have this utility within the command prompt. I am hoping to get to know some of the other 280 command prompt commands.

I've previously explored regex with Python. This Python regex example finds all words in a text file containing '@' symbols:

1
2
3
4
5
6
7
8
import re

# read the file to string + regex email search
with open('test.txt', 'r') as fhand:
    string = fhand.read()
    # this regex returns a python list of emails:
    emails = re.findall('(\S*@\S+)', string)
    print(emails)
findall_python

For more command prompt nuggets, check out my more recent post: Exploring Windows Command Line Tools, Batch Files and Remote Desktop Connection.