In the last post I included a code snippet which uses a regular expression to extract the title from a name in the titanic data.

It just occurred to me that a lot of the people reading the blogs of beginner data scientists, if they aren’t hiring managers (Hi, I love you!), are probably also beginner data scientists, and in a lot of cases, probably quite new to programming. Regular expression patterns (sometimes known as regex) are something you should know about.

Regular expressions can do quite complicated text pattern matching which would often take some quite cumbersome code otherwise. They take a bit of learning and getting used to but I’ll give you the briefest introduction so you can at least understand why it’s worth the effort to learn.

If you’ve already written something similar using string indexing functions, you’re not going to take long to convince.

The names in the titanic data look like these:

Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Th…
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry

The regular expression I started using to process them was this: [^,]*,\s([^.]*) which breaks down into 3 sections, explained with the example “Braund, Mr. Owen Harris”.

RegEx Matches Text
[^,]* Anything except a comma, multiple times “Braund”
,\s A comma followed by a space “, “
([^.]*) Anything except a period, multiple times “Mr”

The brackets around the third part mean that it is a match group, which we can use to extract the actual value in our code. If I’d wanted to know the surname as well (maybe to turn the name into “Mr Braund”) I could have placed brackets around the first part and the third part of the pattern and extracted them both.

This pattern needed a slight tweak as it failed on one name in the training data. “Rothes, the Countess. of (Lucy Noel Martha Dye…”. This was returning “the Countess” in the match group instead of “Countess”.

The new pattern is [^,]*,\s(the\s)?([^.]*). A question mark in a regex means zero or one times and the match object returns None when the pattern doesn’t find the optional “the ” in the name. I ignore it anyway and return the second match group which contains the title I’m after.

import re
 
def ExtractTitle(name):
  ptn = "[^,]*,\s(the\s)?([^\.]*)"   
  p = re.compile(ptn, re.IGNORECASE)
  m = p.match(name)

  if m is not None:
    return m.group(2)
  else:
    print("FAILED EXTRACTING TITLE FROM: " + name)
    return None

You should at least now understand how powerful regex patterns can be, and this is probably one of the simplest examples you’re likely to find. If you think about something even a tiny bit more complicated like an email address and the many different ways they can appear, if you try extracting them from larger blocks of text using string indexing functions, you’re going to have a migraine quite quickly.

Luckily, regular expressions have been around since 1951 and appear in some form in most programming languages, even if you have to download a third-party library. Python has a built-in package for them so you just need to import re.

Lastly the web is full of websites which teach you regular expressions and help you build them. RegExr has a great visual editor which lets you edit the pattern with test text underneath so you can see in real time what your changes are doing. There are plenty more like this. Go and play…