17. Strings and Text#

17.1. Introduction#

Strings are just the data type for text. So far, you’ve used strings but without learning much about the details. Now it’s time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tools you have at your disposal.

This chapter has benefitted from the Python String Cook Book and Jake VanderPlas’ Python Data Science Handbook.

Note that there are more powerful methods for working with strings called regular expressions but these will be covered in a different chapter.

17.2. Creating Strings#

We’ve created strings in passing earlier in the book, but didn’t discuss the details. First, you can create a string using either single quotes (') or double quotes ("). It’s good to be consistent in this, even if it doesn’t matter which you use, but automatic code formatters tend to prefer ". If you have a quote inside a string, use ' within it.

string_one = "This is a string"
string_two = (
    "If I want to include a 'quote' inside a string, I use double quotes on the outside"
)

Strings are of type str:

type(string_one)
str

Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like.

var = "banana"
var[:3]
'ban'

The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the var[start:stop:step] syntax. Here’s an example of getting every other character from the string starting from the 2nd position.

var[1::2]
'aaa'

Note that strings, like tuples such as (1, 2, 3) but unlike lists such as [1, 2, 3], are immutable. This means commands like var[0] = "B" will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be var = "Banana".

Like lists, you can find the length of a string using len():

len(var)
6

You can concatenate strings using the + operator:

string_one + ". " + string_two + "."
"This is a string. If I want to include a 'quote' inside a string, I use double quotes on the outside."

Note that we added extra characters so that the phrase made sense. Another way of achieving the same end that scales to many words or phrases more efficiently (if you have them in a list) is:

". ".join([string_one, string_two])
"This is a string. If I want to include a 'quote' inside a string, I use double quotes on the outside"

Three useful functions to know about are upper(), lower(), and title(). Let’s see what they do

var = "input TEXT"
var_list = [var.upper(), var.lower(), var.title()]
print(var_list)
['INPUT TEXT', 'input text', 'Input Text']

Note that there are many built-in functions for using strings in Python, you can find a comprehensive list here.

Exercise

Reverse the string "gnirts desrever a si sihT" using indexing operations.

While we’re using print(), it has a few tricks. If we have a list, we can print out entries with a given separator:

print(*var_list, sep="; and \n")
INPUT TEXT; and 
input text; and 
Input Text

(We’ll find out more about what ‘\n’ does shortly.) To turn variables of other kinds into strings, use the str() function, for example

(
    "A boolean is either "
    + str(True)
    + " or "
    + str(False)
    + ", there are only "
    + str(2)
    + " options."
)
'A boolean is either True or False, there are only 2 options.'

In this example two boolean variables and one integer variable were converted to strings. str() generally makes an intelligent guess at how you’d like to convert your non-string type variable into a string type. You can pass a variable or a literal value to str().

17.2.1. f-strings#

The example above is quite verbose. Another way of combining strings with variables is via f-strings. A simple f-string looks like this:

variable = 15.32399
print(f"You scored {variable}")
You scored 15.32399

This is similar to calling str on variable and using + for concatenation but much shorter to write. You can add expressions to f-strings too:

print(f"You scored {variable**2}")
You scored 234.8246695201

This also works with functions; after all **2 is just a function with its own special syntax.

In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more inforation to the f-string to get the output formatted just the way you want. Let’s say we wanted two decimal places and a sign (although you always write + in the formatting, the sign comes out as + or - depending on the value):

print(f"You scored {variable:+.2f}")
You scored +15.32

There are a whole range of formatting options for numbers as shown in the following table:

Number

Format

Output

Description

15.32347

{:.2f}

15.32

Format float 2 decimal places

15.32347

{:+.2f}

+15.32

Format float 2 decimal places with sign

-1

{:+.2f}

-1.00

Format float 2 decimal places with sign

15.32347

{:.0f}

15

Format float with no decimal places

3

{:0>2d}

03

Pad number with zeros (left padding, width 2)

3

{:*<4d}

3***

Pad number with *’s (right padding, width 4)

13

{:*<4d}

13**

Pad number with *’s (right padding, width 4)

1000000

{:,}

1,000,000

Number format with comma separator

0.25

{:.1%}

25.0%

Format percentage

1000000000

{:.2e}

1.00e+09

Exponent notation

12

{:10d}

12

Right aligned (default, width 10)

12

{:<10d}

12

Left aligned (width 10)

12

{:^10d}

12

Center aligned (width 10)

As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at this link.

17.2.2. Special Characters and How to Escape Strings#

Python has a string module that comes with some useful built-in strings and characters. For example

import string

string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

gives you all of the punctuation,

string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

returns all of the basic letters in the ‘ASCII’ encoding (with .ascii_lowercase and .ascii_uppercase variants), and

string.digits
'0123456789'

gives you the numbers from 0 to 9. Finally, though less impressive visually, string.whitespace gives a string containing all of the different (there is more than one!) types of whitespace.

There are other special characters around; in fact, we already met the most famous of them: “\n” for new line. To actually print “\n” we have to ‘escape’ the backward slash by adding another backward slash:

print("Here is a \n new line")
print("Here is an \\n escaped new line ")
Here is a 
 new line
Here is an \n escaped new line 

The table below shows the most important escape commands:

Code

Result

\'

Single Quote (useful if using ' for strings)

\"

Double Quote (useful if using " for strings)

\\

Backslash

\n

New Line

\r

Carriage Return

\t

Tab

Here’s a more complicated example:

print("a\tb\nA\tB")
a	b
A	B

17.2.3. Raw Strings#

Strings prefixed with r such as r’…’ and r”…” are called raw strings and treat backslashes \ as literal characters rather than special characters.

print(r"a\tb\nA\tB")
a\tb\nA\tB

17.3. Cleaning Text#

You often want to make changes to the text you’re working with. In this section, we’ll look at the various options to do this.

17.3.1. Replacing sub-strings#

A common text task is to replace a substring within a longer string. Let’s say you have a string variable var. You can use .replace(old_text, new_text) to do this.

"Value is objective".replace("objective", "subjective")
'Value is subjective'

As with any variable of a specific type (here, string), this would also work with variables:

text = "Value is objective"
old_substr = "objective"
new_substr = "subjective"
text.replace(old_substr, new_substr)
'Value is subjective'

Note that .replace() performs an exact replace and so is case-sensitive.

17.3.2. Replacing characters with translate#

A character is an individual entry within a string, like the ‘l’ in ‘equilibrium’. You can always count the number of characters in a string variable called var by using len(var). A very fast method for replacing individual characters in a string is str.translate().

Replacing characters is extremely useful in certain situations, most commonly when you wish to remote all punctuation prior to doing other text analysis. You can use the built-in string.punctuation for this.

Let’s see how to use it to remove all of the vowels from some text. With apologies to economist Lisa Cook, we’ll use the abstract from Cook [Coo11] as the text we’ll modify and we’ll first create a dictionary of translations of vowels to nothing, i.e. "".

example_text = "Much recent work has focused on the influence of social capital on innovative outcomes. Little research has been done on disadvantaged groups who were often restricted from participation in social networks that provide information necessary for invention and innovation. Unique new data on African American inventors and patentees between 1843 and 1930 permit an empirical investigation of the relation between social capital and economic outcomes. I find that African Americans used both traditional, i.e., occupation-based, and nontraditional, i.e., civic, networks to maximize inventive output and that laws constraining social-capital formation are most negatively correlated with economically important inventive activity."
vowels = "aeiou"
translation_dict = {x: "" for x in vowels}
translation_dict
{'a': '', 'e': '', 'i': '', 'o': '', 'u': ''}

Now we turn our dictionary into a string translator and apply it to our text:

translator = example_text.maketrans(translation_dict)
example_text.translate(translator)
'Mch rcnt wrk hs fcsd n th nflnc f scl cptl n nnvtv tcms. Lttl rsrch hs bn dn n dsdvntgd grps wh wr ftn rstrctd frm prtcptn n scl ntwrks tht prvd nfrmtn ncssry fr nvntn nd nnvtn. Unq nw dt n Afrcn Amrcn nvntrs nd ptnts btwn 1843 nd 1930 prmt n mprcl nvstgtn f th rltn btwn scl cptl nd cnmc tcms. I fnd tht Afrcn Amrcns sd bth trdtnl, .., ccptn-bsd, nd nntrdtnl, .., cvc, ntwrks t mxmz nvntv tpt nd tht lws cnstrnng scl-cptl frmtn r mst ngtvly crrltd wth cnmclly mprtnt nvntv ctvty.'

Exercise

Use translate to replace all puncuation from the following sentence with spaces: “The well-known story I told at the conferences [about hypocondria] in Boston, New York, Philadelphia,…and Richmond went as follows: It amused people who knew Tommy to hear this; however, it distressed Suzi when Tommy (1982–2019) asked, “How can I find out who yelled, ‘Fire!’ in the theater?” and then didn’t wait to hear Missy give the answer—‘Dick Tracy.’”

Generally, str.translate is very fast at replacing individual characters in strings. But you can also do it using a list comprehension and a join() of the resulting list, like so:

"".join(
    [
        ch
        for ch in "Example. string. with- excess_ [punctuation]/,"
        if ch not in string.punctuation
    ]
)
'Example string with excess punctuation'

17.3.3. Splitting strings#

If you want to split a string at a certain position, there are two quick ways to do it. The first is to use indexing methods, which work well if you know at which position you want to split text, eg

"This is a sentence and we will split it at character 18"[:18]
'This is a sentence'

Next up we can use the built-in split() function, which returns a list of places where a given sub-string occurs:

"This is a sentence. And another sentence. And a third sentence".split(".")
['This is a sentence', ' And another sentence', ' And a third sentence']

Note that the character used to split the string is removed from the resulting list of strings. Let’s see an example with a string used for splitting instead of a single character:

"This is a sentence. And another sentence. And a third sentence".split("sentence")
['This is a ', '. And another ', '. And a third ', '']

A useful extra function to know about is splitlines(), which splits a string at line breaks and returns the split parts as a list.

17.3.4. count and find#

Let’s do some simple counting of words within text using str.count(). Let’s use the first verse of Elizabeth Bishop’s sestina ‘A Miracle for Breakfast’ for our text.

text = "At six o'clock we were waiting for coffee, \n waiting for coffee and the charitable crumb \n that was going to be served from a certain balcony \n --like kings of old, or like a miracle. \n It was still dark. One foot of the sun \n steadied itself on a long ripple in the river."
word = "coffee"
print(f'The word "{word}" appears {text.count(word)} times.')
The word "coffee" appears 2 times.

Meanwhile, find() returns the position where a particular word or character occurs.

text.find(word)
35

We can check this using the number we get and some string indexing:

text[text.find(word) : text.find(word) + len(word)]
'coffee'

But this isn’t the only place where the word ‘coffee’ appears. If we want to find the last occurrence, it’s

text.rfind(word)
57

17.4. Working with Multiple Strings#

We’ve seen how to work with individual strings. But often we want to work with a group of strings, otherwise known as a corpus, that is a collection of texts. It could be a collection of words, sentences, paragraphs, or some domain-based grouping (eg job descriptions). Just like any other Python object, you can put strings into a list (or other iterable).

And, fortunately, many of the methods that we have seen deployed on a single string can be straightforwardly scaled up to hundreds, thousands, or millions of strings using pandas or other tools. This scaling up is achieved via vectorisation, in analogy with going from a single value (a scalar) to multiple values in a list (a vector).

As a very minimal example, here is capitalisation of names vectorised using a list comprehension:

[name.capitalize() for name in ["ada", "adam", "elinor", "grace", "jean"]]
['Ada', 'Adam', 'Elinor', 'Grace', 'Jean']

A pandas series can be used in place of a list. Let’s create the series first:

import pandas as pd

dfs = pd.Series(
    ["ada lovelace", "adam smith", "elinor ostrom", "grace hopper", "jean bartik"],
    dtype="string",
)
dfs
0     ada lovelace
1       adam smith
2    elinor ostrom
3     grace hopper
4      jean bartik
dtype: string

Now we use the syntax series.str.function to change the text series:

dfs.str.title()
0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
dtype: string

If we had a data frame and not a series, the syntax would change to refer just to the column of interest like so:

df = pd.DataFrame(dfs, columns=["names"])
df["names"].str.title()
0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
Name: names, dtype: string

The table below shows a non-exhaustive list of the string methods that are available in pandas.

Function (preceded by .str.)

What it does

len()

Length of string.

lower()

Put string in lower case.

upper()

Put string in upper case.

capitalize()

Put string in leading upper case.

swapcase()

Swap cases in a string.

translate()

Returns a copy of the string in which each character has been mapped through a given translation table.

ljust()

Left pad a string (default is to pad with spaces)

rjust()

Right pad a string (default is to pad with spaces)

center()

Pad such that string appears in centre (default is to pad with spaces)

zfill()

Pad with zeros

strip()

Strip out leading and trailing whitespace

rstrip()

Strip out trailing whitespace

lstrip()

Strip out leading whitespace

find()

Return the lowest index in the data where a substring appears

split()

Split the string using a passed substring as the delimiter

isupper()

Check whether string is upper case

isdigit()

Check whether string is composed of digits

islower()

Check whether string is lower case

startswith()

Check whether string starts with a given sub-string

Regular expressions can also be scaled up with pandas. The below table shows vectorised regular expressions.

Function

What it does

match()

Call re.match() on each element, returning a boolean.

extract()

Call re.match() on each element, returning matched groups as strings.

findall()

Call re.findall() on each element

replace()

Replace occurrences of pattern with some other string

contains()

Call re.search() on each element, returning a boolean

count()

Count occurrences of pattern

split()

Equivalent to str.split(), but accepts regexes

rsplit()

Equivalent to str.rsplit(), but accepts regexes

Let’s see a couple of these in action. First, splitting on a given sub-string:

df["names"].str.split(" ")
0     [ada, lovelace]
1       [adam, smith]
2    [elinor, ostrom]
3     [grace, hopper]
4      [jean, bartik]
Name: names, dtype: object

It’s fairly common that you want to split out strings and save the results to new columns in your data frame. You can specify a (max) number of splits via the n= kwarg and you can get the columns using expand

df["names"].str.split(" ", n=2, expand=True)
0 1
0 ada lovelace
1 adam smith
2 elinor ostrom
3 grace hopper
4 jean bartik

Exercise

Using vectorised operations, create a new column with the index position where the first vowel occurs for each row in the names column.

Here’s an example of using a regex function with pandas:

df["names"].str.extract("(\w+)", expand=False)
0       ada
1      adam
2    elinor
3     grace
4      jean
Name: names, dtype: string

There are a few more vectorised string operations that are useful.

Method

Description

get()

Index each element

slice()

Slice each element

slice_replace()

Replace slice in each element with passed value

cat()

Concatenate strings

repeat()

Repeat values

normalize()

Return Unicode form of string

pad()

Add whitespace to left, right, or both sides of strings

wrap()

Split long strings into lines with length less than a given width

join()

Join strings in each element of the Series with passed separator

get_dummies()

extract dummy variables as a data frame

The get() and slice() methods give access to elements of the lists returned by split(). Here’s an example that combines split() and get():

df["names"].str.split().str.get(-1)
0    lovelace
1       smith
2      ostrom
3      hopper
4      bartik
Name: names, dtype: object

If we have a column with tags split by a symbol, we can use the get_dummies() function to split it out. For example, let’s create a data frame with a single column that mixes subject and nationality tags:

df = pd.DataFrame(
    {
        "names": [
            "ada lovelace",
            "adam smith",
            "elinor ostrom",
            "grace hopper",
            "jean bartik",
        ],
        "tags": ["uk; cs", "uk; econ", "usa; econ", "usa; cs", "usa; cs"],
    }
)
df
names tags
0 ada lovelace uk; cs
1 adam smith uk; econ
2 elinor ostrom usa; econ
3 grace hopper usa; cs
4 jean bartik usa; cs

If we now use str.get_dummies() and split on ; we can get a data frame of dummies.

df["tags"].str.get_dummies(";")
cs econ uk usa
0 1 0 1 0
1 0 1 1 0
2 0 1 0 1
3 1 0 0 1
4 1 0 0 1

17.5. Reading Text In#

17.5.1. Text file#

If you have just a plain text file, you can read it in like so:

fname = 'book.txt'
with open(fname, encoding='utf-8') as f:
    text_of_book = f.read()

You can also read a text file directly into a pandas data frame using

df = pd.read_csv('book.txt', delimiter = "\n")

In the above, the delimiter for different rows of the data frame is set as “\n”, which means new line, but you could use whatever delimiter you prefer.

Exercise

Download the file ‘smith_won.txt’ using this link (use right-click and save as). Then read the text in using pandas.

17.5.2. CSV file#

CSV files are already split into rows. By far the easiest way to read in csv files is using pandas,

df = pd.read_csv('book.csv')

Remember that pandas can read many other file types too.