Introduction to Text#

This chapter covers how to use code to work with text as data, including opening files with text in, changing and cleaning text, and vectorised operations on text.

It has benefitted from the Python String Cook Book and Jake VanderPlas’ Python Data Science Handbook.

Note that regexes are mentioned a few times in this chapter; you’ll find out much more about them in the Regular Expressions, aka regex chapter.

An aside on encodings#

Before we get to the good stuff, we need to talk about string encodings. Whether you’re using code or a text editor (Notepad, Word, Pages, Visual Studio Code), every bit of text that you see on a computer will have an encoding behind the scenes that tells the computer how to display the underlying data. There is no such thing as ‘plain’ text: all text on computers is the result of an encoding. Oftentimes, a computer programme (email reader, Word, whatever) will guess the encoding and show you what it thinks the text should look like. But it doesn’t always know, or get it right: that is what is happening when you get an email or open a file full of weird symbols and question marks. If a computer doesn’t know whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), it simply cannot display it correctly and you get gibberish.

When it comes to encodings, there are just two things to remember: i) you should use UTF-8 (aka Unicode), it’s the international standard. ii) the Windows operating system tends to use either Latin 1 or Windows 1252 but (and this is good news) is moving to UTF-8.

Unicode is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.

Take special care when saving CSV files containing text on a Windows machine using Excel; unless you specify it, the text may not be saved in UTF-8. If your computer and you get confused enough about encodings and re-save a file with the wrong ones, you could lose data.

Hopefully you’ll never have to worry about string encodings. But if you do see weird symbols appearing in your text, at least you’ll know that there’s an encoding problem and will know where to start Googling. You can find a much more in-depth explanation of text encodings here.

Strings#

Note that there are many built-in functions for using strings in Python, you can find a comprehensive list here.

Strings are the basic data type for text in Python. They can be of any length. A string can be signalled by quote marks or double quote marks like so:

'text'

or

"text"

Style guides tend to prefer the latter but some coders (ahem!) have a bad habit of using the former. We can put this into a variable like so:

var = "banana"

Now, if we check the type of the variable:

type(var)
str

We see that it is str, which is short for string.

Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like.

var[:3]
'ban'

The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the var[start:stop:step] syntax. Here’s an example of getting every other character from the string starting from the 2nd position.

var[1::2]
'aaa'

Note that strings, like tuples such as (1, 2, 3) but unlike lists such as [1, 2, 3], are immutable. This means commands like var[0] = "B" will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be var = "Banana".

Like lists, you can find the length of a string using len():

len(var)
6

The + operator concatenates two or more strings:

second_word = "panther"
first_word = "black"
print(first_word + " " + second_word)
black panther

Note that we added a space so that the phrase made sense. Another way of achieving the same end that scales to many words more efficiently (if you have them in a list) is:

" ".join([first_word, second_word])
'black panther'

Three useful functions to know about are upper(), lower(), and title(). Let’s see what they do

var = "input TEXT"
var_list = [var.upper(), var.lower(), var.title()]
print(var_list)
['INPUT TEXT', 'input text', 'Input Text']

Exercise

Reverse the string "gnirts desrever a si sihT" using indexing operations.

While we’re using print(), it has a few tricks. If we have a list, we can print out entries with a given separator:

print(*var_list, sep="; and \n")
INPUT TEXT; and 
input text; and 
Input Text

(We’ll find out more about what ‘\n’ does shortly.) To turn variables of other kinds into strings, use the str() function, for example

(
    "A boolean is either "
    + str(True)
    + " or "
    + str(False)
    + ", there are only "
    + str(2)
    + " options."
)
'A boolean is either True or False, there are only 2 options.'

In this example two boolean variables and one integer variable were converted to strings. str() generally makes an intelligent guess at how you’d like to convert your non-string type variable into a string type. You can pass a variable or a literal value to str().

f-strings#

The example above is quite verbose. Another way of combining strings with variables is via f-strings. A simple f-string looks like this:

variable = 15.32399
print(f"You scored {variable}")
You scored 15.32399

This is similar to calling str on variable and using + for concatenation but much shorter to write. You can add expressions to f-strings too:

print(f"You scored {variable**2}")
You scored 234.8246695201

This also works with functions; after all **2 is just a function with its own special syntax.

In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more information to the f-string to get the output formatted just the way you want. Let’s say we wanted two decimal places and a sign (although you always write + in the formatting, the sign comes out as + or - depending on the value):

print(f"You scored {variable:+.2f}")
You scored +15.32

There are a whole range of formatting options for numbers as shown in the following table:

Number

Format

Output

Description

15.32347

{:.2f}

15.32

Format float 2 decimal places

15.32347

{:+.2f}

+15.32

Format float 2 decimal places with sign

-1

{:+.2f}

-1.00

Format float 2 decimal places with sign

15.32347

{:.0f}

15

Format float with no decimal places

3

{:0>2d}

03

Pad number with zeros (left padding, width 2)

3

{:*<4d}

3***

Pad number with *’s (right padding, width 4)

13

{:*<4d}

13**

Pad number with *’s (right padding, width 4)

1000000

{:,}

1,000,000

Number format with comma separator

0.25

{:.1%}

25.0%

Format percentage

1000000000

{:.2e}

1.00e+09

Exponent notation

12

{:10d}

12

Right aligned (default, width 10)

12

{:<10d}

12

Left aligned (width 10)

12

{:^10d}

12

Center aligned (width 10)

As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at this link.

Special characters#

Python has a string module that comes with some useful built-in strings and characters. For example

import string

string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

gives you all of the punctuation,

string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

returns all of the basic letters in the ‘ASCII’ encoding (with .ascii_lowercase and .ascii_uppercase variants), and

string.digits
'0123456789'

gives you the numbers from 0 to 9. Finally, though less impressive visually, string.whitespace gives a string containing all of the different (there is more than one!) types of whitespace.

There are other special characters around; in fact, we already met the most famous of them: “\n” for new line. To actually print “\n” we have to ‘escape’ the backward slash by adding another backward slash:

print("Here is a \n new line")
print("Here is an \\n escaped new line ")
Here is a 
 new line
Here is an \n escaped new line 

The table below shows the most important escape commands:

Code

Result

\'

Single Quote (useful if using ' for strings)

\"

Double Quote (useful if using " for strings)

\\

Backslash

\n

New Line

\r

Carriage Return

\t

Tab

Methods for Strings#

Let’s end this sub-section on strings with a comprehensive overview of all string methods, courtesy of the excellent rich package.

from rich import inspect

var_of_type_str = "string"
inspect(var_of_type_str, methods=True)
╭───────────────────────────────────────────────── <class 'str'> ─────────────────────────────────────────────────╮
 str(object='') -> str                                                                                           
 str(bytes_or_buffer[, encoding[, errors]]) -> str                                                               
                                                                                                                 
 ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ 
  'string'                                                                                                     
 ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 
                                                                                                                 
   capitalize = def capitalize(): Return a capitalized version of the string.                                    
     casefold = def casefold(): Return a version of the string suitable for caseless comparisons.                
       center = def center(width, fillchar=' ', /): Return a centered string of length width.                    
        count = def count(...) S.count(sub[, start[, end]]) -> int                                               
       encode = def encode(encoding='utf-8', errors='strict'): Encode the string using the codec registered for  
                encoding.                                                                                        
     endswith = def endswith(...) S.endswith(suffix[, start[, end]]) -> bool                                     
   expandtabs = def expandtabs(tabsize=8): Return a copy where all tab characters are expanded using spaces.     
         find = def find(...) S.find(sub[, start[, end]]) -> int                                                 
       format = def format(...) S.format(*args, **kwargs) -> str                                                 
   format_map = def format_map(...) S.format_map(mapping) -> str                                                 
        index = def index(...) S.index(sub[, start[, end]]) -> int                                               
      isalnum = def isalnum(): Return True if the string is an alpha-numeric string, False otherwise.            
      isalpha = def isalpha(): Return True if the string is an alphabetic string, False otherwise.               
      isascii = def isascii(): Return True if all characters in the string are ASCII, False otherwise.           
    isdecimal = def isdecimal(): Return True if the string is a decimal string, False otherwise.                 
      isdigit = def isdigit(): Return True if the string is a digit string, False otherwise.                     
 isidentifier = def isidentifier(): Return True if the string is a valid Python identifier, False otherwise.     
      islower = def islower(): Return True if the string is a lowercase string, False otherwise.                 
    isnumeric = def isnumeric(): Return True if the string is a numeric string, False otherwise.                 
  isprintable = def isprintable(): Return True if the string is printable, False otherwise.                      
      isspace = def isspace(): Return True if the string is a whitespace string, False otherwise.                
      istitle = def istitle(): Return True if the string is a title-cased string, False otherwise.               
      isupper = def isupper(): Return True if the string is an uppercase string, False otherwise.                
         join = def join(iterable, /): Concatenate any number of strings.                                        
        ljust = def ljust(width, fillchar=' ', /): Return a left-justified string of length width.               
        lower = def lower(): Return a copy of the string converted to lowercase.                                 
       lstrip = def lstrip(chars=None, /): Return a copy of the string with leading whitespace removed.          
    maketrans = def maketrans(...) Return a translation table usable for str.translate().                        
    partition = def partition(sep, /): Partition the string into three parts using the given separator.          
 removeprefix = def removeprefix(prefix, /): Return a str with the given prefix string removed if present.       
 removesuffix = def removesuffix(suffix, /): Return a str with the given suffix string removed if present.       
      replace = def replace(old, new, count=-1, /): Return a copy with all occurrences of substring old replaced 
                by new.                                                                                          
        rfind = def rfind(...) S.rfind(sub[, start[, end]]) -> int                                               
       rindex = def rindex(...) S.rindex(sub[, start[, end]]) -> int                                             
        rjust = def rjust(width, fillchar=' ', /): Return a right-justified string of length width.              
   rpartition = def rpartition(sep, /): Partition the string into three parts using the given separator.         
       rsplit = def rsplit(sep=None, maxsplit=-1): Return a list of the substrings in the string, using sep as   
                the separator string.                                                                            
       rstrip = def rstrip(chars=None, /): Return a copy of the string with trailing whitespace removed.         
        split = def split(sep=None, maxsplit=-1): Return a list of the substrings in the string, using sep as    
                the separator string.                                                                            
   splitlines = def splitlines(keepends=False): Return a list of the lines in the string, breaking at line       
                boundaries.                                                                                      
   startswith = def startswith(...) S.startswith(prefix[, start[, end]]) -> bool                                 
        strip = def strip(chars=None, /): Return a copy of the string with leading and trailing whitespace       
                removed.                                                                                         
     swapcase = def swapcase(): Convert uppercase characters to lowercase and lowercase characters to uppercase. 
        title = def title(): Return a version of the string where each word is titlecased.                       
    translate = def translate(table, /): Replace each character in the string using the given translation table. 
        upper = def upper(): Return a copy of the string converted to uppercase.                                 
        zfill = def zfill(width, /): Pad a numeric string with zeros on the left, to fill a field of the given   
                width.                                                                                           
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Cleaning Text#

You often want to make changes to the text you’re working with. In this section, we’ll look at the various options to do this.

Replacing sub-strings#

A common text task is to replace a substring within a longer string. Let’s say you have a string variable var. You can use .replace(old_text, new_text) to do this.

"Value is objective".replace("objective", "subjective")
'Value is subjective'

As with any variable of a specific type (here, string), this would also work with variables:

text = "Value is objective"
old_substr = "objective"
new_substr = "subjective"
text.replace(old_substr, new_substr)
'Value is subjective'

Note that .replace() performs an exact replace and so is case-sensitive.

Replacing characters with translate#

A character is an individual entry within a string, like the ‘l’ in ‘equilibrium’. You can always count the number of characters in a string variable called var by using len(var). A very fast method for replacing individual characters in a string is str.translate().

Replacing characters is extremely useful in certain situations, most commonly when you wish to remote all punctuation prior to doing other text analysis. You can use the built-in string.punctuation for this.

Let’s see how to use it to remove all of the vowels from some text. With apologies to economist Lisa Cook, we’ll use the abstract from Cook [2011] as the text we’ll modify and we’ll first create a dictionary of translations of vowels to nothing, i.e. "".

example_text = "Much recent work has focused on the influence of social capital on innovative outcomes. Little research has been done on disadvantaged groups who were often restricted from participation in social networks that provide information necessary for invention and innovation. Unique new data on African American inventors and patentees between 1843 and 1930 permit an empirical investigation of the relation between social capital and economic outcomes. I find that African Americans used both traditional, i.e., occupation-based, and nontraditional, i.e., civic, networks to maximize inventive output and that laws constraining social-capital formation are most negatively correlated with economically important inventive activity."
vowels = "aeiou"
translation_dict = {x: "" for x in vowels}
translation_dict
{'a': '', 'e': '', 'i': '', 'o': '', 'u': ''}

Now we turn our dictionary into a string translator and apply it to our text:

translator = example_text.maketrans(translation_dict)
example_text.translate(translator)
'Mch rcnt wrk hs fcsd n th nflnc f scl cptl n nnvtv tcms. Lttl rsrch hs bn dn n dsdvntgd grps wh wr ftn rstrctd frm prtcptn n scl ntwrks tht prvd nfrmtn ncssry fr nvntn nd nnvtn. Unq nw dt n Afrcn Amrcn nvntrs nd ptnts btwn 1843 nd 1930 prmt n mprcl nvstgtn f th rltn btwn scl cptl nd cnmc tcms. I fnd tht Afrcn Amrcns sd bth trdtnl, .., ccptn-bsd, nd nntrdtnl, .., cvc, ntwrks t mxmz nvntv tpt nd tht lws cnstrnng scl-cptl frmtn r mst ngtvly crrltd wth cnmclly mprtnt nvntv ctvty.'

Exercise

Use translate() to replace all punctuation from the following sentence with spaces: “The well-known story I told at the conferences [about hypocondria] in Boston, New York, Philadelphia,…and Richmond went as follows: It amused people who knew Tommy to hear this; however, it distressed Suzi when Tommy (1982–2019) asked, “How can I find out who yelled, ‘Fire!’ in the theater?” and then didn’t wait to hear Missy give the answer—‘Dick Tracy.’”

Generally, str.translate() is very fast at replacing individual characters in strings. But you can also do it using a list comprehension and a join() of the resulting list, like so:

"".join(
    [
        ch
        for ch in "Example. string. with- excess_ [punctuation]/,"
        if ch not in string.punctuation
    ]
)
'Example string with excess punctuation'

Slugifying#

A special case of string cleaning occurs when you are given text with lots of non-standard characters in, and spaces, and other symbols; and what you want is a clean string suitable for a filename or column heading in a dataframe. Remember that it’s best practice to have filenames that don’t have spaces in. Slugiyfing is the process of creating the latter from the former and we can use the slugify package to do it.

Here are some examples of slugifying text:

from slugify import slugify

txt = "the quick brown fox jumps over the lazy dog"
slugify(txt, stopwords=["the"])
'quick-brown-fox-jumps-over-lazy-dog'

In this very simple example, the words listed in the stopwords= keyword argument (a list), are removed and spaces are replaced by hyphens. Let’s now see a more complicated example:

slugify("当我的信息改变时... àccêntæd tËXT  ")
'dang-wo-de-xin-xi-gai-bian-shi-accentaed-text'

Slugify converts text to latin characters, while also removing accents and whitespace (of all kinds-the last whitespace is a tab). There’s also a replacement= keyword argument that will replace specific strings with other strings using a list of lists format, eg replacement=[['old_text', 'new_text']]

Splitting strings#

If you want to split a string at a certain position, there are two quick ways to do it. The first is to use indexing methods, which work well if you know at which position you want to split text, eg

"This is a sentence and we will split it at character 18"[:18]
'This is a sentence'

Next up we can use the built-in split function, which returns a list of places where a given sub-string occurs:

"This is a sentence. And another sentence. And a third sentence".split(".")
['This is a sentence', ' And another sentence', ' And a third sentence']

Note that the character used to split the string is removed from the resulting list of strings. Let’s see an example with a string used for splitting instead of a single character:

"This is a sentence. And another sentence. And a third sentence".split("sentence")
['This is a ', '. And another ', '. And a third ', '']

A useful extra function to know about is splitlines(), which splits a string at line breaks and returns the split parts as a list.

count and find#

Let’s do some simple counting of words within text using str.count(). Let’s use the first verse of Elizabeth Bishop’s sestina ‘A Miracle for Breakfast’ for our text.

text = "At six o'clock we were waiting for coffee, \n waiting for coffee and the charitable crumb \n that was going to be served from a certain balcony \n --like kings of old, or like a miracle. \n It was still dark. One foot of the sun \n steadied itself on a long ripple in the river."
word = "coffee"
print(f'The word "{word}" appears {text.count(word)} times.')
The word "coffee" appears 2 times.

Meanwhile, find() returns the position where a particular word or character occurs.

text.find(word)
35

We can check this using the number we get and some string indexing:

text[text.find(word) : text.find(word) + len(word)]
'coffee'

But this isn’t the only place where the word ‘coffee’ appears. If we want to find the last occurrence, it’s

text.rfind(word)
57

Scaling up from a single string to a corpus#

For this section, it’s useful to be familiar with the pandas package, which is covered in the Data Analysis Quickstart and Working with Data sections. This section will closely follow the treatment by Jake VanderPlas.

We’ve seen how to work with individual strings. But often we want to work with a group of strings, otherwise known as a corpus, that is a collection of texts. It could be a collection of words, sentences, paragraphs, or some domain-based grouping (eg job descriptions).

Fortunately, many of the methods that we have seen deployed on a single string can be straightforwardly scaled up to hundreds, thousands, or millions of strings using pandas or other tools. This scaling up is achieved via vectorisation, in analogy with going from a single value (a scalar) to multiple values in a list (a vector).

As a very minimal example, here is capitalisation of names vectorised using a list comprehension:

[name.capitalize() for name in ["ada", "adam", "elinor", "grace", "jean"]]
['Ada', 'Adam', 'Elinor', 'Grace', 'Jean']

A pandas series can be used in place of a list. Let’s create the series first:

import pandas as pd

dfs = pd.Series(
    ["ada lovelace", "adam smith", "elinor ostrom", "grace hopper", "jean bartik"],
    dtype="string",
)
dfs
0     ada lovelace
1       adam smith
2    elinor ostrom
3     grace hopper
4      jean bartik
dtype: string

Now we use the syntax series.str.function to change the text series:

dfs.str.title()
0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
dtype: string

If we had a dataframe and not a series, the syntax would change to refer just to the column of interest like so:

df = pd.DataFrame(dfs, columns=["names"])
df["names"].str.title()
0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
Name: names, dtype: string

The table below shows a non-exhaustive list of the string methods that are available in pandas.

Function (preceded by .str.)

What it does

len()

Length of string.

lower()

Put string in lower case.

upper()

Put string in upper case.

capitalize()

Put string in leading upper case.

swapcase()

Swap cases in a string.

translate()

Returns a copy of the string in which each character has been mapped through a given translation table.

ljust()

Left pad a string (default is to pad with spaces)

rjust()

Right pad a string (default is to pad with spaces)

center()

Pad such that string appears in centre (default is to pad with spaces)

zfill()

Pad with zeros

strip()

Strip out leading and trailing whitespace

rstrip()

Strip out trailing whitespace

lstrip()

Strip out leading whitespace

find()

Return the lowest index in the data where a substring appears

split()

Split the string using a passed substring as the delimiter

isupper()

Check whether string is upper case

isdigit()

Check whether string is composed of digits

islower()

Check whether string is lower case

startswith()

Check whether string starts with a given sub-string

Regular expressions can also be scaled up with pandas. The below table shows vectorised regular expressions.

Function

What it does

match()

Call re.match() on each element, returning a boolean.

extract()

Call re.match() on each element, returning matched groups as strings.

findall()

Call re.findall() on each element

replace()

Replace occurrences of pattern with some other string

contains()

Call re.search() on each element, returning a boolean

count()

Count occurrences of pattern

split()

Equivalent to str.split(), but accepts regexes

rsplit()

Equivalent to str.rsplit(), but accepts regexes

Let’s see a couple of these in action. First, splitting on a given sub-string:

df["names"].str.split(" ")
0     [ada, lovelace]
1       [adam, smith]
2    [elinor, ostrom]
3     [grace, hopper]
4      [jean, bartik]
Name: names, dtype: object

It’s fairly common that you want to split out strings and save the results to new columns in your dataframe. You can specify a (max) number of splits via the n= kwarg and you can get the columns using expand

df["names"].str.split(" ", n=2, expand=True)
0 1
0 ada lovelace
1 adam smith
2 elinor ostrom
3 grace hopper
4 jean bartik

Exercise

Using vectorised operations, create a new column with the index position where the first vowel occurs for each row in the names column.

Here’s an example of using a regex function with pandas:

df["names"].str.extract("(\w+)", expand=False)
0       ada
1      adam
2    elinor
3     grace
4      jean
Name: names, dtype: string

There are a few more vectorised string operations that are useful.

Method

Description

get()

Index each element

slice()

Slice each element

slice_replace()

Replace slice in each element with passed value

cat()

Concatenate strings

repeat()

Repeat values

normalize()

Return Unicode form of string

pad()

Add whitespace to left, right, or both sides of strings

wrap()

Split long strings into lines with length less than a given width

join()

Join strings in each element of the Series with passed separator

get_dummies()

extract dummy variables as a dataframe

The get() and slice() methods give access to elements of the lists returned by split(). Here’s an example that combines split() and get():

df["names"].str.split().str.get(-1)
0    lovelace
1       smith
2      ostrom
3      hopper
4      bartik
Name: names, dtype: object

We already saw get_dummies() in the Regression chapter, but it’s worth revisiting it here with strings. If we have a column with tags split by a symbol, we can use this function to split it out. For example, let’s create a dataframe with a single column that mixes subject and nationality tags:

df = pd.DataFrame(
    {
        "names": [
            "ada lovelace",
            "adam smith",
            "elinor ostrom",
            "grace hopper",
            "jean bartik",
        ],
        "tags": ["uk; cs", "uk; econ", "usa; econ", "usa; cs", "usa; cs"],
    }
)
df
names tags
0 ada lovelace uk; cs
1 adam smith uk; econ
2 elinor ostrom usa; econ
3 grace hopper usa; cs
4 jean bartik usa; cs

If we now use str.get_dummies and split on ; we can get a dataframe of dummies.

df["tags"].str.get_dummies(";")
cs econ uk usa
0 1 0 1 0
1 0 1 1 0
2 0 1 0 1
3 1 0 0 1
4 1 0 0 1

Reading Text In#

Text file#

If you have just a plain text file, you can read it in like so:

fname = 'book.txt'
with open(fname, encoding='utf-8') as f:
    text_of_book = f.read()

You can also read a text file directly into a pandas dataframe using

df = pd.read_csv('book.txt', delimiter = "\n")

In the above, the delimiter for different rows of the dataframe is set as “\n”, which means new line, but you could use whatever delimiter you prefer.

Exercise

Download the file ‘smith_won.txt’ from this book’s github repository using this link (use right-click and save as). Then read the text in using pandas.

CSV file#

CSV files are already split into rows. By far the easiest way to read in csv files is using pandas,

df = pd.read_csv('book.csv')

Remember that pandas can read many other file types too.