Parts-of-speech and lemmas with spaCy

spaCy offers parts-of-speech (noun, verb, adverb, etc.) and word lemmas — standardized variants of related word groups (e.g., the lemma of both "wrote" and "writes" is "write"). Wikipedia: Lemmatization

Language: Python 3
Library: spacy

Key statements

import spacy

# Load the language model and parse your document.
nlp = spacy.load('en')
doc = nlp('The carnivorous cactuses ate your hamburger.')

# Print out all parts of speech and lemmas.
for token in doc:
    print('%-14s' * 3 % (token, token.pos_, token.lemma_))

# Output:
#
# The           DET           the
# carnivorous   ADJ           carnivorous
# cactuses      NOUN          cactus
# ate           VERB          eat
# your          ADJ           -PRON-
# hamburger     NOUN          hamburger
# .             PUNCT         .

Working example

import spacy


# Set up functions to help produce human-friendly printing.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

def skip_and_print(*args):
    """ Act like print(), but skip a line before printing. """
    print('\n' + str(args[0]), *args[1:])

def print_table(rows, padding=0):
    """ Print `rows` with content-based column widths. """
    col_widths = [
        max(len(str(value)) for value in col) + padding
        for col in zip(*rows)
    ]
    total_width = sum(col_widths) + len(col_widths) - 1
    fmt = ' '.join('%%-%ds' % width for width in col_widths)
    print(fmt % tuple(rows[0]))
    print('~' * total_width)
    for row in rows[1:]:
        print(fmt % tuple(row))


# Load a language model and parse a document.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

nlp = spacy.load('en')

doc_string = 'Curse your sudden but inevitable betrayal!'
skip_and_print('Original string: "%s"' % doc_string)
doc = nlp(doc_string)


# Lemmatization.
# ~~~~~~~~~~~~~~

# The lemma of a word is a standardized version of it that
# attempts to throw away inflection and capitalization.
# Note that "your" changes, and everything is lowercased.
skip_and_print('Lemmas in doc:')
print([token.lemma_ for token in doc])

# Lemmas are available as integer values, so there's no need to
# create a string-to-integer map for the token.lemma_ values.
skip_and_print('Lemma integers in doc:')
print([token.lemma for token in doc])

# Many `Token` attributes have both a .thing form and a .thing_
# form (like .lemma / .lemma_ here); the first is an integer for
# use by code; the second is a string for use by humans.


# Parts of speech.
# ~~~~~~~~~~~~~~~~

# Each word has both a simple (pos_) and detailed (tag_)
# part-of-speech tag.
skip_and_print('Part-of-speech information for each token:')
rows = [['token', 'pos_', 'tag_']]
for token in doc:
    rows.append([token, token.pos_, token.tag_])
print_table(rows, padding=5)

# The underscore-free .pos and .tag attributes provide integer
# values corresponding to their .pos_ / .tag_ string variants.

# spacy has a general top-level explain() method to provide
# explanations of tag strings and other constant-value language
# model strings that it knows about.
skip_and_print('spacy.explain() for the tag "PRP$" of "your":')
print(spacy.explain(doc[1].tag_))
# Prints: pronoun, possessive

Notes

Learn what spaCy's part-of-speech codes (such as JJ, CC, etc.) mean from spaCy's part-of-speech tag docs.

Notes

See Also