Parts-of-speech and lemmas with spaCy
spaCy offers parts-of-speech (noun, verb, adverb, etc.) and word lemmas — standardized variants of related word groups (e.g., the lemma of both "wrote" and "writes" is "write"). Wikipedia: Lemmatization
Key statements
import spacy
# Load the language model and parse your document.
nlp = spacy.load('en')
doc = nlp('The carnivorous cactuses ate your hamburger.')
# Print out all parts of speech and lemmas.
for token in doc:
print('%-14s' * 3 % (token, token.pos_, token.lemma_))
# Output:
#
# The DET the
# carnivorous ADJ carnivorous
# cactuses NOUN cactus
# ate VERB eat
# your ADJ -PRON-
# hamburger NOUN hamburger
# . PUNCT .
Working example
import spacy
# Set up functions to help produce human-friendly printing.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def skip_and_print(*args):
""" Act like print(), but skip a line before printing. """
print('\n' + str(args[0]), *args[1:])
def print_table(rows, padding=0):
""" Print `rows` with content-based column widths. """
col_widths = [
max(len(str(value)) for value in col) + padding
for col in zip(*rows)
]
total_width = sum(col_widths) + len(col_widths) - 1
fmt = ' '.join('%%-%ds' % width for width in col_widths)
print(fmt % tuple(rows[0]))
print('~' * total_width)
for row in rows[1:]:
print(fmt % tuple(row))
# Load a language model and parse a document.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nlp = spacy.load('en')
doc_string = 'Curse your sudden but inevitable betrayal!'
skip_and_print('Original string: "%s"' % doc_string)
doc = nlp(doc_string)
# Lemmatization.
# ~~~~~~~~~~~~~~
# The lemma of a word is a standardized version of it that
# attempts to throw away inflection and capitalization.
# Note that "your" changes, and everything is lowercased.
skip_and_print('Lemmas in doc:')
print([token.lemma_ for token in doc])
# Lemmas are available as integer values, so there's no need to
# create a string-to-integer map for the token.lemma_ values.
skip_and_print('Lemma integers in doc:')
print([token.lemma for token in doc])
# Many `Token` attributes have both a .thing form and a .thing_
# form (like .lemma / .lemma_ here); the first is an integer for
# use by code; the second is a string for use by humans.
# Parts of speech.
# ~~~~~~~~~~~~~~~~
# Each word has both a simple (pos_) and detailed (tag_)
# part-of-speech tag.
skip_and_print('Part-of-speech information for each token:')
rows = [['token', 'pos_', 'tag_']]
for token in doc:
rows.append([token, token.pos_, token.tag_])
print_table(rows, padding=5)
# The underscore-free .pos and .tag attributes provide integer
# values corresponding to their .pos_ / .tag_ string variants.
# spacy has a general top-level explain() method to provide
# explanations of tag strings and other constant-value language
# model strings that it knows about.
skip_and_print('spacy.explain() for the tag "PRP$" of "your":')
print(spacy.explain(doc[1].tag_))
# Prints: pronoun, possessive
Notes
Learn what spaCy's part-of-speech codes (such as JJ, CC, etc.) mean from spaCy's part-of-speech tag docs.