Dependency trees with spaCy
A dependency tree is a grammatrical structure added to a sentence or phrase which delineates the dependency between a word (such as a verb) and the phrases it builds upon (such as the subject and object phrases of that verb). Jurafsky: Dependency Parsing
Key statements
# Inputs: document_string (a str)
import spacy
# Load a language model and parse a document.
nlp = spacy.load('en')
doc = nlp(document_string)
# Print all noun chunks.
# These are contiguous noun phrases.
for chunk in doc.noun_chunks:
print(chunk)
# Print the head word of each sentence.
# This is the grammatically most informative word.
for sentence in doc.sents:
print(sentence.root)
# Print the dependency subtree of each token.
# These are the words operated upon by the token.
for token in doc:
print(token, token.subtree)
Working example
import spacy
# Note: Document strings in this example are from the book
# "The Name of the Wind" by Patrick Rothfuss.
# Set up functions to help produce human-friendly printing.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def skip_and_print(*args):
""" Act like print(), but skip a line before printing. """
print('\n' + str(args[0]), *args[1:])
def print_table(rows, padding=0):
""" Print `rows` with content-based column widths. """
col_widths = [
max(len(str(value)) for value in col) + padding
for col in zip(*rows)
]
total_width = sum(col_widths) + len(col_widths) - 1
fmt = ' '.join('%%-%ds' % width for width in col_widths)
print(fmt % tuple(rows[0]))
print('~' * total_width)
for row in rows[1:]:
print(fmt % tuple(row))
# Load a language model and parse a document.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nlp = spacy.load('en')
document_string = """
The Waystone Inn lay in silence,
and it was a silence of three parts.
"""
# Remove starting, ending, and duplicated whitespace characters.
document_string = ' '.join(document_string.split())
skip_and_print('Working with string: "%s"' % document_string)
doc = nlp(document_string)
# An example rendered dependency tree.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# This sentence has the following dependency tree:
#
# ┌──▸ The
# │┌─▸ Waystone
# ┌─▸└┴── Inn
# ┌───────┬┼┬───── lay
# │ ││└─▸┌── in
# │ ││ └─▸ silence
# │ │└─────▸ ,
# │ └──────▸ and
# │ ┌─▸ it
# └─▸┌┬────────┴── was
# ││ ┌─▸ a
# │└─▸┌─────┴── silence
# │ └─▸┌───── of
# │ │ ┌─▸ three
# │ └─▸└── parts
# └───────────▸ .
# Find noun chunks
# ~~~~~~~~~~~~~~~~
skip_and_print('All the found noun chunks & some properties:')
rows = [['Chunk', '.root', 'root.dep_', '.root.head']]
for chunk in doc.noun_chunks:
rows.append([
chunk, # A Span object with the full phrase.
chunk.root, # The key Token within this phrase.
chunk.root.dep_, # The grammatical role of this phrase.
chunk.root.head # The grammatical parent Token.
])
print_table(rows, padding=4)
# Chunk .root root.dep_ .root.head
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The Waystone Inn Inn nsubj lay
# silence silence pobj in
# it it nsubj was
# a silence silence attr was
# three parts parts pobj of
# Find the head words of sentences.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
document_string = """
It's the questions we can't answer that teach us the most.
They teach us how to think.
"""
# Remove starting, ending, and duplicated whitespace characters.
document_string = ' '.join(document_string.split())
skip_and_print('Working with string: "%s"' % document_string)
doc = nlp(document_string)
# For each sentence, spacy identifies a root of the dependency
# tree. You can think of this as the grammatically most
# meaningful word in the sentence.
skip_and_print('Root word of each sentence:')
rows = [['Root', '|', 'Sentence']]
for sentence in doc.sents:
rows.append([sentence.root, '|', sentence.text])
print_table(rows)
# Root | Sentence
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 's | It's the questions we can't answer that teach us ...
# teach | They teach us how to think.
# Find all the dependent tokens of a given one.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# This means finding the words in a sentence being operated on
# by the given input word. Another perspective is to view words
# lower in the dependency tree (that is, being more dependent),
# as being less important to the overall sentence meaning.
skip_and_print('Dependent words (aka subtree) of some tokens:')
rows = [['Token', '|', 'Subtree']]
# Print subtrees for 'teach' in 1st sentence, 'most', and then
# 'teach' in the 2nd sentence (which are tokens 9, 12, and 15).
for token in [doc[9], doc[12], doc[15]]:
subtree = [
('((%s))' if t is token else '%s') % t.text
for t in token.subtree
]
rows.append([token.text, '|', ' '.join(subtree)])
print_table(rows)
# Token | Subtree
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# teach | that ((teach)) us the most
# most | the ((most))
# teach | They ((teach)) us how to think .
Notes
To help understand dependency trees, you can render spaCy's parse of a sentence using explosion.ai's displaCy visualizer (explosion.ai is the maker of spaCy).
You can programmatically render spaCy's dependency trees in text using the open source explacy repo.