Dependency trees with spaCy

A dependency tree is a grammatrical structure added to a sentence or phrase which delineates the dependency between a word (such as a verb) and the phrases it builds upon (such as the subject and object phrases of that verb). Jurafsky: Dependency Parsing

Language: Python 3
Library: spacy

Key statements

# Inputs: document_string (a str)

import spacy

# Load a language model and parse a document.
nlp = spacy.load('en')
doc = nlp(document_string)

# Print all noun chunks.
# These are contiguous noun phrases.
for chunk in doc.noun_chunks:
    print(chunk)

# Print the head word of each sentence.
# This is the grammatically most informative word.
for sentence in doc.sents:
    print(sentence.root)

# Print the dependency subtree of each token.
# These are the words operated upon by the token.
for token in doc:
    print(token, token.subtree)

Working example

import spacy

# Note: Document strings in this example are from the book
#       "The Name of the Wind" by Patrick Rothfuss.


# Set up functions to help produce human-friendly printing.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

def skip_and_print(*args):
    """ Act like print(), but skip a line before printing. """
    print('\n' + str(args[0]), *args[1:])

def print_table(rows, padding=0):
    """ Print `rows` with content-based column widths. """
    col_widths = [
        max(len(str(value)) for value in col) + padding
        for col in zip(*rows)
    ]
    total_width = sum(col_widths) + len(col_widths) - 1
    fmt = ' '.join('%%-%ds' % width for width in col_widths)
    print(fmt % tuple(rows[0]))
    print('~' * total_width)
    for row in rows[1:]:
        print(fmt % tuple(row))


# Load a language model and parse a document.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

nlp = spacy.load('en')

document_string = """
The Waystone Inn lay in silence,
and it was a silence of three parts.
"""

# Remove starting, ending, and duplicated whitespace characters.
document_string = ' '.join(document_string.split())

skip_and_print('Working with string: "%s"' % document_string)
doc = nlp(document_string)


# An example rendered dependency tree.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This sentence has the following dependency tree:
#
#             ┌──▸ The
#             │┌─▸ Waystone
#          ┌─▸└┴── Inn
# ┌───────┬┼┬───── lay
# │       ││└─▸┌── in
# │       ││   └─▸ silence
# │       │└─────▸ ,
# │       └──────▸ and
# │            ┌─▸ it
# └─▸┌┬────────┴── was
#    ││        ┌─▸ a
#    │└─▸┌─────┴── silence
#    │   └─▸┌───── of
#    │      │  ┌─▸ three
#    │      └─▸└── parts
#    └───────────▸ .


# Find noun chunks
# ~~~~~~~~~~~~~~~~

skip_and_print('All the found noun chunks & some properties:')

rows = [['Chunk', '.root', 'root.dep_', '.root.head']]
for chunk in doc.noun_chunks:
    rows.append([
        chunk,            # A Span object with the full phrase.
        chunk.root,       # The key Token within this phrase.
        chunk.root.dep_,  # The grammatical role of this phrase.
        chunk.root.head   # The grammatical parent Token.
    ])
print_table(rows, padding=4)

# Chunk                .root       root.dep_     .root.head
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The Waystone Inn     Inn         nsubj         lay
# silence              silence     pobj          in
# it                   it          nsubj         was
# a silence            silence     attr          was
# three parts          parts       pobj          of


# Find the head words of sentences.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

document_string = """
It's the questions we can't answer that teach us the most.
They teach us how to think.
"""

# Remove starting, ending, and duplicated whitespace characters.
document_string = ' '.join(document_string.split())

skip_and_print('Working with string: "%s"' % document_string)
doc = nlp(document_string)

# For each sentence, spacy identifies a root of the dependency
# tree. You can think of this as the grammatically most
# meaningful word in the sentence.

skip_and_print('Root word of each sentence:')
rows = [['Root', '|', 'Sentence']]
for sentence in doc.sents:
    rows.append([sentence.root, '|', sentence.text])
print_table(rows)

# Root  | Sentence
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 's    | It's the questions we can't answer that teach us ...
# teach | They teach us how to think.


# Find all the dependent tokens of a given one.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This means finding the words in a sentence being operated on
# by the given input word. Another perspective is to view words
# lower in the dependency tree (that is, being more dependent),
# as being less important to the overall sentence meaning.

skip_and_print('Dependent words (aka subtree) of some tokens:')
rows = [['Token', '|', 'Subtree']]

# Print subtrees for 'teach' in 1st sentence, 'most', and then
# 'teach' in the 2nd sentence (which are tokens 9, 12, and 15).
for token in [doc[9], doc[12], doc[15]]:
    subtree = [
        ('((%s))' if t is token else '%s') % t.text
        for t in token.subtree
    ]
    rows.append([token.text, '|', ' '.join(subtree)])
print_table(rows)

# Token | Subtree
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# teach | that ((teach)) us the most
# most  | the ((most))
# teach | They ((teach)) us how to think .

Notes

To help understand dependency trees, you can render spaCy's parse of a sentence using explosion.ai's displaCy visualizer (explosion.ai is the maker of spaCy).

You can programmatically render spaCy's dependency trees in text using the open source explacy repo.

Notes

See Also