Parsing a Single YAML Doc in Python

Last Updated: 2021-03-15 14:48

Background

For better or for worse (better for me…maybe worse for everyone else), I decided to write my own static site generator to create this blog. One of the things I wanted to emulate was the split YAML/markdown scheme used by Jekyll, e.g.:

---
title: This is my post title!
description: This is a summary description of what the post is all about!
...
# Heading

Article prose.

What about the rest of the stream?

PyYAML makes it easy to load a single yaml doc from a file or all of the yaml docs in a multi-doc file. However, it doesn’t have any high-level functions that allow you to parse off just the first yaml doc and leave the rest of the data hanging out there in the stream (NOTE: You can, use Loader.get_data() to load just the first document, if you don’t need to retain the file position for subsequent reads!).

The Hack

I got around this using the PyYAML events API.

yml_topdoc.py
"""Parse document-level config from an input file."""

import yaml

def load_first_yaml(stream):
    """
    Read only the first yaml doc from a multi-doc - one or more of which may
    not be yaml.
    """
    loader = yaml.SafeLoader(stream)
    yaml_doc = {}
    parse_events = []
    doc_end = None

    # Parse the yaml doc, event by event, until document end:
    while True:
        parse_event = loader.get_event()
        if isinstance(parse_event, yaml.DocumentEndEvent):
            doc_end = loader.get_mark()
            break
        parse_events.append(parse_event)

    # Emit the parsed events as a doc and load:
    if parse_events:
        yaml_doc = yaml.safe_load(yaml.emit(parse_events))

    # Reset the file pointer to the byte after the document end marker:
    if doc_end is not None:
        stream.seek(doc_end.index+1)

    # Return the yaml doc.
    # The stream provided as input now points immediately after the doc end.
    return yaml_doc


if __name__ == '__main__':
    import sys
    import pprint
    with open(sys.argv[1], 'rb') as f:
        yaml_doc = load_first_yaml(f)

        print('\n=== Yaml doc: ===')
        pprint.pprint(yaml_doc, indent=4, compact=False)

        print('\n=== Remainder of file: ===')
        print(f.read())

Example

test.md
---
# YAML configuration data:
field: value
object:
  stuff:
  - 1
  - 2
  - c
...
# Markdown text starts here!
Text.
Test Run
(ins)(acanaday)-▹ python3 ./yml_topdoc.py ./test.md

=== Yaml doc: ===
{'field': 'value', 'object': {'stuff': [1, 2, 'c']}}

=== Remainder of file: ===
b'# Markdown text starts here!\n\nText.\n'