In this post, we’ll take a look at a problem that’s easy for humans but surprisingly difficult for XML parsers, namely: Deciding whether some XML node represents a list in XML.
Why is this difficult? Glad you asked!
Lists in XML are generally represented as an XML node that contains one or more child elements with the same name. For example, the following XML snippet represents a list of books:
<books>
<book>The Silmarillion</book>
<book>The Hobbit</book>
<book>The Lord of the Rings</book>
</books>
<books>
(the ‘root node’) denotes a list containing nodes of type <book>
. That’s pretty straightforward and indeed most XML parsers1 do recognize it as such:
>>> from xmltodict import parse
>>> s = """<books>
... <book>The Silmarillion</book>
... <book>The Hobbit</book>
... <book>The Lord of the Rings</book>
... </books>"""
>>> parse(s)
{'books': {'book': ['The Silmarillion', 'The Hobbit', 'The Lord of the Rings']}}
However, where most (if not all) XML parsers “fail” is when a list contains only a single element:
>>> s = """<books><book>The Silmarillion</book></books>"""
>>> parse(s)
{'books': {'book': 'The Silmarillion'}}
We humans know that <books>
represents a list of <book>
s not because we’re perfect XML parsers but because of your knowledge of Enlish where nouns with an -s at the end mark plurals, i.e. “more than one”.
The rest of this post describes how XML elements that contain lists can be mapped to Python lists with a little help of Pydantic2.
There is, however, an important caveat to this: We are not going to come up with a solution that can parse arbitrary XML but rather we’ll be defining the structure of our XML data beforehand and therefore “tell” our parser which nodes are lists in advance.
The Bookstore Manager
Imagine you’re working for a company that manages libraries and bookstores. The company requires all of its members to keep us up-to-date on their catalogues by regularly sending us lists of their books and, in oder to make matters worse, we’re going to assume that they do so using XML:
<bookstore>
<name>Bilbo's Bag o' Books</name>
<books>
<book><title>The Silmarillion</title></book>
<book><title>The Hobbit</title></book>
<book><title>The Lord of the Rings</title></book>
</books>
</bookstore>
It goes without saying that working with XML directly is beneath us! Instead, we’re going to come up with a little Python script that takes the XML and converts it to Pydantic models that we can use to put our data into some sort of database later on:
from xml.etree.ElementTree import fromstring
from pydantic import BaseModel
from pydantic.utils import GetterDict
from typing import List, Any
class BookGetter(GetterDict):
def get(self, key: str, default: Any) -> Any:
return self._obj.find(key).text
class BookStoreBase(BaseModel):
class Config:
orm_mode = True
getter_dict = BookGetter
class Book(BookStoreBase):
title: str
class BookStore(BookStoreBase):
name: str
books: List[Book]
Ok, now let’s try it on our small XML snippet:
s1 = """
<bookstore>
<name>Bilbo's Bag o' Books</name>
<books>
<book><title>The Silmarillion</title></book>
<book><title>The Hobbit</title></book>
<book><title>The Lord of the Rings</title></book>
</books>
</bookstore>
"""
xml = fromstring(s1)
bookstore = BookStore.from_orm(xml)
print(bookstore)
Traceback (most recent call last):
File "xml_lists/main.py", line 41, in <module>
bookstore = BookStore.from_orm(xml)
File "pydantic/main.py", line 579, in pydantic.main.BaseModel.from_orm
pydantic.error_wrappers.ValidationError: 1 validation error for BookStore
books
value is not a valid list (type=type_error.list)
Oh oh, that doesn’t look good! What went wrong here? Apparently, Pydantic doesn’t know how to handle XML lists yet - and why would it? Our BookGetter.get(..)
method clearly states that Elements are to be returned as text not lists.
Alright, let’s fix this by telling our method to put all children of a <books>
node in a list:
class BookGetter(GetterDict):
def get(self, key: str, default: Any) -> Any:
if key == "books":
return self._obj.findall('.//book')
return self._obj.find(key).text
Try again:
xml = fromstring(s1)
bookstore = BookStore.from_orm(xml)
print(bookstore)
name="Bilbo's Bag o' Books" books=[Book(title='The Silmarillion'), Book(title='The Hobbit'), Book(title='The Lord of the Rings')]
Now look at that!
Bonus Points: Nested Lists
Let’s take this a little bit further and require our bookstores and libraries to also provide a list of chapters for each book if possible, like so:
<bookstore>
<name>The Bookworm</name>
<books>
<book>
<title>Quenta Silmarillion</title>
<chapters>
<chapter>Of the Beginning of Days</chapter>
<chapter>Of Aule and Yavanna</chapter>
<chapter>Of the Coming of the Elves</chapter>
</chapters>
</book>
<book>
<title>The Red Dragon</title>
<chapters>
<chapter>Chapter 1</chapter>
<chapter>Chapter 2</chapter>
</chapters>
</book>
</books>
</bookstore>
Although this nesting might look complicated, it’s not a big deal for our Pydantic XML parser once we’ve told it how to handle it:
class BookGetter(GetterDict):
def get(self, key: str, default: Any) -> Any:
if key == "books":
return self._obj.findall('.//book')
if key == "chapters":
return [chapter.text for chapter in self._obj.findall('.//chapter')]
return self._obj.find(key).text
... rest of the code ...
xml = fromstring(s1)
bookstore = BookStore.from_orm(xml)
print(bookstore)
name='The Bookworm' books=[Book(title='Quenta Silmarillion', chapters=['Of the Beginning of Days', 'Of Aule and Yavanna', 'Of the Coming of the Elves']), Book(title='The Red Dragon', chapters=['Chapter 1', 'Chapter 2'])]
Marvellous!