Introduction to XML and LXML

XML stands for eXtensible Markup Language, and it’s a base for defining other markup languages. HTML is the most well known XML, being the basis for all webpages. Python has a standard library, called xml, for working with XML files. It does the job, but today we’ll be talking about the lxml library, which is more feature rich. lxmls’s biggest advantages are is its full ipmlementation of XPath and it’s factory functions for creating XML elements.

Element Tree

The goal of lxml is to work with XML using the ElementTree API. Every XML tag is an element, each element contains a name and potentially attributes, child elements, and text. Element tree attributes work like Python dictionaries, and child elements are in a Python list. This allows for standard Python syntax to work seamlessly with etree elements.

Parsing XML to Elements

lxml can read from files or string objects of XML and parse them into etree elements.

 <?xml version="1.0" encoding="UTF-8"?>
 <store>
    <inventory>
        <apples count="5">Golden Delicious</apples>
        <oranges count="0"/>
    </inventory>
    <employees>
        <employee title="writer">SearingFrost</employee>
    </employees>
 </store>

xml_string = '<?xml version="1.0" encoding="UTF-8"?><store><inventory>...</store>'

store = etree.fromstring(xml_string)
# root is the root element, in this case store

Serializing/Writing Elements to XML

XML travels in string format, so the ability to transform lxml elements into strings is vital. Luckily, it is very simple with the tostring function. tostring just reverses the fromstring function.

print(etree.tostring(store, encoding='utf-8', xml_declaration=True, pretty_print=True))
# b'<?xml version=\'1.0\' encoding=\'utf-8\'?><store><inventory>...</store>'

Searching and Modifying XML objects

The most effective way to search for XML Elements is with XPath. XPath is fully implemented in lxml, and if you know what tags you are searching for XPath is the way to go. Give the XPath expression to the element you wish to search from using that element’s xpath method. Note that the xpath method always returns a list of results, even if no elements or only one element is found.

xml_string = '<?xml version="1.0" encoding="UTF-8"?><store><inventory>...</store>'

store = etree.fromstring(xml_string)

print(store.xpath('/store'))
# [<Element store at 0x112bb51c0>]
print(store.xpath('/store/*'))
# [<Element inventory at 0x112bb5100>, <Element employees at 0x112bb52c0>]

Searching can also be accomplished using iterators to go through elements until you find what you need. iterwalk is a class and part of etree that will take an Element and iterate through the Element’s tree. iterwalk’s iterator returns tuples of events and elements. When elements are encountered, they can be mutated as needed.

# Iterating through the store
# returning start and end events
# We can see SearingFrost is changed to NewName inplace
for event, element in etree.iterwalk(store, events=('start', 'end')):
    print(event, element)
    if element.tag == 'SearingFrost':
        element.tag = 'NewName'
# start <Element store at 0x107240700>
# start <Element inventory at 0x1081d0540>
# start <Element apples at 0x1081d6580>
# end <Element apples at 0x1081d6580>
# start <Element oranges at 0x1081d6600>
# end <Element oranges at 0x1081d6600>
# end <Element inventory at 0x1081d0540>
# start <Element employees at 0x1081c0b00>
# start <Element SearingFrost at 0x1081d05c0>
# end <Element NewName at 0x1081d05c0>
# end <Element employees at 0x1081c0b00>
# end <Element store at 0x107240700>

Creating XML objects

lxml also makes it possible to build your own XML trees from scratch. After you start with a root element the XML tree is built mostly with calls to etree.subelement. The subelement factory takes a parent, tag name, an optional attribute dictionary, and extra keyword arguments if necessary. Let’s recreate the store XML document using subelements.

from lxml import etree

store = etree.Element('store')
inventory = etree.SubElement(store, 'inventory')
# Be mindful when adding assigning variables after
# the .text on SubElement call, as it will return 
# the text and not the Element
etree.SubElement(inventory, 'apples', {'count': '5'}).text = 'Golden Delicious'
etree.SubElement(inventory, 'oranges', {'count': '0'})
employees = etree.SubElement(store, 'employees')
etree.SubElement(employees, 'SearingFrost', {'title': 'writer'})

print(etree.tostring(store, encoding='utf-8', xml_declaration=True, pretty_print=True))
# b'<?xml version=\'1.0\' encoding=\'utf-8\'?><store><inventory>...</store>'

Namespaces

Finally, we’re going to dig into namespaces with a fair amount of detail. Namespaces are an important part of XML, and lxml provides full utility to work with them. Namespaces are prefixes for tag names to avoid tag names clashing. Namespaces use URI’s, which are essentially more generic URL’s, in that URI’s do not need to describe how to locate a resource. Namespaces are defined in the root attributes of XML elements that use that namespace (often the root element of an XML document will contain all the namespaces used). Declarations use a special attribute name xmlns:namespace=”URI/of/namespace”. Remember, the URI doesn’t necessarily have to point to anything real, it just needs to be unique so it doesn’t clash with other namespaces.

<root xmlns:searing="https://www.jamesfheath.com/xmlnamespaces" xmlns:frost="https://www.jamesfheath.com/xmlnamespaces">
    <searing:store>store with the searing namespace</searing:store>
    <frost:store>store with the frost namespace</frost:store>
</root>

lxml uses fully qualified namespaces on etree elements and not just prefixes in order to avoid ambiguity and make code easier to write. When the final XML document is seraialized, it simply translates back into the prefix as expected.

lxml QName	Prefix
{https://www.jamesfheath.com/xmlnamespaces}tagname	searing

lxml stores namespaces in dictionaries in the nsmap attribute.

namespace_xml_string = '<root xmlns:searing="https://www.jamesfheath.com/xmlnamespace1" xmlns:frost="https://www.jamesfheath.com/xmlnamespace2"><searing:store>store with the searing namespace</searing:store><frost:store>store with the frost namespace</frost:store></root>'

namespace_xml_root = etree.fromstring(namespace_xml_string)

print(namespace_xml_root.nsmap)
# {'searing': 'https://www.jamesfheath.com/xmlnamespace1',
#  'frost': 'https://www.jamesfheath.com/xmlnamespace2'}

for e in namespace_xml_root:
    print(e.tag)
    print(e.prefix)
# {https://www.jamesfheath.com/xmlnamespace1}store
# searing
# {https://www.jamesfheath.com/xmlnamespace2}store
# frost

To create XML elements with namespaces, you can add the full qualified string as a tag, though this is a lot of typing.

nsmap = {'searing': 'https://www.jamesfheath.com/xmlnamespace1', 'frost': 'https://www.jamesfheath.com/xmlnamespace2'}

root = etree.Element('root', nsmap=nsmap)
searing_store = etree.SubElement(root, '{https://www.jamesfheath.com/xmlnamespace1}store')
frost_store = etree.SubElement(root, '{https://www.jamesfheath.com/xmlnamespace2}store')

print(etree.tostring(root, pretty_print=True))
# <root xmlns:searing="https://www.jamesfheath.com/xmlnamespace1" xmlns:frost="https://www.jamesfheath.com/xmlnamespace2">
#    <searing:store/>
#    <frost:store/>
# </root>

Using etree.QName constructors, it’s easy to build qualified names using your namespace map.

nsmap = {'searing': 'https://www.jamesfheath.com/xmlnamespace1', 'frost': 'https://www.jamesfheath.com/xmlnamespace2'}

root = etree.Element('root', nsmap=nsmap)
searing_store = etree.SubElement(root, etree.QName(nsmap['searing'], 'store'))
frost_store = etree.SubElement(root, etree.QName(nsmap['frost'], 'store'))

print(etree.tostring(root, pretty_print=True))
# <root xmlns:searing="https://www.jamesfheath.com/xmlnamespace1" xmlns:frost="https://www.jamesfheath.com/xmlnamespace2">
#    <searing:store/>
#    <frost:store/>
# </root>

Conclusion

XML is a complicated topic, and there are a lot of details with stylesheets that need to be accounted for when working with complex elements. lxml will give you the tools to operate on XML, and we’ve gone over most of the tools you will need for 90% of you XML work.