lxml
Fast XML and HTML processing library for Python — C-backed bindings to libxml2 and libxslt providing the fastest XML/HTML parsing with XPath, XSLT, validation (DTD/XML Schema/RelaxNG), and lxml.etree API. lxml features: etree.parse()/etree.fromstring() for XML, html.parse()/html.fromstring() for HTML, XPath expressions (.xpath()), XSLT transformations, XML Schema and RelaxNG validation, ElementTree-compatible API, incremental parsing (iterparse for large files), objectify API for attribute-style XML access, ElementMaker for programmatic XML building, cleanup_namespaces(), and serialization (tostring() with pretty_print=True).
Score Breakdown
⚙ Agent Friendliness
🔒 Security
XML parsing library. XXE (XML External Entity) attacks: lxml has resolve_entities=False to disable; use etree.XMLParser(resolve_entities=False, no_network=True) for untrusted XML. Billion laughs attack: use etree.XMLParser(huge_tree=False). XSLT from untrusted source: allows arbitrary code execution via extensions — never apply untrusted XSLT.
⚡ Reliability
Best When
Fast XML/HTML processing with XPath and XSLT — lxml is 5-50x faster than stdlib ElementTree and provides full XPath 1.0, XSLT, and schema validation that stdlib lacks.
Avoid When
Simple XML without XPath (use stdlib etree), JSON data, or environments where C extension compilation fails.
Use Cases
- • Agent XML parsing — from lxml import etree; tree = etree.parse('data.xml'); root = tree.getroot(); items = root.xpath('//item[@status="active"]/name/text()') — XPath; agent parses XML and extracts data via XPath; lxml XPath is much more powerful than ElementTree; returns list of matched nodes/text
- • Agent HTML scraping — from lxml import html; tree = html.fromstring(html_content); links = tree.xpath('//a[@href]/@href'); titles = tree.cssselect('.product h2') — HTML parse; agent parses HTML with lxml for fast XPath-based extraction; html.fromstring() handles malformed HTML; cssselect plugin adds CSS selector support
- • Agent large XML streaming — context = etree.iterparse('large.xml', events=('end',), tag='Record'); for event, elem in context: process(elem); elem.clear() — streaming parse; agent processes multi-GB XML files without loading into memory; elem.clear() releases memory; iterparse is event-driven
- • Agent XML validation — schema = etree.XMLSchema(etree.parse('schema.xsd')); doc = etree.parse('data.xml'); if not schema.validate(doc): errors = schema.error_log; handle_errors(errors) — schema validation; agent validates XML against XSD schema; error_log provides detailed validation errors
- • Agent XML generation — root = etree.Element('root'); child = etree.SubElement(root, 'item', id='1'); child.text = 'content'; xml_bytes = etree.tostring(root, pretty_print=True, xml_declaration=True, encoding='UTF-8') — XML creation; agent programmatically builds XML documents
Not For
- • Simple XML with stdlib — for basic XML tasks, stdlib xml.etree.ElementTree is sufficient without the C dependency
- • JSON data — lxml is XML/HTML only; for JSON use stdlib json or orjson
- • Environments where C extension fails — lxml requires compiled C extension; some minimal environments (Alpine musl) may have issues; use lxml-stubs for type hints
Interface
Authentication
No auth — XML/HTML parsing library.
Pricing
lxml is BSD 3-Clause and GPL licensed. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ XPath returns list of nodes not single value — root.xpath('//item/text()') returns list ['val1', 'val2']; root.xpath('//item[1]/text()') returns list with one element; agent code: use [0] or xpath('//item[1]/text()')[0] for single value; or xpath('string(//item[1]/text())') returns string directly; xpath('count(//item)') returns float
- ⚠ lxml elements have tail text — etree structure: <a>text<b/>tail</a>; elem.text is 'text' (before child); elem[0].tail is 'tail' (after child b, still part of parent a); agent code parsing mixed-content XML: use itertext() or get_text() from lxml.html for HTML; raw XML: concatenate elem.text and each child's tail
- ⚠ Namespace handling in XPath — <root xmlns:ns='http://example.com'><ns:item/></root>; xpath('//ns:item') requires namespace map: root.xpath('//ns:item', namespaces={'ns': 'http://example.com'}); agent code with namespaced XML: always define namespace map; Clark notation alternative: root.xpath('//{http://example.com}item')
- ⚠ iterparse memory management — iterparse yields (event, elem) pairs; element still holds children in memory; agent code: after processing elem, call elem.clear() and del elem; also root reference keeps everything alive: use root = tree.getroot() outside loop or clear root too; without clearing: iterparse for large files still loads all into memory
- ⚠ from lxml import etree vs import lxml.etree — both work; from lxml import etree is conventional; lxml.html is separate: from lxml import html; html module has fromstring() that handles malformed HTML better than etree; agent code: use html.fromstring() for HTML (returns HtmlElement); etree.fromstring() for strict XML
- ⚠ html.fromstring() vs html.document_fromstring() — html.fromstring('<p>text</p>') may return an Element (not full document) if input is a fragment; html.document_fromstring() always returns full document HtmlElement; agent code parsing complete HTML pages: use document_fromstring(); for fragments: fromstring() and work with the element directly
Alternatives
Full Evaluation Report
Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for lxml.
AI-powered analysis · PDF + markdown · Delivered within 30 minutes
Package Brief
Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.
Delivered within 10 minutes
Score Monitoring
Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.
Continuous monitoring
Scores are editorial opinions as of 2026-03-06.