Nokogiri
HTML and XML parsing library for Ruby — the standard Ruby library for parsing, querying, and modifying HTML/XML documents. Nokogiri wraps libxml2 and libgumbo (HTML5 parser) for fast, standards-compliant parsing. Key APIs: Nokogiri::HTML5(html_string) for HTML5 parsing, Nokogiri::XML(xml_string) for XML, CSS selectors (doc.css('div.agent-card')), XPath (doc.xpath('//agent[@status="active"]')), text extraction (.text), attribute access (.attr('href')), and document modification. Used for web scraping agent tools, parsing HTML email content, processing RSS/Atom feeds, extracting agent knowledge from web pages, and XML API response handling.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Nokogiri sanitizes HTML with Nokogiri::HTML::DocumentFragment.parse and then filtering — use ActionView::Helpers::SanitizeHelper or Loofah built on Nokogiri for safe HTML rendering. Parsing attacker-controlled XML can trigger XXE (XML External Entity) attacks; always disable external entity processing for untrusted agent-scraped XML.
⚡ Reliability
Best When
You need to parse HTML or XML in Ruby — web scraping agent tools, XML API processing, feed parsing, or HTML email extraction — Nokogiri is the standard, production-ready HTML/XML parser.
Avoid When
Your content is JSON, you need JavaScript rendering (use headless browser), or you're doing simple text extraction (use regex).
Use Cases
- • Web scraping tool for agent knowledge extraction — Nokogiri.HTML5(faraday.get(url).body).css('article.content').map(&:text) extracts relevant content from web pages for agent knowledge base
- • Parse XML API responses in agent integrations — Nokogiri::XML(soap_response).xpath('//AgentData').map { |n| n.text } for agent services consuming legacy XML-based APIs
- • Extract agent-relevant data from HTML email — parse HTML email body with Nokogiri and extract order details, confirmation numbers for agent inbox processing workflows
- • Process RSS/Atom feeds for agent content — Nokogiri::XML(feed_response).css('item').map { |i| {title: i.css('title').text, content: i.css('content').text} } for agent news/content tools
- • Sanitize user-generated HTML before agent processing — Nokogiri parses and filters HTML content, removing scripts and dangerous elements from agent input processing
Not For
- • Simple string extraction — if you just need to extract a regex pattern from HTML, use Ruby regex instead of parsing; Nokogiri is for structured document traversal
- • JSON APIs — Nokogiri parses HTML/XML, not JSON; use Ruby's built-in JSON.parse for JSON agent API responses
- • JavaScript-rendered pages — Nokogiri parses static HTML; use Ferrum or Capybara with headless Chrome for JavaScript-rendered agent target pages
Interface
Authentication
Document parsing library — no auth. Network requests for agent scraping handled by separate HTTP client (Net::HTTP, Faraday, HTTParty).
Pricing
Nokogiri is MIT licensed, maintained by Mike Dalessio. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ CSS vs XPath syntax — Nokogiri supports both; doc.css('div.agent') uses CSS selector; doc.xpath('//div[@class="agent"]') uses XPath; CSS is cleaner for HTML scraping, XPath is more powerful for complex conditions; agent scraping tools should pick one consistently
- ⚠ HTML parsing is lenient, XML parsing is strict — Nokogiri::HTML5(malformed_html) repairs malformed HTML silently; Nokogiri::XML(malformed_xml) may raise SyntaxError or parse incorrectly; validate agent XML sources if XML structure is business-critical
- ⚠ Encoding issues with non-UTF8 content — Nokogiri may misparse agent web content with non-UTF8 encoding; force encoding with Nokogiri::HTML5(body.encode('UTF-8', invalid: :replace, undef: :replace)) before parsing Japanese, Arabic, or other non-Latin agent content
- ⚠ NodeSet vs Node vs String return types — doc.css('.title') returns NodeSet; doc.css('.title').first returns Node; .text on Node returns string; calling .text on NodeSet concatenates all text without separators; agent scraping code must handle return type correctly
- ⚠ Nested element text includes descendant text — .text on parent element includes all nested child text; doc.css('article').text returns ALL article text including headers, links, captions; use .children.select { |n| n.text? }.map(&:text).join for direct text only
- ⚠ libxml2 native extension compilation — Nokogiri ships native gems for major platforms (Linux, macOS, Windows) avoiding compile; on uncommon platforms or custom systems, gem install nokogiri may trigger native compilation requiring libxml2-dev and libxslt-dev system packages
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Nokogiri.
Scores are editorial opinions as of 2026-03-06.