charset-normalizer
Modern character encoding detection library for Python — detects the encoding of bytes sequences using statistical analysis. charset-normalizer features: from_bytes()/from_path()/from_fp() for encoding detection, Results object with best match and alternatives, encoding property for detected encoding name, confidence score per result, normalize() for encoding conversion, cli tool (normalizer) for command-line detection, MYPYC compiled C extension for speed, multibyte encoding support (CJK, Arabic, Hebrew), chaos detection for garbled text, and requests-compatible interface. Drop-in replacement for chardet with better accuracy and active maintenance.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Pure text processing library with no network calls. Processing untrusted bytes: detection is safe (read-only analysis). normalize() writes files — validate output path to prevent path traversal. Decoded text from untrusted sources should be sanitized before use in SQL/HTML/shell contexts.
⚡ Reliability
Best When
Detecting character encoding of text files and API responses with unknown encoding — charset-normalizer provides better accuracy than chardet with active maintenance and faster C extension.
Avoid When
Binary file type detection (use python-magic), guaranteed-accurate encoding detection (add BOM at creation time instead), or real-time streaming text.
Use Cases
- • Agent file encoding detection — from charset_normalizer import from_path; results = from_path('unknown.txt'); best = results.best(); print(best.encoding, best.chaos) — detect encoding; agent reads files with unknown encoding; from_path detects encoding from file bytes; best() returns highest-confidence match; then: with open(path, encoding=best.encoding) as f
- • Agent bytes to string — from charset_normalizer import from_bytes; raw_bytes = some_api_response.content; result = from_bytes(raw_bytes).best(); if result: text = str(result) — bytes decode; agent processes API responses or scraped content with unknown encoding; str(result) decodes using detected encoding; returns None if no encoding detected
- • Agent encoding normalization — from charset_normalizer import normalize; normalized_path = normalize('old_file.txt', encoding='utf-8') — convert encoding; agent converts legacy files to UTF-8; normalize() reads file, detects encoding, writes UTF-8; original file backed up with .bak extension by default
- • Agent bulk file processing — from charset_normalizer import from_bytes; for file_bytes in batch: result = from_bytes(file_bytes); encoding = result.best().encoding if result.best() else 'utf-8'; text = file_bytes.decode(encoding, errors='replace') — batch encoding detection; agent processes mixed-encoding document collections; fallback to UTF-8 with errors='replace' for safety
- • Agent requests integration — import requests; from charset_normalizer import from_bytes; response = requests.get(url); detected = from_bytes(response.content).best(); text = response.content.decode(detected.encoding if detected else 'utf-8') — web scraping; agent correctly decodes web pages regardless of Content-Type header encoding declaration accuracy
Not For
- • Binary file detection — charset-normalizer detects text encodings not binary vs text; use python-magic for binary type detection
- • Guaranteed accuracy — encoding detection is probabilistic; short texts and mixed-encoding documents may be misdetected; validate results
- • Real-time streaming — from_bytes requires full byte sequence; for streaming text use codec detection via BOM or HTTP Content-Type header
Interface
Authentication
No auth — local text processing library.
Pricing
charset-normalizer is MIT licensed. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ results.best() can return None — from_bytes(data).best() returns None if no encoding detected (empty bytes, binary data, or too ambiguous); agent code must check: result = from_bytes(data).best(); if result is None: handle_failure(); do NOT call result.encoding on None — raises AttributeError; always guard with if result check
- ⚠ import charset_normalizer not charset_normalizer.detect — the chardet-compatible interface is: from charset_normalizer import detect; result = detect(bytes_data) returns dict with encoding and confidence; the native interface is: from charset_normalizer import from_bytes; both work but native interface gives more information including alternatives and chaos score
- ⚠ Small byte sequences have low reliability — detection on <100 bytes is unreliable; UTF-8 with ASCII characters often detected as ASCII; agent code should pass as much of the file as available; for streaming: buffer first 4096 bytes then detect; confidence < 0.5 should be treated as uncertain
- ⚠ chaos score indicates garbled text — result.chaos is float 0.0-1.0 measuring % of weird characters; chaos > 0.1 suggests misdetected encoding or garbled data; agent code should: if result.chaos > 0.1: try alternative encodings or flag for human review; chaos = 0.0 means clean, well-formed text in detected encoding
- ⚠ normalize() creates output file, not in-memory — charset_normalizer.normalize(path, encoding='utf-8') writes a new file to same directory; it does NOT return string; agent code needing in-memory conversion: read file → from_bytes() → str(result); normalize() is a file-to-file utility
- ⚠ requests library uses charset-normalizer automatically since 2.26 — requests already detects encoding via charset-normalizer/chardet; response.text uses detected encoding; agent code using requests should use response.text directly; only use charset-normalizer manually when response.content needs custom processing or response.encoding is wrong
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for charset-normalizer.
Scores are editorial opinions as of 2026-03-06.