ftfy
Fix text encoding issues in Python — automatically repairs mojibake (garbled text from wrong encoding), normalizes Unicode, and fixes common text encoding problems. ftfy features: fix_text() for general text repair, fix_encoding() for encoding-specific fixes, fix_and_explain() for diagnostic output, explain_unicode() for character analysis, remove_control_chars() for control character removal, uncurl_quotes() for smart/curly quote normalization, fix_surrogates() for surrogate pair repair, fix_latin_ligatures() for ligature expansion, and normalization (NFC/NFKC). Handles Windows-1252-as-UTF-8, Latin-1 mojibake, and other real-world encoding disasters.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Text processing library with no network calls. No security concerns for processing untrusted text — fix_text() is purely transformative. Output may contain different characters than input (intended) — validate output if used in security contexts after fixing.
⚡ Reliability
Best When
Cleaning real-world text data with encoding problems — ftfy automatically detects and repairs the most common Unicode encoding issues in scraped text, database dumps, and document processing.
Avoid When
Text is already clean UTF-8 (ftfy is safe on clean text but adds overhead), language-level normalization (use spacy), or non-text binary data.
Use Cases
- • Agent text cleaning pipeline — import ftfy; clean = ftfy.fix_text(scraped_text) — fixes mojibake, normalizes Unicode; agent web scraping pipeline cleans text before NLP processing; fix_text() handles: 'Café' → 'Café', ‘smart’ quotes → 'smart' quotes, control characters
- • Agent encoding diagnosis — text, explanation = ftfy.fix_and_explain(garbled_text); print(explanation) — understand what was fixed; agent debugging encoding issues sees exactly what transformations applied; explanation shows: 'decoded from Windows-1252 then re-encoded as UTF-8'
- • Agent database text repair — cursor.execute('SELECT content FROM posts'); for row in cursor: fixed = ftfy.fix_text(row['content']); cursor2.execute('UPDATE posts SET content=%s WHERE id=%s', (fixed, row['id'])) — batch repair; agent fixes historical data with encoding issues; fix_text() is safe to call on already-correct text
- • Agent quote normalization — from ftfy import fix_text; normalized = fix_text(user_input, fix_entities=True, uncurl_quotes=True) — normalize smart quotes; agent NLP pipeline converts curly quotes to straight quotes; Word/Office copy-paste introduces curly quotes that break tokenizers
- • Agent character analysis — from ftfy import explain_unicode; explain_unicode('Cafẽ') — prints details of each character; agent debugging unexpected Unicode; shows codepoint, name, category, script for each character; useful for understanding what went wrong with encoding
Not For
- • Language detection — ftfy fixes encoding, not language; for language detection use langdetect
- • Translation — ftfy normalizes encoding, not language; for translation use deep-translator or googletrans
- • Heavy NLP preprocessing — ftfy handles encoding fixes; for full text normalization (stemming, stopwords) use spacy or nltk
Interface
Authentication
No auth — local text processing library.
Pricing
ftfy is MIT licensed. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ fix_text() requires str not bytes — ftfy.fix_text(b'caf\xc3\xa9') raises TypeError; agent code reading bytes from file must decode first: text.decode('utf-8', errors='replace'); then apply fix_text(); or use fix_text(content.decode('latin-1')) if encoding is unknown
- ⚠ fix_text() may change text unexpectedly — fix_text() applies heuristics; rare cases where intended text is misidentified as mojibake; agent code processing technical content with unusual Unicode may see unwanted changes; use fix_text(text, fix_encoding=False) to disable encoding fixes while keeping normalization
- ⚠ Not for binary data detection — ftfy cannot tell if string is already UTF-8 or is Latin-1 misinterpreted as UTF-8; heuristics determine encoding; agent code uncertain about source encoding should try chardet first to detect encoding then decode properly before ftfy
- ⚠ explain_unicode() is diagnostic only — ftfy.explain_unicode('...') prints to stdout; no return value; agent code wanting explanation for automated processing should use: text, explanation = ftfy.fix_and_explain(text) — explanation is list of operation tuples
- ⚠ normalize_line_endings not included — fix_text() does not normalize \r\n to \n; agent NLP pipeline should separately: text = text.replace('\r\n', '\n').replace('\r', '\n') for consistent line endings; fix_text() handles encoding but not OS line ending differences
- ⚠ ftfy 6.x removed some functions — ftfy 6.x removed fix_text_segment() and changed API from 5.x; agent code upgrading from ftfy 5.x must update to fix_text(); check version: import ftfy; ftfy.__version__; migration: fix_text_segment() → fix_text() with same behavior
Alternatives
Full Evaluation Report
Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for ftfy.
AI-powered analysis · PDF + markdown · Delivered within 30 minutes
Package Brief
Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.
Delivered within 10 minutes
Score Monitoring
Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.
Continuous monitoring
Scores are editorial opinions as of 2026-03-06.