String#unicode_normalize
str.unicode_normalize(form=:nfc) Ruby strings can represent the same character in multiple ways. The letter “é” exists as a single codepoint (\u00E9) or as “e” followed by a combining acute accent (\u0065\u0301). These look identical but have different byte representations. Without normalization, string comparisons fail even when the content is visually the same.
String#unicode_normalize converts a string to a canonical Unicode representation so you can compare and store strings reliably.
Normalization Forms
Unicode defines four normalization forms. Ruby defaults to :nfc (Canonical Composition):
| Form | Name | What it does |
|---|---|---|
:nfc | Canonical Composition | Decomposes then recomposes. The recommended default. |
:nfd | Canonical Decomposition | Splits characters into base + combining marks. |
:nfkc | Compatibility Composition | Like NFC but also strips decorative distinctions. |
:nfkd | Compatibility Decomposition | Full decomposition including compatibility forms. |
Use :nfkc when you need to collapse visual variants like ligatures (fi → fi) or historical characters.
Signatures
str.unicode_normalize(form=:nfc) # => new string
str.unicode_normalized?(form=:nfc) # => true or false
Both methods accept a symbol specifying the normalization form. The default is :nfc.
Basic Normalization
Two byte-different strings become equal after normalization:
a = "\u00E9" # é as single codepoint
b = "\u0065\u0301" # e + combining acute accent
a == b # => false (different bytes)
a.unicode_normalize == b.unicode_normalize # => true (same after NFC)
unicode_normalize always returns a new string, even if no change is needed.
Check If Already Normalized
Use unicode_normalized? to test whether a string is already in a given form:
normalized = "Café".unicode_normalize(:nfc)
normalized.unicode_normalized?(:nfc) # => true
normalized.unicode_normalized?(:nfkc) # => false (NFC differs from NFKC)
This is useful before expensive operations or when you want to avoid unnecessary allocations.
Generate URL Slugs with NFKC
NFKC normalization strips decorative marks and converts ligatures to their component characters. Use it to build consistent slugs from internationalized titles:
title = "Mötley Crüe"
slug = title.unicode_normalize(:nfkc)
.downcase
.gsub(/[^a-z0-9]+/, "-")
.gsub(/^-|-$/, "")
# => "motley-crue"
Compare the same title without NFKC — the umlaut remains and produces a different slug:
"Mötley Crüe".downcase.gsub(/[^a-z0-9]+/, "-").gsub(/^-|-$/, "")
# => "mötley-crue"
NFKC ensures that accented characters are normalized to their base form before case conversion, giving you predictable, ASCII-safe slugs.
Error Handling
unicode_normalize raises Encoding::CompatibilityError if the string contains bytes that are not valid Unicode. ASCII-only strings are unaffected and require no normalization.
In Ruby 3.0+, frozen string literals do not change the normalization behavior. If the string is already in the target form, Ruby may return the same frozen object rather than allocating a new one.
See Also
String#encode— Convert strings between character encodingsString#gsub!— Replace patterns in stringsString#downcase— Convert strings to lowercase