String#unicode_normalize

str.unicode_normalize(form=:nfc)

Returns String· Added in v2.2· Updated May 30, 2026· String Methods

rubystringsunicode

Ruby strings can represent the same character in multiple ways. The letter “é” exists as a single codepoint (\u00E9) or as “e” followed by a combining acute accent (\u0065\u0301). These look identical but have different byte representations. Without normalization, string comparisons fail even when the content is visually the same.

String#unicode_normalize converts a string to a canonical Unicode representation so you can compare and store strings reliably.

Normalization Forms

Unicode defines four normalization forms. Ruby defaults to :nfc (Canonical Composition):

Form	Name	What it does
`:nfc`	Canonical Composition	Decomposes then recomposes. The recommended default.
`:nfd`	Canonical Decomposition	Splits characters into base + combining marks.
`:nfkc`	Compatibility Composition	Like NFC but also strips decorative distinctions.
`:nfkd`	Compatibility Decomposition	Full decomposition including compatibility forms.

Use :nfkc when you need to collapse visual variants like ligatures (ﬁ → fi) or historical characters.

Signatures

str.unicode_normalize(form=:nfc)   # => new string
str.unicode_normalized?(form=:nfc) # => true or false

Both methods accept a symbol specifying the normalization form. The default is :nfc because it is the most widely supported form across text editors, databases, and web standards. NFC composes characters into their shortest possible form while preserving semantic meaning.

When you are unsure which form to pick, start with NFC. It handles the common case of visually identical strings that differ at the byte level, and it is the form that most Ruby string operations expect by default.

Basic Normalization

Two byte-different strings become equal after normalization:

a = "\u00E9"        # é as single codepoint
b = "\u0065\u0301"  # e + combining acute accent

a == b                              # => false (different bytes)
a.unicode_normalize == b.unicode_normalize  # => true (same after NFC)

unicode_normalize always returns a new string, even if no change is needed. This is worth remembering because the method does not mutate in place, and you can safely call it on a string without worrying about side effects anywhere else in the program.

When two strings need to be compared for visual equality but might differ in their underlying byte representation, normalization is the right first step. After both strings are in the same form, ordinary comparison operators work correctly.

Check if already normalized

Use unicode_normalized? to test whether a string is already in a given form:

normalized = "Café".unicode_normalize(:nfc)
normalized.unicode_normalized?(:nfc)   # => true
normalized.unicode_normalized?(:nfkc)  # => false (NFC differs from NFKC)

This is useful before expensive operations or when you want to avoid unnecessary allocations. Checking first with unicode_normalized? is especially helpful in loops or batch processing where the same strings might be normalized many times. If the string is already in the target form, you can skip the work and keep the code running at full speed.

The distinction between NFC and NFKC matters here. A string can be NFC-normalized but still contain compatibility characters that NFKC would collapse, so the check is form-specific and should match the normalization you plan to apply.

Generate URL Slugs with NFKC

NFKC normalization strips decorative marks and converts ligatures to their component characters. Use it to build consistent slugs from internationalized titles:

title = "Mötley Crüe"
slug = title.unicode_normalize(:nfkc)
         .downcase
         .gsub(/[^a-z0-9]+/, "-")
         .gsub(/^-|-$/, "")
# => "motley-crue"

The NFKC form converts the umlaut characters into their base ASCII equivalents before the slug transformation runs. This means the resulting URL uses only basic Latin characters, which is safer for DNS, file systems, and systems that expect ASCII-only identifiers.

Compare the same title without NFKC — the umlaut remains and produces a different slug:

"Mötley Crüe".downcase.gsub(/[^a-z0-9]+/, "-").gsub(/^-|-$/, "")
# => "mötley-crue"

NFKC ensures that accented characters are normalized to their base form before case conversion, giving you predictable, ASCII-safe slugs. The difference between the two approaches is visible in the output: without NFKC, the umlaut survives into the slug, which can cause problems in URLs, file systems, and other places where ASCII is expected.

This pattern is common in content management systems and web frameworks that need to turn user-facing titles into machine-friendly identifiers. Normalizing first and then cleaning up the remaining characters produces slugs that are both readable and safe.

Comparing strings across sources

When strings come from different sources (a file, a database, and a form submission), normalization makes comparison reliable:

from_file = File.read("name.txt").strip
from_db    = "Café"
# These might be byte-different despite looking the same
from_file.unicode_normalize == from_db.unicode_normalize  # => true

The normalization step is cheap compared to the cost of incorrect comparisons, so it is worth applying early whenever strings from multiple origins need to match.

Error Handling

unicode_normalize raises Encoding::CompatibilityError if the string contains bytes that are not valid Unicode. ASCII-only strings are unaffected and require no normalization.

In Ruby 3.0+, frozen string literals do not change the normalization behavior. If the string is already in the target form, Ruby may return the same frozen object rather than allocating a new one.

This is a small but meaningful optimization for code that normalizes frequently. When a string is frozen and already normalized, the method can return it without copying, which saves both time and memory. The behavior is transparent to the caller because the return value is still a valid normalized string.

Encoding::CompatibilityError is the main exception to watch for. If the input contains bytes outside the valid Unicode range, normalization cannot proceed. In most practical cases, strings from user input, files, or network data should be validated before normalization is attempted.