rubyguides

String#valid_encoding?

str.valid_encoding? → true or false

What valid_encoding? Checks

String#valid_encoding? returns true if the string’s byte sequence is valid for its currently assigned encoding. It returns false if the bytes form an invalid sequence.

The method checks the string against its current encoding, not against some intended or assumed encoding. It never raises an exception and never modifies the string.

str = "hello"
str.encoding        # => #<Encoding:UTF-8>
str.valid_encoding? # => true

Valid and Invalid Byte Sequences

UTF-8 strings with properly formed characters return true. This includes ASCII text, accented characters, and CJK characters.

"hello".valid_encoding?          # => true
"héllo".valid_encoding?          # => true
"日本語".valid_encoding?         # => true

Invalid byte sequences return false. This happens when a multi-byte character is truncated, or when bytes that are invalid in UTF-8 appear in the string.

# \xc2 starts a 2-byte UTF-8 character but has no continuation byte
"\xc2".force_encoding("UTF-8").valid_encoding?       # => false

# Completely invalid UTF-8 start bytes
"bad\xff\xfefood".force_encoding("UTF-8").valid_encoding?  # => false

The full sequence "\xc2\xa1" (which represents the character ¡) is valid UTF-8:

"\xc2\xa1".force_encoding("UTF-8").valid_encoding?  # => true

But if you truncate it by removing the continuation byte \xa1, validity fails:

"\xc2".force_encoding("UTF-8").valid_encoding?     # => false

The ASCII-8BIT Gotcha

Strings tagged with ASCII-8BIT (also called BINARY) always return true when all bytes are 7-bit ASCII, because 7-bit ASCII is valid in every encoding. This can be misleading when you’re actually dealing with binary data that happens to look valid.

raw = "\xde\xad\xbe\xef".b  # .b forces ASCII-8BIT encoding
raw.valid_encoding?          # => true  (7-bit bytes pass every encoding check)

raw.encoding                 # => #<Encoding:ASCII-8BIT>

A true result from an ASCII-8BIT string does not mean the data is valid UTF-8. It only means the bytes happen to be valid in the BINARY encoding. Check encoding alongside valid_encoding? to know what you’re actually dealing with.

How It Differs from encode and scrub

These three methods handle invalid byte sequences differently:

MethodBehavior on invalid bytes
valid_encoding?Returns false, no exception
encodeRaises Encoding::InvalidByteSequenceError
scrubReturns a copy with invalid bytes replaced
# valid_encoding? — no exception
"\xc2".force_encoding("UTF-8").valid_encoding?  # => false

# encode — raises
"\xc2".force_encoding("UTF-8").encode("UTF-8")
# => Encoding::InvalidByteSequenceError: "\xC2" on UTF-8

# scrub — replaces invalid bytes
"\xc2".force_encoding("UTF-8").scrub  # => "�"

Use valid_encoding? as a diagnostic check before deciding whether to encode or scrub. It tells you the problem exists without throwing you into exception handling.

Practical Examples

Validate input before processing

def ensure_utf8(str)
  unless str.valid_encoding?
    str = str.encode("UTF-8", invalid: :replace, replace: "?")
  end
  str
end

ensure_utf8("Hello")                        # => "Hello"
ensure_utf8("\xc2".force_encoding("UTF-8")) # => "?"

Filter a collection by encoding validity

strings = ["hello", "\xff Invalid".force_encoding("UTF-8"), "日本"]
strings.select(&:valid_encoding?)  # => ["hello", "日本"]

Common Mistakes

Assuming true means UTF-8. A string in ASCII-8BIT encoding with 7-bit ASCII bytes returns true. The string is valid for BINARY, not necessarily for UTF-8. Always check encoding if you need to know which encoding is in use.

Using valid_encoding? after force_encoding without understanding what it checks. force_encoding only changes the encoding label — it does not change the bytes. valid_encoding? then checks whether those bytes are valid for the new label. It cannot detect that you originally intended a different encoding.

valid_utf8_bytes = "\xc2\xa1"
valid_utf8_bytes.force_encoding("ASCII-8BIT").valid_encoding?  # => true
valid_utf8_bytes.force_encoding("UTF-8").valid_encoding?        # => true

Both return true because the bytes are genuinely valid in both encodings. But if those bytes were actually meant to be interpreted as, say, Windows-1252, the UTF-8 label gives you a false sense of correctness.

See Also