rubyguides

String#encoding

Basic Inspection

Every Ruby string carries an encoding with it. The encoding method tells you which one:

str = "Hello"
str.encoding
# => #<Encoding:UTF-8>

String literals in your source code inherit the script encoding, which defaults to UTF-8 in Ruby 2.0+. A string you create with String.new gets ASCII-8BIT encoding by default:

empty = String.new
empty.encoding
# => #<Encoding:ASCII-8BIT>

The encoding affects how Ruby interprets the string’s bytes. Two strings with different encodings may look identical but are not equal:

a = "é"
b = "é".force_encoding("ISO-8859-1")

a == b
# => false
a.encoding
# => #<Encoding:UTF-8>
b.encoding
# => #<Encoding:ISO-8859-1>

Transcoding with encode

The encode method converts a string’s bytes to a different encoding:

utf8_string = "Résumé"
utf8_string.encoding
# => #<Encoding:UTF-8>

iso_string = utf8_string.encode("ISO-8859-1")
iso_string.encoding
# => #<Encoding:ISO-8859-1>

encode actually transcodes the bytes. UTF-8 uses multiple bytes for non-ASCII characters, while ISO-8859-1 uses one byte per character. The transcoding process converts the internal representation.

Handling Invalid or Undefined Characters

When transcoding fails because a character can’t be represented in the target encoding, encode raises an exception by default. You can control this behavior with keyword arguments:

# Replace characters that can't be encoded
str = "R\u00E9sum\u00E9"  # "Résumé" as UTF-8 codepoints
str.encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "R?sum?"

The available options are:

  • invalid: — what to do with byte sequences that are invalid in the source encoding (:replace substitutes a replacement character)
  • undef: — what to do with characters undefined in the target encoding (:replace substitutes a replacement character)
  • replace: — the replacement string to use (defaults to "?")

Without these options, invalid sequences cause Encoding::UndefinedConversionError or Encoding::InvalidByteSequenceError.

Re-labeling with force_encoding

force_encoding does something different. It changes the encoding label without touching the bytes:

raw = "\xC0\xC1".force_encoding("UTF-8")
raw.encoding
# => #<Encoding:UTF-8>

The bytes stay the same. Ruby just starts interpreting them as UTF-8. This is useful when you know the actual encoding of some data but Ruby misidentified it.

The danger is that force_encoding can create invalid strings:

raw = "\xFF".force_encoding("UTF-8")
raw.valid_encoding?
# => false

Checking Validity

Use valid_encoding? to check whether a string’s bytes are valid for its encoding:

valid = "hello"
valid.valid_encoding?
# => true

invalid = "\xFF".force_encoding("UTF-8")
invalid.valid_encoding?
# => false

This matters when you read data from external sources like files, databases, or network sockets. Always validate before processing.

Encoding Compatibility

When you try to concatenate strings with incompatible encodings, Ruby raises an error:

a = "\u00A9".force_encoding("ASCII-8BIT")
b = "hello"

a + b
# => Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

Ruby considers encodings compatible when they can be safely combined. ASCII-8BIT (binary data) is often incompatible with text encodings because it can contain byte values that are invalid in those encodings.

Use Encoding.compatible? to check before concatenating:

a = "\xC0".force_encoding("ASCII-8BIT")
b = "hello"

Encoding.compatible?(a, b)
# => nil (not compatible)

Common Encoding Constants

Ruby provides named constants for common encodings:

Encoding::UTF_8
Encoding::ASCII_8BIT
Encoding::ISO_8859_1
Encoding::Windows_1252

You can also look up encodings by name:

Encoding.find("UTF-8")
# => #<Encoding:UTF-8>

See Also

  • String#prepend — prepends strings to the beginning of another string
  • String#encode — transcodes a string to a different encoding (bytes are converted)

More reference entries will be added as the String method documentation grows.