Unicode and String Encoding: Essentials and Pitfalls Quiz

Test your fundamental knowledge of Unicode, ASCII, UTF-8, UTF-16, string length measurement, slicing, and the differences between bytes and characters. This quiz covers practical issues and concepts anyone working with string encoding should understand to avoid common mistakes.

ASCII Character Encoding
Which encoding can only represent basic English letters and symbols using exactly one byte per character?
1. UTF-8
2. Unicode
3. UTF-16
4. ASCII
Explanation: ASCII stands for American Standard Code for Information Interchange and uses exactly one byte per character, supporting basic English letters, digits, and common symbols. UTF-8 and UTF-16 can represent a much wider range of characters and use a variable number of bytes. Unicode refers to the universal character set concept itself, not a specific encoding.
UTF-8 and Multibyte Characters
In UTF-8, how many bytes can a single character require at most?
1. 4
2. 1
3. 2
4. 8
Explanation: In UTF-8, a single character can be represented using one to four bytes depending on the code point. Many basic characters only use one byte, but some, especially those outside the Basic Multilingual Plane, require up to four. Options 1 and 2 are insufficient, and while 8 is larger, UTF-8 never uses more than 4 bytes per character.
String Length Versus Byte Length
If the string 'café' is encoded in UTF-8, why might its byte length differ from its character length?
1. Because accented characters need more bytes
2. Because string length in UTF-8 is always doubled
3. Because UTF-8 uses a fixed number of bytes
4. Because UTF-8 is not compatible with ASCII
Explanation: In UTF-8, accented characters like 'é' require more bytes than basic ASCII characters since they are not part of the standard ASCII set. UTF-8 is compatible with ASCII for characters 0–127, disproving option 2. Option 1 is wrong as UTF-8 is a variable-length encoding, and option 4 is not true, as length depends on the actual characters.
UTF-16 Encoding Details
What is the minimum number of bytes required to encode any character in UTF-16?
1. 2 bytes
2. 4 bytes
3. 1 byte
4. 3 bytes
Explanation: UTF-16 encodes most characters with 2 bytes (16 bits). Some characters, usually those outside the Basic Multilingual Plane, require 4 bytes (two 16-bit code units), but never less than 2. Options 1 and 3 are not correct, and while 4 bytes is possible for certain characters, the minimum is 2.
Understanding Unicode
What does Unicode primarily define in the context of computer text processing?
1. Text editor
2. Character set
3. Bit rate
4. Binary encoding
Explanation: Unicode primarily defines a character set, which is a mapping of unique code points to characters from various languages and scripts. It does not define how these are represented as binary (which is the job of encodings like UTF-8 or UTF-16). Bit rate and text editor are unrelated to Unicode's role.
ASCII Versus Unicode Capabilities
Which of the following statements is true when comparing ASCII and Unicode?
1. Unicode supports more languages
2. ASCII can represent emojis
3. Unicode is smaller than ASCII
4. ASCII uses variable-length encoding
Explanation: Unicode was developed to support a vast array of languages and symbols worldwide, while ASCII is limited to English letters and basic symbols. ASCII cannot represent emojis, disproving option 1. ASCII uses fixed-length encoding, and Unicode’s size depends on encoding choice, so option 4 is incorrect.
UTF-8 Compatibility with ASCII
Why is UTF-8 backward compatible with ASCII?
1. Because UTF-8 only encodes ASCII
2. Because both use variable-length encoding
3. Because ASCII characters are encoded the same in UTF-8
4. Because ASCII is a superset of UTF-8
Explanation: In UTF-8, ASCII characters are represented by the same one-byte values as in basic ASCII encoding, allowing files and systems using ASCII to be compatible with UTF-8. UTF-8, however, encodes far more than ASCII. Option 1 is misleading, and 3 and 4 reverse the relationship.
String Slicing Multibyte Characters
What potential issue may arise when slicing a UTF-8 encoded byte string in the middle of a multi-byte character?
1. It converts to UTF-16 automatically
2. It always works fine
3. It removes whitespace from the string
4. It can produce invalid or unreadable text
Explanation: Slicing in the middle of a multi-byte character can break the sequence, resulting in invalid or corrupted text that cannot be properly decoded. The operation does not automatically convert encodings or remove whitespace. It does not always work fine, as the encoding structure may be disrupted.
Byte Strings vs Character Strings
In text encoding, what is the difference between a byte string and a character string?
1. Byte strings store raw bytes, character strings store text
2. Byte strings are larger than kilobyte strings
3. Character strings contain only numbers
4. They are always identical
Explanation: A byte string is a sequence of raw bytes which may represent encoded text, while a character string directly represents human-readable characters. Option 1 is incorrect as confusion often occurs between the two. Option 3 and 4 are unrelated to the distinction.
Measuring String Length in Encodings
If '你好' is encoded in UTF-8, how many characters and bytes does it contain?
1. 2 characters, 2 bytes
2. 6 characters, 2 bytes
3. 2 characters, 6 bytes
4. 6 characters, 6 bytes
Explanation: '你好' consists of two Chinese characters, but each takes 3 bytes in UTF-8, totaling 6 bytes. It is not 2 bytes (each is 3, not 1), and there are not 6 characters (just 2). 6 characters, 6 bytes is also inaccurate.
Standard ASCII Character Range
What is the range of values for standard ASCII characters?
1. 0-100
2. 128-255
3. 0-127
4. 0-255
Explanation: Standard ASCII uses 7 bits to represent characters, resulting in codes from 0 to 127. 0-255 refers to extended byte values, and ASCII does not extend to 100 or only the 128–255 range. 0-127 is the correct answer.
Unicode Surrogate Pairs in UTF-16
Why are surrogate pairs used in UTF-16 encoding?
1. To compress ASCII characters
2. To remove invalid bytes
3. To represent characters outside the Basic Multilingual Plane
4. To shorten string length
Explanation: UTF-16 uses surrogate pairs—a pair of 16-bit code units—to encode characters whose code points don't fit into a single 16-bit value. ASCII characters do not use surrogate pairs, and the technique does not serve compression or correction. Shortening length or removing invalid bytes is not their purpose.
String Slicing and Code Points
What is the safest way to slice a string containing multi-byte characters?
1. Slice at random byte positions
2. Always slice by word count
3. Slice every two bytes
4. Slice at character boundaries
Explanation: Slicing at character boundaries ensures that no character is broken in the middle, preserving string integrity. Random byte position or slicing by fixed byte length risks breaking multi-byte encodings. Slicing by word count is not relevant to encoding boundaries.
BOM in UTF-16 Encoding
What is the role of the Byte Order Mark (BOM) in UTF-16 encoded text?
1. It signals end of file
2. It indicates the byte order of the encoding
3. It identifies the language used
4. It converts bytes to characters
Explanation: The BOM (Byte Order Mark) is used in UTF-16 to indicate whether the bytes are stored in big-endian or little-endian order. It does not identify language, signal end of file, or itself convert bytes to characters. Its primary purpose is decoding correctness across systems.
Effects of Incorrect Encoding/Decoding
What typically happens if you decode UTF-8 bytes as if they were ASCII?
1. It converts text to UTF-16
2. The file size doubles
3. Only ASCII characters display correctly, others become garbled
4. All text displays normally
Explanation: ASCII characters in UTF-8 occupy the same byte values, so they display properly; however, non-ASCII characters will render incorrectly or as gibberish. Not all text displays normally, and decoding does not convert to UTF-16 or affect file size directly.
Finding String Length in UTF-8
If you want to count the number of characters in a UTF-8 encoded string, what should you do?
1. Count only vowels
2. Divide byte length by two
3. Decode into text and count characters
4. Count the number of bytes
Explanation: To accurately determine the number of characters (not bytes), you must first decode the UTF-8 byte sequence into a character string, then count the characters. Counting raw bytes can be misleading, and dividing bytes by two or counting vowels does not provide the total character count.

Unicode and String Encoding: Essentials and Pitfalls Quiz

ASCII Character Encoding

UTF-8 and Multibyte Characters

String Length Versus Byte Length

UTF-16 Encoding Details

Understanding Unicode

ASCII Versus Unicode Capabilities

UTF-8 Compatibility with ASCII

String Slicing Multibyte Characters

Byte Strings vs Character Strings

Measuring String Length in Encodings

Standard ASCII Character Range

Unicode Surrogate Pairs in UTF-16

String Slicing and Code Points

BOM in UTF-16 Encoding

Effects of Incorrect Encoding/Decoding

Finding String Length in UTF-8