Test your fundamental knowledge of Unicode, ASCII, UTF-8, UTF-16, string length measurement, slicing, and the differences between bytes and characters. This quiz covers practical issues and concepts anyone working with string encoding should understand to avoid common mistakes.
Which encoding can only represent basic English letters and symbols using exactly one byte per character?
Explanation: ASCII stands for American Standard Code for Information Interchange and uses exactly one byte per character, supporting basic English letters, digits, and common symbols. UTF-8 and UTF-16 can represent a much wider range of characters and use a variable number of bytes. Unicode refers to the universal character set concept itself, not a specific encoding.
In UTF-8, how many bytes can a single character require at most?
Explanation: In UTF-8, a single character can be represented using one to four bytes depending on the code point. Many basic characters only use one byte, but some, especially those outside the Basic Multilingual Plane, require up to four. Options 1 and 2 are insufficient, and while 8 is larger, UTF-8 never uses more than 4 bytes per character.
If the string 'café' is encoded in UTF-8, why might its byte length differ from its character length?
Explanation: In UTF-8, accented characters like 'é' require more bytes than basic ASCII characters since they are not part of the standard ASCII set. UTF-8 is compatible with ASCII for characters 0–127, disproving option 2. Option 1 is wrong as UTF-8 is a variable-length encoding, and option 4 is not true, as length depends on the actual characters.
What is the minimum number of bytes required to encode any character in UTF-16?
Explanation: UTF-16 encodes most characters with 2 bytes (16 bits). Some characters, usually those outside the Basic Multilingual Plane, require 4 bytes (two 16-bit code units), but never less than 2. Options 1 and 3 are not correct, and while 4 bytes is possible for certain characters, the minimum is 2.
What does Unicode primarily define in the context of computer text processing?
Explanation: Unicode primarily defines a character set, which is a mapping of unique code points to characters from various languages and scripts. It does not define how these are represented as binary (which is the job of encodings like UTF-8 or UTF-16). Bit rate and text editor are unrelated to Unicode's role.
Which of the following statements is true when comparing ASCII and Unicode?
Explanation: Unicode was developed to support a vast array of languages and symbols worldwide, while ASCII is limited to English letters and basic symbols. ASCII cannot represent emojis, disproving option 1. ASCII uses fixed-length encoding, and Unicode’s size depends on encoding choice, so option 4 is incorrect.
Why is UTF-8 backward compatible with ASCII?
Explanation: In UTF-8, ASCII characters are represented by the same one-byte values as in basic ASCII encoding, allowing files and systems using ASCII to be compatible with UTF-8. UTF-8, however, encodes far more than ASCII. Option 1 is misleading, and 3 and 4 reverse the relationship.
What potential issue may arise when slicing a UTF-8 encoded byte string in the middle of a multi-byte character?
Explanation: Slicing in the middle of a multi-byte character can break the sequence, resulting in invalid or corrupted text that cannot be properly decoded. The operation does not automatically convert encodings or remove whitespace. It does not always work fine, as the encoding structure may be disrupted.
In text encoding, what is the difference between a byte string and a character string?
Explanation: A byte string is a sequence of raw bytes which may represent encoded text, while a character string directly represents human-readable characters. Option 1 is incorrect as confusion often occurs between the two. Option 3 and 4 are unrelated to the distinction.
If '你好' is encoded in UTF-8, how many characters and bytes does it contain?
Explanation: '你好' consists of two Chinese characters, but each takes 3 bytes in UTF-8, totaling 6 bytes. It is not 2 bytes (each is 3, not 1), and there are not 6 characters (just 2). 6 characters, 6 bytes is also inaccurate.
What is the range of values for standard ASCII characters?
Explanation: Standard ASCII uses 7 bits to represent characters, resulting in codes from 0 to 127. 0-255 refers to extended byte values, and ASCII does not extend to 100 or only the 128–255 range. 0-127 is the correct answer.
Why are surrogate pairs used in UTF-16 encoding?
Explanation: UTF-16 uses surrogate pairs—a pair of 16-bit code units—to encode characters whose code points don't fit into a single 16-bit value. ASCII characters do not use surrogate pairs, and the technique does not serve compression or correction. Shortening length or removing invalid bytes is not their purpose.
What is the safest way to slice a string containing multi-byte characters?
Explanation: Slicing at character boundaries ensures that no character is broken in the middle, preserving string integrity. Random byte position or slicing by fixed byte length risks breaking multi-byte encodings. Slicing by word count is not relevant to encoding boundaries.
What is the role of the Byte Order Mark (BOM) in UTF-16 encoded text?
Explanation: The BOM (Byte Order Mark) is used in UTF-16 to indicate whether the bytes are stored in big-endian or little-endian order. It does not identify language, signal end of file, or itself convert bytes to characters. Its primary purpose is decoding correctness across systems.
What typically happens if you decode UTF-8 bytes as if they were ASCII?
Explanation: ASCII characters in UTF-8 occupy the same byte values, so they display properly; however, non-ASCII characters will render incorrectly or as gibberish. Not all text displays normally, and decoding does not convert to UTF-16 or affect file size directly.
If you want to count the number of characters in a UTF-8 encoded string, what should you do?
Explanation: To accurately determine the number of characters (not bytes), you must first decode the UTF-8 byte sequence into a character string, then count the characters. Counting raw bytes can be misleading, and dividing bytes by two or counting vowels does not provide the total character count.