Character codes — ASCII and Unicode
Storing text as numbers
- A computer stores no letters — only numbers.
- Each character is given a numeric code point by a character set.
- Two character sets matter: ASCII and Unicode.
ASCII
- ASCII uses 7 bits → $128$ code points: basic Latin letters, digits, punctuation, and control codes.
- Extended ASCII uses 8 bits → $256$ code points; the lower 128 match ASCII, the upper 128 vary by region.
- It works for English, but cannot cover other scripts.
How many different code points does 7-bit ASCII have?
7 bits give $2^7 = 128$ code points.
A character set such as ASCII defines:
A character set maps each character to a number (its code point), which is what the computer actually stores.
Unicode
- Unicode is a universal character set covering almost every script, plus symbols and emoji.
- It is stored using an encoding:
- UTF-8 — 1 to 4 bytes per character, and ASCII-compatible (the first 128 match ASCII).
- UTF-16 — 2 or 4 bytes; UTF-32 — a fixed 4 bytes.
Which is true of UTF-8?
UTF-8 is a variable-length Unicode encoding (1–4 bytes); its first 128 code points match ASCII, so plain ASCII text is valid UTF-8.
Why Unicode beats ASCII
- It represents far more characters — every script and emoji, not just basic English.
- Files are portable (no code-page confusion) and can mix many languages in one document.
- Trade-off: for English-only text, a Unicode file is usually larger than the same text in ASCII.
A key advantage of Unicode over ASCII is that it:
Unicode covers nearly every writing system plus symbols and emoji — far beyond ASCII's basic English set.
For English-only text, a Unicode file is usually larger than the same text in ASCII.
Unicode encodings can use more bytes per character, so plain English text is usually larger than in 7-bit ASCII — the trade-off for universal coverage.
You've got it
- text is stored as numeric code points set by a character set
- ASCII = 7 bits (128 code points); extended ASCII = 8 bits (256)
- Unicode covers every script + emoji; UTF-8 is variable-length and ASCII-compatible
- Unicode = more characters + portable + multilingual, but larger for plain English