Unicode

R. Alexander Miłowski

milowski@ischool.berkeley.edu

School of Information, UC Berkeley

What is a character?

From Unicode 6.2:

Characters are the abstract representations of the smallest components of written language that have semantic value.

Unicode uses characters not glyphs!

A character has a code point (a number) as a representation.

Glyphs are representations of characters when they are rendered or displayed.

Fonts are collections of glyphs.

Some Examples

AAAA U+0041 Latin capital letter A
→⇒⟹⇨☞⇰➲➽Arrows (U+2190-U21FF)
∏∑∉∰∫⨸⨂∞∪∩Mathematical Operators (U+2200-U22FF)
Miłowskimy name

Arabic Alphabet

ابجدهوزحطيكلمنسعفصقرشتثخذضظغ

ا ب ج د ه و ز ح ط ي ك ل م ن س ع ف ص ق ر ش ت ث خ ذ ض ظ غ

Unicode Code Page Charts

See the reference charts.

Each chart has:

Examples: Greek or Cuneiform Numbers and Punctuation

How do you use Unicode?

Many editors will let you just insert any character directly!

A few methods:

  1. Cut-n-paste from character views (e.g. Mac OS X Character Viewer)
  2. Escaped values in constant strings (e.g. \u2192 for →)
  3. Character references in markup (e.g. → for →)

How do you transport Unicode?

byte != character

1,114,112 code points, first 65,536 is the Basic Multilingual Plane (16 bit), you really need 32 bits ...

Unicode characters are encoded into a byte sequence:

UTF-8 is the default encoding of the Web.

Many systems mess this up by default — including probably everyone's operating system in this class.