« Previous 007a More Strings: Regular Expressions

007b More Strings: Unicode

Since the creation of computes, there has been a need to have a way to encode text in such a way that humans could read. One of the first standards was called ASCII. It was limited in the number of available characters it could encode. As computers quickly spread across the globe, there was a need to include many more languages and characters. In the 1980's, a group called the Unicode Consortium non-profit organization was established. They created a standard called Unicode whose purpose was to include the encoding of all modern, in-use languages in the world. The standard includes millions of possible characters, and includes not only all living languages but many historical alphabets and syllabaries as well.

One of the defining aspects of ASCII encoding is that each character was represented by one byte (or eight bits) of memory. In order to cover the large amounts of characters, more than one byte is used. Thus, in PHP, many of the functions related to Unicode support are prefixed with "mb" which stands for "multi-byte."

There is more than one Unicode encoding standard: UTF-8, UTF-16, and UTF-32 are the most common. But it turns out that there are more encodings out there than just ASCII and Unicode (often abbreviated as UTF). PHP supports many encodings. For now, we'll focus on Unicode.

Unicode codepoint escape syntax

It is possible to get unicode characters using an escape sequence with the codepoint in hexadecimal form.

<?php
echo "\u{aa}"; // "ª"
echo "\u{000aa}"; // "ª" (same as above as the leading zeroes are optional)
echo "\u{9999}"; // "香"

The following are several common multi-byte functions to give an idea of how they work.
This is not a complete list.

mb_ord() and mb_chr()

The mb_ord() function will get the unicode code point of a character. The mb_chr() function will return a character given the Unicode code point value. It is important to not use ASCII encodings with these functions as you will not get the correct values. Equally important is to not use Unicode values with the regular string functions for the same reason. This applies to all the multi-byte methods.

<?php
// ascii: "language theory";
// unicode: "语言处理";

echo ord("l"); // 108
echo mb_ord("l"); // 108
echo ord("语"); // 232
echo mb_ord("语"); // 35821
echo mb_ord("语", "UTF-16"); // 59567

echo chr(108); // "l"
echo mb_chr(108); // "l"
echo chr(232); // �
echo mb_chr(35821); // "语"

// Not expected? Remember that the input above was UTF-8, and
// we are trying to put that in UTF-16 encoding point
echo mb_chr(59567, "UTF-16"); // �

mb_substr()

The mb_substr() will return a part of a string.

<?php
// mb_substr(string, start, length, encoding);
// ascii: "language theory";
// unicode: "语言处理";

echo substr("language theory", 2, 1); // "n"
echo mb_substr("language theory", 2, 1); // "n"

echo substr("语言处理", 2, 1); // �
echo mb_substr("语言处理", 2, 1); // "处"

mb_str_split()

The mb_str_split() function will take a string and return an array of characters.

<?php
// mb_str_split(string, length_of_each_array_item = 1, encoding);
// ascii: "language theory"
// unicode: "语言处理"

print_r(str_split("language theory"));
print_r(mb_str_split("language theory"));
/* Both produce same result:
Array
(
    [0] => l
    [1] => a
    [2] => n
    [3] => g
    [4] => u
    [5] => a
    [6] => g
    [7] => e
    [8] =>  
    [9] => t
    [10] => h
    [11] => e
    [12] => o
    [13] => r
    [14] => y
)
*/

print_r(str_split("语言处理"));
/*
Array
(
    [0] => �
    [1] => �
    [2] => �
    [3] => �
    [4] => �
    [5] => �
    [6] => �
    [7] => �
    [8] => �
    [9] => �
    [10] => �
    [11] => �
)
*/
print_r(mb_str_split("语言处理"));
/*
Array
(
    [0] => 语
    [1] => 言
    [2] => 处
    [3] => 理
)
*/

mb_strlen()

The mb_strlen() function will return a count of the number of characters in a string.

<?php
// mb_strlen(string, encoding);
// ascii: "language theory"
// unicode: "语言处理"

echo strlen("language theory"); // 15
echo mb_strlen("language theory"); // 15

echo strlen("语言处理"); // 12
echo mb_strlen("语言处理"); // 4

mb_strpos()

The mb_strpos() function will return the position (zero index) of a string within a string.

<?php
// mb_strpos(haystack, needle, offset, encoding);
// ascii: "language theory"
// unicode: "语言处理"
echo strpos("language theory", "gua"); // 3
echo mb_strpos("language theory", "gua"); // 3

echo strpos("语言处理", "处"); // 6
echo mb_strpos("语言处理", "处"); // 2

Resources

Challenges

Emoji Fun

Find a list online of the codepoints for emoji's. Then see if you can output some emojis using the Unicode codepoint escape syntax.

Exploring mb_* functions

Visit the list of PHP's multibyte string functions. Choose at least three that were not listed on this page and try them out.

Create your own function

Create your own function that accepts a UTF-8 string as input and returns the same word but with the capitalization inverted.