UTF-8, a transformation format of ISO 10646

نویسنده

  • Francois Yergeau
چکیده

ISO/IEC 10646-1 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world’s writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. UTF-8, the object of this memo, has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo updates and replaces RFC 2044, in particular addressing the question of versions of the relevant standards.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Alis Technologies UTF - 16 , an encoding of ISO 10646

The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly define a coded character set (CCS), hereafter referred to as Unicode, which encompasses most of the world’s writing systems [WORKSHOP]. UTF-16, the object of this specification, is a way to encode Unicode characters that has the characteristics of encoding the vast majority of currently-defined characters in exactly two octet...

متن کامل

Internet Mail Consortium

The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly define a character set (hereafter referred to as Unicode) which encompasses most of the world’s writing systems. UTF-16, the object of this specification, is an encoding scheme of this character set that has the characteristics of encoding the vast majority of currently-defined characters in exactly two octets and of being ab...

متن کامل

UTF-7 - A Mail-Safe Transformation Format of Unicode

This memo defines an Experimental Protocol for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E) jointly define a 16 bit character set (hereafter referred to as Unicode) which encompasses most of the world's writing systems. However, Internet mail (S...

متن کامل

An Upper-Bound on Information Contained Within a Tweet

While tweets (and this paper) are limited to 140 characters, not all characters are created equal. This paper explores abuses of character encoding schemes to maximize the number of bits that can be conveyed by a tweet. In particular, since Twitter supports Unicode, we examine how we can abuse UTF8. For example, while people equate a Unicode codepoint with a character, some can be combined to f...

متن کامل

Using Unicode with MIME

The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E) jointly define a 16 bit character set (hereafter referred to as Unicode) which encompasses most of the world’s writing systems. However, Internet mail (STD 11, RFC 822) currently supports only 7-bit US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends Internet mail to support different media types and character sets, an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • RFC

دوره 3629  شماره 

صفحات  -

تاریخ انتشار 1998