Commands
  • About
  • Software Development
    • Glossary
    • System Design
      • CAP Theorem
      • Data Migration
      • EAV Model
      • ETL Process
      • Event Sourcing
      • Outbox Pattern
      • SOLID Principles
    • Standards
      • Character Encoding
      • Twirp
    • Elixir
      • IEx
      • Mix
      • Libraries
      • Code Snippets
    • Node.js
      • TypeScript
  • Commands
    • Bookmarks
    • AWS
      • Athena
      • ECS
    • Docker
      • Manage Images
      • Manage Containers
    • File System
      • Grep
      • Search Files
      • Miscellaneous
    • Git
      • Setup New Project
      • Manage Remotes
      • Release Project
    • GPG
    • Kafka
      • Quick Start
    • Kubernetes
    • OpenShift
      • Manage Roles and Bindings
    • PostgreSQL
    • SSL / TLS
      • Common OpenSSL Commands
      • Create Self-signed Certificate
      • Java Truststore and Keystore
    • SSH
      • Key Management
      • Port Forwarding
    • YUM
      • List Repositories
      • List Packages
    • Miscellaneous
      • Network
Powered by GitBook
On this page
  • ASCII
  • Unicode
  • UTF-8
  1. Software Development
  2. Standards

Character Encoding

PreviousStandardsNextTwirp

Last updated 3 years ago

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

-

ASCII

ASCII (American Standard Code for Information Interchange) is one of the most common character encoding schema. It encodes 128 specified characters into 7-bit integers. The specified characters include both printable (e.g. digits and alphabets) and non-printable (e.g. tab and line feed) characters.

Later on, the extended ASCII was defined. It extends the original schema by using 8-bit for encoding. Therefore, the extended ASCII encodes 256 characters in total.

Unicode

Unicode itself is NOT a character encoding scheme. Unicode is a standard to map the characters into the "code points", which could be encoded into the binary format by different character encoding schemes.

A code point is represented by a leading "U+" and the subsequent hexadecimal value from 0000 to 10FFFF. The code points are then devided into 10 (in hex), or 17 (in decimal) code planes.

  U+0000 -   U+FFFF
 U+10000 -  U+1FFFF
         .
         .
         .
 U+F0000 -  U+FFFFF
U+100000 - U+10FFFF

UTF-8

UTF-8 (Unicode Transformation Format – 8-bit) is the most common encoding on the web. It encodes the unicode characters (i.e. unicode code points) into 1 - 4 bytes of binary data.

Since the first 128 unicode code points represent the ASCII characters, and are encoded as 1 byte in UTF-8, any ASCII text is eventually a UTF-8 text.

Wikipedia