The Architecture of Kanji: Components, Positions, and Composition Rules

hbaristr Đọc trong 7 phút

Kanji as a Compression Algorithm

The CJK Unified Ideographs block in Unicode encodes 97,680 characters. The KanjiJump decomposition project reduces the 3,500 most important Japanese kanji to just 281 atomic components -- 200 if you merge positional variants. That is a compression ratio of roughly 12:1 from a component alphabet smaller than the set of English Scrabble tiles. The Cihai dictionary identifies 675 primitive components across 16,339 characters; a 2009 Chinese national standard narrows it to 514 for common use.

This is not metaphor. Kanji are a combinatorial writing system with a formal grammar, positional constraints, and phonetic encoding -- properties that Unicode has literally codified into twelve composition operators.

The Twelve Composition Operators: Unicode's Kanji Grammar

Unicode block U+2FF0--U+2FFB defines Ideographic Description Characters (IDCs) -- prefix operators that describe how components combine into characters. These form a context-free grammar for glyph structure:

Symbol Code Point Name Example Decomposition
U+2FF0 Left to right ⿰木目
U+2FF1 Above to below ⿱木口
U+2FF2 Left-middle-right ⿲彳氵亍
U+2FF3 Above-middle-below ⿳亠口小
U+2FF4 Full surround ⿴囗口
U+2FF5 Surround from above ⿵几皇
U+2FF6 Surround from below ⿶凵㐅
U+2FF7 Surround from left ⿷匚斤
U+2FF8 Above-left surround ⿸疒丙
U+2FF9 Above-right surround ⿹戈廾
U+2FFA Below-left surround ⿺走召
U+2FFB Overlaid ⿻工从

Examples of Ideographic Description Sequences: 字 decomposes as ⿱宀子, 匠 as ⿷匚斤, 京 as ⿳亠口小, 米 as ⿻八木
Worked decomposition examples: each character on the left is rewritten as an IDC operator (the dashed box) followed by its component characters. Source: Wikimedia Commons.

Ten of the twelve are binary operators (two operands); ⿲ and ⿳ are ternary (three). Unicode 15.1 added four more (U+2FFC--U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation -- bringing the total to 16. But the original twelve handle the vast majority of characters.

The U+2FF0 through U+2FFF Unicode block: sixteen Ideographic Description Characters as code-point cells
The full Ideographic Description Characters Unicode block (U+2FF0--U+2FFF), including the four operators added in Unicode 15.1 (U+2FFC--U+2FFF). Source: Wikimedia Commons.

The CHISE project and the cjkvi-ids database on GitHub have applied IDS decomposition to over 75,000 CJK ideographs, producing a machine-readable structural atlas of the entire character space. The distribution is heavily skewed: Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the remaining 20% divide among enclosure and overlay patterns. Left-right dominance reflects the phono-semantic architecture: semantic radical on the left, phonetic component on the right.

The Seven Positional Slots

Japanese pedagogy formalizes component placement into seven named positions (部首の位置). These are not arbitrary labels -- they are structural constraints that determine which shape variant a component takes and which slot it occupies.

Position Japanese Reading Location Examples
Hen へん Left side 氵 in , 亻 in , 扌 in
Tsukuri つくり Right side 刂 in , 攵 in 教, 頁 in 頭
Kanmuri かんむり Top (crown) 艹 in , 宀 in , 雨 in
Ashi あし Bottom (legs) 灬 in , 心 in , 皿 in
Tare たれ Top-left drape 广 in , 疒 in , 尸 in
Nyou にょう Bottom-left wrap 辶 in , 廴 in , 之 in
Kamae かまえ Full/partial surround 門 in , 囗 in , 行 in

The hen (left) and tsukuri (right) positions dominate, accounting for over 60% of all component placements -- a direct consequence of the ⿰ operator's prevalence. Among the 2,136 joyo kanji, just 6 radicals account for 25% of all characters, and 50 radicals cover 75%. Nearly all appear in the hen or kanmuri slots. Many radicals are constrained to a single position: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. When a component moves position, it changes shape -- becomes 氵 on the left, becomes 忄 on the left but stays 心 on the bottom, becomes 灬 at the bottom.

Phonetic Components: The Sound Encoding Layer

Approximately 67--82% of kanji are phono-semantic compounds (形声文字), depending on the analysis. The phonetic component (声符 seifu) encodes the on'yomi while the semantic radical signals the meaning domain. The EDRDG project catalogs 150 phonetic components; KanjiJump documents 808 sound components across the broader set, noting that 74% of the 3,500 most important kanji either include or serve as a sound component.

Reliability varies dramatically. Some phonetic series achieve 100% consistency. Others degrade through centuries of sound change between Old Chinese and modern Japanese on'yomi. The ten most productive phonetic components:

Component On'yomi Derivatives Reliability Example Series
~30 Medium , 匙, , ,
シャ ~23 Medium , , , ,
セイ ~21 Medium , , , ,
ホウ ~20 Medium , , , ,
サイ ~19 Medium , , , ,
~17 High , , , ,
ケイ ~17 High , 畦, 桂, 蛙,
ホウ ~16 High , , , ,
ハク ~15 High , , , ,
カク ~15 High , , , ,

High reliability: >80% of derivatives share the predicted on'yomi. Medium: 50--80%. Data from EDRDG, The Kanji Code, and KanjiJump.

The "perfect series" are the highest-leverage components for learners. (ヒョウ) generates 12 derivatives -- , , 瓢, 剽, and others -- all reading ヒョウ with zero exceptions. 冓 (コウ) yields 10 derivatives (, , , ), all reading コウ. (ホウ) gives 6 (, , , 胞, ), all ホウ. These 100%-reliable series mean that learning a single component lets you predict the on'yomi of every character in the family on sight.

Computational Decomposition: CHISE and the IDS Tree

The CHISE project (Character Processing Based on a Huge Structured Environment), based at Kyoto University, maintains an IDS decomposition database serialized in RDF and queryable via SPARQL. Each character is represented as a tree of composition operators and atomic components -- essentially an abstract syntax tree for glyphs. The cjk-decomp project provides decomposition data for 75,000 ideographs, identifying approximately 10,000 intermediate composite components between the atomic primitives and the final characters.

This hierarchy mirrors how compilers represent expressions: terminals (atomic strokes and components), non-terminals (composite sub-components), and production rules (the IDS operators). The implication is that kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with roughly 300--700 terminals and 12 production rules. Viewed this way, the writing system is less a dictionary than a codebase -- and component analysis is the decompiler.

We've made our own decompiler: the Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the place to see the principle in action — every Grade 1 character is decomposed into its kanji, radicals, and graphemes.

References

Send feedback

Optional — only if you'd like a reply.