The Architecture of Kanji: Components, Positions, and Composition Rules

Kanji as a Compression Algorithm

The CJK Unified Ideographs block in Unicode encodes 97,680 characters. The KanjiJump decomposition project reduces the 3,500 most important Japanese kanji to just 281 atomic components -- 200 if you merge positional variants. That is a compression ratio of roughly 12:1 from a component alphabet smaller than the set of English Scrabble tiles. The Cihai dictionary identifies 675 primitive components across 16,339 characters; a 2009 Chinese national standard narrows it to 514 for common use.

This is not metaphor. Kanji are a combinatorial writing system with a formal grammar, positional constraints, and phonetic encoding -- properties that Unicode has literally codified into twelve composition operators.

The Twelve Composition Operators: Unicode's Kanji Grammar

Unicode block U+2FF0--U+2FFB defines Ideographic Description Characters (IDCs) -- prefix operators that describe how components combine into characters. These form a context-free grammar for glyph structure:

Symbol	Code Point	Name	Example	Decomposition
⿰	U+2FF0	Left to right	相	⿰木目
⿱	U+2FF1	Above to below	杏	⿱木口
⿲	U+2FF2	Left-middle-right	衍	⿲彳氵亍
⿳	U+2FF3	Above-middle-below	京	⿳亠口小
⿴	U+2FF4	Full surround	回	⿴囗口
⿵	U+2FF5	Surround from above	凰	⿵几皇
⿶	U+2FF6	Surround from below	凶	⿶凵㐅
⿷	U+2FF7	Surround from left	匠	⿷匚斤
⿸	U+2FF8	Above-left surround	病	⿸疒丙
⿹	U+2FF9	Above-right surround	戒	⿹戈廾
⿺	U+2FFA	Below-left surround	超	⿺走召
⿻	U+2FFB	Overlaid	巫	⿻工从

Examples of Ideographic Description Sequences: 字 decomposes as ⿱宀子, 匠 as ⿷匚斤, 京 as ⿳亠口小, 米 as ⿻八木
Worked decomposition examples: each character on the left is rewritten as an IDC operator (the dashed box) followed by its component characters. Source: Wikimedia Commons.

Ten of the twelve are binary operators (two operands); ⿲ and ⿳ are ternary (three). Unicode 15.1 added four more (U+2FFC--U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation -- bringing the total to 16. But the original twelve handle the vast majority of characters.

The U+2FF0 through U+2FFF Unicode block: sixteen Ideographic Description Characters as code-point cells
The full Ideographic Description Characters Unicode block (U+2FF0--U+2FFF), including the four operators added in Unicode 15.1 (U+2FFC--U+2FFF). Source: Wikimedia Commons.

The CHISE project and the cjkvi-ids database on GitHub have applied IDS decomposition to over 75,000 CJK ideographs, producing a machine-readable structural atlas of the entire character space. The distribution is heavily skewed: Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the remaining 20% divide among enclosure and overlay patterns. Left-right dominance reflects the phono-semantic architecture: semantic radical on the left, phonetic component on the right.

The Seven Positional Slots

Japanese pedagogy formalizes component placement into seven named positions (部首の位置). These are not arbitrary labels -- they are structural constraints that determine which shape variant a component takes and which slot it occupies.

Position	Japanese	Reading	Location	Examples
Hen	偏	へん	Left side	氵 in 海, 亻 in 休, 扌 in 持
Tsukuri	旁	つくり	Right side	刂 in 判, 攵 in 教, 頁 in 頭
Kanmuri	冠	かんむり	Top (crown)	艹 in 花, 宀 in 安, 雨 in 雲
Ashi	脚	あし	Bottom (legs)	灬 in 然, 心 in 思, 皿 in 盤
Tare	垂	たれ	Top-left drape	广 in 店, 疒 in 病, 尸 in 届
Nyou	繞	にょう	Bottom-left wrap	辶 in 道, 廴 in 建, 之 in 芝
Kamae	構	かまえ	Full/partial surround	門 in 間, 囗 in 国, 行 in 術

The hen (left) and tsukuri (right) positions dominate, accounting for over 60% of all component placements -- a direct consequence of the ⿰ operator's prevalence. Among the 2,136 joyo kanji, just 6 radicals account for 25% of all characters, and 50 radicals cover 75%. Nearly all appear in the hen or kanmuri slots. Many radicals are constrained to a single position: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. When a component moves position, it changes shape -- 水 becomes 氵 on the left, 心 becomes 忄 on the left but stays 心 on the bottom, 火 becomes 灬 at the bottom.

Phonetic Components: The Sound Encoding Layer

Approximately 67--82% of kanji are phono-semantic compounds (形声文字), depending on the analysis. The phonetic component (声符 seifu) encodes the on'yomi while the semantic radical signals the meaning domain. The EDRDG project catalogs 150 phonetic components; KanjiJump documents 808 sound components across the broader set, noting that 74% of the 3,500 most important kanji either include or serve as a sound component.

Reliability varies dramatically. Some phonetic series achieve 100% consistency. Others degrade through centuries of sound change between Old Chinese and modern Japanese on'yomi. The ten most productive phonetic components:

Component	On'yomi	Derivatives	Reliability	Example Series
匕	ヒ	~30	Medium	比, 匙, 旨, 尼, 北
者	シャ	~23	Medium	暑, 署, 諸, 緒, 都
生	セイ	~21	Medium	性, 星, 姓, 牲, 産
勹	ホウ	~20	Medium	包, 抱, 泡, 砲, 飽
隹	サイ	~19	Medium	推, 維, 雄, 集, 準
可	カ	~17	High	何, 河, 荷, 歌, 苛
圭	ケイ	~17	High	掛, 畦, 桂, 蛙, 街
方	ホウ	~16	High	放, 防, 紡, 坊, 芳
白	ハク	~15	High	伯, 拍, 泊, 迫, 舶
各	カク	~15	High	格, 閣, 額, 客, 略

High reliability: >80% of derivatives share the predicted on'yomi. Medium: 50--80%. Data from EDRDG, The Kanji Code, and KanjiJump.

The "perfect series" are the highest-leverage components for learners. 票 (ヒョウ) generates 12 derivatives -- 標, 漂, 瓢, 剽, and others -- all reading ヒョウ with zero exceptions. 冓 (コウ) yields 10 derivatives (構, 溝, 講, 購), all reading コウ. 包 (ホウ) gives 6 (抱, 泡, 砲, 胞, 飽), all ホウ. These 100%-reliable series mean that learning a single component lets you predict the on'yomi of every character in the family on sight.

Computational Decomposition: CHISE and the IDS Tree

The CHISE project (Character Processing Based on a Huge Structured Environment), based at Kyoto University, maintains an IDS decomposition database serialized in RDF and queryable via SPARQL. Each character is represented as a tree of composition operators and atomic components -- essentially an abstract syntax tree for glyphs. The cjk-decomp project provides decomposition data for 75,000 ideographs, identifying approximately 10,000 intermediate composite components between the atomic primitives and the final characters.

This hierarchy mirrors how compilers represent expressions: terminals (atomic strokes and components), non-terminals (composite sub-components), and production rules (the IDS operators). The implication is that kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with roughly 300--700 terminals and 12 production rules. Viewed this way, the writing system is less a dictionary than a codebase -- and component analysis is the decompiler.

We've made our own decompiler: the Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the place to see the principle in action — every Grade 1 character is decomposed into its kanji, radicals, and graphemes.

References

Unicode Consortium. "Ideographic Description Characters." The Unicode Standard, Chapter 18.
CHISE Project. chise.org
cjkvi-ids. IDS Data for CJK Unified Ideographs. github.com/cjkvi/cjkvi-ids
amake/cjk-decomp. Decomposition data for 75,000 CJK ideographs. github.com/amake/cjk-decomp
KanjiJump. "The 281 Atomic Kanji Components." kanjijump.com
Gao, D.G. & Kao, H.S.R. (2002). Chinese character structure analysis. Acta Psychologica Sinica.
EDRDG. Kanji Phonetic Components. edrdg.org
Millen, A. The Kanji Code. thekanjicode.com
Wikipedia. "Ideographic Description Characters," "Chinese Character Components," "List of Kanji Radicals by Frequency."