The Architecture of Kanji: Components, Positions, and Composition Rules
Kanji as a Compression Algorithm
The CJK Unified Ideographs block in Unicode encodes 97,680 characters. The KanjiJump decomposition project reduces the 3,500 most important Japanese kanji to just 281 atomic components -- 200 if you merge positional variants. That is a compression ratio of roughly 12:1 from a component alphabet smaller than the set of English Scrabble tiles. The Cihai dictionary identifies 675 primitive components across 16,339 characters; a 2009 Chinese national standard narrows it to 514 for common use.
This is not metaphor. Kanji are a combinatorial writing system with a formal grammar, positional constraints, and phonetic encoding -- properties that Unicode has literally codified into twelve composition operators.
The Twelve Composition Operators: Unicode's Kanji Grammar
Unicode block U+2FF0--U+2FFB defines Ideographic Description Characters (IDCs) -- prefix operators that describe how components combine into characters. These form a context-free grammar for glyph structure:
| Symbol | Code Point | Name | Example | Decomposition |
|---|---|---|---|---|
| ⿰ | U+2FF0 | Left to right | 相 | ⿰木目 |
| ⿱ | U+2FF1 | Above to below | 杏 | ⿱木口 |
| ⿲ | U+2FF2 | Left-middle-right | 衍 | ⿲彳氵亍 |
| ⿳ | U+2FF3 | Above-middle-below | 京 | ⿳亠口小 |
| ⿴ | U+2FF4 | Full surround | 回 | ⿴囗口 |
| ⿵ | U+2FF5 | Surround from above | 凰 | ⿵几皇 |
| ⿶ | U+2FF6 | Surround from below | 凶 | ⿶凵㐅 |
| ⿷ | U+2FF7 | Surround from left | 匠 | ⿷匚斤 |
| ⿸ | U+2FF8 | Above-left surround | 病 | ⿸疒丙 |
| ⿹ | U+2FF9 | Above-right surround | 戒 | ⿹戈廾 |
| ⿺ | U+2FFA | Below-left surround | 超 | ⿺走召 |
| ⿻ | U+2FFB | Overlaid | 巫 | ⿻工从 |

Worked decomposition examples: each character on the left is rewritten as an IDC operator (the dashed box) followed by its component characters. Source: Wikimedia Commons.
Ten of the twelve are binary operators (two operands); ⿲ and ⿳ are ternary (three). Unicode 15.1 added four more (U+2FFC--U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation -- bringing the total to 16. But the original twelve handle the vast majority of characters.

The full Ideographic Description Characters Unicode block (U+2FF0--U+2FFF), including the four operators added in Unicode 15.1 (U+2FFC--U+2FFF). Source: Wikimedia Commons.
The CHISE project and the cjkvi-ids database on GitHub have applied IDS decomposition to over 75,000 CJK ideographs, producing a machine-readable structural atlas of the entire character space. The distribution is heavily skewed: Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the remaining 20% divide among enclosure and overlay patterns. Left-right dominance reflects the phono-semantic architecture: semantic radical on the left, phonetic component on the right.
The Seven Positional Slots
Japanese pedagogy formalizes component placement into seven named positions (部首の位置). These are not arbitrary labels -- they are structural constraints that determine which shape variant a component takes and which slot it occupies.
| Position | Japanese | Reading | Location | Examples |
|---|---|---|---|---|
| Hen | 偏 | へん | Left side | 氵 in 海, 亻 in 休, 扌 in 持 |
| Tsukuri | 旁 | つくり | Right side | 刂 in 判, 攵 in 教, 頁 in 頭 |
| Kanmuri | 冠 | かんむり | Top (crown) | 艹 in 花, 宀 in 安, 雨 in 雲 |
| Ashi | 脚 | あし | Bottom (legs) | 灬 in 然, 心 in 思, 皿 in 盤 |
| Tare | 垂 | たれ | Top-left drape | 广 in 店, 疒 in 病, 尸 in 届 |
| Nyou | 繞 | にょう | Bottom-left wrap | 辶 in 道, 廴 in 建, 之 in 芝 |
| Kamae | 構 | かまえ | Full/partial surround | 門 in 間, 囗 in 国, 行 in 術 |
The hen (left) and tsukuri (right) positions dominate, accounting for over 60% of all component placements -- a direct consequence of the ⿰ operator's prevalence. Among the 2,136 joyo kanji, just 6 radicals account for 25% of all characters, and 50 radicals cover 75%. Nearly all appear in the hen or kanmuri slots. Many radicals are constrained to a single position: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. When a component moves position, it changes shape -- 水 becomes 氵 on the left, 心 becomes 忄 on the left but stays 心 on the bottom, 火 becomes 灬 at the bottom.
Phonetic Components: The Sound Encoding Layer
Approximately 67--82% of kanji are phono-semantic compounds (形声文字), depending on the analysis. The phonetic component (声符 seifu) encodes the on'yomi while the semantic radical signals the meaning domain. The EDRDG project catalogs 150 phonetic components; KanjiJump documents 808 sound components across the broader set, noting that 74% of the 3,500 most important kanji either include or serve as a sound component.
Reliability varies dramatically. Some phonetic series achieve 100% consistency. Others degrade through centuries of sound change between Old Chinese and modern Japanese on'yomi. The ten most productive phonetic components:
| Component | On'yomi | Derivatives | Reliability | Example Series |
|---|---|---|---|---|
| 匕 | ヒ | ~30 | Medium | 比, 匙, 旨, 尼, 北 |
| 者 | シャ | ~23 | Medium | 暑, 署, 諸, 緒, 都 |
| 生 | セイ | ~21 | Medium | 性, 星, 姓, 牲, 産 |
| 勹 | ホウ | ~20 | Medium | 包, 抱, 泡, 砲, 飽 |
| 隹 | サイ | ~19 | Medium | 推, 維, 雄, 集, 準 |
| 可 | カ | ~17 | High | 何, 河, 荷, 歌, 苛 |
| 圭 | ケイ | ~17 | High | 掛, 畦, 桂, 蛙, 街 |
| 方 | ホウ | ~16 | High | 放, 防, 紡, 坊, 芳 |
| 白 | ハク | ~15 | High | 伯, 拍, 泊, 迫, 舶 |
| 各 | カク | ~15 | High | 格, 閣, 額, 客, 略 |
High reliability: >80% of derivatives share the predicted on'yomi. Medium: 50--80%. Data from EDRDG, The Kanji Code, and KanjiJump.
The "perfect series" are the highest-leverage components for learners. 票 (ヒョウ) generates 12 derivatives -- 標, 漂, 瓢, 剽, and others -- all reading ヒョウ with zero exceptions. 冓 (コウ) yields 10 derivatives (構, 溝, 講, 購), all reading コウ. 包 (ホウ) gives 6 (抱, 泡, 砲, 胞, 飽), all ホウ. These 100%-reliable series mean that learning a single component lets you predict the on'yomi of every character in the family on sight.
Computational Decomposition: CHISE and the IDS Tree
The CHISE project (Character Processing Based on a Huge Structured Environment), based at Kyoto University, maintains an IDS decomposition database serialized in RDF and queryable via SPARQL. Each character is represented as a tree of composition operators and atomic components -- essentially an abstract syntax tree for glyphs. The cjk-decomp project provides decomposition data for 75,000 ideographs, identifying approximately 10,000 intermediate composite components between the atomic primitives and the final characters.
This hierarchy mirrors how compilers represent expressions: terminals (atomic strokes and components), non-terminals (composite sub-components), and production rules (the IDS operators). The implication is that kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with roughly 300--700 terminals and 12 production rules. Viewed this way, the writing system is less a dictionary than a codebase -- and component analysis is the decompiler.
We've made our own decompiler: the Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the place to see the principle in action — every Grade 1 character is decomposed into its kanji, radicals, and graphemes.
References
- Unicode Consortium. "Ideographic Description Characters." The Unicode Standard, Chapter 18.
- CHISE Project. chise.org
- cjkvi-ids. IDS Data for CJK Unified Ideographs. github.com/cjkvi/cjkvi-ids
- amake/cjk-decomp. Decomposition data for 75,000 CJK ideographs. github.com/amake/cjk-decomp
- KanjiJump. "The 281 Atomic Kanji Components." kanjijump.com
- Gao, D.G. & Kao, H.S.R. (2002). Chinese character structure analysis. Acta Psychologica Sinica.
- EDRDG. Kanji Phonetic Components. edrdg.org
- Millen, A. The Kanji Code. thekanjicode.com
- Wikipedia. "Ideographic Description Characters," "Chinese Character Components," "List of Kanji Radicals by Frequency."