May 4, 2026

Allwinner H5 -- the aarch64 MMU, my analysis

The following document is from ARM and is probably the best general reference: The place to start is the TCR register. I am currently only working on understanding MMU setup for EL2, and have not looked at how this register might vary for EL1.

TCR.T0SZ

This is a 6 bit field (so a value from 0-64). It gives the number of upper bits that will be ignored in virtual addresses. If the value is 32, then the upper 32 bits are ignored and a VA has only 32 bits.

The document says that the first level table (level 0) is omitted if the VA is restricted to 39 bits. My case sets T0SZ to 32 (0x20) and apparently they mean the first level is omitted if the VA is <= 39, as it clearly is omitted in my case.

TCR.TG0

This is a 2 bit field giving the "granule size" (i.e. the page size). 3 values are possible:

TCS.PS

This is a 3 bit field giving the physical address size. The 52 bit case is special and gets tangled up with values in the ID_AA64MMFR0_EL1 register. The best thing is to ignore this and take the view that addresses are limited to 48 bits.

With the H5 chip I will only be working with 32 bit physical addresses.

Table layout

There are 3 schemes depending entirely on the granule size chosen. The basic change is how many bits in the VA are set aside for each level.

VA with 4K granule --

      _______________________________________________
     |       |       |       |       |       |       |
     |   0   |  Lv0  |  Lv1  |  Lv2  |  Lv3  |  off  |
     |_______|_______|_______|_______|_______|_______|
       63-48   47-39   38-30   29-21   20-12   11-00

VA with 16K granule, note that only a single bit exists for level 0, which is thus a table with 2 entries.

      _______________________________________________
     |       |       |       |       |       |       |
     |   0   |  Lv0  |  Lv1  |  Lv2  |  Lv3  |  off  |
     |_______|_______|_______|_______|_______|_______|
       63-48    47     46-36   35-25   24-14   13-00

VA with 64K granule, here we have only 6 bits for level 1, which is thus a table with 64 entries.

      _______________________________________________
     |               |       |       |       |       |
     |       0       |  Lv1  |  Lv2  |  Lv3  |  off  |
     |_______ _______|_______|_______|_______|_______|
           63-48       47-42   41-29   28-16   15-00

In all cases upper levels may be eliminated by the choice of T0SZ.

Consider the general case with a 4K granule and a 48 bit VA. The level 0 table has 512 entries,
Each L0 table entry controls 512G and points to a L1 table.
Each L1 table entry points to either a 1G block or an L2 table.
Each L2 table entry points to either a 2M block or an L3 table.
Each L3 table entry points to a 4K page.

Now consider my H5 set up with T0SZ = 32 and a 4K granule.
The level 0 table would be addressed by bits 47:39 and is out of the game.
The full level 1 table with 512 entries is used, even though 32 bits will only be able to address the first 4 entries.

What does a PTE look like

We now know the general structure of the tables, but we have not yet talked about the table entries themselves. Unlike 32 bit ARM, all the table entries are 64 bits. The low 2 bits indicate what type of entry it is. There are 3 options (not 4, stick with me). The documentation confuses me with regard to the 11 case. It shows this as both a table descriptor with only upper attributes, and as a table entry (holding a block address) with both upper and lower attributes.

A level 0 entry can only be a descriptor giving the address of a level 1 table.
A level 3 entry can only be a block descriptor, the tree goes no deeper.

Block entries have upper and lower attributes.
Table descriptors have only upper attributes.

Upper attributes have only 2 bits of interest UXN and PXN. These are unprivileged and privileged eXecute never. I just ignore these bits and leave them zero.

Lower attributes are:

When bits are set in a table descriptor, they override any settings at lower levels

The MAIR register

This is a 64 bit register with 8 fields of 8 bits each. A 0-7 index elsewhere selects one of these fields which give shareability and cacheability specifications.

This looks to me like they ran out of space in the 64 bit PTE and used a 3 bit index to point to an 8 bit field.

A look at U-boot code

The value loaded into the MAIR register is given in mmu.h as follows:
/*
 * Memory types
 */
#define MT_DEVICE_NGNRNE    0
#define MT_DEVICE_NGNRE     1
#define MT_DEVICE_GRE       2
#define MT_NORMAL_NC        3
#define MT_NORMAL       	4

#define MEMORY_ATTRIBUTES   ((0x00 << (MT_DEVICE_NGNRNE * 8)) | \
                (0x04 << (MT_DEVICE_NGNRE * 8))   | \
                (0x0c << (MT_DEVICE_GRE * 8))     | \
                (0x44 << (MT_NORMAL_NC * 8))      | \
                (UL(0xff) << (MT_NORMAL * 8)))
So, apparently only 5 of the 8 possible "memory types" are used.

In my case, with the H5 chip a table with only 2 block entries (each of 1G) does the job. The first entry maps the 1G address area that starts at 0 and holds IO registers. The second entry maps the 1G address area that starts at 0x40000000 and holds DRAM.

I dump these two entries as hex and see:

PTE-7fff0000 0000000000000401
PTE-7fff0008 0000000040000711
I use the fancy pretty-print routines I found in U-boot to dump these and I see:
[0x00000000000000 - 0x00000040000000]     |  Block | RWX        | Device-nGnRnE | Non-shareable
[0x00000040000000 - 0x00000080000000]     |  Block | RWX        | Normal        | Inner-shareable
There are really only 2 differences bwtween the two descriptors: In both cases AP is 00 (so RWX).

The whole business of Inner/Outer shareability is murky and not explained clearly anywhere. At least not anywhere that I have found yet, and I have looked pretty hard. Setting SH to 00 ( non-shareable ) for IO registers certainly makes sense.

It is not clear what inner shareable might mean. The "big idea" involves the fact that the Cortex-A53 is a 4 core cluster and all 4 cores share the L2 cache. What happens when one core writes to memory? The D cache for that core would get updated (but perhaps not yet memory). Other cores might have values for that memory location cached that now need to be invalidated. The cluster has a "snoop unit" that handles this. The shareability bits indicate (among other things perhaps) whether the snoop unit should get involved.

Note that the translation tables themselves can be placed in cacheable memory. This will speed up address translation, and there are fields in the TCR that control cacheability.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org