Learn more about the Ethereum Virtual Machine

Solidity provides many high-level language abstractions, but these features make it difficult to understand what is going on when a program is run. I read the Solidity documentation, but there are still a few basic issues that I don't understand.

What is the difference between string, bytes32, byte[], bytes?

Where to use which type?
What happens when converting string to bytes? Can it be converted to byte[]?
How much do they cost to store?

How does the EVM store mappings?

Why can't a mapping be deleted?
Can there be a mapping of mappings? (Yes, but how to map?)
Why is there a storage map, but no memory map?

What does the compiled contract look like to the EVM?

How is the contract created?
What exactly is a constructor?
What is a fallback function?

I feel that learning a Solidity-like high-level language that runs on the Ethereum Virtual Machine (EVM) is a good investment for several reasons:

Solidity is not the last language . Better EVM languages are coming. (please?)
EVM is a database engine . To understand how smart contracts work in any EVM language, it is necessary to understand how data is organized, stored, and manipulated.
Know how to become a contributor. It's still early days for Ethereum's toolchain, and understanding the EVM can help you implement an awesome tool for yourself and others to use.
Intellectual challenge. EVM allows you to soar between the intersection of cryptography, data structures, and programming language design for a very good reason.

In this series of articles, I will disassemble a simple Solidity contract to show you how it runs in EVM bytecode.

Outline of articles I want to be able to study and write:

Basic understanding of EVM bytecode
How different types (maps, arrays) are represented
What happens when a new contract is created
what happens when a method is called
How the ABI bridges different EVM languages

My ultimate goal is to understand a compiled Solidity contract as a whole. Let's start by reading some basic EVM bytecode.

The EVM instruction set would be a helpful reference.

a simple contract

Our first contract has a constructor and a state variable:

// c1.sol
pragma solidity ^0.4.11;
contract C {
    uint256 a;
    function C() {
      a = 1;
    }
}

Tosolc compile this contract:

$ solc --bin --asm c1.sol
======= c1.sol:C =======
EVM assembly:
    /* "c1.sol":26:94  contract C {... */
  mstore(0x40, 0x60)
    /* "c1.sol":59:92  function C() {... */
  jumpi(tag_1, iszero(callvalue))
  0x0
  dup1
  revert
tag_1:
tag_2:
    /* "c1.sol":84:85  1 */
  0x1
    /* "c1.sol":80:81  a */
  0x0
    /* "c1.sol":80:85  a = 1 */
  dup2
  swap1
  sstore
  pop
    /* "c1.sol":59:92  function C() {... */
tag_3:
    /* "c1.sol":26:94  contract C {... */
tag_4:
  dataSize(sub_0)
  dup1
  dataOffset(sub_0)
  0x0
  codecopy
  0x0
  return
stop
sub_0: assembly {
        /* "c1.sol":26:94  contract C {... */
      mstore(0x40, 0x60)
    tag_1:
      0x0
      dup1
      revert
auxdata: 0xa165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029
}
Binary:
60606040523415600e57600080fd5b5b60016000819055505b5b60368060266000396000f30060606040525b600080fd00a165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029

6060604052...This string of numbers is the bytecode that the EVM actually runs.

little by little

The compiled assembly above half is the boilerplate statement that will be present in most Solidity programs. We'll look at these later. Now, let's look at the unique part of the contract, the simple storage variable assignment:

a = 1

The bytecode representing this assignment is 6001600081905550. Let's break it down into one instruction per line:

EVM is essentially a loop, executing each command from top to bottom. Let's annotate the assembly code with the corresponding bytecode (indented tag_2under the label) to better see how they relate:

tag_2:
  // 60 01
  0x1
  // 60 00
  0x0
  // 81
  dup2
  // 90
  swap1
  // 55
  sstore
  // 50
  pop

Note that 0x1in assembly code is actually push(0x1)shorthand. This instruction pushes the value 1 onto the stack.

It's still hard to understand what's going on just by staring at it, but don't worry, it's easier to emulate the EVM line by line.

Simulate EVM

EVM is a stack machine. Instructions may take values on the stack as arguments, and may also push values onto the stack as results. Let's think about addoperations.

Suppose there are two values on the stack:

[1 2]

When the EVM sees it add, it adds the top 2 items on the stack and pushes the answer onto the stack. The result is:

[3]

Next, we use []symbols to identify the stack:

// 空栈
stack: []
// 有3个数据的栈，栈顶项为3，栈底项为1
stack: [3 2 1]

Use {}symbols to identify contract storage:

// 空存储
store: {}
// 数值0x1被保存在0x0的位置上
store: { 0x0 => 0x1 }

Now let's look at the real bytecode. We will emulate the sequence of bytes like the EVM 6001600081905550and print out the machine state for each instruction:

// 60 01:将1压入栈中
0x1
  stack: [0x1]
// 60 00: 将0压入栈中
0x0
  stack: [0x0 0x1]
// 81: 复制栈中的第二项
dup2
  stack: [0x1 0x0 0x1]
// 90: 交换栈顶的两项数据
swap1
  stack: [0x0 0x1 0x1]
// 55: 将数值0x01存储在0x0的位置上
// 这个操作会消耗栈顶两项数据
sstore
  stack: [0x1]
  store: { 0x0 => 0x1 }
// 50: pop (丢弃栈顶数据)
pop
  stack: []
  store: { 0x0 => 0x1 }

Finally, the stack is an empty stack, and there is one item of data in the memory.

uint256 aIt's worth noting that Solidity has decided where to keep state variables 0x0. Other languages can choose to store state variables anywhere else entirely.

6001600081905550A sequence of bytes is essentially represented by the EVM's operational pseudocode as:

// a = 1
sstore(0x0, 0x1)

If you look closely, you will find that dup2, swap1, popare redundant, and the assembly code can be simpler:

0x1
0x0
sstore

You can simulate the above 3 instructions and find that their machine state results are the same:

stack: []
store: { 0x0 => 0x1 }

two storage variables

Let's add an additional storage variable of the same type:

// c2.sol
pragma solidity ^0.4.11;
contract C {
    uint256 a;
    uint256 b;
    function C() {
      a = 1;
      b = 2;
    }
}

After compiling, mainly look at tag_2:

$ solc --bin --asm c2.sol
//前面的代码忽略了
tag_2:
    /* "c2.sol":99:100  1 */
  0x1
    /* "c2.sol":95:96  a */
  0x0
    /* "c2.sol":95:100  a = 1 */
  dup2
  swap1
  sstore
  pop
    /* "c2.sol":112:113  2 */
  0x2
    /* "c2.sol":108:109  b */
  0x1
    /* "c2.sol":108:113  b = 2 */
  dup2
  swap1
  sstore
  pop

Pseudocode for assembly:

// a = 1
sstore(0x0, 0x1)
// b = 2
sstore(0x1, 0x2)

We can see that the storage locations of the two storage variables are arranged in order, and athe 0x0location is bthe 0x1location.

storage packaging

Each memory slot can store 32 bytes. It would be wasteful to use all 32 bytes if a variable only needs 16 bytes. Solidity provides an optimization solution for efficient storage: if possible, pack two smaller data types and store them in one storage slot.

We modify the asum bto be a 16-byte variable:

pragma solidity ^0.4.11;
contract C {
    uint128 a;
    uint128 b;
    function C() {
      a = 1;
      b = 2;
    }
}

Compile this contract:

$ solc --bin --asm c3.sol

The resulting assembly code is now a bit more complex:

tag_2:
  // a = 1
  0x1
  0x0
  dup1
  0x100
  exp
  dup2
  sload
  dup2
  0xffffffffffffffffffffffffffffffff
  mul
  not
  and
  swap1
  dup4
  0xffffffffffffffffffffffffffffffff
  and
  mul
  or
  swap1
  sstore
  pop
  // b = 2
  0x2
  0x0
  0x10
  0x100
  exp
  dup2
  sload
  dup2
  0xffffffffffffffffffffffffffffffff
  mul
  not
  and
  swap1
  dup4
  0xffffffffffffffffffffffffffffffff
  and
  mul
  or
  swap1
  sstore
  pop

The assembly code above packs the two variables into one storage location ( 0x0), like this:

[         b         ][         a         ]
[16 bytes / 128 bits][16 bytes / 128 bits]

The reason for packing is because by far the most expensive operation is the use of storage:

sstoreThe first time an instruction writes to a new location costs 20000 gas
sstoreSubsequent writes to an existing location cost 5000 gas
sloadThe cost of an instruction is 500 gas
Most instructions cost 3-10 gas

By using the same storage location, Solidity pays 5000 gas to store the second variable instead of 20000 gas, saving 15000 gas.

more optimizations

It should be possible to pack two 128-bit numbers into one into memory, and then use a 'sstore' instruction for the store operation, instead of using two separate sstorecommands to store the variable asum b, which saves an additional 5000 gas.

optimizeYou can make Solidity implement the above optimization by adding options:

$ solc --bin --asm --optimize c3.sol

The assembly code thus produced has only one sloadinstruction and one sstoreinstruction:

tag_2:
    /* "c3.sol":95:96  a */
  0x0
    /* "c3.sol":95:100  a = 1 */
  dup1
  sload
    /* "c3.sol":108:113  b = 2 */
  0x200000000000000000000000000000000
  not(sub(exp(0x2, 0x80), 0x1))
    /* "c3.sol":95:100  a = 1 */
  swap1
  swap2
  and
    /* "c3.sol":99:100  1 */
  0x1
    /* "c3.sol":95:100  a = 1 */
  or
  sub(exp(0x2, 0x80), 0x1)
    /* "c3.sol":108:113  b = 2 */
  and
  or
  swap1
  sstore

The bytecode is:

600080547002000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055

Parse the bytecode into one instruction per line:

// push 0x0
60 00
// dup1
80
// sload
54
// push17 将下面17个字节作为一个32个字的数值压入栈中
70 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
/* not(sub(exp(0x2, 0x80), 0x1)) */
// push 0x1
60 01
// push 0x80 (32)
60 80
// push 0x80 (2)
60 02
// exp
0a
// sub
03
// not
19
// swap1
90
// swap2
91
// and
16
// push 0x1
60 01
// or
17
/* sub(exp(0x2, 0x80), 0x1) */
// push 0x1
60 01
// push 0x80
60 80
// push 0x02
60 02
// exp
0a
// sub
03
// and
16
// or
17
// swap1
90
// sstore
55

There are 4 magic numbers used in the assembly code above:

0x1 (16 bytes), use the lower 16 bytes

// 在字节码中表示为0x01
16:32 0x00000000000000000000000000000000
00:16 0x00000000000000000000000000000001

0x2 (16 bytes), use high 16 bytes

//在字节码中表示为0x200000000000000000000000000000000 
16:32 0x00000000000000000000000000000002
00:16 0x00000000000000000000000000000000

not(sub(exp(0x2, 0x80), 0x1))

// 高16字节的掩码
16:32 0x00000000000000000000000000000000 
00:16 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

sub(exp(0x2, 0x80), 0x1)

// 低16字节的掩码
16:32 0x00000000000000000000000000000000 
00:16 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The code does some bit-shifting of these values to achieve the desired result:

16:32 0x00000000000000000000000000000002 
00:16 0x00000000000000000000000000000001

Finally, the 32-byte value is stored in the 0x0location where .

Use of Gas

600080547002000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055

Note 0x200000000000000000000000000000000is embedded in the bytecode. But the compiler may also choose to use exp(0x2, 0x81)instructions to compute the value, which results in a shorter sequence of bytecodes.

But it turned out to be cheaper than 0x200000000000000000000000000000000. exp(0x2, 0x81)Let's look at the information related to gas costs:

4 gas per zero byte of data or code for a transaction
68 gas per non-zero byte of data or code for a transaction

To calculate the gas cost of the next two representations:

0x200000000000000000000000000000000Bytecode contains a lot of 0s and is cheaper.
(1 * 68) + (32 * 4) = 196
608160020aThe bytecode is shorter, but there are no 0s.
5 * 68 = 340

Longer bytecode sequences have lots of 0s, so are actually cheaper!

Summarize

The EVM's compiler doesn't actually optimize for bytecode size, speed, or memory efficiency. Instead, it optimizes for gas usage, which indirectly encourages the ordering of computations, making the Ethereum blockchain a little more efficient.

We also saw some peculiar things about the EVM:

EVM is a 256-bit machine. It is most natural to process data in 32 bytes
Persistent storage is quite expensive
The Solidity compiler will make corresponding optimization choices to reduce gas usage

The setting of the gas cost is a bit arbitrary and may change in the future. When the cost changes, the compiler also makes different optimization choices.

Links to other translations of this series of articles:

Author of translation: Xu Li
Original address: Diving Into The Ethereum VM Part One