LLVM IR入门-LLVM架构简介

2024-04-16

字数统计: 2.2k | 阅读时长≈ 10 分钟

LLVM IR入门-LLVM架构简介

LLVM代替了C语言在现代语言编译器实现中的地位。我们可以将自己语言的源代码编译成LLVM中间代码（LLVM IR），然后由LLVM自己的后端对这个中间代码进行优化，并且编译到相应的平台的二进制程序。

LLVM的优点正好对应我们之前讲的三个问题：

LLVM后端支持的平台很多，我们不需要担心CPU、操作系统的问题（运行库除外）
LLVM后端的优化水平较高，我们只需要将代码编译成LLVM IR，就可以由LLVM后端作相应的优化
LLVM IR本身比较贴近汇编语言，同时也提供了许多ABI层面的定制化功能

这里使用C语言编译器Clang为例，编写最简单的C程序test.c

#include<stdio.h>
int main()
{
        printf("This is a llvm test page!");
        return 0;
}

使用clang进行编译

1	clang test.c -o test

编译流程

AST抽象语法树

首先前端编译器会将源代码转换为AST抽象语法树，进行预处理、语法分析、语义分析

1	clang -Xclang -ast-dump -fsyntax-only test.c

可以看到上述信息很多，但我们只需要关注最后四行即可

-FunctionDecl 0x191d430 <test.c:2:1, line:6:1> line:2:5 main 'int ()'         # C/C++方法定义                                              
  `-CompoundStmt 0x191d648 <line:3:1, line:6:1> #相当于等于{}
    |-CallExpr 0x191d5c0 <line:4:2, col:36> 'int' #调用c/c++函数方法
    | |-ImplicitCastExpr 0x191d5a8 <col:2> 'int (*)(const char *, ...)' <FunctionToPointerDecay>        #隐式转换                     
    | | `-DeclRefExpr 0x191d4d0 <col:2> 'int (const char *, ...)' Function 0x190d338 'printf' 'int (const char *, ...)'     #表达式变量声明
    | `-ImplicitCastExpr 0x191d600 <col:9> 'const char *' <NoOp>                                                            
    |   `-ImplicitCastExpr 0x191d5e8 <col:9> 'char *' <ArrayToPointerDecay>                                                 
    |     `-StringLiteral 0x191d528 <col:9> 'char [26]' lvalue "This is a llvm test page!"      #将String转换为char[]类型                            
    `-ReturnStmt 0x191d638 <line:5:2, col:9> #return表达式
      `-IntegerLiteral 0x191d618 <col:9> 'int' 0  #数字类型定义

将代码分析成函数，函数体中复合语句，包含printf()和返回语句

IR中间代码

第二个步骤就是根据内存中的抽象语法树AST生成LLVM IR中间代码。

我们来看看AST转化之后产生怎样的LLVM IR？

1	clang -S -emit-llvm test.c

上述命令完成后会生成一个test.ll文件，如下

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [26 x i8] c"This is a llvm test page!\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  %2 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0))
  ret i32 0
}

declare dso_local i32 @printf(i8*, ...) #1

attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"uwtable", i32 1}
!2 = !{i32 7, !"frame-pointer", i32 2}
!3 = !{!"Deepin clang version 13.0.1-+rc3-1~exp1"}

重点关注其中5行内容，大概能看出代码的流程

define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  %2 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0))
  ret i32 0
}

LLVM bitcode 有两部分组成：位流，以及将 LLVM IR 编码成位流的编码格式。使用汇编器 llvm-as 将 LLVM IR 转换成 bitcode：

1	llvm-as test.ll -o test.bc

反过来将 bitcode 转回 LLVM IR 也是可以的，使用反汇编器 llvm-dis：

1	$ llvm-dis hello.bc -o hello.ll

其实 LLVM 可以利用工具 lli 的即时编译器（JIT）直接执行 bitcode 格式的程序

优化IR

LLVM后端在读取了IR之后，会对这个IR进行优化。由opt组件根据输入的LLVM IR和相应的优化等级进行相应的优化

1	opt test.ll -S -O3/clang -S -emit-llvm -O3 test.c

优化后输出相应的结果

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [26 x i8] c"This is a llvm test page!\00", align 1

; Function Attrs: nofree nounwind uwtable
define dso_local i32 @main() local_unnamed_addr #0 {
  %1 = tail call i32 (i8*, ...) @printf(i8* nonnull dereferenceable(1) getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0))
  ret i32 0
}

; Function Attrs: nofree nounwind
declare dso_local noundef i32 @printf(i8* nocapture noundef readonly, ...) local_unnamed_addr #1

attributes #0 = { nofree nounwind uwtable "frame-pointer"="none" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { nofree nounwind "frame-pointer"="none" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"uwtable", i32 1}
!2 = !{!"Deepin clang version 13.0.1-+rc3-1~exp1"}

其中函数体要少了不少，对IR进行了优化处理

define dso_local i32 @main() local_unnamed_addr #0 {
  %1 = tail call i32 (i8*, ...) @printf(i8* nonnull dereferenceable(1) getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0))
  ret i32 0
}

生成汇编代码

最后一步就是由LLVM IR生成汇编代码，由llc组件完成

1	llc test.ll

接下来使用静态编译器 llc 命令可以将 bitcode 编译为特定架构的汇编语言：

1	$ llc -march=x86-64 test.bc -o test.s

也可以使用 clang 来生成，结果是一样的：

1	$ clang -S test.bc -o test.s -fomit-frame-pointer

生成test.s汇编代码如下

.text
        .file   "test.c"
        .globl  main                            # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        pushq   %rax
        .cfi_def_cfa_offset 16
        movl    $.L.str, %edi
        xorl    %eax, %eax
        callq   printf
        xorl    %eax, %eax
        popq    %rcx
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function
        .type   .L.str,@object                  # @.str
        .section        .rodata.str1.1,"aMS",@progbits,1
.L.str:
        .asciz  "This is a llvm test page!"
        .size   .L.str, 26

        .ident  "Deepin clang version 13.0.1-+rc3-1~exp1"
        .section        ".note.GNU-stack","",@progbits

有了汇编代码后，使用操作系统自带的汇编器、链接器，最终生成可执行程序

LLVM

所以整个LLVM编译器整体过程是

c源代码
AST
LLVM IR OPT
LLVM IR LLc
Assembly汇编代码
汇编器+链接器
可执行文件

LLVM IR

LLVM IR同时表示了三种东西：

内存中的LLVM IR
比特码形式的LLVM IR
可读形式的LLVM IR

内存中的LLVM IR是编译器作者最常接触的一个形式，也是其最本质的形式。当我们在内存中处理抽象语法树AST时，需要根据当前的项，生成对应的LLVM IR，这也就是编译器前端所做的事。我们的编译器前端可以用许多语言写，LLVM也为许多语言提供了Binding，但其本身还是用C++写的，所以这里就拿C++为例。

LLVM的C++接口在llvm/IR目录下提供了许多的头文件，如llvm/IR/Instructions.h等，我们可以使用其中的Value, Function, ReturnInst等等成千上万的类来完成我们的工作。也就是说，我们并不需要把AST变成一个个字符串，如ret i32 0等，而是需要将AST变成LLVM提供的IR类的实例，然后在内存中交给LLVM后端处理。

而比特码形式和可读形式则是将内存中的LLVM IR持久化的方法。比特码是采用特定格式的二进制序列，而可读形式的LLVM IR则是采用特定格式的human readable的代码。我们可以用

1	clang -S -emit-llvm test.c

生成可读形式的LLVM IR文件test.ll，采用

1	clang -c -emit-llvm test.c

生成比特码形式的LLVM IR文件test.bc，采用

1	llvm-as test.ll

将可读形式的test.ll转化为比特码test.bc，采用

1	llvm-dis test.bc

将比特码test.bc转化为可读形式的test.ll

1 2	llvm-link test1.bc test2.bc -o test.bc lli test.bc

链接两个文件为比特码形式的IR文件，通过lli执行该程序

打赏

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！