Ifds_graph_reachability.tex

\documentclass{article}

\usepackage{ctex}
\usepackage{tikz}
\usetikzlibrary{cd}

\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{stmaryrd}

\usepackage[linesnumbered,ruled,vlined]{algorithm2e}

%\usepackage{unicode-math}

\usepackage{hyperref} %url
\hypersetup{
    colorlinks=true,
    linkcolor=blue,
    filecolor=magenta,      
    urlcolor=cyan,
    pdftitle={Overleaf Example},
    pdfpagemode=FullScreen,
    }


\usepackage[textwidth=18cm]{geometry} % 设置页宽=18

\usepackage{blindtext}
\usepackage{bm}
\parindent=0pt
\setlength{\parindent}{2em} 
\usepackage{indentfirst}


\usepackage{xcolor}
\usepackage{titlesec}
\titleformat{\section}[block]{\color{blue}\Large\bfseries\filcenter}{}{1em}{}
\titleformat{\subsection}[hang]{\color{red}\Large\bfseries}{}{0em}{}
%\setcounter{secnumdepth}{1} %section 序号

\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{example}[theorem]{Example}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{exercise}{Exercise}[section]
\newtheorem{annotation}[theorem]{Annotation}

\newcommand*{\xfunc}[4]{{#2}\colon{#3}{#1}{#4}}
\newcommand*{\func}[3]{\xfunc{\to}{#1}{#2}{#3}}

\newcommand\Set[2]{\{\,#1\mid#2\,\}} %集合
\newcommand\SET[2]{\Set{#1}{\text{#2}}} %

\newcommand{\redt}[1]{\textcolor{red}{#1}}
\newcommand{\bluet}[1]{\textcolor{blue}{#1}}

\begin{document}
\title{IFDS: Dataflow Analysis via Graph Reachability}
\author{枫聆}
\maketitle
\tableofcontents

\newpage
\section{Definitions}


\begin{annotation}
\rm 在数据流分析中的“精确”一词，实际上等价于“meet over all vaild path”. 
\begin{itemize}
	\item 在过程内分析(intraprocedural)中，一条“vaild path”就是指从某个procedure的CFG上从entry node到特定的点这样一条路径，同时这样的路径也可以称为same level vaild path.  
	\item 在过程间分析(interprocedural)中，一条“vaild path”就是指当从main function开始，且某个procedure结束之后返回调用它的procedure，直到某个特定程序点的这样一条路径.  
\end{itemize}
上述东西没有什么新意，就是在图上分析时候，需要确定一类程序潜在可能的执行路径，而我们只分析它们. 这样vaild path如果更形式化一点，我们可以可以用CFL来描述. 
\end{annotation}

\begin{definition}
\rm 数据流分析中的可能会出现所有不同的数据值组成的集合$D$(underlying set)称为dataflow facts. 对于可能分析得到的结果是dataflow facts的一个子集，通常我们把所有可能得到的结果记为$2^D$. 
\end{definition}

\begin{definition}
\rm 数据流的值可以表示成位向量(bit-vectors)，其中每个bit可以表示一个具体的dataflow fact, 且可以每个传递函数可以用相应的位运算来表示，这样的一类数据流分析问题我们称之为\redt{locally separable problems}. i.e. reaching-definitions, available expressions, live variables. 
\end{definition}

\begin{annotation}
\rm \redt{怎么理解"separable"?} separable对应的是逻辑位运算过程不同位bit是不会相互影响的，可以将dataflow facts划分成独立的space，每个space里面的dataflow facts在propogate的过程是封闭的. 例如在reaching-definitions中两个不同变量定义的作用域是不会相互影响的.   
\end{annotation}


\begin{definition}
\rm 若dataflow facts $D$是一个有限(finite)集合，且每一个transfer function $\func{f}{2^D}{2^D}$都是可分配的(distributive，即满足join or meet semilattice homomorphism)，这样一类过程间(interprocedural)数据流分析问题称为\redt{interprocedural, finite, distributive, subset problems}，或简称为\redt{IFDS problems}.
\end{definition}

\begin{definition}
\rm 设程序$P$的每个procedure $p$对应图表示为$G_p=(N_p, E_p)$，其中$N_p$表示$p$上所有(atomic) statements，$E_p$表示$p$的控制流. $G_p$中结点种类分为
\begin{itemize}
	\item 唯一的start node $s_p$;
	\item 唯一的exit node $e_p$;
	\item 过程调用结点 call node，$G_p$中的所有call nodes构成的集合记为$\text{Call}_p$;
	\item 过程调用的返回位置结点 return-site node，即紧跟在call node后面，$G_p$中的所有return-site nodes构成的集合记为$\text{Ret}_p$
	\item 其余的结点与通常flowgraph上结点保持一致. 
\end{itemize}
特别地，$G_p$上每一对call node $c$和return-site node $r$直接有一条$c$到$r$的有向边，称为\redt{call-to-return-side edge}. 设$G^*=(N^*,E^*)$，$G^*$由所有procedures图表示$G_1,G_2,\cdots,
G_p,\cdots$和两类特殊的边构成
\begin{itemize}
	\item \redt{call-to-start edge}: 从$G_{p_1}$中call node到对应$G_{p_2}$的start node $s_p$的有向边.
	\item \redt{exit-to-return-side edge}: 从$G_{p_2}$的exit node $e_{p_2}$到对应$G_{p_1}$的return-side node. 
\end{itemize}
称$G^*$为\redt{supergraph}.
\end{definition}


\begin{annotation}
\rm 理解supergraph只需要了解几个特殊的点
\begin{itemize}
	\item return-site这类结点可能是抽象出来的，在一定程度上有利于对call node的细化分析.
	\item 放置call-to-return-side edge的目的是用于过程内分析，i.e. local variables.
	\item 一个procedure执行返回会抽象到exit node上统一返回，即exit-to-return-side的作用.
\end{itemize} 
\end{annotation}

\begin{definition}
\rm 设$\func{f}{2^D}{2^D}$某个IFDS problem中一个distributive transfer function，它的一个\redt{关系表示}(representation relation) $R_f \subseteq (D \cup \{0\}) \times (D \cup \{0\})$为
$$
\begin{aligned}
R_f = &~~~ \{0,0\} \\
&\cup \Set{(0,y)}{y \in f(\emptyset)} \\ 
&\cup  \Set{(x,y)}{y \in f(\{x\})~\text{and}~y \notin f(\emptyset)}.
\end{aligned}
$$
其中$0$表示$\emptyset$. 
\end{definition}

\begin{definition}
\rm 定义一个IFDS problem的实例为$\text{IP}=(G^*,D,F,M,\sqcap)$，其中$G^*=(N^*,E^*)$为对应程序$P$的supergraph表示，$D$表示dataflow facts，$F$为distributive transfer function set，$M$为$E^* \to F$的映射，$\sqcap$对应的meet operator. 
\end{definition}

\begin{annotation}
\rm 注意上述representation是当$f$可分配满足
$$
f(X_1 \vee X_2) = f(X_1) \vee f(X_2)
$$
其中$\vee$为set上的union操作. 那么当$\vee$为set上intersection如何定义representation呢? 这个问题非常有趣，此时$f$的可分配的operator记为$\wedge$，文章中作者提到这样的问题可以转化为一个与当前问题等价的一个新问题，且这个新问题是union可分配的，i.e. 如果“must-be-X”是一个intersection可分配的问题，那么与它等价“may-not-be-X”问题就是一个union可分配的问题. 但是为什么呢？ 文中没有解释，我们来做几步推导. 假设“must-be-X”的$f$是一个intersection可分配，那么“may-not-be-X”的$f^c$应为
$$
f^c(X) = D - f(X). 
$$
那么
$$
\begin{aligned}
f^c(X \wedge Y) &= D-f(X \wedge Y)\\ 
&= [D-f(X)] \vee [D-f(Y)] \\
&= f^c(X) \vee f^c(X)
\end{aligned}
$$
真奇怪，显然没有得到想要的结果. 那么就是$f^c(X)$的定义出问题了，我们仔细思考一下原问题和新问题本质没有区别，只是在每个结点的对应值用补的形式表示，两个问题同一状态相同结点的值也是互补的，于是$f^c$的正确定义应当如下:
$$
f^c(X)= D - f(D-X)
$$
那么
$$
\begin{aligned}
f^c(X \vee Y) &= D-f[D-(X \vee Y)] \\ 
&= D-f[(D-X) \wedge (D-Y)] \\
&= D-f(D-X) \wedge f(D-Y) \\
&= [D-f(D-X)] \vee [D-f(D-Y)] \\
&= f^c(X) \vee f^c(Y) 
\end{aligned}
$$
这样确实可以得到一个和原问题等价的union可分配的问题. 
\end{annotation}


\begin{definition}
\rm 给定关系$R \subseteq (D \cup \{0\}) \times (D \cup \{0\})$，则其唯一确定函数$\func{\llbracket R \rrbracket}{2^D}{2^D}$为
$$
\llbracket R \rrbracket(X) = \Set{y}{\forall x \in X, (x,y) \in R} \cup \Set{y}{(0,y) \in R } - \{0\}
$$
\end{definition}

\begin{theorem}\label{representation correctness}
\rm 给定distributive transfer function $f$ (set union)，设$f$的关系表示为$R_f$，则
$$
\llbracket R_f \rrbracket = f
$$
\end{theorem}

\begin{proof}
\rm 上述representation correctness取决于$f$满足对任意的$X = \{x_1,x_2,\cdots,x_n\} \subseteq D$有
$$
f(X) = f(\{x_1\} \vee \{x_2\} \vee \cdots \vee \{x_n\}) = f(\{x_1\}) \vee f(\{x_2\}) \vee \cdots f(\{x_n\})
$$ 
\end{proof}


\begin{proposition}
\rm 将$R$可以看做一个$2(D+1)$结点的图$G$，任意一个pair $(x,y) \in R$作为$G$上一条从$x$到$y$的有向边，则$G$最多$D^2 + D +1$条边.
\end{proposition}

\begin{proof}
注意是不存在$(x,0) \in R$，所以不是$(D+1)^2$(但是原文认为是$(D+1)^2$).
\end{proof}


\begin{definition}
\rm 给定两个关系$R_1,R_2 \subseteq S \times S$，定义它们的\redt{复合关系}(composition)$R_2 \circ R_1$为
$$
R_2 \circ R_1 = \Set{(x,y) \in S \times S}{\text{if}~ z \in S, (x,z) \in R_1~\text{and}~(z,y) \in R_2}.
$$
\end{definition}

\begin{theorem}
\rm 给定两个distributive transfer function $f,g$ (set union)，设$f,g$的关系表示分别为$R_f,R_g$，则
$$
\llbracket R_g \circ R_f \rrbracket = g \circ f. 
$$
\end{theorem}

\begin{proof}
\rm 对任意的$X \subseteq D$，设$y \in \llbracket R_g \circ R_f \rrbracket(X)$，我们来证明$y \in g \circ f(X)$. 若$(0,y) \in R_g \circ R_f$，那么存在$(0,z) \in R_f$和$(z,y) \in R_g$，显然此时$z = 0$，则$y \in g(\emptyset)$，于是
$$
g\circ f(X) = g[\emptyset \vee f(X)] = g(\emptyset) \vee g\circ f(X). 
$$ 
因此$y \in g\circ f(X)$; 若$(x,y) \in R_g \circ R_f$，其中$x \in X$，那么存在$(x,z) \in R_f$和$(z,y) \in R_g$，显然$z \neq 0$且$z \in f(x)$和$y \in g(z)$，于是
$$
\begin{aligned}
g \circ f(X) &= g \circ f[\{x\} \vee (X - \{x\})]\\ &
= g[f(x) \vee f(X - {x})]\\ 
&= g \{z  \vee [f(x) - \{z\}] \vee  (X - \{x\})\} \\
&= g(z) \vee g\{[f(x) - \{z\}] \vee  (X - \{x\})\}
\end{aligned}
$$
因此$y \in g \circ f(X)$. 同理可证，设$y \in g \circ f(X)$，则有$y \in \llbracket R_g \circ R_f \rrbracket(X)$. 
\end{proof}


\begin{corollary}
\rm 给定一组distributive transfer function $f_1,f_2,\cdots,f_n$，设$f_1,f_2,\cdots,f_n$对应的关系表示分别为$R_1,R_2,\cdots,R_n$则
$$
f_n \circ f_{n-1} \circ \cdots \circ f_2 \circ f_1 = \llbracket R_n \circ R_{n-1} \circ \cdots \circ R_2 \circ R_1 \rrbracket.
$$
\end{corollary}

\begin{definition}
\rm 给定一个IDFS问题实例$\text{IP}=(G^*,D,F,M,\sqcap)$. 定义\redt{exploded supergraph} $G^\otimes = (N^\otimes,E^\otimes)$如下
$$
\begin{aligned}
&N^\otimes = N^* \times (D \cup \{0\}) \\
&E^\otimes = \Set{\left<m,d_1\right> \to \left<n,d_2\right>}{(m,n) \in E^*~\text{and}~ (d_1,d_2) \in R_{(m,n)}}
\end{aligned} 
$$
\end{definition}

\begin{annotation}
\rm exploded supergraph就是supergraph的基础上，将每个结点扩展成了$D+1$个结点，同时把每条边用其对应的transfer function relations扩展成了$|R_f|$条边. 
\end{annotation}

\begin{annotation}
\rm 将vaild path推广至到exploded supergraph上，称其为realizable path.
\end{annotation}

\begin{theorem}
\rm 给定一个IDFS问题实例$\text{IP}=(G^*,D,F,M,\cup)$的exploded supergraph表示$G^\otimes = (N^\otimes,E^\otimes)$，则对任意的结点$n \in N^*$，有$d \in \text{MVP}_n$(meet-over-all-path-solution)当且仅当存在一条从$\left<s_m, 0\right>$到$\left<n,d\right>$的realizable path. 
\end{theorem}

\begin{annotation}
\rm 上述定理就是我们的最终目标，将类IFDS问题可以转换称graph reachability问题. 注意这个定理的correctness取决于当前IDFS是一个set union IFDS问题，其他细节是比较显然的. 
\end{annotation}

\newpage
\section{Algorithm and Complexity}


\begin{definition}
\rm 给定一个IDFS问题实例$\text{IP}=(G^*,D,F,M,\cup)$，定义任意起始结点是$\left<s_p,0\right>$的same level realized path称为\redt{path edge}. 定义任意$n \in \text{CallNodes}$，以$\left<n,d_1\right>$为起点和以$\left<\text{ReturnSite}(n),d_2\right>$为终点且其前缀路径包含$\left<n,d_1\right> \to \left<s_{\text{Called}(n)},d_3\right>$的same level realizable path称为\redt{summary path}. 
\end{definition}

%\IncMargin{1em}
\begin{algorithm}
\label{mian_algo}
\caption{Tabulation Algorithm}
\SetKwFunction{ForwardTabulateSLRPs}{ForwardTabulateSLRPs}
\SetKwFunction{Propagate}{Propagate}
\SetKwInOut{Input}{input}
\SetKwInOut{Output}{output}
\DontPrintSemicolon
\Input{A exploded supergraph $G^\otimes$ of the instance of IFDS problem } 
\Output{The dataflow values $X_i$ of every node $i$}
\Begin{
		Let $(N^\otimes, E^\otimes)= G^\otimes$\;
	 	PathEdge = $\{\left<s_{main},0\right> \to \left<s_{main},0\right>\}$\;
	 	WorkList = $\{\left<s_{main},0\right> \to \left<s_{main},0\right>\}$\;
	 	SummaryEdge = $\emptyset$  \;
		\ForwardTabulateSLRPs{} \;
	 	\For{\rm each $n \in N^*$}{
	 		$X_n = \Set{d_2 \in D}{\text{if}~d_1 \in D \cup \{0\}~\text{and}~\left<s_{\text{profOf}(n)},d_1\right> \to \left<n,d_2\right> \in \text{PathEdge}}$ \;
	 	}
	 }	
\end{algorithm}
%\DecMargin{1em}

\begin{procedure}
\label{pr}
\caption{Propagate($e$)}
\Begin{
	\If{\rm $e \notin \text{PathEdge}$}{
		Insert $e$ into PathEdge\;
		Insert $e$ into WorkList\;
	}
}
\end{procedure}
  
\begin{procedure}
\label{ft}
\caption{ForwardTabulateSLRPs()}
\Begin{
	\While{\rm WorkList $\neq \emptyset$}{
		Select and remove an edge $\left<s_p,d_1\right> \to \left<n,d_2\right>$ \;
		\If{$n \in \text{Call}_p$}{
			\For{\rm each $d_3$ such that $\left<n,d_2\right> \to \left<s_{\text{calledProc}(n)},d_3\right>$}{
				\Propagate{\rm $\left<s_{\text{calledProc}(n)},d_3\right> \to \left<s_{\text{calledProc}(n)},d_3\right>$} \;
			}
			\For{\rm each $d_3$ such that $\left<n,d_2\right> \to \left<{\text{returnSide}(n)},d_3\right> \in E^\otimes \cup \text{SummaryEdge} $}{
				\Propagate{\rm $\left<s_p,d_1\right> \to \left<{\text{returnSide}(n)},d_3\right>$} \;
			}					
		}
		\ElseIf{$n = e_p$}{
			\For{\rm each $c \in \text{callers}(p)$}{
				\For{\rm each $d_4,d_5$ such that $\left<c ,d_4\right> \to \left<s_p,d_1\right>$ and $\left<n,d_2\right> \to \left<\text{returnSide},d_5\right>$}{
					\If{\rm $\left<c ,d_4\right>  \to \left<\text{returnSide},d_5\right> \notin \text{SummaryEdge}$}{
						Insert $\left<c ,d_4\right>  \to \left<\text{returnSide}(c),d_5\right>$ into SummaryEdge \;
						\For{\rm each $d_3$ such that $\left<s_{\text{profOf}(c)},d_3\right> \to \left<c ,d_4\right>$}{
							\Propagate{\rm $\left<s_{\text{profOf}(c)},d_3\right> \to \left<\text{returnSide}(c),d_5\right>$ }
						} 
					}
					
				}
			}		
		}
		\Else{
			\For{\rm each $\left<n,d_2\right> \to \left<m,d_3\right>) \in E^\otimes$}{
				\Propagate{\rm $\left<s_p,d_1\right> \to \left<m,d_3\right>)$}\;			
			}
		}
	}
}
\end{procedure}

\newpage

\begin{annotation}
\rm 整个算法的脉络还是比较清晰的，理解定义path edge和 summary edge就是为该算法的dp性质设计的. 一些可能比较有趣的点
\begin{itemize}
	\item FowardTabulateSLRPs第10行这里可能有一些冗余，原文这里calllers函数的设计是找到所有调用该procedure的调用者，如果我们只需要分析特定程序潜在可能的运行的路径，那么可能存在一些callers根本不会被运行，这些calllers理论上也没必要分析，所以这里的callers函数可以设计成一个动态函数，用于记录已经被分析过caller. 
\end{itemize}
\end{annotation}

\begin{theorem}
\rm Tabulation algorithm的时间复杂度为$O(ED^3)$.
\end{theorem}

\begin{proof}
\rm Tabulation algorithm \ref{mian_algo} 的运行时间包括ForwardTablutateSLRPs函数的运行时间和第7-8行的运行时间. 其中第7-8行的运行时间很容易计算为$O(ND^2)$，其中还$N < E$因此$O(ED^2)$. 

ForwardTablutateSLRPs函数 \ref{ft} 的运行时间，你要仔细分析是比较麻烦的，巧妙一点就是可以看做从$D+1$结点开始在$G^\otimes$上的$D+1$次广搜，一次广搜的时间复杂度是和图的规模成线性关系的即$O(ND^2+ED^2)$，还是由于$N < E$，即有$O(ED^2)$，因此总的时间复杂度为$O(ED^3)$. 
\end{proof}

\begin{thebibliography}{9}
\bibitem{lfds}
Precise InterproceduralDataflow Analysis via Graph Reachability.
\end{thebibliography}

\end{document}