-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Memory Regions to IR #231
Conversation
Memory sections in the initial memory/read only memory that are not actually accessed in the program don't need to be given their own region, it's fine if they just use the default |
Which other branch are you talking about? |
I am talking about yousif-vsa branch where I made changes to MRA. This issue has been resolved though in my latest commit here so no need to worry about that anymore. |
@l-kent In that case can we look into merging the changes please? |
Yes, I'm going to review this next. |
The last commit should fix the issue with over approximating to "abort" |
for (elem <- memorySegment) { | ||
elem match { | ||
case mem: MemorySection => | ||
val regions = mmm.findDataObject(mem.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work at present. MemorySection.address
is just the address of the start of the memory section - for example .plt
.got
and .data
are all MemorySections, and their contents are just listed as bytes, which is very crude. We want to extract the bytes relevant to any regions that the memory region analysis has determined are accessed, which requires much more work than just attempting to match the addresses like this.
Currently, all this method does is remove some data sections entirely, even if they contain addresses that we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between those small 8 byte addresses and the memory section? Are they loaded from the relf file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MemorySections are loaded from the .adt or .gts file and each represent a single section of the binary - such as the .data
section. The address
of a MemorySection is the address of the start of the section, and each byte in the array of bytes is located at a subsequent address. So MemorySection.bytes[0] is the 8-bit value stored at address MemorySection.address, MemorySection.bytes[1] is the 8-bit value stored at address MemorySection.address + 1, and so on.
I will handle this part though, since this relates to fixing the Boogie output.
Currently, the initial memory pre/post-conditions are not generated with the identified regions. An example of this is the test correct/basic_function_call_reader/clang_pic. The following snippets are from the output currently produced for this test:
In the second snippet, the address 69568 (65536 + 4032) is loaded from the region |
The rely() procedures created from the user's specification do not include the new regions. An example of this is the test correct/basicassign_gamma0/clang. The following snippets are all from the output produced for it:
In the third snippet, the address 69684 (69632 + 52) is loaded from the region |
The test correct/initialisation/gcc demonstrates three issues - not using identified regions in the generated require/ensures clauses from the user specifications, incorrectly removing relevant sections of initialised memory, and not using the identified regions in initialised memory requires/ensures clauses. I've already discussed the latter so I'll detail the first two for this test. On main, the following snippet is produced for this test:
These requires clauses are all removed from the output when using this branch's memory region analysis. However, the addresses 69648, 69656, 69664, 69668 and 69652 are all accessed, so we want to keep the initialisations for 69648, 69656 and 69664 around. We want the regions in the requires clause to match as well, so that would require splitting up the initialisations further - the region y would need the address 69652 initialised with a 32-bit access, and the region x would need the address 69648 initialised with a 32-bit access, instead of the 64-bit initialisation covering both as-is.
The output also includes the following ensures clauses in the procedure main():
The issue is that mem is not replaced with identified regions for these, but we want it to. |
Currently, the memory regions for pointers and pointers-to-pointers are unnecessarily combined. This doesn't cause any issues at present but it is unnecessary. An example of this is the output for the test case correct/basic_arrays_read/clang_pic:
The important context here is that the global variable Both accesses are given the region |
This PR causes some of the IndirectCallsTests to fail - for some of those tests, the indirect calls get incorrectly resolved to the global variable Indirect call resolution also fails for the correct/jumptable/clang (including variants) tests and for the correct/functionpointer/gcc_O2 and /clang_O2 tests. The correct/jumptable/clang test is a particularly interesting example, and stack region identification is currently incorrect for it too. These are all the stack accesses produced in main() (in order):
Each stack region is associated with the following accesses: stack_15: R31 + 12 (32-bit access) It is incorrect that R31 + 16 (128-bit access) and R31 + 24 (64-bit access) are in different regions, despite being overlapping accesses - the analysis needs to consider the size of accesses as well as their address. While the way the other regions are combined does not cause problems, it is imprecise and it would be ideal to fix it so that distinct accesses have distinct regions. |
Another issue is the regions in assertions, specifically the assertions that check if writes to memory are secure. These are updates to the global variables z and x, respectively from the test correct/secret_write/clang, with annotations:
There are number of issues we need to address here. Most straight-forwardly - there is still a reference to The more difficult issue is that we need to redesign this secure update check to take into account that not all global variables share the same region now, which means many checks can be removed. We want to change the output so that it now looks something like this:
The differences are:
It may also make sense to split the L function so that each region has its own L function? I'm not sure if this matters at all either way though. |
I think that should be pretty much everything. I realise this is a lot. I think the most important issues to focus on at first are the ones in this comment: #231 (comment) since those are issues with the analysis itself, rather than just more work that needs to be done with the analysis results. |
private def returnRegion(region: StackRegion): StackRegion = { | ||
uf.find(region.asInstanceOf[MemoryRegion]).asInstanceOf[StackRegion] | ||
} | ||
|
||
private def returnRegion(region: DataRegion): DataRegion = { | ||
uf.find(region.asInstanceOf[MemoryRegion]).asInstanceOf[DataRegion] | ||
} | ||
|
||
private def returnRegion(region: HeapRegion): HeapRegion = { | ||
uf.find(region.asInstanceOf[MemoryRegion]).asInstanceOf[HeapRegion] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the region subtypes are treated completely separately as this implies, shouldn't each region subtype have its own UnionFind to enforce this?
Another issue: the stack identification doesn't seem to work correctly when heap pointers are involved. This is from test/correct/malloc_with_local/gcc/malloc_with_local.bir, with annotations:
The corresponding Boogie produced is the following (with annotations and cleaned up a little):
The big issue is that accesses to the heap are incorrectly treated as accesses to the stack regions in which the heap pointers are stored. Every heap allocation should have its own region and should not be incorrectly identified as the stack. |
I pushed up to the working things |
# Conflicts: # src/main/scala/analysis/InterprocSteensgaardAnalysis.scala # src/main/scala/analysis/MemoryModelMap.scala # src/main/scala/ir/Expr.scala # src/main/scala/util/RunUtils.scala
For correct/basic_lock_read/gcc:BAP, there is an inconsistency between the GRA result and the MMM. The GRA correctly gives a region with start address 69652 (0x11014) and size 4 to accesses to that address, but this region does not appear in the MMM's dataMap. I haven't yet identified why. This was an issue before your most recent changes but I had previously missed it. |
For correct/initialisation/clang_pic:BAP, there are some region size issues in relation to array accesses as previously discussed, but there is another error in the analysis as well. This is the relevant parts of the .bir file:
The access at 0000036b is incorrectly given a region with a size of 8 instead of 4 (it only has a 32-bit access) - this is the same region size issue with array accesses. The access at 00000379 is incorrectly given a region with size 8 and start address 0x10FCC (69580). It seems to have gotten this by doing 0x10000 + 0xFC8 + 4, but the fact that 0x10FC8 (69576) is loaded from memory and the result has 4 added to it has been missed. The correct address is 0x11044 (69700) as mem[0x10FC8] == 0x11040 (69696). This was probably an issue before your most recent changes, just overshadowed by other issues you've now fixed. |
For correct/secret_write/clang:BAP, there is a new error introduced by the most recent changes. The GRA correctly gives accesses to address 0x11034 (69684, which is the address of z) a region, but this region is not included in the MMM's dataMap. |
For correct/functionpointer/clang:GTIRB, there is an issue with the stack regions. This is an excerpt of the output:
The accesses to regions stack_20 and stack_8 both have the same address (R29 - 8 == R31 + 24) but have different regions. This is incorrect. This was an issue before your most recent changes but I had missed it. edit: This seems to only happen some of the time, so it may be some weird iteration-order issue with iterating over a Set or something? |
src/main/scala/util/RunUtils.scala
Outdated
@@ -537,15 +558,18 @@ object RunUtils { | |||
var iteration = 1 | |||
var modified: Boolean = true | |||
val analysisResult = mutable.ArrayBuffer[StaticAnalysisContext]() | |||
while (modified) { | |||
while (modified || analysisResult.size < 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a good way to make use of the VSA results, as we don't need to run every analysis twice. The way the analyses feed into each other could be significantly improved - having a circular dependency between them really doesn't seem ideal.
private var DataMemory, HeapMemory, StackMemory = TreeMap[BigInt, Array[Byte]]() | ||
|
||
|
||
|
||
// Store operation: store BigInt value at a BigInt address | ||
def store(address: BigInt, value: BigInt, memoryType: MemoryType): Unit = { | ||
val byteArray = value.toByteArray | ||
memoryType match | ||
case MemoryType.Data => DataMemory += (address -> byteArray) | ||
case MemoryType.Heap => HeapMemory += (address -> byteArray) | ||
case MemoryType.Stack => StackMemory += (address -> byteArray) | ||
} | ||
|
||
// Load operation: load from a BigInt address with a specific size | ||
def load(address: BigInt, size: Int, memoryType: MemoryType): BigInt = { | ||
val memory = memoryType match | ||
case MemoryType.Data => DataMemory | ||
case MemoryType.Heap => HeapMemory | ||
case MemoryType.Stack => StackMemory | ||
// Find the memory block that contains the starting address | ||
val floorEntry = memory.rangeTo(address).lastOption | ||
|
||
floorEntry match { | ||
case Some((startAddress, byteArray)) => | ||
val offset = (address - startAddress).toInt // Offset within the byte array | ||
// If the load exceeds the stored data, we need to handle padding with zeros | ||
if (offset >= byteArray.length) { | ||
BigInt(0) | ||
} else { | ||
// Calculate how much data we can retrieve | ||
val availableSize = byteArray.length - offset | ||
// Slice the available data, and if requested size exceeds, append zeros | ||
val result = byteArray.slice(offset, offset + size) | ||
val paddedResult = if (size > availableSize) { | ||
result ++ Array.fill(size - availableSize)(0.toByte) // Padding with zeros | ||
} else { | ||
result | ||
} | ||
BigInt(1, paddedResult) // Convert the byte array back to BigInt | ||
} | ||
case None => | ||
// If no memory is stored at the requested address, return zero | ||
BigInt(0) // TODO: may need to be sm else | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any of this actually used for anything?
override def toString: String = s"Data($regionIdentifier, $start, $size, ($relfContent))" | ||
def end: BigInt = start + size - 1 | ||
val relfContent: mutable.Set[String] = mutable.Set[String]() | ||
val isPointerTo: Option[DataRegion] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
* 3. starts between regions (start, end) and (value + size) => end | ||
* 4. starts between regions (start, end) and (value + size) < end (add both regions ie. next region) | ||
*/ | ||
def findStackPartialAccessesOnly(value: BigInt, size: BigInt): Set[StackRegion] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used for anything?
matchingRegions.toSet.map(returnRegion) | ||
} | ||
|
||
def getRegionsWithSize(size: BigInt, function: String, negateCondition: Boolean = false): Set[MemoryRegion] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used for anything?
// Assume heap region size is at least 1 TODO: must approximate size of heap | ||
val negB = 1 | ||
val (name, start) = nextMallocCount(negB) | ||
val newHeapRegion = HeapRegion(name, start, negB, IRWalk.procedure(n)) | ||
addReturnHeap(directCall, newHeapRegion) | ||
s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to approximate the size of a heap allocation if you don't know it. The size of each heap allocation is not important except for detecting out-of-bounds accesses (which is not the most important). Each heap allocation needs to be its own separate address space, with the R0 returned from malloc being like the stack pointer for that heap.
src/main/scala/util/RunUtils.scala
Outdated
}) | ||
|
||
Logger.debug("[!] Running MMM") | ||
val mmm = MemoryModelMap() | ||
mmm.convertMemoryRegions(mraResult, mergedSubroutines, globalOffsets, mraSolver.procedureToSharedRegions) | ||
mmm.convertMemoryRegions(mraSolver.procedureToStackRegions, mraSolver.procedureToHeapRegions, mraSolver.mergeRegions, mraResult, mraSolver.procedureToSharedRegions, graSolver.getDataMap, graResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mraSolver.mergeRegions is always empty, isn't it?
def canCoerceIntoDataRegion(bitVecLiteral: BitVecLiteral, size: Int): Option[DataRegion] = { | ||
mmm.findDataObject(bitVecLiteral.value) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have an unused size parameter?
tableAddress | ||
} | ||
|
||
def tryCoerceIntoData(exp: Expr, n: Command, subAccess: BigInt, loadOp: Boolean = false): Set[DataRegion] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand how this method works, can you explain it? I don't really follow why BinaryExprs are such a special case for it?
@@ -60,11 +60,6 @@ class InterprocSteensgaardAnalysis( | |||
|
|||
def getMemoryRegionContents: Map[MemoryRegion, Set[BitVecLiteral | MemoryRegion]] = memoryRegionContents.map((k, v) => k -> v.toSet).toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always just empty now isn't it?
private def dataPoolMaster(offset: BigInt, size: BigInt): Option[DataRegion] = { | ||
assert(size >= 0) | ||
if (dataMap.contains(offset)) { | ||
if (dataMap(offset).size < (size.toDouble / 8).ceil.toInt) { | ||
dataMap(offset) = DataRegion(dataMap(offset).regionIdentifier, offset, (size.toDouble / 8).ceil.toInt) | ||
Some(dataMap(offset)) | ||
} else { | ||
Some(dataMap(offset)) | ||
} | ||
} else { | ||
dataMap(offset) = DataRegion(nextDataCount(), offset, (size.toDouble / 8).ceil.toInt) | ||
Some(dataMap(offset)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to return an Option when it never returns None?
Memory Regions in Boogie Output
I'm going to merge this into main and file issues for remaining bugs now. |
The --memory-regions flag can be used to enable the regions output |
Before merging there are two main issues
Some initial regions of a function are ignored (they are not collected by MRA) and thus are not renamed. This may have been fixed in the MRA file of the other branch but has other dependencies that needs merging.Some Data section regions that are in initial memory and read only memory are not found. (Those are not loaded by the ReadELFLoader and cause this issue as MMM relies on results from the loader. Notable mentions for "arrays" example:
[WARN] No region found for memory section 0 [[email protected]:297]
[WARN] No region found for memory section 0 [[email protected]:297]
[WARN] No region found for memory section 0 [[email protected]:297]
[WARN] No region found for memory section 0 [[email protected]:297]
[WARN] No region found for memory section 568 [[email protected]:297]
[WARN] No region found for memory section 596 [[email protected]:297]
[WARN] No region found for memory section 632 [[email protected]:297]
[WARN] No region found for memory section 664 [[email protected]:297]
[WARN] No region found for memory section 696 [[email protected]:297]
[WARN] No region found for memory section 960 [[email protected]:297]
[WARN] No region found for memory section 1148 [[email protected]:297]
[WARN] No region found for memory section 1176 [[email protected]:297]
[WARN] No region found for memory section 1240 [[email protected]:297]
[WARN] No region found for memory section 1528 [[email protected]:297]
Anything bigger than these offsets is resolved. These offsets dont seem to be loaded with externalFunctions or globalOffsets