Dienstag, 24. Dezember 2013

On the way to deterministic binariy (gcc) output

In some projects (actually it should be in general) it is necessary to prove that two releases from the same input (source + configuration) generate the same output. This property is useful because it allows one to compare the binary output of a compilation/linking step. If there is no difference at all, one can be sure that there was no change on source code either and that the behaviour of the software doesn't change (as long as one trusts the compiler). It also allows one to prove that changes in the build infrastructure/system doesn't change the output of a build, that an archiving concept works, etc.

There are several aspects that should be considered on the way to deterministic binary output:
  1. absolute paths which are compiled into the binary code (mainly for debugging)
  2. compiler sometimes decide randomly e.g., which optimization to use, which path to choose, or how to mangle a specific function in anonymous namespaces. Of course, this has no influence on the functional properties of the code, the binaries are (they should ;-) ) always be functional equivalent.
  3. timestamps, uuid in object files, libraries, etc.
  4. timestamps, dates generated by __DATE__, __TIME__, __TIMESTAMP__ macros

An example where the first point comes into play is the __FILE__ macro which is often used for debugging purposes. The implementation of how this macro gets expanded depends from compiler to compiler. For example Microsofts C++ Compiler uses an FC flag which allows to control if the macro expansion to absolute or relative paths. Of course, the question regarding absolute or relative paths is only of value if you have multiple build machines with different location of workspaces or you care about information of your workspace that gets delivered to your customer. You can easily check if there are any path informations like that in the binary by searching for the workspace path in the binary.

strings binary.out | grep workspace

If there is any output that contains full paths to your sourcecode files then you would have to take care of this problem. For me, I have discovered two solutions:
  • Using compiler switches to make sure that paths are relative
  • Making sure that the build environment/workspaces are on the same absolute paths independent of the actual build machine(master, slave jenkins)
The next point is quite interesting. The average programmer would expect that given a specific piece of code and a set of rules for the compiler and linker, the outcome would be always the same. Well this is (normally) true for the functional behaviour of the piece of code. However, this is not true when comparing the two binaries on byte level. You can easily compare two binaries by using the


cmp -b -l b1 b2
 
This will show you all binary differences with location and difference for b1 vs. b2.
The binary incompatibility has several reasons: one is for examples how gcc mangles functions in anonymous namespaces. A part of this name mangling is randomized by using a random generator. If you have taken care of our first point and your object files still differ, then you can use a special gcc parameter. The -frandom-seed=<string> allows one to specify a string which will be used to initialize the random generator. The documentation for this option tells us...

       -frandom-seed=string
           This option provides a seed that GCC uses when it would otherwise
           use random numbers.  It is used to generate certain symbol names
           that have to be different in every compiled file.  It is also used
           to place unique stamps in coverage data files and the object files
           that produce them.  You can use the -frandom-seed option to produce
           reproducibly identical object files.

           The string should be different for every file you compile.

That means that we have to provide random strings for each file we will compile. I have found one solution to this problem in this blogpost by Jörg Förstner. He suggested to use the md5 hash of the source file as input to the -frandom-seed. This is sufficient as it will change for different source files and vice versa will provide the same seed if the source hasn't changed. He suggested to use the following compile parameters...

           $(CC) -frandom-seed=$(shell md5sum $< | sed 's/\(.*\) .*/\1/') $(CCFLAGS) -c $< -o $@

The seed is constructed by calculating the md5sum of the source code file (e.g. test.cpp).

          md5sum $<
          b61f78373a5b404a027c533b9ca6280f  test.cpp

This result is piped into sed (sed 's/\(.*\) .*/\1/') to cut away the filename part behind the actual md5 sum.

The problem described by the third point (timestamps, uuids) is created by some linkers in the linking step. For example when building object files/static libraries/archives with the ar tool, ar will also insert timestamps, uuids and other stuff which will change from build to build. You can easily try this out by executing ar two times and comparing by comparing the generated output. However, for the ar tool there is a simple solution to this problem, ar comes with the -D option which will turn ar into deterministic mode. The documentation for -D tells us...

       D   Operate in deterministic mode.  When adding files and the archive
           index use zero for UIDs, GIDs, timestamps, and use consistent file
           modes for all files.  When this option is used, if ar is used with
           identical options and identical input files, multiple runs will
           create identical output files regardless of the input files'
           owners, groups, file modes, or modification times. 

My command line for building a static library looks like:
      ar Drvs <output> <input> 

In some cases, for example when using a cross-compiler tool chain, you cannot easily change the bin-utils version to get an ar version that supports the deterministic option. This was the motivation for someone to write a tool that wipes out the timestamps in the generated archive files. You can find this tool at github under the following url: https://github.com/nh2/ar-timestamp-wiper/tree/master. If you use cmake as part of your build system, you can link in the tool in the finish step of the archive generation.
          SET(CMAKE_C_ARCHIVE_FINISH "ar-timestamp-wiper ")
          SET(CMAKE_CXX_ARCHIVE_FINISH ${CMAKE_C_ARCHIVE_FINISH})
          SET(CMAKE_C_ARCHIVE_FINISH ${CMAKE_C_ARCHIVE_FINISH})   

The last point (timestamps, dates introduced by macros like __DATE__, __TIME__, __TIMESTAMP__) can addressed by specifying a deterministic/known value for the corresponding build. I know at least two ways how to do this, both work in general, but sometimes one approach is easier to use then the other.
  1. faketime/libfaketime
  2. overriding the macros by compiler defines  
The first approach works by calling the build step/executable using faketime. Faketime then uses the LD_PRELOAD mechanism to override some of the syscalls to pretend a specific time.
          apt-get install faketime
          faketime '2014-01-09 00:00:00' /usr/bin/date
The second approach works by adding for example -D__DATE__="'Jan 9 2014'" -D__TIME__="'12:00:00'" to your buildstep. You have to take care that that you specify a valid date and time according to the expected return values of __DATE__ and __TIME__.

 References:

  • http://cmake.3232098.n2.nabble.com/How-to-calculate-a-value-quot-on-the-fly-quot-for-use-with-gcc-compiler-option-td3277077.html
  • http://stackoverflow.com/questions/14653874/deterministic-binary-output-with-g
  • https://wiki.debian.org/ReproducibleBuilds

Kommentare:

  1. Ha, this is funny. I am working on improving deterministic compilation for GHC, so I decided to read your blog post to learn a bit from it, and then I see you using my ar-timestamp-wiper :D

    AntwortenLöschen

  2. These are the Currency trading working experts as I described above who provide your deals and information to you. They pretty much tell you what to do. Some are much better than others. Some are not out to provide MockingBird Method Softwaresignals at all and generally promoting their program for developing take advantage of that end.

    AntwortenLöschen
  3. The Native Trader is a basically a free binary options trading app that gives you profitable best trading signals. The Native Trader is pretty good if you are really interested in earning quick money. Anyone, including newbies who don’t even have any experience of binary trading can use this free app and earn huge from it. It is incomparable trading software.

    AntwortenLöschen
  4. 5th case is use of -flto gcc option

    AntwortenLöschen