Code Aware Services

Code Aware Services (CAS) is a set of tools for extracting information from a (especially large) source code trees. It consists of Build Awareness Service (BAS) and Function/Type database (FTDB). BAS is a tool for extracting information how particular S/W image is created from ongoing builds. FTDB transforms predefined source code information (like information about functions and types) into easily accessible format (like JSON) which can be used by a number of applications.

Introduction

Imagine you need to fix a S/W issue in a large S/W product (like Mobile Phone). This kind of project usually consists of a large number of S/W modules (like kernel, modem, system libraries and applications) communicating with each other. You need to make an general overview of the source code structure to find interesting source files and identify dependencies between modules. You also need detailed information how the S/W was built (i.e. exact compilation switches for specific source files) to be able to locate S/W errors that depend on specific configuration.

BAS works by tracking the full S/W image build (using specialized Linux kernel module) and storing information about all opened files and executed processes. This is later post-processed to extract detailed compilation information from compilation processes. All that information is stored in database (JSON file) which can be accessed freely after the build.

Examples of potential uses of BAS are:

  • extracting the list of all files from entire source tree which were used to create the final S/W image (which can be much smaller number (like ~10% in case of Linux kernel build) than the entire source repository) which can radically improve code search tools;

  • generating compilation database (i.e. JSON file with all compilation switches used to build the final S/W image) which can be later used by other tools operating on source code, like static analyzers, source code transformation tools, custom source processors etc.;

  • generating custom makefiles that can incrementally rebuild specifically selected parts of the source tree which can speed-up the change/build/test iteration during S/W development

  • generating code description files to various IDE’s (like Eclipse IDE) which can significantly improve development process for user space applications

  • analyzing build related information to locate S/W errors

and more.

Now imagine that you want to write a tool that operates on source code, i.e. custom static analyzer which looks for specific erroneous patterns in the code in the large scale and informs the developer about potential issues to focus on. Writing such tool requires a parsed source code representation to be available but writing parsers (especially for complex languages like C++) is extremely difficult. FTDB is a tool that uses existing libraries (i.e. libclang library) to parse the source code and store selected information from the parsed representation in an easily accessible format (i.e. JSON database). Such database can be later used by applications (i.e. Python programs) which can implement a set of distinct functionalities that takes source code information as an input.

Together BAS with FTDB forms a tightly related toolset which was collectively named CAS (Code Aware Services).

Details

CAS is split into three separate functionalities (stored in three separate source directories):

  • build tracer (tracer directory)

This is the Linux kernel module for tracing specific syscalls executed during build. The module file produced is execve_trace.ko. More details about tracer operation can be found in readme file.

  • BAS tools (bas directory)

This is a set of user space tools used to process and operate on the trace file produced by the build tracer. More details can be found in readme file.

This is the clang processor used to parse source files (currently only C sources are supported) to extract relevant information from the parsed representation and store it in simple JSON database file. More details can be found in readme file.

How to build

First setup the build environment and download sources:

sudo apt install git cmake llvm clang libclang-dev python3-dev gcc-9-plugin-dev build-essential linux-headers-$(uname -r) python2.7-numpy python-futures flex bison libssl-dev
git clone https://github.com/Samsung/CAS.git && cd CAS
export CAS_DIR=$(pwd)

You can build all components using any decent compiler however for FTDB processor to work you’ll need llvm and clang in version 10. In standard Ubuntu 20.04 distribution the above command will install this version by default. In case of other OS arrangements please make sure that the llvm-config --version returns 10.0.0.

First build the tracer:

(cd ${CAS_DIR}/tracer && make && sudo make modules_install)

Some kernel configurations may require modules to be signed. To do so, follow the instructions in kernel module signing docs.

After the tracer module is installed setup the tracing wrapper:

(cd ${CAS_DIR} && sudo ./etrace_install.sh)

This will modify your sudoers file to allow for using the tracer kernel module by any user in the system.

Now build rest of the tools in the CAS suite:

mkdir -p ${CAS_DIR}/build_release && cd ${CAS_DIR}/build_release
# This will configure to build with standard system compiler 'cc'
cmake -DCMAKE_BUILD_TYPE=Release ..
# Use this if you want to use clang explicitly
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
make -j16 && make install

How to use

etrace – this is the wrapper script that runs the command (most probably build) under the tracer.

export PATH=${CAS_DIR}:$PATH
# Run the COMMAND and save tracing log file (.nfsdb) in the current directory
$ etrace <COMMAND>
# Run the COMMAND and save tracing log file (.nfsdb) in the WORK_DIR directory
$ etrace -w <WORK_DIR> <COMMAND>
# Run the COMMAND and save tracing log file (.nfsdb) in the current directory (also keep kernel log file along the way)
$ etrace -l <COMMAND>
# Run the COMMAND and save tracing log file (.nfsdb) in the WORK_DIR directory (also keep kernel log file along the way)
$ etrace -w <WORK_DIR> -l <COMMAND>

All the above commands also produce .nfsdb.stats file. These are the statistics from the underlying ftrace infrastructure. Most common usage is to check for possible dropped events (meaning the underlying kernel buffer should be larger):

$ cat .nfsdb.stats | grep dropped

The -l option saves the kernel log file (dmesg) during execution to the .nfsdb.klog file. This can be used for debugging of the tracer kernel module.

Beware, whole kernel log file is saved (even events before the execution of the tracer). To get rid of those use sudo dmesg -C before running the tracer.

etrace_parser – this will parse the tracing log file (.nfsdb) and produce the build information in the form of simple JSON file (for the description of JSON format please refer to the readme file).

To parse the tracing log file simply execute the following:

${CAS_DIR}/etrace_parser <path to .nfsdb>

The result is written into .nfsdb.json file.

Final step is to post-process the .nfsdb.json file and include additional information regarding compilations and other.

python3 ${CAS_DIR}/bas/postprocess.py <path to .nfsdb.json>

There are times when the post-processed JSON file is really large (in case of full Android builds for latest system version it can easily take few GB of size). Reading such JSON can be very time consuming and take a few minutes just to parse it. There is option to create a specialized cached version of the JSON file strictly dedicated for reading only.

python3 ${CAS_DIR}/bas/postprocess.py --create-nfsdb-cache <path to .nfsdb.json>

This will produce the .nfsdb.json.img file with in-memory representation of the parsed JSON data which can be read back into memory almost instantly.

Examples

Let’s see some quick introduction through several simple examples. Let’s build our CAS suite again but this time with the tracer enabled.

cd ${CAS_DIR}/build_release
make clean
etrace make -j16 && make install
${CAS_DIR}/etrace_parser .nfsdb
export PYTHONPATH=${CAS_DIR}
python3 ${CAS_DIR}/bas/postprocess.py .nfsdb.json
python3 ${CAS_DIR}/bas/postprocess.py --create-nfsdb-cache .nfsdb.json

Now strike up some editor and write-up the following Python code:

#!/usr/bin/env python3

import sys
import libetrace

nfsdb = libetrace.nfsdb()
nfsdb.load(sys.argv[1],quiet=True)

# List all linked modules
L = [x for x in nfsdb if x.is_linking()]
print ("# Linked modules")
for x in L:
	print ("  %s"%(x.linked_file))
print()

# List compiled files
C = [x for x in nfsdb if x.has_compilations()]
print ("# Compiled files")
for x in C:
	for u in x.compilation_info.files:
		print ("  %s"%(u))
print()

# List all compilation commands
print ("# Compilation commands")
for x in C:
	print ("$ %s\n"%(" ".join(x.argv)))

Or try existing example:

${CAS_DIR}/examples/etrace_show_info .nfsdb.json.img

This should give you a list of all linked modules, compiled files and compilation commands.

We can generate FTDB database for all C compiled files.

export CLANG_PROC=${CAS_DIR}/clang-proc/clang-proc
${CAS_DIR}/examples/extract-cas-info-for-ftdb .nfsdb.json.img
${CAS_DIR}/clang-proc/create_json_db -P $CLANG_PROC
${CAS_DIR}/tests/ftdb_cache_test --only-ftdb-create db.json

First command extracts the compilation database (i.e. list of all compilation commands used to compile the program which is later used by the clang processor). Second command creates the FTDB database (db.json file in our case). Finally we create the cached version of the FTDB database (you can perfectly use the original database db.json file if that suits you however as soon will be shown this JSON file can also get really large very quickly for complex modules).

Some simple information regarding the FTDB database:

${CAS_DIR}/examples/ftdb_show_info db.json.img 

Ok, let’s see some more serious example now. First clone Linux kernel source for AOSP version 12 (make sure you have a few GB of space available).

cd ${CAS_DIR}
examples/clone-avdkernel.sh

Sources should be downloaded to avdkernel directory. To build it you’ll need clang at least in version 11.

sudo apt-get install clang-11 llvm-11 lld-11
cd avdkernel
etrace ${CAS_DIR}/examples/build-avdkernel.sh
${CAS_DIR}/etrace_parser .nfsdb
python3 ${CAS_DIR}/bas/postprocess.py --integrated-clang-compilers="/usr/lib/llvm-11/bin/clang" .nfsdb.json
python3 ${CAS_DIR}/bas/postprocess.py --create-nfsdb-cache .nfsdb.json

We will now extract information required to create the FTDB database for the Linux kernel (and all accompanying modules). This time we will get some more information as we will build specialized version of FTDB suited for the automated off-target generation (please see AoT project for more details).

${CAS_DIR}/examples/extract-avdkernel-info-for-ftdb .nfsdb.json.img

You should see confirmation that three files were created along the way. Apart from compilation database we also extract compilation dependency map (i.e. list of compiled files for every module of interest) and reverse dependency map (i.e. for every source file we get information which module used that file).

created compilation database file (compile_commands.json)
created compilation dependency map file (cdm.json)
created reverse dependency map file (rdm.json)

Now you’re ready for FTDB generation.

${CAS_DIR}/clang-proc/create_json_db -P ${CLANG_PROC} -p fops -o fops.json
${CAS_DIR}/clang-proc/create_json_db -p db -o db.json -P $CLANG_PROC -F fops.json -m android_common_kernel -V "12.0.0_r15-eng" -mr ${CAS_DIR}/clang-proc/vmlinux-macro_replacement.json -A -DD ${CAS_DIR}/clang-proc/vmlinux-additional_defs.json -cdm cdm.json -j4

First we create some intermediate file, fops database. For every global variable of structure type (i.e. struct file_operations) that contains some function pointer members (i.e. ioctl) which are initialized statically we grab the function names that initialize them. Second invocation crates the final db.json file (with some additions helpful for the AoT project). If you have infinite amounts of RAM you can skip the -j4 option and it’ll run on full speed. Otherwise I suggest to keep it that way.

The JSON database reached a few GB of size. Now it’s fully justified to use the cache.

${CAS_DIR}/tests/ftdb_cache_test --only-ftdb-create db.json
${CAS_DIR}/examples/ftdb_show_info db.json.img 

Now the FTDB database can be used for further analysis and application development.

GitHub

View Github