A Survey of Most Common Errors in Linux Kernel
A Survey of Most Common Errors in Linux Kernel
Sergey Staroletov
Polzunov Altai State Technical University,
Lenin avenue 46, Barnaul, 656038, Russia
Email: serg [email protected]
Abstract—The paper is devoted to a technique to analyse the A commit is an action when a software developer makes
most frequent bugs in the Linux Kernel by checking the commit a fixation of the changes he did. The developer takes the code
messages and diffs with using Git library. It examines the most from a common repository, works on it and when he thinks he
relevant error messages, the most buggy files and the most buggy has done a requested progress after reviewing the changes, he
lines of code. Some results of experiments on real Linux Kernel saves the changes in the repository by doing a commit. The
Git repositories are given.
commit is coming with a message in natural language about
Keywords—Linux Kernel, Git, Bugs, Algorithm, Tool, Leven- the changes.
shtein Distance
JGit [4] is an open source library written in Java to operate
with Git repositories and walk around the internal commit
I. I NTRODUCTION trees. It was chosen in the current work because of proper
documentation and a big amount of code snippets on the
The Linux Kernel [1] is a very good example how people Internet. JGit offers to access to commits, analyze the diffs,
collaboration can solve difficult problems together. The Kernel and we can write the code in Java for own analysis while we
has been written initially by Linus Torvalds, and now there are iterating among desired Git nodes.
are more than 16.000 contributors (we can get that value by
counting the committers of main Linux repository). A kernel
III. F ORMULATION OF THE PROBLEMS
is a set of functions and data structures that do the low
lever system stuff for an operating system, especially memory Suppose we have a local Git repository of the Kernel
management, process scheduling, devices support, networking, or Kernel related sources (but it can be applied to any Git
etc. The Kernel is written in C and it uses their own C library. repository too). Our goal is to:
One of the most advantages of releasing the open source 1) Find the most frequent errors. It could be reformu-
software - free feedbacks and error reports. A lot of users test lated as: find the most common commit messages of
the upcoming kernel and report bugs. There is no any warranty fixes, and then reformulated to: find the most relevant
for the software; people use it for the own risk, test if for free commit messages of fixes, not necessary the same but
and provide the feedbacks. The software engineer could collect similar.
the reports, find problems and fix the software; later it could be 2) Find the most buggy sources. It could be reformu-
released as a stable version because it has already been tested lated to: find the files which were mentioned in the
by thousands and millions of users, and then could be used in commits with ”fix” message most often.
commercial systems. 3) Find the most buggy lines of code. It could be
Now software developing teams use version control sys- reformulated to: find the lines which were mentioned
tems (VCS), for the Linux Kernel developing the tool Git [2] in the fixing commit messages in the most buggy
is being used. Because the VCS could easily track any change source files most often.
of project state including new feature, improvement and fixing Solving the first problem will able us to getting know the
the bug, it is possible to track the fixes, build statistic for the most regular classes of the errors in C Kernel code. Developers
analysis and the paper is devoted to it. and teachers should be aware of them to learn C techniques
and kernel programming to avoid them.
II. G IT AND JG IT Solving the second problem will show us the most unstable
Git VCS [2] is the second well-known project from Linus portions of the Kernel, it could be very hard to write them at
Torvalds after the Kernel. For the process of developing once correctly because of difficulty. Also, it could mean that
the Kernel he needed a distributed VCS and, at the same the component corresponding to file is an active development
time, a local VCS, which can be used offline to track the and fixing because of significance.
changes, create the patches, see the diffs (differences between And solving the third program can be useful for analysing
commits), merge the requests given from the other developers, the errors with respect to a source code.
for example, when he is flying from Europe to America in the
plane or using a quite Internet-less place. But in 2005 there
IV. F INDING THE CLOSEST MESSAGES
was no such a system, and he decided to make it [3]. Now Git
is being utilized in a big amount of software companies which It is easy to find the same messages in the messages list
want to share the code between the team, track the changes but to find closest messages we have to use special algorithms
and control the development process. that are counting metrics of similarity of two given strings.
The strings here are the commit messages which contain will be described in the next section, GetFirst is a bounded
”Fix” or ”Fixed:”, from those starting points to the end of a iterative algorithm that returns first count values from the given
line or to the end of a string which is coming firstly. ”Fix” or collection, SortMap - is a hash table sorting algorithm by the
”Fixed” have been given because of finding it while analysing value in descending order.
the existing commits in the Linux Kernel repository.
For this work, we use the Levenshtein distance [5] algo- VI. F INDING THE MOST BUGGY SOURCES
rithm and the [6] implementation of it. This algorithm runs from the previous algorithm to travel
Levenshtein distance computes a value of a distance be- through possible huge amount of commits once. Here is this
tween two strings. To find the nearest string to a string list, proposed algorithm:
we must calculate the Levenshtein distance between the given algorithm FindMostBuggySources (commit, count)
string and every string in the string list and return the string
with a minimal distance. diffs ← DiffsScan (commit, Parent(commit))
It is not a very clear and high-speed solution, but it works. for diff ∈ diffs
fileName ← ChangedFileInDiff(diff)
V. F INDING THE MOST FREQUENT ERRORS
for edit ∈ Edits (diff)
Here is the algorithm for it proposed:
mapFileNameChanges [fileName] ++
algorithm FindMostFrequentErrors (repository, count)
FindMostBuggySourceLines (filename, edit)
for branch ∈ Branches(repository)
for commit ∈ Commits (repository, branch, path) endfor
for otherMessage ∈ messages Here commit - is a JGit object represented the commit,
count - maximum value of sources to find, specified at the
minDst ← min (minDst, program startup, DiffsScan - an algorithm to retrieve diff
LevenshteinDistance( objects from the two commit differences, here - is a part of
JGit library,Parent - an algorithm to get the parent commit
commitMessage, otherMessage)) from given commit object, it means the previous state of the
repository, changes by a given commit, ChangedFileInDiff - is
closestMessage ← (message ∈ messages|
an algorithm to get changed filename from a given diff object,
LevenshteinDistance (commitMessage, message)==min) mapFileNameChanges - a hashtable to map changed filename
to the count of changes, FindMostBuggySourceLines is an
endfor algorithm to solve the most buggy lines of code problem,
messages[] ← commitMessage will be described in the next section, GetFirst is a bounded
iterative algorithm that returns first count values from the given
mapMsgRelevance [closestMessage] ++ collection, SortMap - is a hash table sorting algorithm by the
mapMsgRelevance [commitMessage] ++ value in descending order. Note that return here is given for
consistency, it actually runs after the FindMostFrequentErrors
FindMostBuggySources (commit) ends.
endif
VII. F INDING THE MOST BUGGY SOURCE LINES
endfor
This algorithm runs from the previous algorithm. Here is
endfor the proposed algorithm:
return GetFirst(count, SortMap(mapMsgRelevance)) algorithm FindMostBuggySourceLines (fileName, edit,
Here repository - is a JGit object that abstracts of Git local count)
cloned repository, count - maximum value of errors to find,
for line ← FirstInsertLine (edit)...LastInsertLine (edit)
Branches returns all the branches from given Git repository,
Commits returns all the commits from given repository, given mapFileChanges[fileName][line] ++
branch and given starting path (we can start not from the root
of the repository but from some folder within it), Message endfor
returns commit message from the given commit object, Leven- return for fileName ∈ MostBuggySources
shteinDistance calculates the Levenshtein Distance value from
two given strings, messages is a list of already known commit GetFirstLines(count,
messages, mapMsgRelevance is a hash table that maps a mes- SortMapByChanges(mapFileNameChanges[filename]))
sage to its relevance value in the commits, MostBuggySources
is an algorithm to solve the most buggy sources problem, endfor
Here fileName is a filename to check the buggy lines, drivers/gpu/drm/i915/intel_display.c/9165
count - maximum value of bug lines per source to find, Line:3166, changes -> 7
specified at the program startup, edit - a JGit object represented drivers/staging/skein/
edit insertion in the source file (they are given from Most- threefish_block.c/6481
BuggySources run), FirstInsertLine - an algorithm to extract Line:1048, changes -> 3
the first inserted line from the edit object, from JGit library, drivers/gpu/drm/i915/i915_gem.c/5390
LastInsertLine - an algorithm to extract the last inserted line Line:1213, changes -> 7
from the edit object, from JGit library, mapFileChanges is two drivers/gpu/drm/i915/
keys hash table, maps filename and line in the file to count of intel_ringbuffer.c/4396
changes, GetFirstLines is a bounded iterative algorithm that Line:2363, changes -> 7
returns first lines values from the given two keys hashtable, drivers/gpu/drm/i915/intel_pm.c/3960
SortMapByChanges sorts the given hash table for the desired Line:2276, changes -> 9
filename by the count of changes. Note that return here is /dev/null/3911
given for consistency, it actually runs after the algorithms Line:0, changes -> 3911
FindMostFrequentErrors and FindMostBuggySources end. include/linux/mlx5/mlx5_ifc.h/3745
Line:922, changes -> 4
Creating the hash table with a key of changed line could drivers/gpu/drm/i915/intel_lrc.c/3547
be memory costly, but now we are working not with ordinary Line:774, changes -> 7
commits but the commits with fixes only, where a maximum drivers/gpu/drm/i915/i915_debugfs.c/3381
of inserted lines couldn’t be so big. Line:5092, changes -> 7
drivers/gpu/drm/i915/
VIII. E XPERIMENTS AND ANALYSIS i915_gem_gtt.c/3206
Line:2615, changes -> 5
During the R&D work, all the given algorithms were
drivers/scsi/scsi_debug.c/2619
implemented as a Java program. The algorithms are nor well
Line:4882, changes -> 5
optimized nor parallelized or distributed (it is a possible subject
drivers/net/ethernet/mellanox/mlx5/core/
of further work), so the analysis of current master Linux Kernel
en_main.c/2373
repository [7] (more than 600.000 commits) lasts for about 10
Line:297, changes -> 6
hours at regular Intel Core i5 laptop. This analysis gives us
drivers/gpu/drm/i915/i915_drv.h/2252
overall statistic.
Line:3244, changes -> 8
Because the developed software offers to choose the repos- drivers/net/dsa/mv88e6xxx.c/2247
itory or choose the initial path in the repository, it is better to Line:1238, changes -> 5
analyze different related repositories in [8] or parts of the main arch/powerpc/xmon/ppc-opc.c/2138
repository (memory management, networking and so on). The Line:2964, changes -> 2
time for analysing a repository of 20.000 commits is about 10 drivers/gpu/drm/i915/intel_dp.c/2060
minutes. Line:1227, changes -> 6
drivers/media/pci/saa7134/
So, here are some results of analyzing the whole main saa7134-cards.c/1982
Linux repository, Bluetooth protocol stack for Linux (Bluez) Line:5713, changes -> 2
[9], memory management part [10], kernel scheduling part and
networking part of the main repository. Comments
The results were corrected because of some duplicates The overall analysis gives us information that the most
(plurals, signs, etc.). relevant issues are not the code issues, they are integration
The main repository. Most common error fixes. The issues.
format here is: fixing commit/relevance. Firstly we could observe the non-informative commits
messages (it, this, that) from commit message such as ”Fixed
it/135 this”. They spoil statistics and it is strange that they are in a
typos/102 project of this level. ”Typos” is somewhere near.
the checkpatch.pl issues:/102
some error handling/78 Checkpatch warnings and issues are the most frequent.
this/70 Checkpatch.pl [11] is a script to check the code to a Linux
that/53 Kernel guideline. The guideline is given in [12]. Very strange
dtc warnings/49 that developers who want to create a patch to the Kernel,
checkpatch warning:/45 ignoring that document. The Kernel is a big amount of mostly
module autoload/44 C files and if every developer of 16.000 had used a self style
modular build/44 of coding, the project could be a big mess. At the beginning
unaligned accesses in VC code/29 of coding to every project, every developer should request the
camel case/29 project’s code style.
missing interrupts/28 Error handling is a way to improve system stability and
performance after the implementation of a principal logic. The
The most buggy files and lines: filename/fixes count developer can write code without paying attention to special
cases. After, during the testing the module or a kernel part,
some errors will occur and the code must be fixed. Error
handling can be a set of special ifs and gotos to the end of
code to free the buffers and return a negative error code.
DTC warnings are about device trees [13]. The core reason
for the existence of Device Tree in Linux is to provide a way
to describe non-discoverable hardware. This information was
previously hard coded in source code. To deal with device tree
developer must know low-level information from chipset and
motherboard vendors, so testing it is not simple.
Module autoload is a process to load module in a boot
time. Typically developers test the loadable Kernel modules
by installing them into the live kernel, but at the load time, Fig. 2. The most common error fixes types
the module could conflict with other modules in the Kernel so
it is an integration issue.
a bug of Git or JGit), a lot of fixes for Intel GPU drivers, Skein
Modular build is possible a process of creation correct Hash function driver, SCSI driver, fixes for Philips SAA7134
makefiles for building the module to build it among the kernel. TV card, fixes for Mellanox Ethernet adapter, etc. (see Figure
It is a process of integration the module to the whole kernel 3).
and kernel config.
Unaligned memory access occurs when the developer tries
to read N bytes of data starting from an address that is not
evenly divisible by N (i.e. addr % N != 0) [14]. It can cause
performance costs in some architectures or even a processor
exceptions. VC means PCI Express bus virtual channel.
Camel case is a code style issue. It usually means the
naming the variables and functions as one big word with
capital letters in the place of sub-words starts without any signs
like SomeFunction. The fixes mean that initially the developer
followed to another code styles like ”some function” or didn’t
follow to any style.
Missing interrupts is a specific OS kernel or driver issue.
A driver can set up and start an I/O operation, then wait for an Fig. 3. Top types of Linux buggy drivers
interrupt indicating completion. If that interrupt never shows
up, the driver can end up waiting for a very long time [15]. We see that Intel display and GPU drivers have the most
We think this issue looks like a problem with handling on a numerous fixes in the Linux Kernel. It could mean that the
semaphore and it could be solved with static checking methods. drivers are completely unstable, we see lots of fixes in the
interface .h file, that is nonsense. But it also means a good
The overall diagram of most common errors fixes is given support of Intel company to his drivers. In not a secret that a
on Figure 1. If we get rid of typos and uninformative fixes and big amount of free Linux code is written in the work time by
the developers from hardware and software corporations (Intel,
Oracle, IBM,..). They are paid for the patches for Linux kernel
or drivers, they test the software by the significant amount of
free Linux users to find possible bugs and then use the well-
tested software in the own projects, create commercial products
and sell it to vendors.
There are a lot of fixes in Skein Hash Function [16]
implementation. Skein is a finalist in the competition to be
an SHA-3 function, but not a winner. It is a cryptographic
function, so there were some attacks on it, and the performance
should be a matter, so there are a lot of fixes here.
Bluez. It is a Linux part (Linux Bluetooth protocol stack), it
is selected because of not a very big amount of commits (about
Fig. 1. Top error fixes after analysis the main repository of Linux Kernel 23.000) and independence of the most Linux components.
Most common error fixes: fixing commit/relevance
group the results, we can get Figure 2.
If we look into the most buggy files (files with a big count memory leaks/153
of fixing commits) we find the strange /dev/null path (possible coding style/60
typos/47
possible invalid reads/25
double free/24
setting connecting state/24
not reseting sink source/22
whitespace issues/15
not handling notification/21
missing file/21
use after free/20
passing wrong error code/20
possible NULL pointer dereference/19
warnings/18
sending command responses/18
invalid reference counting/18 Fig. 4. Pointer related errors in Bluez
incorrect error check/17
includes for gobex.h header/17
not setting scope properly/16 We can see emails from well-known companies and a stupid
device found tests/16 miss the space. Why not follow code guidelines? Why not
memory leak in gap/16 review the code before commit?
And there are some warning fixes. Why commit to the
The most buggy files and lines: filename/fixes count Linux Kernel repository a code with warnings? Especially in
OS kernel, a compiler warning in code could easily cause an
Filename: src/adapter.c/935 error when running the code, and the error in a kernel part or
Line:422, changes -> 2 in a kernel module may cost a lot.
Line:2280, changes -> 2 The overall diagram of most common error types in Bluez
Filename: src/device.c/753 is given on Figure 5.
Line:1972, changes -> 2
Filename: obexd/plugins/
phonebook-tracker.c/515
Filename: android/gatt.c/386
Filename: obexd/client/session.c/309
Filename: audio/headset.c/251
Filename: tools/mgmt-tester.c/250
Filename: audio/a2dp.c/238
Line:246, changes -> 2
Filename: audio/avdtp.c/230
Filename: lib/sdp.c/212
Filename: android/bluetooth.c/195
Comments
We see a lot of types C-related errors here (see Figure Fig. 5. Top error types in Bluez
4)! The biggest problem for the Bluez is a memory leak, but
also there are invalid reads, double free, use after free, NULL Memory management. It is a very important Kernel part
pointer (all are about pointers). We know now that even in to allocate and manage the memory pages. The code is taken
Bluetooth implementation the memory leak problem is more to analyse from /mm path in the Git Linux Kernel repository.
frequent that others, but it is still easy to catch a pointer trouble.
Most common error fixes: fixing commit/relevance
IX. C ONCLUSION
The described algorithms could be well used to study the
most frequent errors in the Linux Kernel and to analyze them.
People study information by the troubles well, and the existed
errors are a real point to learn about them.
During the research, the software for the analysis has been
developed, and the sources can be free downloaded from the
Git repository [18].
Some Linux Kernel repositories have been analysed by
using the software.
The results of the study show that the integration and
developing process organization errors on the scale of whole
Linux Kernel play a huge role. Also, there are many C
language related error fixes, but in the total number of fixes,
they don’t play a significant role.
We found that the most fixed part are the Intel drivers.
They could be unstable or well tested and fixed; we cannot
say anything about it.
Also, we see a lot of fixes of typos/uninformative com-
mits/code style issues; these fixes show us that the developers
should pay attention to a Linux code style guide, and they
must review the changes before doing the commit.
The research work can be continued to looking at a source
code of diffs to understand the vast amount of refixes (fixes
of the same line of the file) we found.