0% found this document useful (0 votes)
201 views14 pages

Controller Area Network Intrusion Detection Dataset

This document introduces the ROAD CAN Intrusion Dataset, which provides the first publicly available dataset containing real, advanced attacks on a vehicle's CAN network. It begins with a survey of existing public CAN intrusion detection datasets, categorizing them and identifying limitations, particularly the lack of realistic labeled attacks. It then presents the ROAD dataset, collected on an automotive dynamometer, as the first to address this need by including real, sophisticated attacks to aid the evaluation and development of CAN intrusion detection systems.

Uploaded by

Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views14 pages

Controller Area Network Intrusion Detection Dataset

This document introduces the ROAD CAN Intrusion Dataset, which provides the first publicly available dataset containing real, advanced attacks on a vehicle's CAN network. It begins with a survey of existing public CAN intrusion detection datasets, categorizing them and identifying limitations, particularly the lack of realistic labeled attacks. It then presents the ROAD dataset, collected on an automotive dynamometer, as the first to address this need by including real, sophisticated attacks to aid the evaluation and development of CAN intrusion detection systems.

Uploaded by

Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ROAD: The Real ORNL Automotive Dynamometer

Controller Area Network Intrusion Detection Dataset


With a comprehensive CAN IDS dataset survey & guide

MIKI E. VERMA1 , MICHAEL D. IANNACONE1 , ROBERT A. BRIDGES1 , SAMUEL C.


HOLLIFIELD1 , BILL KAY2 , AND FRANK L. COMBS3
1
Cyber Resilience & Intelligence Division, Oak Ridge National Laboratory, Oak Ridge, TN (e-mail: {vermake, iannaconemd, bridgesra, hollifieldsc}@ornl.gov)
2
Computer Science & Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN (e-mail: [email protected])
3
Electrical & Electronics Systems Research Division, Oak Ridge National Laboratory, Oak Ridge, TN (e-mail: [email protected])
Corresponding author: Miki E. Verma (e-mail: [email protected]).
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.
arXiv:2012.14600v1 [cs.CR] 29 Dec 2020

The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States
Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this
manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to
these results of federally sponsored research in accordance with the DOE Public Access Plan
(http://energy.gov/downloads/doe-public-access-plan).

ABSTRACT
The Controller Area Network (CAN) protocol is ubiquitous in modern vehicles, but the protocol lacks
many important security properties, such as message authentication. To address these insecurities, a rapidly
growing field of research has emerged that seeks to detect tampering, anomalies, or attacks on these
networks; this field has developed a wide variety of novel approaches and algorithms to address these
problems. One major impediment to the progression of this CAN anomaly detection and intrusion detection
system (IDS) research area is the lack of high-fidelity datasets with realistic labeled attacks, without which
it is difficult to evaluate, compare, and validate these proposed approaches. In this work we present the first
comprehensive survey of publicly available CAN intrusion datasets. Based on a thorough analysis of the
data and documentation, for each dataset we provide a detailed description and enumerate the drawbacks,
benefits, and suggested use cases. Our analysis is aimed at guiding researchers in finding appropriate datasets
for testing a CAN IDS. We present the Real ORNL Automotive Dynamometer (ROAD) CAN Intrusion
Dataset, providing the first dataset with real, advanced attacks to the existing collection of open datasets.

INDEX TERMS Controller Area Network (CAN), Intrusion Detection, Dataset, Machine Learning, Vehicle
Security, Benchmark

I. INTRODUCTION IDS progress. Note that these surveys (even when combined)
Modern vehicles are increasingly drive-by-wire, relying on are not comprehensive but provide a representative sample of
continual communication of small computers called elec- the described distributions.
tronic control units (ECUs). Nearly ubiquitous in modern CAN IDS methods generally fall into five major cate-
vehicles, controller area networks (CANs) facilitate the data gories:
exchange among ECUs by providing a common network with Rule/Specification-Based: Uses rules / whitelisting [13, 14]
a standard protocol. While it is lightweight and reliable, the Physical Side-Channel: Uses physical layer attributes (e.g.,
CAN standard has well-known security flaws, such as a lack voltage) [15, 16]
of message authentication. Furthermore, intra-vehicle CANs
Frequency/Timing-Based: Regards the timing of each
are increasingly exposed—often directly by mandatory on-
frame arbitration ID and/or the sequential nature of IDs
board diagnostics II (OBD-II) ports and potentially indirect-
[17–19]
ly/remotely through a variety of vehicle interfaces, including
USB ports and various wireless communications. Payload-Based: Uses a black-box approach that considers
the data frame as a string of bits, without recovering the
Consequently, there has been a significant increase in the
signals these bits represent [20–22]
attention given to CAN vulnerability in the form of research
[1–8], as well as many proposed CAN intrusion detection Signal-Based: Requires first decoding raw data field bits
systems (IDSs). Four recent in-vehicle IDS surveys show the into constituent signals, and uses time series of signal
growth and progression of this field [9–12]. Categorizations values as inputs [18, 23–25]
offered by these surveys illustrate the current barriers to CAN The two most recent surveys by Wu et al. [9] and Lok-

VOLUME 4, 2016 1
man et al. [10] demonstrate both the increasing pace and costly hardware investment when purchase price, mainte-
asymmetric growth of the field in terms of the five categories nance, insurance, etc. are taken into account. Discovery and
described above. A quick meta-analysis of the union of the execution of more subtle, physically verifiable CAN attacks
30 papers surveyed (up to 2018) yields the following distri- require ample research time and effort, and (thanks in part
butions by year: before 2015 (4), 2015 (5), 2016 (6), 2017 to issue (1) above) CAN attacks/vulnerability analyses are
(9), 2018 (6); by category: rule/specification-based (3), side- usually a per-vehicle endeavor.
channel (4), frequency/timing-based (14), payload-based (6), Second, producing realistic CAN attack data carries in-
signal-based (3). herent risks to the passengers, bystanders, and to the vehi-
While the number of publications in the CAN security cle itself. Ideally, dynamometers allowing driving in a safe
domain, especially in IDS research, has grown appreciably and controlled laboratory environment are used, but such
in the past few years, IDS research is significantly hindered facilities are large investments and are usually outside the
by two major issues: (1) obfuscated CAN messages (not the researchers’ control. Furthermore, risks of permanent dam-
focus of this work) and (2) lack of quality, publicly available, age loom (e.g., “bricking” a vehicle’s ECU), and successful
real CAN data with advanced attacks present (the focus of implementation of cyber attacks with finesse requires per-
this work). vehicle research efforts in themselves.
The asymmetric growth in the field—in particular the Third, disclosure of sensitive information is an inhibitor.
disproportionate number of publications on methods that OEMs consider their CAN encodings intellectual property.
are timing/frequency-based (and to a lesser extent payload- Additionally, responsible vulnerability disclosure may be
based) as compared to signal-based—is a direct result of necessary if new attacks are discovered, which at a minimum
this obfuscated CAN messages issue. Original equipment pauses release of data. Further, releasing data with targeted
manufacturers ([OEMs], e.g., Subaru, Ford) of passenger attacks may be viewed unfavorably by OEMs, resulting in
vehicles hold secret their proprietary encodings of signals in lawsuits if not handled responsibly.
the CAN data fields and vary the encodings across models. To our knowledge, there are currently only six publicly
Consequently, though researchers can easily add a node to available vehicle CAN datasets with labeled attacks (see
monitor and send CAN messages on most vehicles, the data Table 1). Likely due to the inherent difficulties in producing
is not understandable. Thus, most have focused on methods real CAN attack data described above, all of these datasets
that do not require knowledge of signal encodings. While a are either real fabrication (simple message injection) attacks
few researchers have paired with OEMs or done some manual or simulated attacks (created by manipulating CAN data post
reverse engineering to obtain and develop IDSs based on the collection). Both methods have significant limitations when
de-obfuscated CAN signals, these developments are not ve- supporting CAN IDS development. Fabrication attacks are
hicle agnostic. Notably, the research community is beginning generally simple to detect with timing-based methods and are
to address the CAN signal reverse engineering problem, (see thus limited in scope. Due to the complex dynamics of the
Verma et al. [26] for a survey of these works), in large part broadcast CAN protocol, the simulated CAN attacks ignore
to facilitate CAN IDS research, but more generally to enable aberrations in message timing, content, and presence that nat-
a wide variety of downstream automotive technologies (e.g., urally occur, and therefore change data quality in unknown
[27]). While the obfuscated signal problem is not the problem ways. Physical verification of the effect of the attack on the
addressed in this paper, it is necessary context for the second vehicle is not possible with pure simulation. In short, there is
issue, to which we make a contribution. no publicly available, real CAN data with labeled attacks that
is of sufficient quality to permit assessment of many CAN
PROBLEM ADDRESSED IDS methods.
We now turn to the second barrier, which is that it is difficult
to obtain CAN data with real high-fidelity labeled attacks. CONSEQUENCES
Such data is unavailable for three reasons. First, CAN data
A result of the difficulty to obtain sufficient CAN data with
with real attacks are costly to produce, with the exception
attacks is simply that CAN IDSs are often not evaluated
of fabrication (simple message injection) attacks. Facilitated
on real CAN data with real attacks. A survey by Loukas et
by open-source (e.g., SocketCAN/CANutils [28]) and pro-
al. [11] classifies 17 surveyed automotive CAN IDS papers
prietary software (e.g., CANalzyer,1 VehicleSpy2 ) and OBD-
by the evaluation method: “analytical” (theoretical only, no
II access to many vehicle CANs, collecting ambient CAN
evaluation on data), “simulation” (evaluated on simulated
data from real vehicles is relatively straightforward, as is
CAN data or real CAN data with simulated attacks), and
collection while sending extra messages; thus fabrication
“experimental” (evaluated on real CAN attacks). Note our
attacks are common. For more subtle attacks, researchers
classification language is slightly different than that in the
must have a dedicated modern vehicle for study—a relatively
survey Loukas et al., as we consider IDSs evaluated on real
1 https://www.vector.com/int/en/products/products-a-z/software/ attacks (even if not in situ), to be “experimental.” The dist-
canalyzer/ ribution of the surveyed papers is: analytical (3), simulated
2 https://intrepidcs.com/products/software/vehicle-spy/ (8), experimental (6). This illustrates the first consequence:

2
relatively few IDSs are evaluated on CAN data with real We provide a comprehensive (to the best of our knowl-
attacks. edge) survey of publicly available CAN datasets that contain
A second consequence to the community is that CAN IDS labeled attacks. Our survey includes simulated attacks and
works are not comparable, or at least not compared. Rajba- frameworks for manufacturing attacks in post-processing. We
hadur et al. [12] surveys an even larger set of papers (with itemize these datasets, their download links, and citations in
a much wider scope of “Anomaly Detection for Connected tables to provide easy reference.
Vehicle Cybersecurity”), finding that After performing quality analysis investigations on both
Much of the research is performed on simulated data (37 the data and documentation presented in each previously
out of the 65 surveyed papers)... much of the research released CAN dataset, we provide detailed description of the
does not evaluate the newly proposed techniques against
data, a discussion to illuminate the benefits and drawbacks
a baseline (only 4 out of the 65 surveyed papers do so),
which may lead to results that are difficult to quantify. of each dataset, and recommendations for appropriate use of
This reinforces the findings of Loukas et al. regarding sim- each dataset when developing a CAN IDS.
ulated data, but also articulates a second issue that is even Our foremost contribution is a real CAN dataset collected
larger in magnitude: very few CAN IDSs are evaluated from a passenger vehicle with a variety of physically verified
against a baseline. Our experience with the CAN IDS lit- CAN attacks, and ample training data with no attacks. This
erature is that authors are contributing from a wide variety dataset provides a fuzzing fabrication attack, many targeted
of backgrounds. While this milieu provides a diverse set of fabrication attacks that are maximally stealthy (manipulating
approaches (a benefit), the area suffers by lacking a uniform only the necessary portions of the data field and sending a
body of knowledge, and the lack of depth seems to inhibit single manipulated message per ambient message of the same
the steady development of ideas and systematic, quantifiable ID), and two advanced attacks that include no fabricated
progress. To again quote Rajbahadur et al., (injected) messages. For each targeted injection attack, we
The varied use and scattered publication of anomaly also include an augmented CAN capture by deleting the
detection [for connected vehicle cybersecurity] research targeted ambient message to simulate a masquerade attack.
has given rise to a sprawling literature with many gaps This is the highest fidelity dataset with attacks that vary
and concerns... we urge researchers to address these dramatically in difficulty to detect, so as to allow appropriate
identified shortcomings. testing and head-to-head comparisons of the wide variety of
To summarize, quantifiable comparison across competing proposed CAN IDS methods.
and complementary IDS methods is currently not possible. Section II provides necessary background on CAN proto-
Standardized datasets are necessary for head-to-head com- col and vehicle attack terminology; Section III comprises the
parisons and for replicability (or better reproducibility). To survey, analysis and discussion of all previous CAN attack
continue to progress in an empirically verified and scientific datasets; Section IV introduces our new CAN attack dataset.
manner, the CAN IDS research community needs to produce We hope that our contributions help facilitate advancement
and adopt a publicly shareable collection of CAN datasets in the structure and impact of this growing field.
with labeled attacks. This sentiment was reiterated and acted
on by Hanselmann et al. [23] in their recent CAN IDS work: II. CAN DATA AND SECURITY
To the best our knowledge, there is no standard data A. CAN PROTOCOL
set for comparing methods. We try to close this gap by
evaluating our model on both real and synthetic data,
CAN is a message-based protocol standard [32] that defines
and we make the synthetic data publicly available. We the first two Open Systems Interconnection (OSI) layers
hope that this simplifies the work of future researchers (physical and data link). Using this protocol, ECUs (e.g.,
to compare their work with a baseline. Power Control Module [PCM], Antilock Braking System
Finally, we find that IDSs are often evaluated against in- [ABS]) continually broadcast data frames with information
appropriate test data. For example, IDSs promising detection relating to the current state of the vehicle. A standard CAN
of advanced, subtle attacks are tested only on CAN data with data frame (or packet), depicted in Fig. 1, contains several
exceptionally noisy attacks, or (another example) works use fields, of which two are relevant for the scope of this paper:
attacks that disrupt timing, then ignore timing in evaluation the 11-bit Arbitration ID, and the 64-bit Data field.
to test payload-based detection. In order to not disparage The Arbitration ID, or simply ID, is the message header
other IDS works, we cite our own insufficient evaluation of that identifies the frame and is used for arbitration, the
CAN IDS ideas as examples [29, 30]. The consequence is process by which frames are prioritized when multiple ECUs
that many promising IDS methods, which are excessive for concurrently transmit—the lower the ID, the higher the pri-
the easily detected attacks in data available, are never truly ority. The RTR bit is an indicator of a remote frame. Any
evaluated on the more advanced attacks they target.

CONTRIBUTIONS
We add to these cries for a more systematic progression of FIGURE 1: CAN data frame [31]: The two primary components are the
CAN IDS research and to the request to also use appropriate Arbitration ID used for message identification and arbitration (prioritizing
test datasets by offering the following contributions. messages) and the Data Field, containing up to 8 bytes of message contents.

3
ECU can request the data on an ID by sending the ID and 1) Fabrication Attacks
the RTR bit indicating the request. This remote frame would A fabrication attack uses a strongly compromised ECU to
be immediately followed by a response with the requested inject messages with malicious IDs and Data Fields. The
ID and data. The Data Field contains the actual message majority of the attacks in the CAN IDS literature fall into
contents of up to 8 bytes, where each distinct piece of this category, including the following:
information carried in the message is called a signal. CAN DoS Attack — Messages with ID 0x000 and an arbitrary
frames with the same ID encode the same set of signals in payload are injected at a high frequency. Since ID 0x000
the same format and are usually sent with a fixed frequency to always wins arbitration and is not usually issued by
relay updated signal values. In general, each ECU is assigned legitimate ECUs, flooding the bus with these high priority
a set of IDs that only it transmits. For example, the PCM may messages prohibits legitimate messages from being trans-
transmit: ID 0x102 containing engine RPM, vehicle speed, mitted. This results in a host of unusual effects, such as
and odometer signals every 0.05s, and ID 0x45D with signals flashing dash indicators, intermittent accelerator/steering
encoding the angle of the gas and brake pedals every 0.01s. control, and even full vehicle shutdown.
The CAN standard also defines a robust error handling
Fuzzing Attack — Messages with random IDs and arbi-
mechanism that is designed to prevent erroneous messages
trary payloads are injected at a high frequency. The effect
from being propagated or faulty nodes from disrupting com-
is similar to that of the DoS attack: the bus becomes
munications. For example, if two nodes attempt to concur-
occupied with mostly injected messages, displacing real
rently transmit different messages with the same ID, both
messages. In addition, unlike the DoS attack, injected
nodes will transmit their frame until they send opposing bits
messages may have an ID that appears in normal traffic,
simultaneously, at which point one will incur an error. If a
so receiver nodes expecting these ID messages will read
node’s error count gets too high, it will enter a “bus off”
and use the information in the malicious payload, causing
mode, meaning it cannot read or transmit messages on the bus
a wide variety of unexpected results. There are two slight
until it is reset. See previous works [7, 33] for more details
variations of this attack: some researchers inject only IDs
on CAN error handling.
that appear during normal traffic (e.g., [35]), while others
inject arbitrary random IDs (e.g., [36]).
B. CAN ATTACKS Targeted ID Attack — Messages are injected with a spe-
While lightweight, CAN lacks encryption and authentication, cific target ID and manipulated data field. When only the
and is therefore vulnerable to exploitation. There have been bits in a specific signal—that is, a select part of the 64-bit
a number of successful attacks on vehicular CANs published data field—are modified, we refer to this as targeting a
in the past several years, some remote, and some requiring signal, rather than an ID.
physical access. Koscher et al. [1] provide a comprehensive Fabrication attacks are characterized by the inherent prob-
overview of CAN-based ECU vulnerabilities. Their explo- lem of message confliction, described by Miller & Valasek
ration involves applications that facilitate CAN communi- [6],
cation, such as the Unified Diagnostic Service (UDS), a
The biggest problem with CAN message injection is that,
standardized set of commands which can change the state of a while attackers can inject arbitrary messages onto the
targeted ECU or directly read and write to memory addresses. bus, the original sender of the message (i.e., the legit-
Instead of focusing on these types of applications, our attack imate ECU) is still sending legitimate messages...The
data focuses on inherent vulnerabilities found within the result of the ECU continuously sending messages along
CAN protocol. side our attack messages is message confliction. From
the perspective of the receiving ECU, inconsistent mes-
The attacks surveyed here begin with the assumption of sages are received (and it must) decide what to do with
a compromised node on the bus. Cho & Shin [34] provide this conflicting information.
a well-defined adversary and attack model with terminology In general, ECUs will regard the last seen data frame on a
that is widely used. A weakly compromised ECU is a node given ID; thus, to effectively overwrite legitimate messages
that an adversary is able to silence, suspending any message with the target ID, the injected frames must occur on the bus
transmission, while a fully compromised ECU is a node over very soon after the true frame. Not all data frames (especially
which the adversary has complete control, with the ability injected frames) are regarded independently and acted upon;
to send fabricated messages and access the node’s memory. simply reverse engineering a signal to inform targeted injec-
Note that the method of connecting to a vehicle’s CAN via tions will often not result in the desired or any response from
the OBD-II port is considered a fully compromised ECU in the vehicle. Miller & Valasek [6] provide potential techniques
this model. for side-stepping message confliction, but the desired effect is
Using this terminology, Cho & Shin introduce the follow- the same—de-conflicting ambient and fabricated data frames
ing three general categories of attacks, which are in turn by suspending the ambient messages.
useful for describing attack sophistication and the types of The first two fabrication attacks described (DoS and
IDSs that would be able to detect them. fuzzing) require almost no understanding of or reconnais-
sance on the target vehicle, nor do they allow for finesse in

4
execution. On the other hand, the targeted ID attack can be many occasions, in essence running a suspension attack on
more sophisticated. Targeting manipulation of specific func- ourselves!
tionality requires knowledge of at least one of the IDs’ signals This previous research has shown that masquerade at-
and requires the data field designed to have a particular effect tacks are indeed possible, but they require enormous hacking
based on the given ID’s signal definitions. expertise and thus far more in-depth per-vehicle research.
Furthermore, targeted ID attacks can, similar to the first Further, white-hat CAN hackers and CAN intrusion detec-
two attacks, be accomplished by flooding the bus, simply tion research communities are working independently with
meaning that messages are sent at a very high frequency, seemingly different skill sets toward a common goal. Thus,
although this is blatant and easy to detect. Research hackers no real CAN masquerade attacks are publicly available, and
Miller & Valesek used this tactic to successfully attack a the evidence from the CAN IDS research community is that
Toyota Prius, injecting fabricated collision prevention system defensive researchers do not have the skills or resources to
messages at a high frequency, causing the ABS to engage the create such advanced attacks, yet they need to access to attack
brakes [5]. data in order to create a robust IDS.
The most stealthy targeted ID attack is a flam at-
tack—immediately after each target ID’s legitimate message, Timing Transparent vs. Timing Opaque
an injected message is sent (with the same ID but manip- Unsurprisingly, these attack categories map to intrusion de-
ulated data), so that the true message state is not physically tection techniques that have matured alongside offensive
realized before the spoofed message alters the car to the target developments. Fabrication and suspension attacks are ac-
state. Injected frames and true target ID frames are in one-to- curately detectable by frequency-based IDSs, which regard
one correspondence. This type of attack was pioneered by the timing of each ID and/or the sequential nature of IDs.
Hoppe et al. [3], who essentially disabled a car’s warning Masquerade attacks require more sophisticated methods: for
lights by sending a “lights off” frame immediately after any example, attempting to identify the sending ECU [16, 34],
legitimate frame’s “lights on” message was sent, resulting luring an added node to reveal itself [35], or inspecting the
in the lights appearing continually off. Provided the attacker data field [20, 23, 30].
can reverse engineer the target ID’s payload, only the bits We define a Timing Transparent (T.T.) attack to be any that
involved in the targeted signal need to be manipulated. is hypothetically detectable using a frequency-based method,
2) Suspension Attacks
specifically, a fabrication attack, detectable by unusually fast
message timing or the appearance of new IDs, or a suspen-
An adversary mounting a suspension attack needs a weakly
sion attack, detectable from unusually slow or disappearance
compromised ECU, preventing it from transmitting some or
of usually present IDs.
all messages. For example, an adversary could suspend all
An attack that is Timing Opaque (T.O.), on the other hand,
messages on a particular safety-critical ID, thus disrupting
is defined as an attack that does not disrupt normal timing
other systems that rely on this constantly updated data.
or ID distributions, and thus would not be detected with a
3) Masquerade Attacks frequency-based IDS. Instead, a more sophisticated method,
Finally, the most sophisticated category, masquerade attacks, such as a payload-based detector that uses the data field, or
involve an adversary first suspending messages of a specific perhaps a side-channel method monitoring the physical layer,
ID from a weakly compromised target ECU, and then using a would be needed. In short, developing a comprehensive and
strongly compromised ECU to inject spoofed messages with robust IDS requires hardening (and therefore testing) against
this ID at a realistic frequency, thus masquerading as the timing opaque attacks. A masquerade attack is the primary
target ECU. Using this more advanced strategy, a targeted example, but other attacks that may alter the overall state
ID attack can be carried out without message confliction, of the vehicle (e.g., the “accelerator attack” in the ORNL
thus allowing for a more stealthy attack. Miller & Valasek’s dataset, Sec. IV-A3), may also be included in this category.
infamous remote Jeep Cherokee hack [6] employed a mas-
querade attack. Unlike the Prius [5], the Cherokee ABS III. PREVIOUS DATASETS
system dealt with message confliction by simply turning off Here we itemize the publicly available CAN datasets with
the collision prevention system, and thus they were unable to attacks, including descriptions of the attacks present and
mount a similar fabrication attack; instead, they had to first whether they are real or simulated, as well as dataset benefits
suspend legitimate messages (in addition to a few other steps) and drawbacks. We examined each dataset, and attempted to
in order to mount an attack. verify the accuracy of documentation. Refer to Table 1.
In another example, Cho & Shin cleverly use a strongly
compromised ECU in order to weakly compromise a target HCRL CAN INTRUSION DATASET (OTIDS)
ECU by causing it to go into bus off mode, at which point Lee et al. at the Hacking & Countermeasures Research Lab
they run a masquerade attack [7]. Interestingly, if an attacker (HCRL) published this dataset as a supplement to their CAN
is not careful when mounting a fabrication attack, this same IDS paper [35], which presented OTIDS: Offset Ratio and
mechanism can result in the attacker’s own strongly compro- Time Interval based Intrusion Detection System, a novel IDS
mised ECU getting bussed off. In fact, we have done this on based on the timing of remote frame responses. The general

5
TABLE 1: Open CAN IDS datasets. All datasets include ambient data for the given vehicle(s) in addition to attack data. “T.T.” and “T.O.” refer to Timing
Transparent (alters normal timing characteristics) and Timing Opaque (does not alter normal timing characteristics), respectively.

Real # Attack Types


Dataset Org. Year Dataset URL Attacks T.T. T.O.
CAN Intrusion (OTIDS) [35] HCRL 2017 http://ocslab.hksecurity.net/Dataset/CAN-intrusion-dataset X 2 11
Survival Analysis for Automobile IDS [37] HCRL 2018 http://ocslab.hksecurity.net/Datasets/survival-ids X 3 0
Car Hacking for Intrusion Detection [36, 38] HCRL 2018 http://ocslab.hksecurity.net/Datasets/CAN-intrusion-dataset X 4 0
SynCAN [23]3 Bosch 2019 https://github.com/etas/SynCAN 2 3
Automotive CAN Bus Intrusion v2 [39] TU Eindhoven 2019 https://doi.org/10.4121/uuid:b74b4928-c377-4585-9432-2004dfa20a5d 63 1
Can Log Infector4 CrySyS Lab 2020 https://www.crysys.hu/research/vehicle-security/ 0 7
ORNL Dynamometer CAN Intrusion ORNL 2020 https://0xsam.com/road/ 3 5 5
1 The sole advanced attack, the “Impersonation Attack,” may not actually be useful for researchers developing an IDS (see description in Sec. III).
2 Dataset does not contain any real raw CAN data—all data is synthetic (not just the attacks), and it contains decoded signal values rather than 64-bit data fields.
3 Due to the rather crude method of creating synthetic injections (see description in Sec. III) which involves modifying timestamps, classification is less clear.
4 Dataset does not technically include attack data—instead, includes several ambient captures and a python script for programmatically modifying ambient logs to
create different types of simulated attacks.

method was to transmit remote frame requests for a given above, the “impersonation attack,” while characterized as
ID, time how long it took an ECU to respond to the request, masquerade attack in their paper, does not seem to be
and test whether this delay was anomalous—the idea being a true masquerade attack since the message transmission
that a compromised ECU, being controlled by an adversary, by the legitimate node is suspended. While it is possible
would respond with an unusual delay. The published dataset that we misunderstood their documentation, our confusion
contains artifacts of this method, specifically, the remote on the matter and the various discrepancies we found are
frames (which are labeled as such and thus easy to remove in a testament to poor documentation. Finally, the presence
preprocessing), and legitimate as well as spoofed responses of remote frame requests and responses results in small
(less easy to identify and remove). The documentation of timing changes that are not usually in ambient traffic, and
the dataset on the website and in the paper, their figures may be problematic for testing and training a frequency-
describing the dataset, and the data itself have discrepancies. based detector. Overall, unless it used for leveraging remote
We provide our assessment as to what is actually contained frames for an IDS, this dataset is not recommended.
in this dataset.
Attacks — Real. Includes DoS and fuzzing fabrication at- HCRL SURVIVAL ANALYSIS DATASET FOR
tacks. According to the authors, the dataset also includes AUTOMOBILE IDS
a masquerade attack, which they call an “impersonation Alongside their frequency-based CAN IDS paper, Han et al.
attack,” but it does not seem as though the target node was [37] at HCRL published their dataset composed of three dif-
actually suspended, and the only message injections appear ferent vehicles (Hyundai Sonata, Kia Soul, Chevrolet Spark).
to be the spoofed remote frame responses. Thus, while On each car, they collected ambient data and ran three differ-
this may be a useful simulation for testing whether their ent types of injection attacks that caused the vehicles to mal-
remote-frame–based IDS would detect a masquerade attack, function. Note that a different version of this dataset is also
it would not be useful for other IDSs that do not leverage this published for an IDS challenge (http://ocslab.hksecurity.net/
aspect of the protocol. Rough estimates of injection intervals Datasets/datachallenge2019/car), which we do not include in
are given. our survey.
Benefits — The fuzzing attack provided in this dataset is the Attacks — Real. Includes all three flooding fabrication at-
slightly stealthier version that involves only spoofing IDs tack types: DoS (they call this “flooding”), fuzzing, and
that appear in normal traffic. This is the only example of targeted ID (flooding delivery). Each attack capture is 25–
this kind of fuzzing attack in an open dataset. This is also 100 seconds long, and contains between 1 and 4 five-second
the only open dataset with remote frames and responses. injection intervals. Each injected message is labeled.
Drawbacks — First and foremost, the injected messages Benefits — This is the only dataset that contains real attacks
are not labeled, and the documentation on the injection on multiple vehicles; furthermore, the same set of attacks
intervals is unclear and possibly incorrect. Authors indicate are repeated multiple times on each one. One of the values
that in the DoS attack all 0x00 messages are injections of training and testing a frequency-based IDS on multiple
(these take place the during the entire capture), and the vehicles is illustrated in Fig. 2, depicting a fuzzing attack
fuzzy and impersonation attacks start after ∼250s. However, mounted on four different vehicles (top three plots from
our analysis indicates that these attacks take place during this dataset). This illustrates how the bus load (% of time
the entire capture. Furthermore, this disagrees with their the bus is occupied with a message) can differ dramatically
paper, which depicts a injected message during the fuzzing across vehicles, demonstrating that a frequency-based IDS
attack at 0.1565s, well before 250s. Second, as explained must be adaptable to such differences. Additionally, authors

6
FIGURE 3: HCRL Car Hacking Dataset contains unintentional artifacts
of data collection; in particular, in each of the four attack datasets, right
after conclusion of the attack, there is a prolonged period during which no
messages appear on the bus. This depicts the end of the DoS dataset, starting
from the last four injection intervals (red), followed by ∼53s of ambient
traffic (blue), and a ∼22s transmission gap before ambient message resume
again. Note that the first point after messages resume (with a ∆t ≈ 22.4s)
has been omitted for scale. We hypothesize that this gap is due the CAN
bus going into a “stand-by” mode due to inactivity, that is, the vehicle is not
being operated and no messages are being injected.

FIGURE 2: Time between messages during fuzzing attacks on four different two HCRL datasets, the function of the IDs in the targeted
vehicles, with top three plots from each vehicle in the HCRL Survival ID attacks is provided (drive gear and RPM gauge). This
Analysis Dataset, and the bottom plot from the ORNL dataset. While the
injections (in red) result in a significant disruption in the overall message
dataset seems to be the most widely used dataset in the
timings in the HCRL dataset, the fuzzing attack in the ORNL dataset CAN IDS research community, and is thus clearly a very
does not, and would therefore be slightly more difficult to detect using important niche. (Unfortunately, many recent publications
a frequency-based IDS. This also illustrates that the bus load and overall
message frequency distribution varies widely across vehicles.
seem clearly to be using this dataset without citation.)
Drawbacks — All the attack captures contain a significant
confirm that these attacks have a real effect on the vehicle. artifact of data collection that may pose a problem for
This dataset would be a good choice for training and testing researchers using this data, particularly since it is not noted
a vehicle-agnostic, frequency-based detector. in the documentation. At the conclusion of each attack (soon
Drawbacks — All of the attacks are basic and could be after the 300th injection), there is a large gap in messages
detected with a very simple frequency-based detector. Even where it appears no messages are being transmitted. This is
conditioned on being fabrication flooding attacks, these are depicted for the DoS attack in Fig. 3 where there is a gap of
particularly un-stealthy examples. See the top three plots of 22s. In the other three captures containing attacks, this gap
Fig. 2, as the frequency of the entire bus is significantly dis- is much longer, in the order of 3000s! Researchers Berger et
rupted by the exceptionally high injection frequency, which al. [40] also noted these jumps in timestamps. Having dealt
is not the case the ORNL fuzzing attack (bottom plot of Fig. with this issue in our own data collection efforts, we hypoth-
2). The targeted ID attack, which they call the “malfunction” esize that this message transmission gap arises from the bus
attack, is done somewhat blindly: there is no indication of going into a “stand-by” mode due to vehicle inactivity once
what the function of the target IDs are, and the injected the researchers stop injecting messages. We suggest that
payloads (data fields) are chosen by either cycling through researchers using this dataset, particularly for a frequency-
random values (on the Soul) or a single random value (on based IDS, should trim the attack captures to right before the
the Spark and Sonata). As for the ambient captures, only gap—at the 2328s, 2466s, 1949s, 1952s mark for the DoS,
60–90s of data are provided per vehicle, which is likely Fuzzing, Gear, RPM datasets, respectively.
not sufficient for robust training, let alone for testing false While this relatively simple fix renders the dataset usable
positive rates. Finally, the ambient data and attack data are for testing, the hypothesized source of the issue, namely that
in differently formatted CSVs, which is undesirable. the car wasn’t being driven during the attacks (which we
verified by decoding the signals), poses a problem that can-
HCRL CAR HACKING DATASET FOR INTRUSION
not be solved in post-processing. As the car is being driven
DETECTION
in the ambient data, the test data fundamentally differs from
the training data outside of the injections, making it an
The Car Hacking dataset is the most recent dataset released unsuitable test set. Given these issues, this dataset does not
by HCRL and was used by these researchers to train and test seem like a good choice, even for testing a simple detector.
two deep learning IDSs [36, 38]. The dataset includes ∼500s Similar to the HCRL survival dataset, these attacks are
of ambient driving data, and four different attacks. particularly un-stealthy with respect to disrupting overall
Attacks — Real. Includes all three flooding fabrication bus timing (see Fig. 3). Finally, ambient and attack data are
attack types—DoS, fuzzing, and two targeted ID attacks in different formats (fixed width format and CSV).
(flooding delivery) in which they spoof the drive gear and
the RPM gauge. Each capture is upwards of 40 minutes BOSCH SYNCAN
and contains 300 attack intervals lasting 3–5 seconds. Each Hanselmann et al. [23] at Bosch GmbH (the company
injected message is labeled. that created CAN), constructed a signal-translated, synthetic
Benefits — The attack captures are very long and contain CAN dataset that they used for training and testing their CAN
a large number of instances per attack. Unlike the other IDS, “CANet”, and, noting the lack of a standard, sufficient

7
CAN benchmark dataset, Hanselmann et al. published theirs a targeted ID fabrication attack. The joystick is used to
for the research community to use. Unlike all others, this send messages during normal traffic, and during the attack,
dataset contains timestamped signal values rather than the messages with this ID are injected by the compromised ECU.
raw binary data fields; thus, it is the only available dataset Attacks — Simulated. Includes the capability to create
for evaluating a signal-based model. seven kinds of masquerade attacks. Specifically, given a
Attacks — Simulated. Includes the following attack types: start point, a target ID, and a set of contiguous bytes in
fabrication targeted ID (flooding delivery), suspension, and the data field to target, the original value of each byte can
masquerade. Includes one capture of each attack, each be either replaced (by a constant value, random value, or
containing a hundred 2–4s attack intervals. Each simulated an increasing/decreasing sequence over each frame) or in-
intrusion signal is labeled. cremented (by a specified value, or a increasing/decreasing
Benefits — This is the sole signal-based dataset, which is sequence over each frame). The modified copy, containing
a particularly impactful contribution since the encodings the simulated injections, does not have attacks labeled, but
of signals into most vehicles’ CAN data are unknown. It as all modification parameters are passed by the user, these
is clear from the treatment in the paper that Hanselmann could presumably be determined.
et al. have true expertise in CAN protocol and data. This Benefits — This dataset includes the only diagnostic proto-
dataset contains the most nuanced masquerade attacks (e.g., col attack publicly available, and the only suspension attack
signal values slowly drifting from a real to a target value, or (simulated) in real CAN data. (Recall that SynCAN contains
replayed from normal traffic) currently available (including suspension attacks but is signal data from a simulated CAN.)
our dataset). Further, this is the only dataset (other than ours) The same set of attacks is available for testing on multiple
that contains attacks targeting a single signal, rather than the vehicles/CANs.
full 64-bit data field. This allows for testing a very advanced, Drawbacks — First and foremost, adjusting timestamps
signal-based IDS. in post-processing alters data and diminishes fidelity in a
Drawbacks — Synthetic data is clearly an imperfect proxy critical way: message timing on the bus is dependent on each
for real data; Hanselmann et al. train and test their IDS on ID’s frequency and priority through the arbitration process.
the synthetic and real data, and remark that the former is Thus, changing message timings risks creation of a synthetic
“somewhat ‘cleaner’ than in the real case.” As a separate dataset that is not realistic. Secondly, many attacks are unre-
issue, simulated attacks are inherently problematic since alistic. For the DoS attack, they simply overwrite 10s worth
their effect on a vehicle cannot be verified. Additionally, of frames, which is not how real DoS attacks appear, and
authors claim that all ambient data should be used for many attacks are far too short (e.g., 10 messages) with the
training but do not provide any additional ambient data for injections often too dispersed to affect vehicle functionality
testing; thus, it would would difficult to test an IDS’s false in practice. With respect to the prototype, it is unclear how
positive rate—an attribute of the utmost importance. Finally, they generated ambient traffic (e.g., were they recorded and
since authors do not provide a version of the data with CAN replayed from another car?), which clearly affects fidelity,
signals packed into frames, many CAN detectors, which and such a testbed is an imperfect proxy for a real vehicle.
require CAN data in the usual format (time-stamped IDs Finally, attack labels are in an unstructured text file, so there
with 64-bit data fields) could not use this data. is no way of programmatically reading what/when packets
were injected.
TU EINDHOVEN LAB AUTOMOTIVE CAN BUS
INTRUSION DATASET CRYSYS LAB CAN-LOG-INFECTOR & AMBIENT CAN
This dataset was published by researchers in the Department TRACES
of Mathematics and Computer Science at Eindhoven Uni- The Laboratory of Cryptography and System Security
versity of Technology (TU/e), who collected data from three (CrySyS Lab) at Budapest University of Technology and
different vehicles/CANs: an Opel Astra; a Renault Clio; and Economics published an ambient CAN dataset (along with
a CAN testbed, which consists of a VW instrument cluster, GPS data) from a comprehensive set of driving scenarios.
two Arduino boards (programmed to be a legitimate and a They pair this data with their open-source CAN Log Infector
strongly compromised ECU, respectively) and a joystick pro- tool (https://github.com/CrySyS/can-log-infector), which is
grammed to replicate the throttle that sends messages used used to manipulate data contents of the CAN log traces to
by the speedometer in the instrument cluster. They simulate a simulate an attack. Using this tool, a variety of different
set of attacks on each CAN, and for all but one attack, simply masquerade attacks can be created in ambient CAN data by
augmented the recorded data in post-processing by doing the modifying only the data field of a specified target ID.
following: for fabrication attacks, they “added packets man- Attacks — Simulated. Includes the capability to create
ually and adjusted timestamps accordingly”; for suspension seven kinds of masquerade attacks. Specifically, given a
attacks, they deleted particular frames; and for masquerade start point, a target ID, and a set of contiguous bytes in
attacks, they replaced the data field of particular frames. The the data field to target, the original value of each byte can
dataset also includes one real attack on their CAN testbed, either be replaced (by a constant value, random value, or

8
an increasing/decreasing sequence over each frame) or in- make/model of which we do not disclose. The published data
cremented (by a specified value, or a increasing/decreasing has been obfuscated in a way that maintains the anonymity of
sequence over each frame). The modified copy, containing the vehicle, while preserving all important aspects of the data
the simulated injections, does not have attacks labeled, but for an IDS (see Sec. IV-C). During all of the attacks, the ve-
as all modification parameters are passed by the user, these hicle was on a dynamometer, and was actively being driven.
could presumably be determined. Ambient data was collected both on the dynamometer and on
Benefits — While these authors do not provide any CAN roads, while performing a variety of normal and sometimes
data with attacks, the authors provide their framework for unusual but benign driving activities (e.g., unbuckled seatbelt
simulating a wide variety of masquerade attacks; this fa- or opened door while driving).
cilitates the creation of unlimited masquerade attacks—for
example, combining different payload manipulation tech- A. ATTACKS
niques simultaneously on different IDs, in any CAN data. Attack captures are detailed below in order of expected
Furthermore, this software is open source, and can be easily difficulty to detect.
extended to add new attacks. Additionally, this is the only
dataset with real CAN data that does not have only “in- 1) Fuzzing Attack
jections” a constant target value over time, and is the only We mounted the less stealthy version of the fuzzing
dataset (other than ours) that allows for modifying only part attack, injecting frames with random IDs (cycling in
of a data field, and thus enables targeting signals. Finally, order from 0x000 to 0x255) with maximum payloads
other than ours, this is the only dataset furnished with (0xFFFFFFFFFFFFFFFF) every .005s (as opposed to only
descriptions of the driver’s actions during ambient captures, injected IDs seen in ambient data). Many physical effects
which is highly valuable when for training and testing an of this attack were observed—accelerator pedal is impotent,
IDS. dash and lights activated, seat positions move, etc. By inject-
Drawbacks — As attacks are added in post-processing, ing messages with maximal payload, we prevent incidental
there is no guarantee that these attacks would actually affect ECU bus-off.
vehicle function. Moreover, these attacks are completely
2) Targeted ID Fabrication & Masquerade Attacks
blind to the function and signal mapping of particular target
IDs and meaning of the values being modified. There are We performed targeted ID fabrication attacks using the flam
also logical problems with Can-Log-Infector’s implementa- delivery, meaning a message is injected immediately after a
tion, most notably that whole bytes must be changed and legitimate message with the target ID is seen. As discussed
all selected bytes must change uniformly. Since CAN-Log- in Section II-B, the flam technique allows for dynamic in-
Infector can only change each of the eight bytes in the data jection; that is, the legitimate ID message is read, only the
field, signals that do not exactly fill a set of bytes cannot be bits corresponding to the target signal are modified with
solely targeted (recall that payloads are composed of several malicious values, and then this spoofed message is injected.
signals of varying lengths and positions whose bits often When only part of the message is modified, we refer to this as
cross byte boundaries). Furthermore, incrementing whole targeting a signal, rather than an ID. Designing these attacks
bytes means multi-byte signals will vary in a highly dis- required reverse engineering of signals for this vehicle, which
continuous manner. Finally, the method for specifying the we completed using CAN-D [26] signal reverse engineering
injection interval is rather irksome—rather than specifying algorithm and manually verifying the results. The targeted ID
a starting timestamp, the user passes “a value between 0 fabrication attacks and masquerade attacks are as follows:
• Correlated Signal — The single ID message communi-
and 1 (indicating) the ratio when the attack should start
regarding the full length of the capture,” and the attack end cating the four wheels’ speeds (each is a two-byte signal)
point cannot be specified. is injected with four false wheel speed values that are all
pairwise very different. This effectively kills the car—
it rolls to a stop and inhibits the driver from effecting
IV. ORNL DATASET acceleration, usually until the car is restarted.
The ORNL dataset (https://0xsam.com/road/, DOI: • Max Speedometer — The speedometer signal (one byte)
10.13139/ORNLNCCS/1728694) consists of 33 attack cap- is targeted. We modify this signal value to be the maxi-
tures totalling about 30 minutes, and 12 ambient captures mum (0xFF), causing the speedometer to falsely display
containing about 3 hours of ambient data. These are enumer- a maximum value.
ated in Table 2; attacks are described in more detail in Sec. • Max Engine Coolant Temperature — We target the
IV-A; important syntactic metadata appears in Sec. IV-B, engine coolant signal (one byte), modifying the signal
and further descriptions are provided in the documentation value to be the maximum (0xFF). The physical effect is
published with our dataset. an “engine coolant too high” warning light on the dash
We collected CAN data using SocketCAN [28] software illuminates.
on a Linux computer with a Kvaser Leaf Light V2 connecting • Reverse Light — A binary (one bit) signal communi-
to the OBD-II port. All of the data is from a single vehicle, the cating the state of the reverse lights (on/off) is targeted.

9
TABLE 2: Logs in ROAD CAN Intrusion Detection Dataset 3) Accelerator Attacks
Modified # Logs The “accelerator attack” is an advanced attack, that does not
Accelerator Attack (In Drive) 2 fit into the general of framework injection attacks introduced
Accelerator Attack (In Reverse) 2 Sec. II-B. This attack exploits a vulnerability particular to the
Correlated Signal Fabrication Attack 3 vehicle make/model that puts the ECUs into a compromised
Correlated Signal Masquerade Attack X 3
Fuzzing Attack 3 state. We have responsibly disclosed this vulnerability to the
Max Engine Coolant Temp Fabrication Attack 1 OEM, and will not disclose details of how to implement this
Max Engine Coolant Temp Masquerade Attack X 1 attack.
Max Speedometer Fabrication Attack 3
Max Speedometer Masquerade Attack X 3 We do not include the CAN data during the exploit. After
Reverse Light Off Fabrication Attack 3 the exploit, the effect is that the vehicle is in a state that has
Reverse Light Off Masquerade Attack X 3 less control by the driver as follows:
Reverse Light On Fabrication Attack 3
Reverse Light On Masquerade Attack X 3 • When put into drive gear, the vehicle accelerates to a

Dynamometer Various Ambient 10 fixed speed and then holds this speed (regardless of
Road Various Ambient 2 accelerator pedal position or cruise control setting).
• In reverse, the vehicle accelerates to a (different) fixed
We perform two slight variations of the attack, where
speed and holds this speed (regardless of accelerator
we manipulate the value to off (on), while the car is
pedal position or cruise control setting).
in Reverse (Drive), respectively. The effect is that the
• Cruise control is disabled.
reverse lights do not reflect what gear the car is using.
• Touching the brake pedal results in the acceleration
ceasing and the brakes engaging normally.
For all of these targeted ID attacks, we provide two ver-
• When the brake is released, the vehicle commences to
sions of the same CAN data captures: the original fabrication
accelerate as described above.
attack, and a version slightly modified in post-processing to
simulate a masquerade attack. Refer to the Table 2, which The Accelerator Attack captures have no injected messages,
itemizes the fabrication/masquerade pairs. A subset of the but simply record the CAN data when the vehicle is in this
fabrication/masquerade pairs attacks are visualized in Table state. Discrepancies in the driver inputs (e.g., pressing the
3. accelerator pedal) with the vehicle’s actions are present.
The fabrication attack versions are the original altered cap-
B. SYNTACTIC DESCRIPTION
ture, including both the legitimate target ID frames and the
injected frames. Because these are real, physically verified All of the CAN data files are logged using the standard can-
attacks with the minimally occurring injected frames (due utils [41] candump format:
to the flam delivery), they provide perhaps the best (i.e., (1569510697.667343
most stealthy/most difficult to detect), current, public data for | {z }) can0
|{z } 5E1 # 893FE0070A000080
|{z} | {z }
Unix Timestamp Channel ID (hex) Data Field (hex)
testing frequency-based IDSs.
Note that all data fields in these logs contain the full 8 bytes,
The masquerade attack versions provide a more advanced
which we padded with zeros if necessary. The channel is
version of the attack, in which we remove the legitimate
always can0, so this column can be dropped. We provide
target ID frames preceding each injected frame, simulating
metadata (in JSON format) for each capture, including a
a masquerade attack. In effect, this removes message con-
general description of driving activities, the length of the
fliction in the data, and makes it appear as though only the
capture in seconds, and whether or not the car was on the
spoofed messages are present during the injection interval.
dynamometer. For attack captures, we also include whether
With these masquerade datasets, frequency-based approaches
the capture was modified (i.e., masquerade attacks), the
will almost certainly fail to provide accurate detection. It
injection ID and data field, and the interval of injection
is important to note that while the masquerade aspect is
(start, end) corresponding to the time of the first/last injected
simulated through post-processing, this means of alteration
message in elapsed seconds. Importantly, we do not label
avoids problematic issues with synthetic data. Namely, the
individual messages as attack/normal, because the software
effect of the attack on the vehicle was physically verified;
we used to collect did not have that capability. However, with
every message appearing in the data was actually seen by the
injection ID, data, and intervals, these can be labeled in post-
car in the order it appears in the data; and no aspect of CAN
processing fairly easily. Examples of the provided metadata
protocol was violated. As discussed in the introduction, there
are shown in Fig. 4.
are no publicly available, real CAN data captures with real
We use a wildcard character “X” in the injection_data_str
masquerade attacks, and the extreme hacking skill required
field when a signal was targeted (only some, not all bytes in
to implement such an attack on a real vehicle seems to be
the message are manipulated) to indicate that the byte in the
preventing CAN IDS researchers from implementing such an
given position was not modified in the injection. Similarly,
attack. This provides the highest fidelity alternative possible.
“X” appears in the injection_id field to indicate that no
particular ID was targeted, which is only the case in the

10
correlated_signal_attack_1 :{
description : " start from presented roughly in order of difficulty to detect (using a non-
driving ; accelerate ;
ambient_dyno_drive_basic_short :
{ start injecting ; car
rolls to stop ; stop
timing–based IDS).
description : " start from injecting ; accelerate "
park ; basic drive
activities (e.g ., elapsed_sec : 33.101852 ,
injection_id : "0 x6e0 ",
• The Correlated Signal attacks break the relationship
drive ; accelerate ; injection_data_str :
brake ; reverse ; ect .) "
elapsed_sec : 444.75061 , "595945450000 FFFF ", between all the usually correlated wheel speed signals
on_dyno : True } injection_interval :
[ 9.191851 , 30.050109 ],
modified : False ,
in the message, and cause signal discontinuities as well.
on_dyno : True } • The Max Speedometer attacks causes a signal discon-
tinuity for the target signal without affecting other por-
FIGURE 4: Snippet of metadata provided for each capture, with an example
of ambient (right) and attack (left) entries. tions of the message.
• The Reverse Light Off attack breaks relationships of
fuzzing attack. For the accelerator attack, the injection_id CAN signals.
and injection_data_str are null, and the injection interval Details and takeaways are documented in the caption.
is just the start and end time of the capture. (This will be Ambient dynamometer driving includes activities (e.g.,
included in the full documentation.) reverse, drive, accelerate, brake) and can be used for training
detectors and/or testing for false positives.
C. OBFUSCATION The fuzzing attack should be very apparent to most de-
While other public CAN datasets provide information on the tectors, but should be accurately detectable with timing/fre-
make, model, and year of the vehicles attacked, it would be quency approaches. As discussed above, by using the flam
irresponsible, given our previous disclosure, to release such technique (one injected message right after each ambient
information. Furthermore, we have taken steps to obfuscate message, and only necessary bytes in the data field manipu-
the CAN data in such as way as to preserve the characteristics lated), our dataset provides the stealthiest possible fabrication
necessary for CAN IDS development, while ideally prevent- attacks. This is the hardest attack we expect a timing/fre-
ing users from knowing the make, model, and year of the quency based detector to accurately detect. These fabrication
vehicle. Below we itemize the augmentations performed on attacks enhance the quality in terms of stealth and realism
the data to preserve anonymity: over what is currently available.
• Absolute timestamps may be all shifted by a scalar, but Next, each fabrication attack is accompanied by a mas-
relative times are preserved. querade version where the ambient messages in 1:1 cor-
• Messages from particular arbitration IDs that were respondence with the injected messages are synthetically
deemed unimportant were replaced with the “filler mes- removed after the capture. As discussed previously, while
sage” FFF#0000000000000000 (ID#Data in hex). Rela- Miller & Valasek have exhibited masquerade attacks on
tive timestamps are still preserved for these messages. real vehicles (even remotely!), the hacking expertise and
• Messages on reserved IDs (greater than 0x700: e.g., diag- time required has stymied any defensive researchers from
nostic messages) have been removed. acquiring such data. Our dataset provides the best possible
• Arbitration IDs have been anonymized in such a way alternative: real data with real attacks that are physically
that arbitration order/priority is not preserved. There verified seem to have the ability to produce a bona fide
is a one-to-one mapping between the original and the masquerade attack. This provides the best possible alternative
anonymized IDs for a given vehicle (not including the and should be T.O. attacks, not detectable by timing-base
“filler messages” under ID 0xFFF). For example, if ID means. We believe that while these attacks are T.O., they
0x10 is converted to ID 0x821 in an anonymized log, are likely detectable by intra-signal (or per-signal) models—
every anonymized log for the given vehicle will have ID those detectors modeling each signal’s time series.
0x10 converted to ID 0x821. Finally, the Accelerator Attacks are unique examples of
• Data fields have been scrambled in such a way that CAN data from a functioning vehicle with only the vehicle’s
signals have been preserved, and fields are scrambled in ECUs transmitting messages, but in a compromised manner.
a consistent way for each ID/Vehicle. For example, if the There is no disruption of the timing of the CAN messages,
first byte is moved to the end of the field for ID 0x10 on thus this attack should be T.O. This attack permits adequate
a given vehicle, it will be shifted this way in all messages testing of advanced IDSs that must rely on some understand-
from ID 0x10 in all logs from that vehicle. ing of the payloads and their correlations with each other.
Having data from a wide variety of real vehicles in actual
V. DISCUSSION road driving conditions with simple to very sophisticated
By design, our dataset contains attacks requiring detectors attacks is of course ideal; thus, many limitations to this
of varying types and sophistication. The desired alteration of dataset are indeed present. Notably, we only present data
vehicle functionality was physically verified for all attacks from a single vehicle. Secondly, the dataset includes only
included. dynamometer data, which is well known to cause subtle
To illustrate characteristics of the data, refer to Table 3, differences in actual road driving. Thirdly, we provide attack
which includes three different visualizations of six attack intervals rather than labeling each message. Finally, our mas-
captures in our dataset. The three attack types in Table 3 are querade attacks were the best possible versions we (as non-

11
TABLE 3: Depiction of six of the targeted ID & masquerade attacks in the ORNL dataset, showing both the Timing Transparent (fabrication, with message
confliction) and Timing Opaque (masquerade, without message confliction) versions for three attack types. The x-axis of all plots are elapsed time (s), and the
red dashed lines demarcate the attack interval. The three main columns visualize different aspects of each of the six attacks: Message Timing: Inter-message
arrival time (ms) between all messages shown in the Top All Messages subplot, and between only the target ID messages in the bottom Target ID Messages
(Near Attack Start) subplot, which zooms in to 15s before to 20s after the attack start. blue dots/red x’s indicate legitimate/injected messages. Compare the
six All Messages (Top) subplots with Fig. 2 to see overall bus timing is nearly undisturbed, whereas previous attacks are blatant. The six Target ID Messages
(bottom) subplots illustrate that the fabrication attacks (using flam injection delivery) cause unusually short inter-message times for the target ID, while
masquerade attacks do not cause perceptible timing changes. Target ID Data Field: Time series of 64-bit binary data during the time period near the attack
start (black denotes 1s, white denotes 0s). If only part of the message was altered (i.e., one target signal), the section of altered bits are delimited with red
solid lines. Through visual inspection, fabrication attacks are more obvious due to message confliction, and both fabrication and masquerade attacks are more
noticeable when the entire message is targeted (e.g., Correlated Signal Attack), rather than just a single signal. Target ID Signals: The time series of signals
in the target ID message are depicted, annotated with signal names and bit ranges, which are made boldface for target signals (note not all non-target signals
in the message may be shown). Notice the Max Speedometer and Reverse Light Off attack target different signals in the same ID. While even the masquerade
versions of the first two attack types are somewhat visually identifiable at the signal level due to discontinuities and extreme values, the Reverse Light Off
Attack targeting a 1-bit signal is difficult to discern without understanding more complex signal relationships or by examining signals in other messages.
Message Timing Target ID Data Field Target ID (Select) Signals
(Near Attack Start)
Fabrication
Correlated Signal Attack

Masquerade
Fabrication
Max Speedometer Attack

Masquerade
Fabrication
Reverse Light Off Attack

Masquerade

12
professionals in offensive security) could provide; as such, networks,” in 2008 IEEE Intelligent Vehicles Symposium.
they rely on a small amount of simulation. IEEE, Jun 2008, pp. 220–225. [Online]. Available: http:
Nearly all detectors in the literature that hold the promise //ieeexplore.ieee.org/document/4621263/
[14] N. Salman and M. Bresch, “Design and implementation of
of detecting T.O. attacks have yet to be tested on an appro- an intrusion detection system (IDS) for in-vehicle networks,”
priate dataset. The previous two types of attacks provide the Ph.D. dissertation, Chalmers University of Technology,
first such real data for these detectors to be validated. It is our 2017. [Online]. Available: http://publications.lib.chalmers.se/
hope that this dataset, and hopefully others to follow, will records/fulltext/251871/251871.pdf
allow for better comparison and validation of these proposed [15] K.-T. Cho and K. G. Shin, “Viden: Attacker identification
on in-vehicle networks,” in Proceedings of the 2017
detectors, and perhaps contribute to other advances in CAN ACM SIGSAC Conference on Computer and Communications
bus security. Security, ser. CCS ’17. ACM, 2017, pp. 1109–1123. [Online].
Available: http://doi.acm.org/10.1145/3133956.3134001
ACKNOWLEDGMENT [16] W. Choi, K. Joo, H. J. Jo, M. C. Park, and D. H. Lee,
Special thanks to Suzanne Parete-Koon and Ross Miller for “Voltageids: Low-level communication characteristics for au-
tomotive intrusion detection system,” IEEE Transactions on
assistance in posting the dataset online. Thanks to Pablo Mo- Information Forensics and Security, vol. 13, no. 8, pp. 2114–
riano, Stacy Prowell, & John Baston for helping us polish this 2129, Aug 2018.
document. Research sponsored by the Laboratory Directed [17] M. R. Moore, R. A. Bridges, F. L. Combs, M. S.
Research and Development Program of Oak Ridge National Starr, and S. J. Prowell, “Modeling inter-signal arrival
Laboratory, managed by UT-Battelle, LLC, for the U. S. times for accurate detection of CAN bus signal injection
attacks: a data-driven approach to in-vehicle intrusion
Department of Energy. detection,” in Proceedings of the 12th Annual Conference
on Cyber and Information Security Research - CISRC
REFERENCES ’17. ACM Press, 2017, pp. 1–4. [Online]. Available:
[1] K. Koscher et al., “Experimental security analysis of a modern http://dl.acm.org/citation.cfm?doid=3064814.3064816
automobile,” in 2010 IEEE S&P. IEEE, 2010. [18] A. Tomlinson, J. Bryans, S. A. Shaikh, and H. K. Kalutarage,
[2] S. Checkoway, D. McCoy, B. Kantor, D. Anderson, “Detection of automotive CAN cyber-attacks by identifying
H. Shacham, S. Savage, K. Koscher, A. Czeskis, F. Roesner, packet timing anomalies in time windows,” Jun 2018,
and T. Kohno, “Comprehensive experimental analyses of au- pp. 231–238. [Online]. Available: https://ieeexplore.ieee.org/
tomotive attack surfaces,” in Proceedings of the 20th USENIX stamp/stamp.jsp?arnumber=8416254
conference on Security, ser. SEC’11. USENIX Association, [19] Y. Hamada, M. Inoue, H. Ueda, Y. Miyashita, and Y. Hata,
Aug 2011. “Anomaly-based intrusion detection using the density esti-
[3] T. Hoppe, S. Kiltz, and J. Dittmann, “Security threats to mation of reception cycle periods for in-vehicle networks,”
automotive CAN networks practical examples and selected SAE International Journal of Transportation Cybersecurity
short-term countermeasures,” Reliability Engineering & Sys- and Privacy, vol. 1, no. 1, pp. 39–56, May 2018.
tem Safety, vol. 96, Jan 2011. [20] A. Taylor, S. Leblanc, and N. Japkowicz, “Anomaly
[4] S. Woo, H. J. Jo, and D. H. Lee, “A practical wireless attack on detection in automobile control network data with long
the connected car and security protocol for in-vehicle CAN,” short-term memory networks,” in 2016 IEEE International
IEEE Transactions on Intelligent Transportation Systems, pp. Conference on Data Science and Advanced Analytics
1–14, Sep 2014. (DSAA). IEEE, Oct 2016. [Online]. Available: http:
[5] C. Valasek and C. Miller, “Remote exploitation of an unaltered //ieeexplore.ieee.org/document/7796898/
passenger vehicle,” p. 91, Aug 2015. [21] M.-J. Kang and J.-W. Kang, “Intrusion detection system using
[6] C. Valasek and D. C. Miller, “CAN message injection,” p. 29, deep neural network for in-vehicle network security,” PLOS
Jun 2016. ONE, vol. 11, no. 6, Jun 2016.
[7] K.-T. Cho and K. G. Shin, “Error handling of in-vehicle [22] M. Marchetti, D. Stabili, A. Guido, and M. Colajanni,
networks makes them vulnerable,” in Proceedings of the 2016 “Evaluation of anomaly detection for in-vehicle networks
ACM SIGSAC Conference on Computer and Communications through information-theoretic algorithms,” in 2016 IEEE
Security. ACM, 2016, pp. 1044–1055. 2nd International Forum on Research and Technologies for
[8] T. K. S. Lab, Experimental Security Research of Tesla Autopi- Society and Industry Leveraging a better tomorrow (RTSI).
lot, Mar 2019. IEEE, Sep 2016. [Online]. Available: http://ieeexplore.ieee.
[9] W. Wu, R. Li, G. Xie, J. An, Y. Bai, J. Zhou, and K. Li, “A org/document/7740627/
survey of intrusion detection for in-vehicle networks,” 2019. [23] M. Hanselmann et al., “CANet: An unsupervised intrusion
[10] S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intru- detection system for high dimensional CAN bus data,” IEEE
sion detection system for automotive controller area network Access, vol. 8, 2020.
(CAN) bus system: a review,” EURASIP Journal on Wireless [24] S. Nair Narayanan, S. Mittal, and A. Joshi, “Obd_securealert:
Communications and Networking, vol. 2019, no. 1, p. 184, Jul An anomaly detection system for vehicles,” May 2016.
2019. [Online]. Available: https://ieeexplore.ieee.org/document/
[11] G. Loukas, E. Karapistoli, E. Panaousis, P. Sarigiannidis, 7501710
A. Bezemskij, and T. Vuong, “A taxonomy and survey of [25] A. R. Wasicek, M. D. Pese, and A. Weimerskirch, “Context-
cyber-physical intrusion detection approaches for vehicles,” aware intrusion detection in automotive control systems,”
Ad Hoc Networks 2019, vol. 84, p. 27, Mar 2019. p. 14, 2017.
[12] G. K. Rajbahadur, A. J. Malton, A. Walenstein, and A. E. [26] M. E. Verma, R. A. Bridges, J. J. Sosnowski, S. C. Hollifield,
Hassan, “A survey of anomaly detection for connected vehicle and M. D. Iannacone, “CAN-D: A modular four-step pipeline
cybersecurity and safety,” p. 6, 2018. for comprehensively decoding controller area network data,”
[13] U. E. Larson, D. K. Nilsson, and E. Jonsson, “An arXiv:2006.05993 [cs, eess], Jun 2020, arXiv: 2006.05993.
approach to specification-based attack detection for in-vehicle [Online]. Available: http://arxiv.org/abs/2006.05993

13
[27] “Comma AI,” https://comma.ai/.
[28] “SocketCAN,” https://python-can.readthedocs.io/en/master/
interfaces/socketcan.html.
[29] Z. Tyree, R. A. Bridges, F. L. Combs, and M. R.
Moore, “Exploiting the shape of CAN data for in-vehicle
intrusion detection,” arXiv:1808.10840 [cs], Aug 2018,
arXiv: 1808.10840. [Online]. Available: http://arxiv.org/abs/
1808.10840
[30] K. Pawelec, R. A. Bridges, and F. L. Combs, “Towards a
CAN IDS based on a neural network data field predictor,”
in Proceedings of the ACM Workshop on Automotive
Cybersecurity, ser. AutoSec ’19. ACM, 2019, pp. 31–34,
event-place: Richardson, Texas, USA. [Online]. Available:
http://doi.acm.org/10.1145/3309171.3309180
[31] “Automotive buses,” https://training.dewesoft.com/online/
course/automotive-buses-can-measurement.
[32] R. Bosch et al., “CAN specification version 2.0,” Rober
Bousch GmbH, Postfach, vol. 300240, p. 72, 1991.
[33] W. Voss, A Comprehensible Guide to Controller Area Net-
work, 2008.
[34] K.-T. Cho and K. G. Shin, “Fingerprinting electronic control
units for vehicle intrusion detection,” in 25th USENIX Security
Symposium USENIX Security 16), 2016, pp. 911–927.
[35] H. Lee et al., “OTIDS: A novel intrusion detection system for
in-vehicle network by using remote frame,” in PST. IEEE,
2017.
[36] E. Seo, H. M. Song, and H. K. Kim, “Gids: Gan
based intrusion detection system for in-vehicle network,”
in 2018 16th Annual Conference on Privacy, Security
and Trust (PST), Aug 2018. [Online]. Available: https:
//ieeexplore.ieee.org/document/8514157
[37] M. L. Han, B. I. Kwak, and H. K. Kim, “Anomaly intrusion
detection method for vehicular networks based on survival
analysis,” Vehicular Communications, vol. 14, Oct 2018.
[38] H. M. Song, J. Woo, and H. K. Kim, “In-vehicle network
intrusion detection using deep convolutional neural network,”
Vehicular Communications, vol. 21, Jan 2020.
[39] Dupont, G. (Guilaume), Lekidis, A. (Alexios), Den
Hartog, J. (Jerry), and Etalle, S. (Sandro), “Automotive
controller area network (CAN) bus intrusion dataset v2,”
2019. [Online]. Available: https://data.4tu.nl/repository/uuid:
b74b4928-c377-4585-9432-2004dfa20a5d
[40] I. Berger, R. Rieke, M. Kolomeets, A. Chechulin, and
I. Kotenko, Comparative Study of Machine Learning Methods
for In-Vehicle Intrusion Detection. Springer International
Publishing, Jan 2019, vol. 11387, pp. 85–101. [Online]. Avail-
able: http://link.springer.com/10.1007/978-3-030-12786-2_6
[41] “can-utils,” https://github.com/linux-can/can-utils.

14

You might also like