By Tiago Henriques, Filipa Rodrigues
Florentino Bexiga, Ana Barbosa
I, for one, welcome our
new Cyber Overlords!
An introduction to the use of
data science in cybersecurity
WHO ARE WE?
MACHINE LEARNING AND CYBERSECURITY
IMAGE WORKFLOW
IMAGE ANALYSIS IN DETAIL
DATA VISUALISATION
Agenda
Tiago is the CEO and Data necromancer at
BinaryEdge however he gets to meddle in the
intersection of data science and cybersecurity
by providing his team with lovely problems that
they solve on a daily basis.
Tiago Henriques
Presenter
Florentino is the Data MacGyver at
BinaryEdge. On a daily basis he needs to
deploy infrastructure used to analyse big
and realtime data. When not doing that, he
can be found creating models to analyse
data. Give him an orange, he’ll give you a
skynet. Why an orange you ask? He’s
hungry and likes oranges, there!
Florentino Bexiga
Presenter
Filipa is the Data Diva at BinaryEdge, she
dances the macarena with numbers to get
them to tell her all their dirty secret.
Filipa Rodrigues
Presenter
Ana is the Data Ferret at BinaryEdge.
She is small and hides between the 110th
and 111th characters of the ascii code to
see and show data in that unique
perspective of someone who can’t reach
the box of cookies stored on top of the
capitol 'I'
Ana Barbosa
Presenter
Earlier today
BinaryEdge
HACKING
SKILLS
SECURITY DOMAIN
EXPERTISE
STATISTICS
KNOWLEDGE
MACHINE
LEARNING
TRADITIONAL
RESEARCH
DANGER
ZONE!
DATA
SCIENCE
Source: Data-Driven Security: Analysis, visualisation and Dashboards (adapted)
How we got here....
200 port scan of the entire internet/ month
1,400,000,000 scanning events/ month *
746,000 torrents monitored and increasing
1,362,225,600 torrent events/ month
* at a minimum
Worldwide distribution of IPs running services
<= 100
Number of IPs found
>= 1,000,000
100,000 < #found < 1,000,000
10,000 < #found <= 100,000
1,000 < #found <= 10,000
100 < #found <= 1,000
Map IPv4 addresses to Hilbert curves
% of coverage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Data Science & Machine Learning
How many IP addresses did job X had vs. job Y?
What is the average duration of the scans?
Can we extract more from all the screenshots we get?
Can we have a more optimized job distribution?
We can only identify X% of services because we’re
using static signatures, can we do better?
Can we find similar images?
MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER
DATA SCIENCE
&
MACHINE LEARNING
Data Science & Machine Learning
DATA SCIENCE MACHINE LEARNING
INITIAL ANALYSIS AND CLEAN UP
EXPLORATORY DATA ANALYSIS
DATA VISUALISATION
KNOWLEDGE DISCOVERY
CLASSIFICATION
CLUSTERING
SIMILARITY MATCHING
REGRESSION
IDENTIFICATION
Problems and Limitations of
Machine Learning in CyberSecurity
Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of
machine learning
Prediction – Scenarios and data too volatile, not enough proper sources of data
Lack of data in quantity and quality to train models
Good use cases
further work needs to be done, but will allow to move antivirus from a static/
signature based system into a much improved dynamic/ learning based
system
If a computer is hacked certain behaviors will change, if constant data is being
monitored and fed into a system the hack could be detected
detection of vulnerable patterns during development
sentiment analysis applied to emails, tweets, social networks of employees
PATTERN DETECTION/OUTLIER
DETECTION (IDS/IPS)
ANTIVIRUS
ANTI-SPAM
SMARTER FUZZERS
SOURCE CODE ANALYSIS
INTERNAL ATTACKERS
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forums
sub-reddits
internal
external
phone
email
linked urls
likes
topics
BGP
AS
whois
AS membership
AS peer
list of IPs
shared
infrastructure
co-hosted
sites
contact
geolocation
office
locations
social
networks
phone
portscan
dns
torrents
binaryedge.io2016
domains
AXFR
MX records
screenshots
web
services
http https
webserver
framework
headers
cookies
certificate
configuration
authorities
entities
SMB
VNC
RDP
users
appsfiles
peers torrent name
OCR
SW
banners
image
classifier
vulnerabilities
data points
Torrent Correlation
Torrent Correlation
China or Military
Data correlation
Data correlation
Turkish IP
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY
DEMO
At PixelsCamp
At PixelsCamp
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forums
sub-reddits
internal
external
phone
email
linked urls
likes
topics
BGP
AS
whois
AS membership
AS peer
list of IPs
shared
infrastructure
co-hosted
sites
contact
geolocation
office
locations
social
networks
phone
portscan
dns
torrents
binaryedge.io2016
domains
AXFR
MX records
screenshots
web
services
http https
webserver
framework
headers
cookies
certificate
configuration
authorities
entities
SMB
VNC
RDP
users
appsfiles
peers torrent name
OCR
SW
banners
image
classifier
vulnerabilities
data points
Microservices (REST API)
MICROSERVICES
(REST API)
PORT WORD
TAG
FACECOUNTRY LOGO
IP
Scan
SCAN
GENERATES EVENTS
DOES IT
GENERATE A
SCREENSHOT?
STORE THE IMAGE FILE
ON THE CLOUD
YES
NO
GENERATE A NOTIFICATION
THAT NEW IMAGE WAS UPLOADED
FINISH
Image Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Image Workflow
PULL MESSAGE
FROM QUEUE
IS THERE
A NEW IMAGE?
DECRYPT AND STORE IMAGE
METADATA ON A DATABASE
YES
NO
GENERATE IMAGE SIGNATURE
FOR SIMILARITY COMPARISON
FINISH
MESSAGE QUEUE
Image Workflow
PULL MESSAGE
FROM QUEUE
DOES THE
IMAGE HAVE ANY
INFORMATION?
PERFORM SIMPLE
ENTROPY FILTERING
YES
NO
FINISH
MESSAGED QUEUE
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
PULL MESSAGE
FROM QUEUE
ENHANCE IMAGE WITH
APPLICATION OF SOME FILTERS
RUN FACE AND LOGO DETECTION
AND OCR ALGORITHMS
STORE RESULTS
IN DATABASE
PERFORM ADDITIONAL
ACTIONS WITH THE RESULTS
Image Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Image Workflow
[{"BreachDate": "2013-10-04", "DataClasses": ["Email addresses",
"Password hints", "Passwords", "Usernames"], "Title": "Adobe", "IsAc-
tive": true, "Description": "In October 2013, 153 million Adobe accounts
were breached with each containing an internal ID, username, email,
<em>encrypted</em> password and a password hint in plain text. The
password cryptography was poorly done and <a href="http://stric-
ture-group.com/files/adobe-top100.txt" target="_blank">many were
quickly resolved back to plain text</a>. The unencrypted hints also <a
href="http://www.troyhunt.com/2013/11/adobe-creden-
tials-and-serious.html" target="_blank">disclosed much about the
passwords</a> adding further to the risk that hundreds of millions of
Adobe customers already faced.", "Domain": "adobe.com", "Added-
Date": "2013-12-04T00:00:00Z", "PwnCount": 152445165, "IsRetired":
false, "IsVerified": true, "LogoType": "svg", "IsSensitive": false, "Name":
"Adobe"}]
Email
DataLeak API
Image WorkflowImage Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Shannon’s Entropy
Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits
Filter
Data Visualization
EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS
“a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.”
Andy Kirk, “Data Visualization: a successful design process”
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DATA TYPE
RELEVANCE
FILTER
What is the most interesting?
What is most important?
Audience’s Profile
What is the most relevant information in the context?
Show all values or just a few?
Define periods?
Define a threshold?
Hierarchical
Relational
Temporal
Spatial
Categorical
Exploration
Data Visualization
Representation
Experimentation is important
Conceive ideas
Storyboarding
Do multipe iterations
Prototype
Test
design can be used in the future
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331
951,629 854,817 789,515 759,115 490,290 288,885
266,827 257,105 219,025 198,898 186,286 141,474
HowmanyopenportsdoesanIPhave?
NumberofIPswithXopenportsport
NumberofIPs
Representation
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Distribution of IP addresses running encrypted and unencrypted services
MARKS
Points
Areas
Lines
ATTTRIBUTES Position
Connections/ Patterns
Size/ Color
REPRESENT RECORDS
EMPHASIZE THE MOST IMPORTANT
ASPECTS OF THE DATA on port 443
on port 80
51,467,779
HTTP
28,671,263
IPs running
HTTP services
IPs running
HTTPS services
16,519,503IPs running both
HTTP and HTTPS services
HTTP
&
HTTPS
HTTPS
Data Visualization
Data Visualization
Representation
PRECISION IN DESIGN
Geometric Calculations
Truncated axis
Scales
MAKE IT UNDERSTANDABLE
Reference lines
Markers
MAKE IT APPEALING
Minimise the clutter
Priority: preserve function
Top 10Web Servers for theWeb
Most common web servers found on port 80
Apache httpd
AkamaiGHost
Micorosft IIS httpd
nginx
lighttpd
Huawei HG532e ADSL modem http admin
Microsoft HTTPAPI httpd
Technicolor DSL modem http admin
Mbedthis-Appweb
micro_httpd
2 4 6 80 10 12 millions
11,493,552
8,361,080
4,843,769
3,860,883
2,031,741
1,539,629
952,300
699,202
694,393
678,657
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
CVSS SCORES
LOW
MEDIUM
HIGH
0.0
10.0
4.0
7.0
SEVERITY
CVSS: CommonVulnerability Scoring System
Data Visualization
CVE
Identifier
Number
References
Description
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
CVE: CommonVulnerabilities and Exposure
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
Data Visualization
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Overview of protocols used for email, according to encryption used
Email Protocols
ENCRYPTED UNENCRYPTED
POP3 POP3S SMTP SMTPS IMAP IMAPS
4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364
10,416,812 12,234,969
SERVICE
COUNT
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
Data Visualization
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Big Data Technologies
Changes in amount of data exposed without security
MongoDB Memcached Redis 2 TB
644.3 TB
Aug 2015 Jan 2016 July 2016
724.7 TB 627.7 TB
13.2 TB
11.3 TB
710.9 TB 12.0 TB
598.7 TB 27.5 TB 1.5 TB
1.8 TB
619.8 TB
Data Visualization
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorical
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Heartbleed
Countries with higher number of IPs vulnerable to Heartbleed
Russia
5,264
Republic of Korea
4,564
China
6,790
United States
23,649
Italy
2,508
Germany
6,382
France
5,622
Netherlands
2,779United Kingdom
3,459
Japan
2,484
Data Visualization
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
VNC wordcloud
loginwindows
edition
2016
delete
ctrl
server
press
microsoft
system
welcome
your help
file
linux
google
kernel
from
ubuntu
Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are considered sloppy
SSH Banners
SSH-2.0-OpenSSH_5.3
SSH-2.0-OpenSSH_6.6.1p1
SSH-2.0-OpenSSH_6.6.1
SSH-2.0-OpenSSH_4.3
SSH-2.0-OpenSSH_6.0p1
SSH-2.0-OpenSSH_6.7p1
SSH-2.0-dropbear_2014.63
SSH-2.0-OpenSSH_5.5p1
SSH-2.0-ROSSSH
SSH-2.0-OpenSSH_5.9p1
202,361
352,978
436,700449,570
462,616
537,667
555,779
604,579
1,501,749
2,632,270
count
banner
Most common SSH Banners found
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Data Visualization
Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are considered sloppy
SSH
-2.0-O
penSSH
_5.3
SSH
-2.0-O
penSSH
_6.6.1p1
SSH
-2.0-O
penSSH
_6.6.1
SSH
-2.0-O
penSSH
_4.3
SSH
-2.0-O
penSSH
_6.0p1
SSH
-2.0-O
penSSH
_6.7p1
SSH
-2.0-dropbear_2014.63
SSH-2.0-OpenSSH_5.5p1
SSH
-2.0-RO
SSSH
SSH
-2.0-O
penSSH
_5.9p1
202,361
352,978
436,700449,570
462,616
537,667
555,779
604,579
1,501,749
2,632,270
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Data Visualization
Details
COLOR
Legibility
Functional purpose
Salience
Consistency
Color Blindness
COMPOSITION
Chart size/ orientation
Alignments
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
SSH Key Lengths
Most common key lengths found
Key
length
count
641,719
1040
186,070
1032
13,845
4096
5,068,711
1024
3,740,593
2048
9,064
512
7,830
2056
6,265
2064
6,212
1016
4,755
768
Data Visualization
Tools
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
BALANCE
Automation
Programming Language
to create plots
Fine tunning in illustrator
(make it better for the audience)
Hand-editing process
Human error
Originality
Automated Analysis
Illustrator (or other tool) to
create visualization solution
Human error
Data Visualization
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DOCUMENT EVERY STEP OF THE PROCESS
Calculations
Choices of visualisations
Choices of data points
REVIEW EVERYTHING
What could have been done differently?
What could be better?
TAKE CONSTRUCTIVE FEEDBACK
Even if it means to start over
A visualization can be used in the future
Data Visualization
INTERNET
SECURITY
EXPOSURE
2016
BinaryEdge.io
Be Ready. Be Safe. Be Secure.
ise.binaryedge.io
THE SCIENCE
BEHIND THE DATA
CREATED BY
BINARYEDGE

More Related Content

PDF
Pixels Camp 2017 - Stories from the trenches of building a data architecture
PDF
BinaryEdge - Security Data Metrics and Measurements at Scale - BSidesLisbon 2015
PDF
BSides Lisbon - Data science, machine learning and cybersecurity
PDF
Webzurich - The State of Web Security in Switzerland
PDF
The state of cybersecurity in Switzerland - FinTechDay 2017
PDF
Pixels Camp 2017 - Stranger Things the internet version
PPTX
UNCOVER DATA SECURITY BLIND SPOTS IN YOUR CLOUD, BIG DATA & DEVOPS ENVIRONMENT
PDF
Cyber Vigilantes: Turning the Tables on Hackers
Pixels Camp 2017 - Stories from the trenches of building a data architecture
BinaryEdge - Security Data Metrics and Measurements at Scale - BSidesLisbon 2015
BSides Lisbon - Data science, machine learning and cybersecurity
Webzurich - The State of Web Security in Switzerland
The state of cybersecurity in Switzerland - FinTechDay 2017
Pixels Camp 2017 - Stranger Things the internet version
UNCOVER DATA SECURITY BLIND SPOTS IN YOUR CLOUD, BIG DATA & DEVOPS ENVIRONMENT
Cyber Vigilantes: Turning the Tables on Hackers

What's hot (20)

PPTX
Infragard atlanta ulf mattsson - cloud security - regulations and data prot...
PDF
Hacking 05 2011
PPTX
Emerging Data Privacy and Security for Cloud
PDF
F5 networks the_expectation_of_ssl_everywhere
PPTX
What I Learned at RSAC 2020
PPTX
What i learned at gartner summit 2019
PPTX
Next generation data protection and security for oracle users - gdpr blockc...
PDF
Institucional proofpoint
PDF
State of the ATT&CK - ATT&CKcon Power Hour
PPTX
Jun 15 privacy in the cloud at financial institutions at the object managemen...
PPTX
Emerging application and data protection for multi cloud
PPTX
Securing data today and in the future - Oracle NYC
PDF
[EMC] Source Code Protection
PDF
What I learned from RSAC 2019
PDF
Becoming a Yogi on Mac ATT&CK with OceanLotus Postures
PDF
The past, present, and future of big data security
PDF
Information Security Risk Management
PDF
Data centric security key to digital business success - ulf mattsson - bright...
PPTX
ISSA Atlanta - Emerging application and data protection for multi cloud
Infragard atlanta ulf mattsson - cloud security - regulations and data prot...
Hacking 05 2011
Emerging Data Privacy and Security for Cloud
F5 networks the_expectation_of_ssl_everywhere
What I Learned at RSAC 2020
What i learned at gartner summit 2019
Next generation data protection and security for oracle users - gdpr blockc...
Institucional proofpoint
State of the ATT&CK - ATT&CKcon Power Hour
Jun 15 privacy in the cloud at financial institutions at the object managemen...
Emerging application and data protection for multi cloud
Securing data today and in the future - Oracle NYC
[EMC] Source Code Protection
What I learned from RSAC 2019
Becoming a Yogi on Mac ATT&CK with OceanLotus Postures
The past, present, and future of big data security
Information Security Risk Management
Data centric security key to digital business success - ulf mattsson - bright...
ISSA Atlanta - Emerging application and data protection for multi cloud
Ad

Similar to I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY (20)

PDF
From Info Science to Data Science & Smart Nation
PDF
RDBMS to Graph Webinar
PDF
The Other AI: How Semantic Reasoning Automates Security Analysis
PDF
AI pitch SSideri
PPTX
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
PDF
Internet of Things (IoT) - in the cloud or rather on-premises?
PPTX
Alessandro Ferrari - Smart City, Mixed Reality, Self-Driving Cars & Neural Co...
PDF
Big Data LDN 2017: Big Impact with Big Data
PPTX
Xuedong Huang - Deep Learning and Intelligent Applications
PDF
High-performance database technology for rock-solid IoT solutions
PDF
Graph Database Use Cases - StampedeCon 2015
PDF
Graph database Use Cases
PDF
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
PPTX
Connected devices microsoft
PDF
Luiz eduardo. introduction to mobile snitch
PDF
RSA2015: Securing the Internet of Things
PDF
Tfm slides
PPTX
Brief Intro to Data Visualisation
PPT
Alitora Innovation Networks
PDF
Intro to Neo4j and Graph Databases
From Info Science to Data Science & Smart Nation
RDBMS to Graph Webinar
The Other AI: How Semantic Reasoning Automates Security Analysis
AI pitch SSideri
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Internet of Things (IoT) - in the cloud or rather on-premises?
Alessandro Ferrari - Smart City, Mixed Reality, Self-Driving Cars & Neural Co...
Big Data LDN 2017: Big Impact with Big Data
Xuedong Huang - Deep Learning and Intelligent Applications
High-performance database technology for rock-solid IoT solutions
Graph Database Use Cases - StampedeCon 2015
Graph database Use Cases
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Connected devices microsoft
Luiz eduardo. introduction to mobile snitch
RSA2015: Securing the Internet of Things
Tfm slides
Brief Intro to Data Visualisation
Alitora Innovation Networks
Intro to Neo4j and Graph Databases
Ad

More from Tiago Henriques (17)

PDF
BSides Lisbon 2023 - AI in Cybersecurity.pdf
PDF
Codebits 2014 - Secure Coding - Gamification and automation for the win
PPTX
Presentation Brucon - Anubisnetworks and PTCoresec
PPTX
Hardware hacking 101
PPTX
Workshop
PPTX
PPTX
Confraria 28-feb-2013 mesa redonda
PPTX
Preso fcul
PPTX
How to dominate a country
PPTX
Country domination - Causing chaos and wrecking havoc
PDF
(Mis)trusting and (ab)using ssh
PPTX
Secure coding - Balgan - Tiago Henriques
PPTX
Vulnerability, exploit to metasploit
PPTX
Practical exploitation and social engineering
PDF
PPT
Talkj4mshare
PPT
Codebits 2010
BSides Lisbon 2023 - AI in Cybersecurity.pdf
Codebits 2014 - Secure Coding - Gamification and automation for the win
Presentation Brucon - Anubisnetworks and PTCoresec
Hardware hacking 101
Workshop
Confraria 28-feb-2013 mesa redonda
Preso fcul
How to dominate a country
Country domination - Causing chaos and wrecking havoc
(Mis)trusting and (ab)using ssh
Secure coding - Balgan - Tiago Henriques
Vulnerability, exploit to metasploit
Practical exploitation and social engineering
Talkj4mshare
Codebits 2010

Recently uploaded (20)

PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
Architecture types and enterprise applications.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Build Your First AI Agent with UiPath.pptx
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPT
Geologic Time for studying geology for geologist
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
CloudStack 4.21: First Look Webinar slides
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
STKI Israel Market Study 2025 version august
PPTX
Configure Apache Mutual Authentication
Module 1.ppt Iot fundamentals and Architecture
2018-HIPAA-Renewal-Training for executives
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
Architecture types and enterprise applications.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
TEXTILE technology diploma scope and career opportunities
Flame analysis and combustion estimation using large language and vision assi...
Comparative analysis of machine learning models for fake news detection in so...
A contest of sentiment analysis: k-nearest neighbor versus neural network
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Build Your First AI Agent with UiPath.pptx
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Geologic Time for studying geology for geologist
Credit Without Borders: AI and Financial Inclusion in Bangladesh
CloudStack 4.21: First Look Webinar slides
A review of recent deep learning applications in wood surface defect identifi...
Microsoft Excel 365/2024 Beginner's training
STKI Israel Market Study 2025 version august
Configure Apache Mutual Authentication

I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY

  • 1. By Tiago Henriques, Filipa Rodrigues Florentino Bexiga, Ana Barbosa I, for one, welcome our new Cyber Overlords! An introduction to the use of data science in cybersecurity
  • 2. WHO ARE WE? MACHINE LEARNING AND CYBERSECURITY IMAGE WORKFLOW IMAGE ANALYSIS IN DETAIL DATA VISUALISATION Agenda
  • 3. Tiago is the CEO and Data necromancer at BinaryEdge however he gets to meddle in the intersection of data science and cybersecurity by providing his team with lovely problems that they solve on a daily basis. Tiago Henriques Presenter
  • 4. Florentino is the Data MacGyver at BinaryEdge. On a daily basis he needs to deploy infrastructure used to analyse big and realtime data. When not doing that, he can be found creating models to analyse data. Give him an orange, he’ll give you a skynet. Why an orange you ask? He’s hungry and likes oranges, there! Florentino Bexiga Presenter
  • 5. Filipa is the Data Diva at BinaryEdge, she dances the macarena with numbers to get them to tell her all their dirty secret. Filipa Rodrigues Presenter
  • 6. Ana is the Data Ferret at BinaryEdge. She is small and hides between the 110th and 111th characters of the ascii code to see and show data in that unique perspective of someone who can’t reach the box of cookies stored on top of the capitol 'I' Ana Barbosa Presenter
  • 9. How we got here.... 200 port scan of the entire internet/ month 1,400,000,000 scanning events/ month * 746,000 torrents monitored and increasing 1,362,225,600 torrent events/ month * at a minimum
  • 10. Worldwide distribution of IPs running services <= 100 Number of IPs found >= 1,000,000 100,000 < #found < 1,000,000 10,000 < #found <= 100,000 1,000 < #found <= 10,000 100 < #found <= 1,000
  • 11. Map IPv4 addresses to Hilbert curves % of coverage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
  • 12. Data Science & Machine Learning How many IP addresses did job X had vs. job Y? What is the average duration of the scans? Can we extract more from all the screenshots we get? Can we have a more optimized job distribution? We can only identify X% of services because we’re using static signatures, can we do better? Can we find similar images? MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER DATA SCIENCE & MACHINE LEARNING
  • 13. Data Science & Machine Learning DATA SCIENCE MACHINE LEARNING INITIAL ANALYSIS AND CLEAN UP EXPLORATORY DATA ANALYSIS DATA VISUALISATION KNOWLEDGE DISCOVERY CLASSIFICATION CLUSTERING SIMILARITY MATCHING REGRESSION IDENTIFICATION
  • 14. Problems and Limitations of Machine Learning in CyberSecurity Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of machine learning Prediction – Scenarios and data too volatile, not enough proper sources of data Lack of data in quantity and quality to train models
  • 15. Good use cases further work needs to be done, but will allow to move antivirus from a static/ signature based system into a much improved dynamic/ learning based system If a computer is hacked certain behaviors will change, if constant data is being monitored and fed into a system the hack could be detected detection of vulnerable patterns during development sentiment analysis applied to emails, tweets, social networks of employees PATTERN DETECTION/OUTLIER DETECTION (IDS/IPS) ANTIVIRUS ANTI-SPAM SMARTER FUZZERS SOURCE CODE ANALYSIS INTERNAL ATTACKERS
  • 16. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  • 22. DEMO
  • 25. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  • 26. Microservices (REST API) MICROSERVICES (REST API) PORT WORD TAG FACECOUNTRY LOGO IP
  • 27. Scan SCAN GENERATES EVENTS DOES IT GENERATE A SCREENSHOT? STORE THE IMAGE FILE ON THE CLOUD YES NO GENERATE A NOTIFICATION THAT NEW IMAGE WAS UPLOADED FINISH
  • 28. Image Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  • 29. INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR) Image Workflow PULL MESSAGE FROM QUEUE IS THERE A NEW IMAGE? DECRYPT AND STORE IMAGE METADATA ON A DATABASE YES NO GENERATE IMAGE SIGNATURE FOR SIMILARITY COMPARISON FINISH MESSAGE QUEUE
  • 30. Image Workflow PULL MESSAGE FROM QUEUE DOES THE IMAGE HAVE ANY INFORMATION? PERFORM SIMPLE ENTROPY FILTERING YES NO FINISH MESSAGED QUEUE INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  • 31. PULL MESSAGE FROM QUEUE ENHANCE IMAGE WITH APPLICATION OF SOME FILTERS RUN FACE AND LOGO DETECTION AND OCR ALGORITHMS STORE RESULTS IN DATABASE PERFORM ADDITIONAL ACTIONS WITH THE RESULTS Image Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  • 32. Image Workflow [{"BreachDate": "2013-10-04", "DataClasses": ["Email addresses", "Password hints", "Passwords", "Usernames"], "Title": "Adobe", "IsAc- tive": true, "Description": "In October 2013, 153 million Adobe accounts were breached with each containing an internal ID, username, email, <em>encrypted</em> password and a password hint in plain text. The password cryptography was poorly done and <a href="http://stric- ture-group.com/files/adobe-top100.txt" target="_blank">many were quickly resolved back to plain text</a>. The unencrypted hints also <a href="http://www.troyhunt.com/2013/11/adobe-creden- tials-and-serious.html" target="_blank">disclosed much about the passwords</a> adding further to the risk that hundreds of millions of Adobe customers already faced.", "Domain": "adobe.com", "Added- Date": "2013-12-04T00:00:00Z", "PwnCount": 152445165, "IsRetired": false, "IsVerified": true, "LogoType": "svg", "IsSensitive": false, "Name": "Adobe"}] Email DataLeak API
  • 33. Image WorkflowImage Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  • 34. Shannon’s Entropy Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits Filter
  • 35. Data Visualization EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS “a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.” Andy Kirk, “Data Visualization: a successful design process”
  • 36. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP DATA TYPE RELEVANCE FILTER What is the most interesting? What is most important? Audience’s Profile What is the most relevant information in the context? Show all values or just a few? Define periods? Define a threshold? Hierarchical Relational Temporal Spatial Categorical Exploration Data Visualization
  • 37. Representation Experimentation is important Conceive ideas Storyboarding Do multipe iterations Prototype Test design can be used in the future Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP 69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331 951,629 854,817 789,515 759,115 490,290 288,885 266,827 257,105 219,025 198,898 186,286 141,474 HowmanyopenportsdoesanIPhave? NumberofIPswithXopenportsport NumberofIPs
  • 38. Representation EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Distribution of IP addresses running encrypted and unencrypted services MARKS Points Areas Lines ATTTRIBUTES Position Connections/ Patterns Size/ Color REPRESENT RECORDS EMPHASIZE THE MOST IMPORTANT ASPECTS OF THE DATA on port 443 on port 80 51,467,779 HTTP 28,671,263 IPs running HTTP services IPs running HTTPS services 16,519,503IPs running both HTTP and HTTPS services HTTP & HTTPS HTTPS Data Visualization
  • 39. Data Visualization Representation PRECISION IN DESIGN Geometric Calculations Truncated axis Scales MAKE IT UNDERSTANDABLE Reference lines Markers MAKE IT APPEALING Minimise the clutter Priority: preserve function Top 10Web Servers for theWeb Most common web servers found on port 80 Apache httpd AkamaiGHost Micorosft IIS httpd nginx lighttpd Huawei HG532e ADSL modem http admin Microsoft HTTPAPI httpd Technicolor DSL modem http admin Mbedthis-Appweb micro_httpd 2 4 6 80 10 12 millions 11,493,552 8,361,080 4,843,769 3,860,883 2,031,741 1,539,629 952,300 699,202 694,393 678,657 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
  • 40. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical CVSS SCORES LOW MEDIUM HIGH 0.0 10.0 4.0 7.0 SEVERITY CVSS: CommonVulnerability Scoring System Data Visualization
  • 41. CVE Identifier Number References Description EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP CVE: CommonVulnerabilities and Exposure Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical Data Visualization
  • 42. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Overview of protocols used for email, according to encryption used Email Protocols ENCRYPTED UNENCRYPTED POP3 POP3S SMTP SMTPS IMAP IMAPS 4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364 10,416,812 12,234,969 SERVICE COUNT Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical Data Visualization
  • 43. Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Big Data Technologies Changes in amount of data exposed without security MongoDB Memcached Redis 2 TB 644.3 TB Aug 2015 Jan 2016 July 2016 724.7 TB 627.7 TB 13.2 TB 11.3 TB 710.9 TB 12.0 TB 598.7 TB 27.5 TB 1.5 TB 1.8 TB 619.8 TB Data Visualization
  • 44. Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Heartbleed Countries with higher number of IPs vulnerable to Heartbleed Russia 5,264 Republic of Korea 4,564 China 6,790 United States 23,649 Italy 2,508 Germany 6,382 France 5,622 Netherlands 2,779United Kingdom 3,459 Japan 2,484 Data Visualization
  • 45. Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP VNC wordcloud loginwindows edition 2016 delete ctrl server press microsoft system welcome your help file linux google kernel from ubuntu
  • 46. Details ANNOTATION Titles and subtitles Labels Legends TYPOGRAPHY Use fonts that are easy to read Don’t use fonts that are considered sloppy SSH Banners SSH-2.0-OpenSSH_5.3 SSH-2.0-OpenSSH_6.6.1p1 SSH-2.0-OpenSSH_6.6.1 SSH-2.0-OpenSSH_4.3 SSH-2.0-OpenSSH_6.0p1 SSH-2.0-OpenSSH_6.7p1 SSH-2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH-2.0-ROSSSH SSH-2.0-OpenSSH_5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 count banner Most common SSH Banners found EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Data Visualization
  • 47. Details ANNOTATION Titles and subtitles Labels Legends TYPOGRAPHY Use fonts that are easy to read Don’t use fonts that are considered sloppy SSH -2.0-O penSSH _5.3 SSH -2.0-O penSSH _6.6.1p1 SSH -2.0-O penSSH _6.6.1 SSH -2.0-O penSSH _4.3 SSH -2.0-O penSSH _6.0p1 SSH -2.0-O penSSH _6.7p1 SSH -2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH -2.0-RO SSSH SSH -2.0-O penSSH _5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Data Visualization
  • 48. Details COLOR Legibility Functional purpose Salience Consistency Color Blindness COMPOSITION Chart size/ orientation Alignments EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP SSH Key Lengths Most common key lengths found Key length count 641,719 1040 186,070 1032 13,845 4096 5,068,711 1024 3,740,593 2048 9,064 512 7,830 2056 6,265 2064 6,212 1016 4,755 768 Data Visualization
  • 49. Tools EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP BALANCE Automation Programming Language to create plots Fine tunning in illustrator (make it better for the audience) Hand-editing process Human error Originality Automated Analysis Illustrator (or other tool) to create visualization solution Human error Data Visualization
  • 50. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP DOCUMENT EVERY STEP OF THE PROCESS Calculations Choices of visualisations Choices of data points REVIEW EVERYTHING What could have been done differently? What could be better? TAKE CONSTRUCTIVE FEEDBACK Even if it means to start over A visualization can be used in the future Data Visualization
  • 52. THE SCIENCE BEHIND THE DATA CREATED BY BINARYEDGE