1BNAIC 2015 November 5-6, 2015
AMIDST Toolbox
A Java library for Analysis of MassIve Data Streams using
Probabilistic Graphical Models
FP7 European research project
http://amidst.eu
Anders	L.	Madsen,	Andres	R.	Masegosa, Ana	M.	Martinez,	
Hanen Borchani,	Thomas	D.	Nielsen,	Helge Langseth,	Antonio	
Salmeron, Dario	Ramos-Lopez.
Outline
1. Overview	of	AMIDST	Toolbox
o Why	data	streams	are	important?
o Why	PGMs	for	analyzing	data	streams?
o Scalable	Inference	(and	learning)
o Roadmap	for	coming	releases
2. Live	Demo:	Modeling	concept	drift	in	financial	data.
o Handling	data	streams.
o Defining	Bayesian	networks	with	hidden	 variables.
o Inference	and	Learning	Bayesian	networks.
BNAIC 2015 November 5-6, 2015
Scope
Part	I
Data Streams everywhere
• Unbounded	flows	of	data	are	generated	daily:	
• Social	Networks
• Network	Monitoring
• Financial/Banking	industry
• ….
BNAIC 2015 November 5-6, 2015
Data Stream Processing
• Processing	data	streams	is	challenging:
– Do	not	fit	in	main	memory
– Continuous	Model	updating	
– Continuous	Model	Inference
– Concept	Drift
BNAIC 2015 November 5-6, 2015
Processing Massive Data Streams
• Everything	has	to	scale:
• Scalable	Computing	infrastructure
• Scalable	Models/Inference/Learning
BNAIC 2015 November 5-6, 2015
AMIDST Toolbox
• Scalable	framework	for	data	stream	processing.
• Based	on	Probabilistic	Graphical	Models.
• Unique	project	for	data	stream	mining	using	PGMs.
• Open	source	project	(Apache	Software	License	2.0).
BNAIC 2015 November 5-6, 2015
AMIDST EU Project
8
§ This	toolbox	aims	to	deal	with	real,	complex	and	massive	data	streams.
§ Applied	to	real	use-cases	of	AMIDST’s	industrial	partners.
BNAIC 2015 November 5-6, 2015
Toolbox Web Page
http://amidst.github.io/toolbox/
BNAIC 2015 November 5-6, 2015
Why	PGMs	for	
data	stream	processing?
Part	II
Why Graphical Models?
§ Let’s	look	at	the	following	simple	example:	
§ Stream	of	sensor	measurements	about	temperature and	
smoke presence	in	a	given	geographical	area.
§ Monitor	the	stream	to	detect	the	presence	of	a	fire (event	
detection	problem)
?
BNAIC 2015 November 5-6, 2015
§ Cast	the	problem	as	an	anomaly	detection	
problem	(outliers).	
§ Streaming	K-Means	(widely	used	in	industry).
Why Graphical Models?
Anomaly
BNAIC 2015 November 5-6, 2015
Why Graphical Models for
analyzing Data Streams?
§ Many	data	streams	models	are	black	box	models:
§ Pros:
§ No	need	to	understand	the	problem.
§ Cons:
§ Many	hyper-parameters	to	tune.
§ Blackbox models	can	rarely	explain	what	they	learned.	
Stream
Blackbox Model
Predictions
BNAIC 2015 November 5-6, 2015
§ Bayesian	Networks:
§ Openbox models
§ Encode	prior	knowledge.
§ Continuous	and	discrete	variables	(CLG	networks).	
§ Example:	
Why Graphical Models?
Fire
Temp Smoke
T1 T2 T3 S1
p(Fire=true|t1,t2,t3,s1)
BNAIC 2015 November 5-6, 2015
Why Graphical Models?
Stream Predictions
Openbox Models
BNAIC 2015 November 5-6, 2015
Why Graphical Models?
Stream Predictions
Openbox Models
Blackbox Inference	Engine
(multi-core	parallelization)
BNAIC 2015 November 5-6, 2015
Inference	Engine
Part	III
Inference Engine
§ Querying	the	model
§ p(Fire=true|t1,t2,t3,s1,season)
§ E(Temperature|smoke=true).
BNAIC 2015 November 5-6, 2015
Inference Engine
§ Querying	the	model
§ p(Fire=true|t1,t2,t3,s1,season)
§ E(Temperature|smoke=true)
§ Learning	from	data		(using	a	Bayesian	approach):
§ Bayesian	framework	naturally	deals	with	data	streams.
§ Prior	is	updated	in	the	light	of	new	data.
p(✓|d1, . . . , dn, dn+1) / p(dn+1|✓)p(✓|d1, . . . , dn)
BNAIC 2015 November 5-6, 2015
Querying the model
§ Parallel	Monte	Carlo	Inference	[Salmeron et	al.	CAEPIA	2015]
§ Exploit	Multi-Core	(powered	by	Java	8)
BNAIC 2015 November 5-6, 2015
Querying the model
§ Parallel	Monte	Carlo	Inference	[Salmeron et	al.	CAEPIA	2015]
§ Exploit	Multi-Core	(powered	by	Java	8)
§ Variational		Message	Passing	[Winn	et	al.	JMLR	2004]
§ Deterministic	approximation
BNAIC 2015 November 5-6, 2015
Learning from data streams
§ Bayesian	approach:
§ Learning	as	an	inference	problem.
§ Powered	by	VMP.
✓
Z
x
i = 1 . . . N
↵
BNAIC 2015 November 5-6, 2015
Learning from data streams
§ Bayesian	approach:
§ Learning	as	an	inference	problem.
§ Powered	by	VMP.
§ Plateau	notation!!
BNAIC 2015 November 5-6, 2015
Learning from data streams
§ Parallel	Streaming	Variational	Bayes[Broderick	et	al.	NIPS	2013]
§ Powered	by	Variational	Message	Passing.
§ Multi-core	processing	(using	Java	8).
BNAIC 2015 November 5-6, 2015
Links to other open software
§ MoaLink
§ MOA	is	a	state-of-the-art	tool	for	data	stream	mining.
§ Using	AMIDST	models	within	MOA	GUI!
§ Great	for	evaluation	&	comparison.
BNAIC 2015 November 5-6, 2015
Links to other open software
§ HuginLink
§ Hugin is	a	commercial	software	for	PGMs	and		influence	diagrams.
§ Models	conversion.
§ Hugin inference	engine	can	be	used	within	AMIDST.
26BNAIC 2015 November 5-6, 2015
Road	Map
Part	III
Dynamic Bayesian Networks
(release 1.1)
§ Encode	temporal	knowledge
§ Naturally	fits	with	data	streams
Fire(t)
Temp(t) Smoke(t)
T1(t) T2(t) T3(t) S1(t)
Fire(t-1)
Temp(t-1)
BNAIC 2015 November 5-6, 2015
Distributed Stream Processing
(release 1.1)
§ RLink
§ Invoke	AMIDST	Inference	engine	within	R.	
§ Preliminary	functionality	recently	presented.
29BNAIC 2015 November 5-6, 2015
Distributed Stream Processing
(release 2.0)
§ FlinkLink
§ Apache	Flink:	Open	source	platform	for	distributed	stream	processing.
§ Handling	Massive	Data	Streams.
30BNAIC 2015 November 5-6, 2015
Open Source project
§ We’re	open	to	your	contributions!!	;)
31BNAIC 2015 November 5-6, 2015
Hosted on Github
§ Download:
:>	git clone	https://github.com/amidst/toolbox.git
§ Compile:
:>	./compile.sh
§ Run:
:>	./run.sh <class-name>
BNAIC 2015 November 5-6, 2015
Please “star” our project!
(if you like it)
33BNAIC 2015 November 5-6, 2015
Any questions
before the live demo ?
34
Live	Demo
Tracking	concept	drift	in	
Financial	data	with	AMIDST
Borchani et	al.	Modeling	Concept	Drift:	A	Probabilistic	Graphical	Model	Based	Approach.	IDA	2015.
Demo Code Available in Github
36
eu.amidst.bnaic2015.examples.BCC
BNAIC 2015 November 5-6, 2015
Financial Data
§ Provided	by	BCC	(spanish regional	bank).
§ Consist	of	monthly	aggregated	information
§ Active	clients	between	18	and	65	years	old.
§ Data	between	April	2007	and	March	2014.
§ 11	variables
§ Income,	total	credit,	expenses,	etc.
§ Each	client	is	classified	as:
§ defaulter/non-defaulter	in	following	12	months.
37BNAIC 2015 November 5-6, 2015
Financial Data
§ Hypothesis:
§ Does	spanish financial	crisis	impact	on	bank	customers?
§ Look	at	the	evolution	of	regional	unemployment	rate.
38BNAIC 2015 November 5-6, 2015
Data Preprocessing/Visualization
§ Visualize	the	evolution	of	the	monthly	aggregated	data:
§ Data	does	not	fit	in	main	memory!
39BNAIC 2015 November 5-6, 2015
Model Building
§ We	use	a	simple	Naïve	Bayes	model:
§ With	a	global	hidden	variable	to	track	concept	drift.
40
D
A1 A2 A11…
H
BNAIC 2015 November 5-6, 2015
Model Building
§ We	also	use	Plateau	notation
§ “H”	is	designed	to	capture	concept	drift	
41
D
A1 A2 A11…
HtHt-1
i=1…M
✓
BNAIC 2015 November 5-6, 2015
Tracking concept drift
42BNAIC 2015 November 5-6, 2015
Tracking concept drift
43BNAIC 2015 November 5-6, 2015
References
§ Masegosa	et	al.	AMIDST:	Analysis	of	Massive	Data	Streams	using	Probabilistic	
Graphical	Models.	Submitted	to	JMLR.	2015.
§ Borchani et	al.	Modeling	Concept	Drift:	A	Probabilistic	Graphical	Model	Based	
Approach.	IDA	2015.	
§ Masegosa	et	al.	Probabilistic	graphical	models	on	multi-core	CPUs	using	Java	8.	
Submitted	to	IEEE	Computational	Intelligence	Magazine,	Special	Issue	on	
Computational	Intelligence	Software.	2015.	
§ Salmeron et	al.	Parallel	importance	sampling	in	conditional	linear	Gaussian	
networks.	In	Proceedings	of	the	Conferencia de	la	Asociacion Española	para la	
Inteligencia Artificial,	volume	in	press,	2015.	
§ Winn	et	al. Variational	message	passing.	Journal	of	Machine	Learning	Research,	
6:661–694,	2005.	
§ Broderick	et	al.	Streaming	variational Bayes.	In	Advances	in	Neural	Information	
Processing	Systems,	pages	1727–1735,	2013.	
44BNAIC 2015 November 5-6, 2015
Any questions ?
45
http://amidst.github.io/toolbox/
BNAIC 2015 November 5-6, 2015
Open Source project
§ We’re	open	to	your	contributions!!	;)
46BNAIC 2015 November 5-6, 2015

Amidst demo (BNAIC 2015)