SQL++ FOR BIG DATA
Same Language, More Power
Matthew D. Groves
2
SQL, for the win
https://insights.stackoverflow.com/survey/2019
01/
02/
03/
04/
SQL & Relational
NoSQL
Analytics & Reporting
Summary & More Resources
AGENDA
SQL & RELATIONAL
1
5
• Relational
• E.F. Codd invented the relational model
• Alpha
• SQL
• Created by Don Chamberlin & Raymond Boyce
• Designed to be English-friendly
• "SQL" and "relational" are now synonyms
Relational and SQL
6
• Impedance mismatch
• Scaling
• Inflexibility
Criticisms of SQL/relational
7
Impedance mismatch
ID Username DateCreated
1 mgroves 2019-06-13
2 agroves 2019-06-14
. . .
. . .
CartID Item Price Qty
1 hat 12.99 1
1 socks 11.99 1
2 t-shirt 15.99 1
. . . .
. . . .
public class ShoppingCart
{
public int Id;
public string Username;
public List<Items> Items;
}
ShoppingCart
ShoppingCartItems
8
Scaling
Vertical Horizontal
9
Inflexibility
Billing
ConnectionsPurchases
Contacts
Customer
10
• A relational database may be…
Disclaimer!
NOSQL / SQL++
2
12
JSON data is NoSQL data
13
Example 1
{
"callsign": "UNITED",
"country": "United States",
"name": "United Airlines",
"type": "airline"
}
document key: airline_5209
14
Example 2
document key: route_55758
{
"airline": "UA",
"airlineid": "airline_5209",
"destinationairport": "ORD",
"distance": 1050.394306634423,
"equipment": "ER4 ERJ",
"schedule": [
{ "day": 0, "flight": "UA479", "utc": "15:05:00" },
{ "day": 1, "flight": "UA842", "utc": "02:27:00" },
{ "day": 1, "flight": "UA252", "utc": "03:00:00" },
// ... etc ...
],
"sourceairport": "CMH",
"stops": 0,
"type": "route"
}
15
• Get by key
• Set by key
• Delete by key
• Other "operational" query
NoSQL basic operations
16
Problems:
1. Large amounts of data
2. Queries against the data could impact
operations
What about reporting and analytics?
ANALYTICS &
REPORTING3
18
Operational vs Analytics vs Operational Analytics
19
Fewer queries
Operational Analytics workload
Adhoc
Could be complex Performance is nice-to-have
20
How are operational analytics done?
21
¯_(ツ)_/¯
Answer 1
22
Answer 2: Export to relational
Data
ETL
SQL
23
Answer 3: Hadoop?
https://medium.com/@ylashin/big-data-using-hdinsight-a-journey-in-the-zoo-ecosystem-c78b913a5ed9
24
Answer 4: SQL++
25
SQL Example
ID foo bar baz
1 matt groves qux
2 ali groves notqux
3 emma groves notqux
mytable
SELECT foo, bar
FROM mytable
WHERE baz = 'qux'
26
SQL++ Example
key: 1
{
"foo" : "matt",
"bar" : "groves",
"baz" : "qux"
}
key: 2
{
"foo" : "ali",
"bar" : "groves",
"baz" : "notqux"
}
key: 3
{
"foo" : "emma",
"bar" : "groves",
"baz" : "notqux"
}
mybucket
SELECT foo, bar
FROM mybucket
WHERE baz = 'qux'
27
SQL++ Research Project
28
• JOIN
• UNION
• aggregation / GROUP BY
• SELECT
• LET
• LIMIT
• ORDER BY
• etc…
SQL++ is backwards compatible
29
SQL++ has superpowers
30
Superpower: Nested Objects
key 1
{
"name" : "matt",
"address" : {
"street" : "White Rd",
"city" : "Grove City",
"state" : "OH"
}
}
key 2
{
"name" : "emma",
"address" : {
"street" : "High St",
"city" : "Columbus",
"state" : "OH"
}
}
SELECT address.city
FROM myusers
myusers
31
Superpower: arrays
key 1
{
"name" : "matt",
"favoriteFoods" : [
"pizza",
"cheesecake",
"donuts"
]
}
key 2
{
"name" : "emma",
"favoriteFoods" : [
"donuts",
"Lucky Charms",
"chicken"
]
}
SELECT favoriteFoods[1]
FROM myusers
myusers
32
Superpower: UNNEST
key 1
{
"name" : "matt",
"favoriteFoods" : [
"pizza",
"cheesecake",
"donuts"
]
}
SELECT food, u.name
FROM myusers u
UNNEST u.favoriteFoods food;
myusers
[
{
"food": "pizza",
"name": "matt"
},
{
"food": "cheesecake",
"name": "matt"
},
{
"food": "donuts",
"name": "matt"
}
]
33
Superpower: Quantification
key 1
{
"name" : "matt",
"favoriteFoods" : [
"pizza",
"cheesecake",
"donuts"
]
}
key 2
{
"name" : "emma",
"favoriteFoods" : [
"donuts",
"Lucky Charms",
"chicken"
]
}
SELECT u.name
FROM eftest u
WHERE ANY f
IN u.favoriteFoods
SATISFIES f == 'pizza'
END;
myusers
34
• Couchbase
• AsterixDb
• Apache Drill
• Others coming soon?
SQL++ Implementations
35
Implementation 1: Couchbase
SQL++
36
Implementation 2: AsterixDB
37
Implementation 3: Apache Drill
SUMMARY
4
39
No
NoSQL doesn't mean NoSQL anymore
++SQL
40
SQL++ is SQL with JSON Superpowers
41
Minimize your ETL, maximize your SQL skills
ETL
👎
SQL
👍
42
• E.F. Codd original research paper
• http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf
• The Free Lunch is Over
• http://www.gotw.ca/publications/concurrency-ddj.htm
• Original SEQUEL paper
• https://dl.acm.org/citation.cfm?id=811515
Resources: SQL/scaling
43
• UCSD
• http://forward.ucsd.edu/sqlpp.html
• The SQL++ Query Language
• https://arxiv.org/abs/1405.3631
Resources: UCSD Research
44
• Book: SQL++ for SQL Users
• Amazon: https://www.amazon.com/SQL-Users-Tutorial-Don-Chamberlin/dp/0692184503/
• Free PDF: https://resources.couchbase.com/sql_tutorial
• Videos
• NoSQL and SQL++, two sides of the same coin:
https://www.youtube.com/watch?v=KGKiSyJa0-k
• Tech Panel on Query Language Evolution:
https://www.youtube.com/watch?v=LAlDe1w7wxc
Resources: Don Chamberlin
45
•@mgroves
•twitch.tv/matthewdgroves
•forums.couchbase.com
•Find me after this session!
Resources: Me!

Intro to SQL++ - Detroit Tech Watch - June 2019

Editor's Notes

  • #3 show that SQL is popular with Stack Overflow survey 2019 About the same as it was last year, in the 55-60% Popular doesn't necessarily equal good, of course, but if you look at the top 3, they are all in the "lingua franca" category SQL rules data https://insights.stackoverflow.com/survey/2019
  • #6 EF Codd did a lot of great theoretical work and research, including the invention of the relational database Interesting quote from his original paper that describes one of the fundamental tradeoffs between relational and non-relational data, which we'll explore today After his initial paper, he designed a language called "Alpha", which was never implemented, but influential
  • #8 In the database we have 5 pieces of data stored For what is actually 2 shopping carts as they exist in the application We have tools to attempt to deal with this, mainly OR/Ms And they mostly do a good job… mostly
  • #9 The easiest way to scale a relational database is vertical But this can get expensive and eventually hit a ceiling (The Free Lunch is Over) Horizontal scaling can be cheaper, can scale bigger, but is difficult to do with relational
  • #10 Rise of agile methodologies "we value responding to change over following a plan" Schema changes A simple change of moving "credit card number" field from customer to a new "billing" table with foreign key That's a simple example, but even that with a large enough database could have huge impact The more complex the schema change and the bigger the database, the more impact it has Which means the more expensive/risky this change will be
  • #11 I'm not here to convince you that relational is dead! You are working with small data sets (for some definition of small) You are working with simple/rarely changing data structures (for some definition of simple/rarely) You aren't feeling performance / scaling pain (yet) But don't turn off your mind yet. You aren't facing these problems now, but you may face them in the future.
  • #12 So what if it's not fine?
  • #13 Isolated pieces of data "Documents" Can be sharded / split between any number of nodes (for some reason when I think of "shards" I think of the crystals that Superman has in the fortress of solitude)
  • #14 This is a simple example Flat data, you could easily imagine this as a row in a table Notice the document KEY Document database is basically a key/value store. The value is the JSON and the key is some string This may look slightly different from database to database, but they all have a key somewhere.
  • #15 More complex example The 'schedule' element in relational would be at least one separate table with foreign keys It's all domestic data here No mismatch, easy to scale, no joining required No schema to follow, so I could add other fields TO JUST THIS ONE DOCUMENT if necessary Don't ALWAYS normalize, notice the 'airlineid' field
  • #16 Other operational query: Map/reduce, Mongo has a javascript-like query language, Couchbase uses SQL for operational queries
  • #17 Suppose your database is used for the backend of an ecommerce site Everything is humming along nicely, customers are adding items to shopping cart They're making purchases, browsing the catalog with well-known, well-indexed queries Suddenly I come along trying to create a report I run a complicated query or adhoc query that I don't have proper indexing, sizing, tuning for And my query impacts customers: slows them down or worse causes timeouts
  • #19 Define these terms Talk more about the differences later, when to use each one Operational: means the moment-to-moment data operations and queries that your website needs to function in order to serve customers Analytical: the operations and queries that you need to serve customers in the extreme long run and extreme history – data science/etc Operational analytics: sits between them, closer to real time, perhaps analyzing only the last 6 months or maybe even the last hour of data - dashboards/reporting/trend analysis
  • #20 - much fewer analysts than customers (hopefully?) - queries are more adhoc in nature - queries might be VERY complex - performance is still nice, but less important.
  • #21 There are 4 methods that I'm aware of I have experience with most of these
  • #22 I dunno? We don't really have a plan for this, we don't think about it We have a bunch of Access databases? We copy the operational data when we want to? Or just link to it directly and hope no one screws it up?
  • #23 export it to a relational database and use SQL - Create/maintain or buy an ETL Impedance mismatch (again!) Size/performance
  • #24 Hadoop is designed for massive scale, not massive speed. It's analytics, but it's not operational analytics. Using Hadoop and the Hadoop ecosystem is a whole other topic This may be too big of a hammer or too slow of a hammer for operational analytics * answer 3: hadoop or something - still an ETL problem – kafka, sqoop, flume, etc - how do we actually create queries? Pig, Hive, Spark, etc designed for petabytes+ two types of analytics: this is the data lake, analyze data of the entire history of the company https://medium.com/@ylashin/big-data-using-hdinsight-a-journey-in-the-zoo-ecosystem-c78b913a5ed9
  • #25 you already know how to write SQL Designed to work with richly structured data minimal or no ETL required This is the cover of a book, and notice the author
  • #27 As Don Chamberlain says, JSON kinda looks like tables "if you squint hard enough"
  • #28 SQL++ was a research project from UCSD in 2015 - https://arxiv.org/abs/1405.3631 - Couchbase's N1QL (operational) is the first implementation of this research paper
  • #29 The language itself The underlying data is different, it's not tables and rows It's collections of JSON documents
  • #30 SQL is made for flat relational data SQL++ takes it a step further to deal with structured data, and therefore it has some superpowers
  • #31 In JSON you can have nested objects Objects within objects, like address here How do I select that, project that, etc The answer is: dotted syntax
  • #32 Addressing arrays with square brackets
  • #33 We may want to flatten that array in order to filter on the values Consider "favoriteFoods" in relation would be a separate table In JSON, it's not, but we might want to do an "intra-document" join Unnest will flatten out the array and basically join each array value to its parent document
  • #34 Quantification means that I want to perform some filtering of an array To see if any or all items in an array satisfy some criteria For instance, I want to find all users who have "pizza" as a favorite food
  • #36 - Analytics - Workload isolation - "Shadow copy" created with two commands It technically IS an ETL, but it is real time, and it's created with two simple commands And it's otherwise completely automated I'll show you a demo of this later Workload isolation, read only
  • #37 - "big data management system" data ingestion (ETL), variety of built in adapters (local filesystem, HDFS, socket, twitter, RSS) and it's extensible Couchbase is essentially using a customized version of AsterixDB under the hood
  • #38 - No ETL required - Seems to access data directly, which could be a workload isolation problem (operational vs analytics) "in-place analytics" Can connect to a wide variety of databases
  • #40 They say you only remember 3 things from any presentation, so here they are
  • #43 Codd research paper - http://db.dobo.sk/wp-content/uploads/2015/11/Codd_1970_A_relational_model.pdf (may not be a good link in the long run, but it's free) - The Free Lunch is Over - http://www.gotw.ca/publications/concurrency-ddj.htm - SEQUEL paper - https://dl.acm.org/citation.cfm?id=811515 (I couldn't find a free copy)
  • #44 -http://forward.ucsd.edu/sqlpp.html (SQL++ part of the FORWARD project) - https://arxiv.org/abs/1405.3631 (paper published at Cornell)
  • #45  - book - https://www.amazon.com/SQL-Users-Tutorial-Don-Chamberlin/dp/0692184503/ - free pdf: https://resources.couchbase.com/sql_tutorial - videos - https://www.youtube.com/watch?v=KGKiSyJa0-k - https://www.youtube.com/watch?v=LAlDe1w7wxc
  • #46 If anything looks interesting to you, you have questions or feedback, come talk to me afterwards I want to hear from you! My boss says I have to listen to you, it's my job. So now's your chance :)