Lecture 6_ Document Databases, Data Formats
Lecture 6_ Document Databases, Data Formats
Document Databases
Lecture 6 of NoSQL Databases (PA195)
2. Document databases
○ Structured text data - Hierarchical tree data structures
■ typically JSON, XML
3. Column-family stores
○ Rows that have many columns associated with a row key
■ can be written as JSON
Part 1: Document Data Types
Data Formats
● Binary Data (previous lecture)
○ often, we want to store objects (class instances)
○ objects can be binary serialized (marshalled)
■ and kept in a key-value store
○ there are several popular serialization formats
■ Protocol Buffers, Apache Thrift
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
JSON Properties
● There is no way to write comments in JSON
○ Originally, there was but it was removed for security
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
Document with JSON Schema
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
Part 2: Document Databases
Document Databases: Fundamentals
● Basic concept of data: Document
● Documents are self-describing pieces of data
○ Hierarchical tree data structures
○ Nested associative arrays (maps), collections, scalars
○ XML, JSON (JavaScript Object Notation), BSON, …
● Documents in a collection should be “similar”
○ Their schema can differ
● Often: Documents stored as values of key-value
○ Key-value stores where the values are examinable
○ Building search indexes on various keys/fields
Why Document Databases
● XML and JSON are popular for data exchange
○ Recently mainly JSON
● Data stored in document DB can be used directly
MS Azure
DocumentDB
row document
rowid _id
● each collection:
○ belongs to a “database”
http://www.mongodb.org/
Documents
● Use JSON for API communication
● Internally: BSON
○ Binary representation of JSON
○ For storage and inter-server communication
http://www.mongodb.org/
Schema: Embedded Docs (2)
● Denormalized schema
● Main advantage:
Manipulate related data in a single operation
● Use this schema when:
○ One-to-one relationships: one doc “contains” the other
○ One-to-many: if children docs have one parent document
● Disadvantages:
○ Documents may grow significantly during the time
○ Impacts both read/write performance
■ Document must be relocated on disk if its size exceeds allocated space
■ May lead to data fragmentation on the disk
Schema: References
● Links/references from one document to another
● Normalization of the schema
http://www.mongodb.org/
Schema: References (2)
● More flexibility than embedding
● Use references:
○ When embedding would result in duplication of data
■ and only insignificant boost of read performance
○ To represent more complex many-to-many relationships
○ To model large hierarchical data sets
● Disadvantages:
○ Can require more roundtrips to the server
■ Documents are accessed one by one
Part 2.2: MongoDB - Indexes
Indexes
● Indexes are the key for MongoDB performance
○ Without indexes, MongoDB must scan every document in a
collection to select matching documents
● Indexes store some fields in easily accessible form
○ Stores values of a specific field(s) ordered by the value
http://www.bsonspec.org/
Data Replication
● Master/slave replication
● Replica set = group of
instances that host the
same data set
○ primary (master) – handles
all write operations
○ secondaries (slaves) –
apply operations from the
primary so that they have
the same data set
Replication: Read & Write
● Write operation:
1. Write operation is applied on the primary
2. Operation is recorded to primary’s oplog (operation log)
3. Secondaries replicate the oplog + apply the operations to
their data sets
● Read: All replica set members can accept reads
○ By default, application directs its reads to the primary
■ Guaranties the latest version of a document
■ Decreases read throughput
○ Read preference mode can be set
■ See below
Replication: Read Modes
● Data balancing:
○ Background chunk migration
Sharding: Components
● MongoDB runs in cluster of different node types:
● Shards – store the data
○ Each shard is a replica set
■ Can be a single node