0% found this document useful (0 votes)
2 views43 pages

Lecture 6_ Document Databases, Data Formats

The document provides an overview of NoSQL databases, focusing on document databases such as MongoDB, which utilize JSON and BSON formats for data storage. It covers key concepts including data types, schema design, indexing, and the internal workings of MongoDB, including replication, sharding, and transactions. The lecture emphasizes the advantages of document databases in terms of flexibility and performance compared to traditional relational databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views43 pages

Lecture 6_ Document Databases, Data Formats

The document provides an overview of NoSQL databases, focusing on document databases such as MongoDB, which utilize JSON and BSON formats for data storage. It covers key concepts including data types, schema design, indexing, and the internal workings of MongoDB, including replication, sharding, and transactions. The lecture emphasizes the advantages of document databases in terms of flexibility and performance compared to traditional relational databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

NoSQL Databases

Document Databases
Lecture 6 of NoSQL Databases (PA195)

David Novak, FI, Masaryk University, Brno


http://disa.fi.muni.cz/david-novak/teaching/nosql-databases-2018/
Agenda
● Text (Document) Data Types
○ JSON: JavaScript Object Notation

● Document Databases: MongoDB


○ Database schema: Design
○ Using MongoDB: Updates, Queries, Indexes
○ Behind the scene
■ BSON format, Distribution, Replication, Transactions, ...
NoSQL Databases and Data Types
1. Key-value stores:
○ Can store any (text or binary) data
■ often, if using JSON data, additional functionality is available

2. Document databases
○ Structured text data - Hierarchical tree data structures
■ typically JSON, XML

3. Column-family stores
○ Rows that have many columns associated with a row key
■ can be written as JSON
Part 1: Document Data Types
Data Formats
● Binary Data (previous lecture)
○ often, we want to store objects (class instances)
○ objects can be binary serialized (marshalled)
■ and kept in a key-value store
○ there are several popular serialization formats
■ Protocol Buffers, Apache Thrift

● Semi-Structured Text Data


○ JSON, BSON (Binary JSON)
■ JSON is currently number one data format used on the Web
○ XML: eXtensible Markup Language
○ RDF: Resource Description Framework
JSON: Basic Information
● Text-based open standard for data interchange
○ Serializing and transmitting structured data
● JSON = JavaScript Object Notation
○ Originally specified by Douglas Crockford in 2001
○ Derived from JavaScript scripting language
○ Uses conventions of the C-family of languages
● Filename: *.json
● Internet media (MIME) type: application/json
● Language independent
http://www.json.org
JSON:Example

source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
JSON Properties
● There is no way to write comments in JSON
○ Originally, there was but it was removed for security

● No way to specify precision/size of numbers


○ It depends on the parser and the programming language

● There exists a standard “JSON Schema”


○ A way to specify the schema of the data
○ Field names, field types, required/optional fields, etc.
○ JSON Schema is written in JSON, of course
■ see example below
JSON Schema: Example

source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
Document with JSON Schema

source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015.
Part 2: Document Databases
Document Databases: Fundamentals
● Basic concept of data: Document
● Documents are self-describing pieces of data
○ Hierarchical tree data structures
○ Nested associative arrays (maps), collections, scalars
○ XML, JSON (JavaScript Object Notation), BSON, …
● Documents in a collection should be “similar”
○ Their schema can differ
● Often: Documents stored as values of key-value
○ Key-value stores where the values are examinable
○ Building search indexes on various keys/fields
Why Document Databases
● XML and JSON are popular for data exchange
○ Recently mainly JSON
● Data stored in document DB can be used directly

● Databases often store objects from memory


○ Using RDBMS, we must do Object Relational Mapping (ORM)
■ ORM is relatively demanding
○ JSON is much closer to structure of memory objects
■ It was originally for JavaScript objects
■ Object Document Mapping (ODM) is faster
Document Databases: Representatives

MS Azure
DocumentDB

Ranked list: http://db-engines.com/en/ranking/document+store


Part 2.1: MongoDB - Basics & Querying
MongoDB
● Initial release: 2009
○ Written in C++
○ Open-source
○ Cross-platform
● JSON documents
● Basic features:
○ High performance – many indexes
○ High availability – replication + eventual consistency +
automatic failover
○ Automatic scaling – automatic sharding across the cluster
○ MapReduce support
http://www.mongodb.org/
MongoDB: Terminology
RDBMS MongoDB ● each JSON document:
database instance MongoDB instance ○ belongs to a collection
schema database ○ has a field _id
table collection ■ unique within the collection

row document

rowid _id
● each collection:
○ belongs to a “database”

http://www.mongodb.org/
Documents
● Use JSON for API communication
● Internally: BSON
○ Binary representation of JSON
○ For storage and inter-server communication

● Document has a maximum size: 16MB (in BSON)


○ Not to use too much RAM
○ GridFS tool can divide larger files into fragments
Document Fields
● Every document must have field _id
○ Used as a primary key
○ Unique within the collection
○ Immutable
○ Any type other than an array
○ Can be generated automatically

● Restrictions on field names:


○ The field names cannot start with the $ character
■ Reserved for operators
○ The field names cannot contain the . character
■ Reserved for accessing sub-fields
Database Schema
● Documents have flexible schema
○ Collections do not enforce specific data structure
○ In practice, documents in a collection are similar

● Key decision of data modeling:


○ References vs. embedded documents

○ In other words: Where to draw lines between aggregates


■ Structure of data
■ Relationships between data
Schema: Embedded Docs
● Related data in a single document structure
○ Documents can have subdocuments (in a field or array)

http://www.mongodb.org/
Schema: Embedded Docs (2)
● Denormalized schema
● Main advantage:
Manipulate related data in a single operation
● Use this schema when:
○ One-to-one relationships: one doc “contains” the other
○ One-to-many: if children docs have one parent document
● Disadvantages:
○ Documents may grow significantly during the time
○ Impacts both read/write performance
■ Document must be relocated on disk if its size exceeds allocated space
■ May lead to data fragmentation on the disk
Schema: References
● Links/references from one document to another
● Normalization of the schema

http://www.mongodb.org/
Schema: References (2)
● More flexibility than embedding
● Use references:
○ When embedding would result in duplication of data
■ and only insignificant boost of read performance
○ To represent more complex many-to-many relationships
○ To model large hierarchical data sets

● Disadvantages:
○ Can require more roundtrips to the server
■ Documents are accessed one by one
Part 2.2: MongoDB - Indexes
Indexes
● Indexes are the key for MongoDB performance
○ Without indexes, MongoDB must scan every document in a
collection to select matching documents
● Indexes store some fields in easily accessible form
○ Stores values of a specific field(s) ordered by the value

● Defined per collection


● Purpose:
○ To speed up common queries
○ To optimize performance of other specific operations
Index Types
● Default: _id
○ Exists by default
■ If applications do not specify _id, it is created.
○ Unique
● Single Field
○ User-defined indexes on a single field of a document
● Compound
○ User-defined indexes on multiple fields
● Multikey index
○ To index the content stored in arrays
○ Creates separate index entry for each array element
Index Types (3)
● Ordered Index
○ B-Tree (see above)
● Hash Indexes
○ Fast O(1) indexes the hash of the value of a field
■ Only equality matches
● Geospatial Index
○ 2d indexes = use planar geometry when returning results
■ For data representing points on a two-dimensional plane
○ 2sphere indexes = spherical (Earth-like) geometry
■ For data representing longitude, latitude
● Text Indexes
○ Searching for string content in a collection
Part 2.3: MongoDB - Behind the Scene
MongoDB: Behind the Scene
● BSON format
● Distribution models
○ Replication
○ Sharding
○ Balancing
● MapReduce
● Transactions
● Journaling
BSON (Binary JSON) Format
● Binary-encoded serialization of JSON documents
○ Representation of documents, arrays, JSON simple data
types + other types (e.g., date)

http://www.bsonspec.org/
Data Replication
● Master/slave replication
● Replica set = group of
instances that host the
same data set
○ primary (master) – handles
all write operations
○ secondaries (slaves) –
apply operations from the
primary so that they have
the same data set
Replication: Read & Write
● Write operation:
1. Write operation is applied on the primary
2. Operation is recorded to primary’s oplog (operation log)
3. Secondaries replicate the oplog + apply the operations to
their data sets
● Read: All replica set members can accept reads
○ By default, application directs its reads to the primary
■ Guaranties the latest version of a document
■ Decreases read throughput
○ Read preference mode can be set
■ See below
Replication: Read Modes

Read Preference Description


Mode
primary operations read from the primary of the replica set
primaryPreferred operations read from the primary, but if unavailable,
operations read from secondary members
secondary operations read from the secondary members
secondaryPreferred operations read from secondary members, but if
none is available, operations read from the primary
nearest operations read from the nearest member (= shortest
ping time) of the replica set
Replica Set Elections
● If the primary
becomes
unavailable, an
election determines
a new primary
○ Elections need some
time
○ No primary =>
no writes
Replica Set: CAP
● Let us have three nodes in the replica set
○ Let’s say that the master is disconnected from the other two
■ The distributed system is partitioned
○ The master finds out, that it is alone
■ Specifically, that can communicate with less than half of the nodes
■ And it steps down from being master (handles just reads)
○ The other two slaves “think” that the master failed
■ Because they form a partition with more than half of the nodes
■ And elect a new master
● In case of just two nodes in RS
○ Both partitions will become read-only
■ Similar case can occur with any even number of nodes in RS
○ Therefore, we can always add an arbiter node to an even RS
Sharding
● MongoDB enables
collection partitioning
(sharding)
Collection Partitioning
● Mongo partitions collection’s data by the shard key
○ Indexed field(s) that exist in each document in the collection
■ Immutable
○ Divided into chunks, distributed across shards
■ Range-based partitioning
■ Hash-based partitioning
○ When a chunk grows beyond
the size limit, it is split
■ Metadata change, no data migration

● Data balancing:
○ Background chunk migration
Sharding: Components
● MongoDB runs in cluster of different node types:
● Shards – store the data
○ Each shard is a replica set
■ Can be a single node

● Query routers – interface with client applications


○ Direct operations to the relevant shard(s)
■ + return the result to the client
○ More than one => to divide the client request load
● Config servers – store the cluster’s metadata
○ Mapping of the cluster’s data set to the shards
○ Recommended number: 3
Sharding: Diagram
Journaling
● Write operations are applied in memory and into
a journal before done in the data files (on disk)
○ To restore consistent state after a hard shutdown
○ Can be switched on/off
● Journal directory – holds journal files
● Journal file = write-ahead redo logs
○ Append only file
○ Deleted when all the writes are durable
○ When size > 1GB of data, MongoDB creates a new file
■ The size can be modified
● Clean shutdown removes all journal files
Transactions
● Write ops: atomic at the level of single document
○ Including nested documents
○ Sufficient for many cases, but not all
○ When a write operation modifies multiple documents,
other operations may interleave
● Transactions:
○ Isolation of a write operation that affects multiple
documents update.
○ Two-phase commit
References
● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a
NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.

● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A


Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley Professional, 192 p.

● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040:


Big Data Management and NoSQL Databases

● MongoDB Manual: http://docs.mongodb.org/manual/

You might also like