Cosmos DB
Cosmos DB
This Course deals with the Architecture, Configuration and working with the Azure
Cosmos DB at a cloud scale in NoSQL space.
Disclaimer: The Interface of Azure Portal changes continuously with updates. The
Videos may differ from the actual interface but the core functionality remains the
same.
It is a schema-less NoSQL database which supports SQL based queries (with few
limitations).
Provides storage across multiple Azure's geographic regions with elastic and
independent throughput scaling.
Key Features
Cosmos DB provides:
Horizontal scaling
Latency guarantees
High availability
SLAs
Cosmos DB can provide near-real response times while handling massive amounts of
data reads, and writes at a global scale for web, mobile, gaming, and IoT
applications.
Move over to the next cards to learn how these features can be leveraged according
to our requirements.
Global Distribution
Global Distribution or Turnkey global distribution facilitates data distribution
near to the customers, over multiple Azure regions, while ensuring latency at its
lowest.
Requests are always sent to the nearest data center, using multi-homing APIs,
without any configuration changes.
APIs handle every task once write-regions and read-regions are set up.
Multi-Model Support
Atom-record-sequence (ARS) based data model underlie the Cosmos DB, providing
native support for multiple data models like document, key-value, graph, table, and
column-family.
Currently, the APIs for the following data models are supported with SDKs in
multiple languages:
SQL API
MongoDB API
Cassandra API
Gremlin API
Table API
Future plans are laid out for additional data models and APIs .
You will learn about these APIs and their usage as you move further into the
course.
Features are supported, by all available data centers, around the world.
Schema-less Design
Rapid iteration of Application schema is possible without concern of database
schema and/or index management.
Availability
Guaranteed 99.99% availability SLA for all single region database accounts.
Dynamic region priority setting and Failure Simulation in one or more regions with
zero-data-loss guarantees(beyond just the database) can help test the End-to-End
availability of the application.
Application Flexibility
Guaranteed end-to-end latency of reads under 10 ms and indexed writes under 15 ms
for a typical 1KB item, within the same Azure region, with a median latency under 5
ms.
The spectrum varies from strong SQL-like consistency to relaxed NoSQL-like eventual
consistency which can be chosen and payed as per requirements.
Cost Management
As Per Microsoft Claims:
Whats Next?
With this overview on the key features of Cosmos DB, move over to the next topics
to get started by working with Cosmos DB as a business solution.
Resources
In Azure Cosmos DB, databases are containers for various collections used by each
API.
Resource Hierarchy
Watch the following video to get started with Cosmos DB and learn about different
resources and their underlying hierarchy and provsion.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Types of Resources
System resources : Have a fixed schema - Database accounts, Databases, Collections,
Users, Permissions, Stored procedures, Triggers, and UDFs.
In Cosmos DB, all the resources are represented and managed as standard-compliant
JSON.
Resource Properties
_rid : Hierarchical, Unique,and system generated resource identifier.
Encapsulation
id is a user-defined string (up to 256 charters) of a resource. System generated
(for documents) and unique within the scope of a specific parent resource.
_rid is system generated hierarchical resource identifier(RID) for each resource.
Entire resource hierarchy along with internal representation is encoded in it,
which enforces referential integrity in a distributed manner.
RID is unique within a database account with an internal use for efficient routing
without a requirement for cross partition lookups.
_rid and _self values represents a resource in canonical and alternate forms.
Both the id and the _rid properties are supported by REST APIs resource addressing
and request routing.
/dbs/{dbName}/colls/{collName}/docs/{docId}
/dbs/{dbName}/users/{userId}/permissions/{permissionId}
Here /dbs is mandatory which is prefixed with the API endpoint of the database
account. Extensions as in above format are added to get the specific resource
Account Management
Azure portal is commonly used to manage Database account and multiple accounts can
be given over an Azure subscription through admin access.
Additionally, REST APIs and client SDKs can also be used to manage accounts.
Databases
According to Microsoft,
Databases are elastic by default, with SSD(GB to PetaBytes) backed document storage
and provisioned throughput.
Their resource model is similar to other Cosmos DB Resources, CRUD and enumeration
is easily possible using REST APIs or client SDKs.
Guaranteed strong consistency for reading or querying the metadata of a database
resource.
Collections
By Microsoft Definition,
Multi-document Transactions
Azure Cosmos DB implicitly wraps the JavaScript-based stored procedures and
triggers within an ambient ACID transaction. An exception in JavaScript execution
aborts the transaction.
It allows:
UDFs
Application logic can be written to run directly within a transaction inside of the
database engine.
UDFs can only be used for queries but not for data manipulation.
UDFs, triggers and Stored procedures with a collection are pre-compiled and stored
as byte code and can be managed by REST APIs.
Documents
You can insert, replace, delete, read, enumerate, and query arbitrary JSON
documents in a collection(maximum size - 2 MB).
Codification of relationships (between documents ) through distinguishing
properties or Special Annotations are not required. This is possible through the
power of Cosmos DB SQL syntax w.r.t relational and hierarchical query operators.
Database consistency policy defines the read consistency level of documents and can
be overridden on a per-request basis.
Media
Binary blobs/media can be stored with Azure Cosmos DB (2 GB / account) or to a
remote media store.
Read consistency for attachment querying is same as the indexing mode of the
collection ( Account’s consistency policy in case of "consistent").
Users
A Cosmos DB user is a logical namespace for grouping a set of permissions.
Permissions for a user correspond to the access control over various collections,
documents, attachments, etc.
Permissions
In access control perspective, database accounts, databases, users, and permission
are considered administrative resources, requiring administrative permissions and
the master key.
On the other hand, Collections, documents, attachments, stored procedures,
triggers, and UDFs are scoped under a given database and considered application
resources, requiring resource key.
Master key is required to create a permission resource under a user which in turn
generates a resource key.
REST APIs and Client SDKs support permission Management.
Strong consistency is provided for reading or querying the metadata of a
permission.
Global Distribution
Cosmos DB currently supports writes to a single region but reads from multiple
regions.
Consistency
Azure Cosmos DB provides 5 levels of consistency. Performance and Data confidence
are parameters for consistency strategy.
Throughput
The throughput of Azure Cosmos DB is elastic and scalable and throughput billing
depends on the request units.
It includes things like CPU, Memory, IO, etc., that are required to perform the
common CRUD operations.
Based on the no.of reads/writes required per sec the RUs can be approximated and
the throughput can be configured. The throughput can also be configured to be high
at peak times of application usage.
Partitioning
Partition Strategy and partition key are the main elements of partitioning Cosmos
DB.
Each collection has its own partition key defining the distribution of data in
physical partitions underneath.
There can be any number of values in a partition key and any number of partitions
under the hood.
The partition keys are spread over ranges and are managed by the database engine.
Metrics of partition usage can be found and analyzed, to prevent over utilization
and throttling.
Indexing
Cosmos DB supports auto-indexing. All the properties and entities ingested into the
database of any model are completely indexed while maintaining read and write
efficiency.
Index policy from account settings further tunes indexing for performance .
Lazy indexing when selected builds the index as a background task using leftover
request units on your container.
Security
Cosmos DB provides network security, encryption and authorization.
Network Security:
IP Firewalls
Whitelisting secure/trusted URLs.
Encryption:
Time to Live: Expiration time for documents such as logs, machine-generated data,
etc., can be configured over TTL per collection and can be overridden on per
document basis in order to purge redundant data on Cosmos DB.
Unique Keys: Data Integrity can be achieved by developers by declaring one of the
properties in the collection as a unique key. For example, for a social
application, data like Email ID can be declared as a primary key in-order to avoid
duplication.
Backup and Restore: Automatic Backups are taken over every 4 hours and the latest 2
copies are retained. If any collection is deleted, the backup is retained for 30
days. Online backups do not consume any resource units and complete restore can be
done to another account and be validated by the user over data corruption.
Account Management
To Delete a Database Account:
az cosmosdb delete --`name`
--`resource-group`
To retrieve details of database account:
Account Management
To Change Failover priority:
az cosmosdb failover-priority-change --`failover-policies` --`name` --`resource-
group`
To list access keys/read-only access keys:
az cosmosdb list-keys/list-read-only-keys --`name` --`resource-group`
To regenerate access keys:
az cosmosdb regenerate-key --`key-kind`{primary, primaryReadonly, secondary,
secondaryReadonly}
--`name`
--`resource-group`
List the accounts under the user/resource group:
az cosmosdb list [--resource-group]
To list connection strings for Database Account:
az cosmosdb list-connection-strings --`name`
--`resource-group`
Database Management
CRUD operations can be performed on Databases using the Azure CLI. All the commands
have the similar syntax with minor changes
Collection Management
Collection Management commands also follow a similar pattern as the Database
commands with minor changes between themselves.
To Create/Update a Collection:
az cosmosdb collection create/update --`collection-name`
--`db-name`
[--default-ttl]
[--indexing-policy]
[--key]
[--name]
[--partition-key-path]
[--resource-group-name]
[--throughput]
[--url-connection]
[--partition-key-path] is not available under update
Azure Cosmos DB evolved from Azure Document DB and the SQL API is the current
offering renamed over Document DB API.
The SQL API allows interaction with your NoSQL database using JavaScript and SQL.
Features
SQL API in Cosmos DB offers:
Multi-region replication.
Automatic Indexing.
Querying
Similar to SQL query language and is based on document identifiers, properties,
complex objects or existence of specific properties.
Queries can be done without the need to know or enforce a specific schema on each
document.
Stored procedures, triggers and user-defined functions (UDFs) within database can
be written in JavaScript.
Where is it used?
SQL API is commonly used in Gaming, IoT and Retail industries. Watch the video to
learn more about it's use cases.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Data Modelling
SQL API stores data in NoSQL form (JSON Documents).
Modelling thumb rule is to remember the database is a distributed and not a single
machine SQL database.
Data Model must be able to leverage JSON capability of storing arrays and objects.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Document Management
Portal: The Data Explorer is a tool embedded within the Azure Cosmos DB blade in
the Azure Portal that allows you to view, modify and add documents to your SQL API
collections.
SDK: SDKs are available across many languages like Python, Java,.Net and Node.Js.
REST API: JSON documents are managed through a well-defined hierarchy of database
resources and are addressable using a unique URI.
Data Migration Tool: The open-source data migration tool allows you to import data
into a SQL API collection from various sources including MongoDB, SQL Server, Table
Storage, Amazon DynamoDB, HBase and other DocumentDB API collections.
SQL Querying
Watch the video to learn how to use SQL querying with SQL API.
Play
09:05
-11:28
Mute
Settings
Enter fullscreen
Play
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Programmability Features
Javascript functions can be written and stored as collection level resources to run
logic in the database. They can be:
Stored Procedures: Runs a javascript function and returns the result.
Pre/Post Triggers: Actions that can occur prior to/after performing an operation on
a document and can be specified when to run over the CRUD operations.
User-defined Functions (UDF): Extends the SQL query language available within the
SQL API by implementing custom functions.
More on Programmability
Watch the following video to learn more about the programmability of SQL API.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Advanced Features
SQL/DocumentDB API supports advanced features such as Change Feed and Geo-spatial
Data.
Geo-spatial Data can be queried and processed using the GeoJSON format and supports
Point data of coordinates or Polygon data for boundaries.
Watch the following video to learn about the Change Feed in SQL API of Azure Cosmos
DB.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
MongoDB API
The Mongo DB wire protocol ensures to use Cosmos DB databases as the data store for
apps written for MongoDB with existing drivers, expertise, tools and code.
Features
Apart from features like Automatic Indexing, replication and elastic scalability
Cosmos DB's tunable consistency works in tandem with MongoDB's consistency levels.
Cosmos DB Mongo DB
Eventual Eventual
Consistent Prefix Eventual with consistent order
Session Eventual with consistent order
Bounded Staleness Strong
Strong Strong
Where is it Used?
Watch the video to learn where Mongo DB can be used.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Query Support
MongoDB API supports comprehensive list of query language constructs.
These include Database commands, Aggregate commands, operators and also several
other additional operators.
Note: MongoDB API do not support $eval and $where operators and cursor.sort()
method.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
User and role management : Does not yet support users and roles. Azure portal
(Connection String page) must be used to obtain read-write and read-only
passwords/keys and Role based access control (RBAC).
Write Concern : Write concern specified by client code is ignored. All writes are
all automatically Quorum by default.
Using Tools
Watch the following video to learn how the present expertise in MongoDB can be used
in along with MongoDB API in Cosmos DB Implementation.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
GraphDB and Gremlin
Gremlin API of Cosmos DB enables us to store and operate on graph data.
Multiple models, like document and graph, can be used within the same
containers/databases.
A Document container when used to store graph data along with documents, improves
productivity with SQL queries over JSON and Gremlin queries for graphs over the
same data.
Features
Basic features of Azure Cosmos DB like Elastic throughput, replication,
availability, tunable consistency and automatic indexing are all covered with
Gremlin API.
Fast queries and traversals: Complete support for Gremlin Query language with
performance leverages through auto-indexing for faster traversals through
heterogenous vertices.
Where is it Used?
Watch the following video to learn where Gremlin API/Graph DB can be used.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Modelling Data
Data Modelling is similar to that of any Graph DB.
The entities/objects are known as vertices and the relationships between them are
known as edges.
Labels can be used on edges and vertices to describe their type/group them.
Note: For more data on GraphDB data modelling, Please go through Neo4j course.
Neo4j is compatible with Apache TinkerPop and its query language CQL is similar to
Gremlin
Azure Cosmos DB uses the JSON format when returning results from Gremlin
operations.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Traversals
Watch the following video to learn traversing graphs using Gremlin Query Language.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Cassandra API
Cassandra is a column based NoSQL database and is queried using Cassandra Query
Language(CQL)
Applications that are written for Apache Cassandra can use the Cassandra API
provided by Azure Cosmos DB to leverage the basic Cosmos DB features such as Global
Distribution, Elastic Throughput, etc..
Cassandra API is currently in preview and can be used similarly as the MongoDB API.
CQLv4 compatible applications with Apache licensed drivers can communicate with the
Azure Cosmos DB Cassandra API.
Code base may require trivial code changes to be used with Wire protocol level
compatibility, via tools and SDKs.
Interaction is possible through using the Cassandra Query Language based tools
(like CQLSH) and Cassandra client drivers.
The use cases are similar to that of a Cassandra DB which include Messaging, Social
Media, Email applications etc..
CQL can be used after connecting to the Cosmos DB over the CQL Shell.
CQL shell can also be used in the Azure Portal.
Cassandra API also supports all the tools available for Cassandra Database.
However, few limitations are imposed regarding queries regarding joins which may be
fixed once the API comes out of its current preview mode.
Additional Features
Unlike Mongo DB, Cassandra in its licensed distribution is built over performance
rather than Developer friendliness.
By switching over to Cosmos DB with Cassandra API, the configuration and setting up
of database consistency and replication are taken care of by the Cosmos DB
configuration making it easy for developers to configure and deploy apps.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Table API
Azure Cosmos DB provides the Table API for applications that are written for Azure
Table storage.
Applications can migrate from Azure Table storage to Azure Cosmos DB by using the
Table API with no code changes.
Client SDKs are available for .NET, Java, Python, and Node.js.
Features
Table API provides all the premium features of Azure Cosmos DB such as Global
Distribution, Consistency, Throughput, etc..
The features specific to the Table API are Secondary indexing, custom indexes and
Query Optimisation.
Automatic and Complete Indexing on all properties without index management enables
fast querying over the Table Storage.
Where is it Used?
Watch the following video to learn about the scenarios where Table API can be used.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Modelling Table Data
Entities within the Tables API is represented as a collection of key-value pairs.
An Entity has a partition key, row key and its properties are as denoted key/value
pairs.
Partition Key: Similar to a shard key and represents how the data in a table is
partitioned or spread across shards.
Row Key: Unique identifier for an entity within a single partition. Two entities
can have the same row key as long as they are in different partitions.
A composite of both the row key and partition key is used to uniquely identify and
reference an entity.
CRUD Operations
Transactional functionality in Storage Tables is similar to that of the traditional
Create, Read, Update, Delete (CRUD) in other data sources.
This model enables dedicated client libraries or REST API endpoints with HTTP
methods to access and modify entities.
https://[account].table.core.windows.net/[table]
Querying
OData syntax along with HTTP GET method is used for querying Table Storage.
Typically in REST API services, the GET endpoint can only return either a specific
item or all items.
But, OData offers a query syntax allowing you to search for a single item or set of
items from a collection using GET requests.
Query parameters permit the user to filter, paginate, or project the result of your
requests.
OData Querying
Watch the following video to learn more about OData Querying.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Azure Search
Azure Search is a managed search engine offered by Azure similar to Lucene & Solr
that allows you to index data from various data sources and then provide a search
engine over the indexed data.
Faceted search
Pagination
Geospatial search
Suggestions and Spell-check
Ranking
Hit Highlighting
Azure Search has it's own built-in query syntax and is compatible with Lucene query
syntax when searching documents. It supports custom linguistic analyzers and
supports over 50+ languages.
Watch the following video to learn about the integration of CosmosDB and Azure
Search.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Apache Spark
Apache Spark (open-source parallel processing framework) boosts the performance of
big-data analytic applications by supporting in-memory processing.
Apache Spark with CosmosDB helps in IoT scenarios requiring real-time response.
It can also be used to get improve the machine learning models in real-time
improving predictions.
The connection between Cosmos DB and Apache Spark for improved speed in predictions
and real-time data analysis can be made possible using Apache Spark Connector.
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Azure Functions
Azure Functions is a serverless compute service where you don't have to explicitly
provision or manage infrastructure while enabling you to run code on-demand.
Data Processing
Integrating systems
IoT
Building simple APIs
Building microservices
If you have trouble playing this video, please click here for help.
No transcript is available for this video.
Hands-on scenario
Your company wants to migrate their NoSQL data from On-premises to Azure Cosmos DB.
However, before migration your manager wants you to test the Azure Cosmos DB
functionality by using sample data and observe the performance, security, etc. i)
Create Cosmos DB: API: Core (SQL), Location: East US, Capacity mode: provisioned
throughput, Apply Free Tier Discount: Apply, Account Type: Non-Production,
Availability Zones: Disable. ii) Create a Container: partition key ‘currency’ and
then create six new items in the container (refer to the sample values provided in
the following table). iii) Perform query operations: Sort the items in descending
order based on their timestamp and filter the items based on currency 'Dollars'.
Use the given credentials in the Hands-on to log in to the Azure Portal. Create a
new resource group and use the same resource group for all the resources. Give the
Username/password/services name as per your choice. Use the 'Edit' option to write
a query. After completing the hands-on delete all the resources created in this
hands-on.
So Far
Over the course, you have learned: