What is a NoSQL database?
The term NoSQL, short for “not only SQL,” refers to non-relational databases that store data in a non-
tabular format, rather than in rule-based, relational tables like relational databases do. NoSQL
databases use a flexible schema model that supports a wide variety of unstructured data such as
documents, key-value, wide columns, and graphs.
Features of NoSQL DB:
• Schema Flexibility: No predefined schema; allows dynamic and unstructured data storage.
• Scalability: Horizontal scaling (adding more servers) is easier than in traditional relational
databases.
• Data Models: Supports various models like key-value, document, column-family, and graph
databases.
• High Performance: Optimized for specific use cases like large-scale read and write
operations.
• Distributed Architecture: Data is often distributed across multiple servers, enhancing fault
tolerance and availability.
• Eventual Consistency: Prioritizes availability over strict consistency (CAP theorem).
• Replication and Sharding: Built-in mechanisms for data replication and partitioning.
• Flexible Transactions: Some NoSQL databases offer eventual or limited transactional
capabilities instead of ACID compliance.
• Big Data Compatibility: Ideal for processing large volumes of data.
• Open Source Options: Many NoSQL databases are open source (e.g., MongoDB, Cassandra).
• High Availability: Automatic failover and backup mechanisms.
• API-Based Access: Commonly accessed through REST APIs or proprietary protocols.
Types :
1. Document-Based Database ( Ex – MongoDB, CouchDB )
The document-based database is a nonrelational database. Instead of storing the data in rows and
columns (tables), it uses the documents to store the data in the database. A document database
stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In the
Document database, the particular elements can be accessed by using the index value that is
assigned for faster querying.
Collections are the group of documents that store documents that have similar contents
Key features of documents database:
• Flexible schema: Documents in the database has a flexible schema. It means the documents
in the database need not be the same schema.
• Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
• No foreign keys: There is no dynamic relationship between two documents so documents can
be independent of one another. So, there is no requirement for a foreign key in a document
database.
• Open formats: To build a document we use XML, JSON, and others.
2. Key-Value Stores ( Ex – Redis, Amazon DynamoDB )
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value
store. Every data element in the database is stored in key-value pairs. The data can be retrieved by
using a unique key allotted to each element in the database. The values can be simple data types like
strings and numbers or complex objects. A key-value store is like a relational database with only two
columns which is the key and the value.
Key features of the key-value store:
• Simplicity: Data retrieval is extremely fast due to direct key access.
• Scalability: Designed for horizontal scaling and distributed storage.
• Speed: Ideal for caching and real-time applications.
3. Column Oriented Databases ( Ex – Apache Cassandra, HBase)
A column-oriented database is a non-relational database that stores the data in columns instead of
rows. That means when we want to run analytics on a small number of columns, we can read those
columns directly without consuming memory with the unwanted data. Columnar databases are
designed to read data more efficiently and retrieve the data with greater speed. A columnar database
is used to store a large amount of data.
Key features of Columnar Oriented Database
• High Scalability: Supports distributed data processing.
• Compression: Columnar storage enables efficient data compression.
• Faster Query Performance: Best for analytical queries.
4. Graph-Based Databases ( Ex – Amazon Neptune, Neo4j)
Graph-based databases focus on the relationship between the elements. It stores the data in the form
of nodes in the database. The connections between the nodes are called links or relationships,
making them ideal for complex relationship-based queries.
• Data is represented as nodes (objects) and edges (connections).
• Fast graph traversal algorithms help retrieve relationships quickly.
• Used in scenarios where relationships are as important as the data itself.
Key features of Graph Database
• Relationship-Centric Storage: Perfect for social networks, fraud detection, recommendation
engines.
• Real-Time Query Processing: Queries return results almost instantly.
• Schema Flexibility: Easily adapts to evolving relationship structures
Aggregate data model
In NoSQL databases, aggregate data models are designed to store and retrieve related sets of data
that are often grouped together for efficient processing. Instead of using traditional relational
database schemas with tables and foreign keys, aggregate models in NoSQL databases focus on
encapsulating related entities into a single, self-contained unit, which is called an "aggregate."
This approach promotes data denormalization and typically results in faster read operations and
easier scaling, at the cost of more complexity in handling updates and potential data duplication.
Schema Less DB
In the context of NoSQL databases, a schema-less database refers to a type of database that does not
require or enforce a predefined schema for the data being stored. In other words, there is no strict
structure or format that the data must adhere to when being inserted into the database.
• Flexible Data Model:
o No fixed schema required before data insertion.
o Each record (or document) can have a different structure.
• Dynamic Fields:
o New fields can be added without altering existing data.
o Fields can vary across different records in the same collection.
• No Fixed Schema:
o Unlike traditional relational databases where the structure of data (tables, columns,
data types) must be defined beforehand, NoSQL schema-less databases allow data
to be stored without any predefined schema. The structure can be different for each
entry.
Types of Schema less Databases
1. Document Stores – Store semi-structured data as JSON or BSON (e.g., MongoDB).
2. Key-Value Stores – Map unique keys to simple values (e.g., Redis).
3. Column-Family Stores – Store data in flexible column-based structures (e.g., Cassandra).
4. Graph Databases – Represent relationships using nodes and edges (e.g., Neo4j).
5. Time-Series Databases – Manage time-stamped data efficiently (e.g., InfluxDB).
6. Object-Oriented Databases – Store data as objects with attributes and methods (e.g., db4o).
Materialized View
A materialized view in NoSQL is a precomputed, stored query result that is updated periodically or on-
demand. It is a snapshot of the data from a query that gets stored separately from the original data to
provide faster access to specific patterns or queries. In simple terms, a materialized view is a stored
result of a computation (such as an aggregation or complex filtering) that makes querying data more
efficient.
Working
o Pre-computation: A materialized view stores the result of a query (like aggregation or filtering) so
that you do not have to compute it again each time.
o Faster Reads: The stored result helps reduce the computation time for frequent or expensive
queries.
o Data Refresh: The materialized view may be refreshed automatically or manually when the
underlying data changes. How and when it is updated depends on the NoSQL database being
used.
Characteristics
1. Optimized Read Operations: By storing the result of complex queries, it allows for faster read
operations, especially when dealing with large datasets.
2. Data Consistency: Since the materialized view is a snapshot of the data at a certain point in time,
maintaining consistency is important. Some NoSQL databases offer automatic refreshes, while
others require manual refreshing of the view.
3. Space Usage: Storing materialized views requires additional storage as they kеер copies of data.
4. Use Case: Materialized views are used in NoSQL databases when you need to run complex queries
frequently, like filtering, aggregations, or joining data.
MongoDB
MongoDB is a widely used NoSQL database that stores data in a document-oriented format. It is a
non-relational database, meaning it doesn't follow the traditional table structure (like SQL databases)
and instead organizes data in documents that are stored in collections. MongoDB is designed to be
flexible, scalable, and high-performing, making it suitable for modern applications that require fast
read/write operations, handling large datasets, and evolving data structures.
Features
Document-Oriented Storage: In MongoDB, data is stored as documents in a format known as BSON
(Binary JSON), which is similar to JSON. Each document is a set of key-value pairs (fields and values),
and it can contain arrays, nested documents, or complex structures.
This document structure allows for more flexibility compared to the rigid row-column structure of SQL
databases.
Collections: MongoDB organizes documents into types. collections, which are equivalent to tables in
a relational database. However, unlike tables in SQL databases, collections are schema-less,
meaning each document in the collection can have different fields and data types.
Schema less : MongoDB is schema-less, meaning it does not require a predefined structure for
documents in a collection. This allows for rapid changes in the structure of your data without requiring
database migrations, making it highly flexible when working with evolving or unstructured data.
Scalability: One of MongoDB's core strengths is its ability to scale horizontally. It supports sharding,
which allows data to be distributed across multiple machines or servers to handle large-scale
applications. This feature makes MongoDB a good fit for applications that require high availability,
fault tolerance, and can handle large amounts of traffic and data.
Indexing: MongoDB supports indexing, which improves query performance. Indexes can be created
on any field, and it also supports advanced indexing types such as geospatial indexes and text
indexes.
Replication: MongoDB supports replication, which means it can copy data from one server to others.
This is used to ensure data availability and fault tolerance.
ACID Transactions: MongoDB supports ACID transactions (Atomicity, Consistency, Isolation,
Durability) starting from version 4.0, allowing multiple operations across multiple documents to be
grouped into a single transaction.
MongoDB Use Cases (in the context of NoSQL):
1. Real-Time Applications: MongoDB is great for applications that need real-time data access, such
as social media platforms, messaging systems, and analytics tools.
2. Content Management Systems: Its flexibility and scalability make MongoDB a popular choice for
content management systems that manage large volumes of varying data (such as blogs, articles, or
media files).
3. Mobile and Web Apps: MongoDB is widely used in web and mobile applications due to its ability to
handle large amounts of data with flexibility.
4. Big Data: MongoDB can handle big data workloads effectively and can scale across multiple
machines to accommodate large datasets.
MongoDB vs SQL (Relational Databases): . . .
Data Structure: MongoDB stores data as documents (BSON format), while relational databases use
rows and columns in tables.
Schema: MongoDB is schema-less, whereas relational databases require a predefined schema with a
fixed structure.
Scaling: MongoDB supports horizontal scaling (sharding), while SQL databases typically scale
vertically (adding more power to a single machine).
Joins: MongoDB does not natively support joins like SQL databases, but it provides the $lookup
operator to join collections in a way that can be less efficient than SQL joins.
Some popular open-source tools for big data analysis
1. Apache Hadoop - A framework that allows for the distributed processing of large data sets across
clusters of computers. It includes Hadoop Distributed File System (HDFS) and the MapReduce
processing model.
2. Apache Spark - A fast and general-purpose cluster computing system, Spark is used for big data
processing and analytics. It supports real-time data streaming, machine learning, and graph
processing.
3. Apache Flink - A stream-processing framework that supports high-throughput, low-latency, and
fault-tolerant processing of data.
4.Apache Kafka - A distributed event streaming platform used for building real-time data pipelines
and streaming applications.
5. Apache Hive - A data warehouse system built on top of Hadoop, enabling users to query and
manage large datasets using a SQL-like language.
6. Apache HBase - A NoSQL database that runs on top of HDFS, designed to handle large amounts of
sparse data.
7. Elasticsearch - A search and analytics engine used to index and search large datasets in real time.
8. D3.js - A JavaScript library for visualizing data through interactive charts and graphs, often used for
big data analysis in web applications.
9. Jupyter Notebooks – A web-based tool for interactive computing that is commonly used for big
data analysis, particularly with Python libraries like Pandas, NumPy, and Matplotlib.
MapReduce in Hadoop
MapReduce in Hadoop is a programming model and processing technique used to process and
generate large datasets in a distributed computing environment. It splits the task into two phases:
1. Map Phase: In this phase, the input data is split into smaller chunks, which are processed by the
"map" function. The map function processes the data and outputs key-value pairs. For example, in a
word count task, the input text is split into words, and each word is assigned a key (the word itself)
with a value of 1.
2. Reduce Phase: The reduce phase takes the output from the map phase (key-value pairs) and
processes them. It groups the pairs by their key and performs an aggregation operation, such as
summing the values for each key. Continuing the word count example, the reduce function will sum
up the counts for each word and output the final result.
o . Map: Processes input data, producing intermediate key-value pairs.
o Reduce: Aggregates and processes the intermediate data based on keys, producing the final
result.
MapReduce runs on the Hadoop Distributed File System (HDFS), ensuring scalability and data
redundancy for efficient processing.
How It Helps Process Large-Scale Data in NoSQL
• Parallel Processing: Enables processing massive datasets by dividing workloads across
multiple machines.
• Scalability: Can handle petabytes of data efficiently.
• Fault Tolerance: Automatically recovers from hardware failures.
• Schema Flexibility: Works well with NoSQL databases, which have a flexible schema.
• Optimized for Read-Heavy Workloads: Suitable for analytics, indexing, and aggregation
operations.
Example :
MapReduce Partition and Combining
1. Partitioning (Shuffle & Sort)
o Distributes key-value pairs to reducers based on keys.
o Uses a function like hash(key) % num_of_reducers.
2.Combining (Local Aggregation)
o Optional step to reduce data size before sending to reducers.
o Acts like a mini reducer on the Mapper side.
o Example:
Before: (Apple, 1), (Apple, 1)
After: (Apple, 2)
Flow
Map → Combiner (Optional) → Partition → Reduce