datascience
datascience
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
print(df.isnull()) # True for missing values
o Replace missing data with a specific value or computed statistics using .fillna().
4. Forward/Backward Fill:
Data Transformation
Data transformation involves converting data into a suitable format for analysis.
1. Scaling and Normalization:
2. Type Conversion:
df['A'] = df['A'].astype(int)
3. Renaming Columns:
4. Removing Duplicates:
df = df.drop_duplicates()
String Manipulation
1. Lowercase Conversion:
df['Name'] = df['Name'].str.lower()
2. Removing Whitespace:
df['Name'] = df['Name'].str.strip()
3. Replacing Substrings:
df['Name'] = df['Name'].str.replace('old', 'new')
4. Splitting Strings:
Data Wrangling
import pandas as pd
2. Hierarchical Indexing:
data = pd.DataFrame({'City': ['NY', 'SF'], 'Year': [2020, 2021], 'Value': [100, 200]})
indexed_data = data.set_index(['City', 'Year'])
print(indexed_data)
4. Melting Data:
These techniques are essential for cleaning and preparing messy datasets for analysis and modeling tasks!
Unit 5
Pandas is commonly used for preparing datasets before feeding them into machine learning or statistical
models. The integration between Pandas and modeling libraries allows seamless data manipulation and
model building.
1. Data Preparation:
o Example:
import pandas as pd
from sklearn.linear_model import LinearRegression
# Prepare data
df = pd.DataFrame({'X': [1, 2, 3], 'Y': [2, 4, 6]})
X = df[['X']]
Y = df['Y']
# Fit model
model = LinearRegression()
model.fit(X, Y)
print(model.coef_) # Output: [2.]
2. Model Integration:
o Pandas DataFrames can be directly used as inputs for libraries like Scikit-learn.
Patsy is a Python library that simplifies the creation of statistical models by using formulas similar to R.
1. Formula Syntax:
o Example:
import patsy
2. Advantages:
Introduction to Statsmodels
1. Features:
o Provides tools for linear regression, time series analysis, and more.
import statsmodels.api as sm
# Prepare data
X = sm.add_constant(df['X']) # Add intercept term
Y = df['Y']
# Fit model
model = sm.OLS(Y, X).fit()
print(model.summary())
3. Advantages:
1. Basic Plotting:
2. Customizations:
o Add labels, legends, and gridlines.
o Example:
1. Line Plot:
2. Bar Plot:
df.plot(kind='bar')
plt.show()
1. Scatter Plot:
2. Heatmap:
sns.heatmap(df.corr(), annot=True)
plt.show()
These tools provide advanced features like interactivity and integration with web applications!
A data warehouse is a centralized system designed to store and manage structured data from multiple
sources. It integrates data for business intelligence, enabling fast queries, insightful reporting, and
decision-making. It is a core component of business intelligence systems, helping organizations analyze
historical and current data to generate actionable insights[27][28][29].
Key Characteristics:
1. Subject-Oriented: Focuses on specific topics like sales or inventory rather than overall processes
Architecture:
2. Middle Tier: OLAP (Online Analytical Processing) servers for complex queries.
• KDD is the overall process of extracting useful knowledge from data, including steps like data
preparation, transformation, mining, and interpretation.
Data Mining:
• A subset of KDD that involves applying algorithms to extract patterns from data.
Techniques Includes preprocessing and evaluation Statistical and machine learning methods
1. Business Understanding:
2. Data Preparation:
3. Model Building:
4. Evaluation:
5. Deployment:
o Apply insights to operational systems (e.g., CRM)[29].
1. Association Rules:
o Discover relationships between items (e.g., "If a customer buys bread, they are likely to buy
butter").
2. Classification:
3. Clustering:
o Group similar data points without predefined labels (e.g., customer segmentation).
4. Regression Analysis:
5. Anomaly Detection:
6. Neural Networks:
Knowledge representation involves encoding mined patterns in formats that are easy to interpret:
1. Rules:
o Represent relationships as "IF...THEN" statements.
o Example: "IF age > 30 AND income > $50K THEN likely to buy luxury products."
2. Decision Trees:
4. Statistical Summaries:
What is DMQL?
The Data Mining Query Language (DMQL) was proposed by Han, Fu, and Wang for the DBMiner system.
It is designed to support ad hoc and interactive data mining tasks, enabling users to define data mining
tasks and specify task-relevant data. DMQL is based on SQL and can work with databases and data
warehouses, making it a powerful tool for knowledge discovery[32][33].
Why Integrate?
1. Centralized Data Access: Data warehouses store large volumes of historical and current data in a
unified format.
2. Efficient Query Execution: Data mining systems can leverage pre-aggregated data from
warehouses for faster processing.
Issues in Integration:
Data Preprocessing
Data preprocessing is essential to prepare raw data for analysis. It includes steps like cleaning,
transformation, feature selection, and dimensionality reduction.
1. Data Cleaning
• Techniques:
o Removing Duplicates:
df.drop_duplicates()
2. Data Transformation
• Techniques:
3. Feature Selection
Selects relevant features (variables) that contribute most to the predictive model.
• Example: Use algorithms like Recursive Feature Elimination (RFE) or mutual information.
4. Dimensionality Reduction
• Techniques:
1. Business Understanding:
Define objectives (e.g., fraud detection).
2. Data Preparation:
Clean, integrate, and transform datasets.
3. Model Building:
Apply algorithms like decision trees or clustering.
4. Evaluation:
Validate patterns using metrics like accuracy or precision.
5. Deployment:
Use insights in operational systems (e.g., CRM).
1. Association Rules:
Discover relationships between items (e.g., "If bread is bought, butter is likely bought").
2. Classification:
Assign labels to data based on predefined categories (e.g., spam detection).
3. Clustering:
Group similar data points without predefined labels (e.g., customer segmentation).
4. Regression Analysis:
Predict numerical outcomes based on input variables.
5. Anomaly Detection:
Identify rare or unusual patterns in the dataset.
1. Rules:
Represent relationships as "IF...THEN" statements.
Example: "IF age > 30 AND income > $50K THEN likely to buy luxury products."
2. Decision Trees:
Visualize decisions as a tree structure with branches representing choices.
3. Graphs and Networks:
Represent relationships between entities visually (e.g., fraud ring detection).
4. Statistical Summaries:
Provide numerical summaries like mean, variance, or correlation coefficients[32][33].
Unit 5
Concept description in data mining refers to summarizing and comparing data characteristics. It involves:
Example:
• Characterizing customer purchase behavior (e.g., "Most customers buy electronics during sales").
What is AOI?
Attribute-Oriented Induction (AOI) is a method for data generalization. It transforms detailed data into
higher-level concepts by replacing specific values with generalized ones using concept hierarchies.
1. Data Preprocessing:
2. Attribute Generalization:
o Replace low-level attribute values with higher-level concepts.
3. Aggregation:
4. Presentation:
Basic Concepts
Frequent patterns are recurring relationships in datasets, such as items frequently purchased together.
Mining these patterns helps identify associations and correlations.
1. Apriori Method
The Apriori algorithm is used to find frequent itemsets in transactional data by leveraging the principle
that subsets of frequent itemsets must also be frequent.
Steps:
• If "Milk" and "Bread" are frequently bought together, they form a frequent itemset.
Association rules describe relationships between items in frequent itemsets, such as:
• Metrics:
The pattern-growth approach avoids candidate generation by recursively dividing the dataset into smaller
subsets based on frequent items (using structures like FP-trees).
Steps:
Advantages:
Concept description provides insights into datasets through characterization and comparison, while AOI
simplifies data by generalizing attributes using hierarchies. Frequent pattern mining methods like Apriori
and pattern-growth help uncover associations and correlations efficiently, enabling businesses to make
informed decisions based on hidden patterns in their data[34][35][36].
Unit
4
Classification is a supervised learning technique in data mining that assigns data instances to predefined
categories (classes) based on their features. It involves building a model using labeled training data and
applying it to predict the class labels of new, unseen data.
4. Evaluation: Assess the model’s performance using metrics like accuracy, precision, recall, and F1-
score.
Types of Classification:
2. Multi-Class Classification: More than two classes (e.g., categorizing fruits as apple, banana, or
orange).
2. Select the best attribute for splitting the data based on an attribute selection measure.
4. Repeat recursively for each subset until all instances belong to the same class or stopping criteria
are met.
Attribute selection measures help identify the best attribute to split the dataset at each step.
1. Information Gain:
2. Gini Index:
3. Chi-Square Test:
Tree Pruning
Tree pruning removes unnecessary branches from a decision tree to reduce complexity and improve
generalization.
1. Pre-Pruning: Stop tree growth early based on criteria like minimum samples per node.
2. Post-Pruning: Remove branches after the tree is built by evaluating their impact on accuracy.
Bayesian classifiers use probability theory to predict class labels based on prior probabilities and
likelihoods of features given classes.
• Formula:
𝑃(𝑋|𝐶) ⋅ 𝑃(𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
• Example:
Predict whether an email is spam based on word frequencies.
Summary
Classification organizes data into predefined categories using methods like decision trees and Bayesian
classifiers. Decision tree induction involves selecting attributes using measures like information gain and
pruning unnecessary branches for simplicity. Bayesian methods rely on probabilistic reasoning, making
them effective for tasks with independent features.
Unit 5
• Antecedent: The "IF" part of an association rule. It represents the item or itemset found in the data.
Example: In the rule {Bread} → {Butter}, "Bread" is the antecedent.
• Consequent: The "THEN" part of the association rule. It represents the item or itemset that occurs
along with the antecedent. Example: In {Bread} → {Butter}, "Butter" is the consequent[37][38].
Multi-relational association rules extend traditional association rules by considering relationships across
multiple tables or datasets. They are useful in scenarios where data is stored in relational databases,
allowing analysis across different dimensions (e.g., customer demographics and purchase history).
ECLAT Algorithm
ECLAT (Equivalence Class Transformation) is a frequent itemset mining algorithm that uses a depth-first
search strategy:
1. Steps:
2. Advantages:
Market Basket Analysis identifies patterns in customer purchasing behavior, such as items frequently
bought together, to optimize sales strategies.
Example: Amazon
Amazon uses market basket analysis to recommend products under headings like "Frequently bought
together" or "Customers who bought this item also bought." This improves cross-selling and enhances
customer experience[39][40][38].
Applications:
1. Retail: Optimize store layouts and product placement (e.g., placing milk near bread).
Benefits:
Cluster Analysis
Cluster analysis groups similar data points into clusters based on their attributes. It is an unsupervised
learning technique used for segmentation, anomaly detection, and pattern recognition.
Partitioning Methods
Partitioning methods divide data into non-overlapping subsets (clusters). Examples include:
1. K-Means Clustering:
o Example:
2. K-Medoids:
o Similar to K-Means but uses medoids (actual data points) as cluster centers.
Hierarchical Methods
1. Agglomerative Clustering:
2. Divisive Clustering:
o Starts with all points in one cluster and splits them iteratively.
Example:
Density-Based Methods
3. Steps:
4. Example:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(data)
print(dbscan.labels_)
5. Advantages:
o Robust to noise.
Summary
Association rule mining uncovers relationships between items, such as antecedents and consequents,
enabling insights like cross-selling opportunities through algorithms like Apriori and ECLAT. Market
Basket Analysis exemplifies its application in retail and e-commerce for optimizing product placement
and recommendations.
Cluster analysis segments data into meaningful groups using methods like K-Means, hierarchical
clustering, and DBSCAN, each suited for specific types of datasets and clustering needs. These techniques
are essential for understanding patterns and improving decision-making across industries!
1. https://www.datacamp.com/blog/what-is-data-analysis-expert-guide
2. https://www.sganalytics.com/blog/what-is-the-meaning-of-data-analysis/
3. https://www.upwork.com/resources/data-analysis-vs-data-analytics
4. https://www.questionpro.com/blog/data-analytics-vs-data-analysis/
5. https://www.w3schools.com/python/python_intro.asp
6. https://www.tutorialspoint.com/python/python_features.htm
7. https://www.sisense.com/glossary/python-for-data-analysis/
8. https://www.linkedin.com/pulse/what-makes-python-brilliant-choice-data-analysis-pratibha-kumari-jha
9. https://en.wikipedia.org/wiki/Data_warehouse
10. https://www.sap.com/mena/products/data-cloud/datasphere/what-is-a-data-warehouse.html
11. https://cdn.aaai.org/KDD/1996/KDD96-014.pdf
12. https://en.wikipedia.org/wiki/Data_mining
13. https://www.ibm.com/think/topics/data-mining
14. https://www.studocu.com/in/messages/question/4583274/discuss-data-mining-task-primitives
15. https://www.investopedia.com/terms/d/datamining.asp
16. https://www.aimasterclass.com/glossary/knowledge-representation
17. http://dataminingzone.weebly.com/uploads/6/5/9/4/6594749/ch_9data_mining_query_language.pdf
18. https://www.tutorialspoint.com/data_mining/dm_query_language.htm
19. https://data-flair.training/blogs/data-mining-query-language/
20. https://liris.cnrs.fr/Documents/Liris-1661.pdf
21. https://marketsplash.com/what-is-ipython/
22. https://domino.ai/data-science-dictionary/jupyter-notebook
23. https://jupyter.org
24. https://www.freecodecamp.org/news/the-python-guide-for-beginners/
25. https://en.wikipedia.org/wiki/Python_syntax_and_semantics
26. https://bootcamp.cvn.columbia.edu/blog/python-basics-guide/
27. https://www.techtarget.com/searchdatamanagement/definition/data-warehouse
28. https://atlan.com/data-warehouse-101/
29. https://en.wikipedia.org/wiki/Data_warehouse
30. https://www.simplilearn.com/data-warehouse-article
31. https://www.sap.com/india/products/technology-platform/datasphere/what-is-a-data-warehouse.html
32. https://www.tutorialspoint.com/data_mining/dm_query_language.htm
33. https://data-flair.training/blogs/data-mining-query-language/
34. https://www.investopedia.com/terms/d/datamining.asp
35. https://www.techtarget.com/searchbusinessanalytics/definition/data-mining
36. https://en.wikipedia.org/wiki/Data_mining
37. https://www.geeksforgeeks.org/market-basket-analysis-in-data-mining/
38. https://www.linkedin.com/pulse/what-market-basket-analysis-overview-uses-types-chatterjee-
39. https://www.simplilearn.com/what-is-market-basket-analysis-article
40. https://www.techtarget.com/searchcustomerexperience/definition/market-basket-analysis
41. https://www.alteryx.com/resources/use-case/market-basket-analysis