0% found this document useful (0 votes)
7 views

70 Days of Data Science

The document outlines a 70-day journey through various topics in data science, covering foundational concepts such as statistics, algorithms, and big data, as well as advanced techniques like neural networks and natural language processing. Each day focuses on a specific topic, providing definitions, applications, and personal insights into how these concepts relate to data science. The series aims to enhance understanding of data science principles and their practical implications in real-world scenarios.

Uploaded by

yah boi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

70 Days of Data Science

The document outlines a 70-day journey through various topics in data science, covering foundational concepts such as statistics, algorithms, and big data, as well as advanced techniques like neural networks and natural language processing. Each day focuses on a specific topic, providing definitions, applications, and personal insights into how these concepts relate to data science. The series aims to enhance understanding of data science principles and their practical implications in real-world scenarios.

Uploaded by

yah boi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

70 Days of Data Science

Day 1 (Date: 09/06/24) – Statistics


Statistics is the study of gathering, organizing, and analyzing any form of data. It is a
branch of mathematics that is applied to a lot of fields such as economics, business, sociology,
geology, and other sectors that may deal with trying to visualize, categorize, and essentially,
understand data. There are different methods for performing statistics on data, there are various
types of methods based on the data we are trying to gather. If we were to collect and perform
statistics on raw data, this is called descriptive statistics in which we categorize data so it can be
understood. On the other hand, inferential statistics are the methods of transforming these data
into conclusions or summaries to create a prediction about a particular parameter.
In my own words, statistics is the field of understanding raw data in a way that normal
people can understand raw data and how different goals can show different methods in
performing statistics. If I were to use statistics, I would use it to categorize and organize my
performance at a certain time period to determine my performance and adjust it based on the
conclusions that I drew from that data. In terms of data science, statistics is one of the
foundational aspects of data science and allows data scientists to fundamentally, interpret and
analyze data. It serves as the basis for a large portion of the automation and decision-making
found in contemporary data-driven industries.

Day 2 (Date: 09/07/24) – Algorithms


Algorithms are the process of solving a problem or performing a calculation at a certain
computation or equation. Rather than trying to find shortcuts in solving a problem, an algorithm
creates a step-by-step formula that is useful for a problem to be solved in a way that’s easy to
understand and effective. In simple terms, algorithms are plainly instructions or commands that
are to be followed to solve problems in the most effective way possible. Algorithms are mostly
used by programmers and developers to set and give computers instructions for them to solve
complex tasks in a short time.
At first glance, I thought that algorithms were a form of patterns and sequences in a
certain type of order that is mostly used on numbers. However, when I started studying
programming and learning about the foundational basics of computer science, I learned that
algorithms are more complex than that and could be used to perform tasks that seemed hard at
first but were a lot easier when understood. If I were to relate it to data science, I think that
algorithms such as sorting algorithms can be used to simplify and effectively shorten the time to
organize data as well as find patterns and essentially teach a machine how to interpret and
classify data using algorithms.
Code:

Day 3 (Date: 09/08/24) – Big Data


Big Data is regarded as one of the cornerstones of data analytics and research into large
amounts of data that are structured, unstructured, or semi-structured and grow at an immense
rate day after day. Large companies make use of advanced database management tools to
control the large volume of data that becomes too complex for normal data management tools
to handle. They use this information and analyze it further to make a more comprehensive
database that they can utilize to make more accurate predictions or preemptive decisions. Big
Data can be categorized into three V's to wit: Volume, Variety, and Velocity. These parameters
are what may be used to measure and describe the big data that we have on our hands.
If I were to explain Big Data in my own words, I believe it’s another challenge that we
encounter in the world of technology at some time. We had no proper methods to control the
amount and the speed of data that we were receiving and as a result, we were left with a lot of
data that we didn’t understand. However, with the development of new tools and new concepts
such as distributed computing and machine learning, we were given the opportunity to learn this
new field of data and use it to our advantage for machines to efficiently learn from data to create
predictions or structured analysis to be used in research or other purposes.

Day 4 (Date: 09/09/24) – Linear Regression


One of the statistical methods used to define and visualize the relationship between two
variables is called Linear Regression. The premise of linear regression is that the dependent
and independent variables have a linear relationship, which can be depicted as a two-
dimensional straight line. This statistical method is mostly used by analysts to plot large
amounts of data and relate the dependent to the independent variables which would arrive at a
prediction. The slope (m) of the line indicates the rate of change of the dependent variable with
respect to the independent variable. A positive slope means that as the independent variable
increases, the dependent variable also increases. Conversely, a negative slope indicates an
inverse relationship.
Linear regression is a basic statistical technique commonly applied in data science for its
capacity to uncover connections between variables and enable accurate forecasts. Plotting a
linear trendline among data points summarizes the relationship and allows data scientists to
predict patterns, determine values, and create additional characteristics. Apart from its ability to
anticipate, linear regression sets the foundation for advanced machine learning algorithms,
making it a crucial tool for extracting insights and endorsing data-informed decision-making.

Day 5 (Date: 09/10/24) – Data Modeling


Data modeling is the process of constructing an organized structure for data, similar to
drafting a blueprint. It involves structuring and presenting the data in a manner that facilitates
analysis and comprehension. By pinpointing important characteristics and establishing
connections between them, data scientists can create models that reveal concealed patterns
and trends. These models have the capability to predict outcomes, categorize data, or cluster
like data points. In a more extensive context, data modeling also involves the procedure of
educating a model to match the data and the particular business goals.

The process of data modeling starts with data preparation, such as cleansing the dataset and
choosing the most important features. Later, data scientists select an appropriate model
depending on the issue they are addressing. The processed data is used to train this model in
order to identify patterns and provide predictions. After the training of the model is completed, it
is then evaluated to determine how well it performs on data that it has not been previously
exposed to. By continuously improving the model, data scientists can enhance it to extract
important insights that back data-informed decision-making and innovation.

Day 6 (Date: 09/11/24) – Neural Networks


A neural network is a machine learning model designed to recognize patterns and tackle
challenging problems by drawing inspiration from the structure of the human brain. It processes
data in a sequential manner and is composed of many interconnected layers of nodes, also
called neurons. While the final layer, called the output layer, produces predictions or
classifications, the first layer, called the input layer, processes the raw data. There are hidden
layers in between them where different mathematical transformations are applied to the data. To
find intricate patterns, these layers employ weighted connections; the weights are changed
during training to reduce the discrepancy between expected and actual results.
Training a neural network involves giving it a lot of labeled data and then tuning the
weights by running through algorithms such as backpropagation. Backpropagation works to
propagate the error backward via the layers, hence improving the correctness of the calculation
of the immediate prediction error made by the network in optimizing the weights. Neural
networks do exceptionally well on tasks such as image recognition, speech processing, and
language analysis because this form of machine learning learns from data and generalized
patterns. For example, specific neural network types have been developed for particular
applications, such as recurrent neural networks for sequential data and convolutional neural
networks for image-related tasks. Besides that, big data growth and sophisticated computation
power have positioned neural networks as critical instruments in the data science domain,
consequently driving innovation in healthcare, finance, and autonomous systems.

Day 7 (Date: 09/12/24) – Data Pre-Processing


Data preprocessing is a fundamental activity in the Data Science pipeline, whose aim is
to transform and prepare raw data into a clean, readable format. It generally comprises the
following activities: data cleaning, data integration, data transformation, and data reduction.
Data cleaning involves the treatment of missing values and inconsistencies and the removal of
noise or irrelevant information. Integration unifies data coming from various sources, ending up
with a complementary data set. It may also involve the normalization of numerical values,
encoding of categorical variables, and scaling of features to set the data ready for analysis. This
step is very important since raw data typically contains errors, outliers, or inconsistencies that
may degrade the performance of machine learning models.
Feature selection and extraction simply mean the selection of the most important
attributes among the many after the cleaning and transformation are performed on the data.
This reduces dimensionality, making the dataset more manageable and interpretable; it also
improves the model's accuracy. Another important preprocessing task is feature engineering,
whereby new variables are created from, or modifications are done toward, existing variables to
better represent meaningful patterns in the data. Spending enough time on thorough
preprocessing helps foster data quality, reduces biases, and improves predictive power within
models developed by a data scientist. Effective data preprocessing not only avoids the
possibility of misleading results but also makes the analysis robust and reliable.
Day 8 (Date: 09/13/24) – Exploratory Data Analysis (EDA)
Exploratory Data Analysis is one of the critical processes in data science, geared at
comprehending the underlying structure and characteristics of the dataset before applying
machine learning models. Its most critical goals include exploration of the pattern of data,
detection of anomalies, testing hypotheses, and assumption checking based on a combination
of various visualizations and summary statistics. Basic cleaning of the data through various
methods, including but not limited to histograms, scatter plots, and box plots, provides the data
scientist with an initial insight into the distribution of data, variable relationships, and anomalies.
This helps identify fundamental trends and can indicate other areas that may be in focus for
further analysis.
Exploratory Data Analysis also involves summarizing the data with measures like mean
and median, standard deviation, and correlation coefficients. The Exploratory Data Analysis
might also be applied to perform necessary data transformation and feature engineering to
prepare the dataset better. Additionally, EDA can reveal data problems, which might include
missing values, data type errors, or highly skewed distributions that could impact model
performance. Such in-depth data exploration by the data scientist will help the scientist make
informed decisions on the selection of features, identify the right modeling techniques, and
enhance the quality of the analysis manifold. Generally, Exploratory Data Analysis provides a
foundation for constructing accurate and robust models, since it somewhat ensures an in-depth
understanding of the data at hand.

Day 9 (Date: 09/14/24) – Data Privacy in Data Science


Data privacy, within data science, refers to the ethical and legal considerations around
data collection, storage, and use of personal information. Ensuring the protection of sensitive
information would, therefore, call for a highly advanced sensitive approach to privacy, especially
now that technologies driven by data are increasingly becoming complex. In data science,
privacy can be maintained through adherence to strict laws on the protection of data, such as
the General Data Protection Regulation or the California Consumer Privacy Act. These
regulations institute some standards for processing personal data in order to grant individuals
control of their data. Some common techniques used when minimizing the risk of exposure of
PII, while maintaining data scientists' capability to analyze and derive insight from the data,
include data anonymization, pseudonymization, and encryption.
Privacy of data is paramount, as it is highly important in various fields like healthcare,
finance, or marketing, which often handle sensitive information. Data scientists must be very
open regarding the collection, usage, and sharing of data, with measures to prevent
unauthorized access. In addition to legal requirements, showing respect for privacy will confer
trust among users and stakeholders. Inadequate protection of personal information issues leads
to reputational damage, financial penalties, and legal consequences. Data privacy practices can
also foster responsibility in the use of data, whereby the use of data for the intended purpose is
coupled with consideration for the rights and freedoms of subjects.
Day 10 (Date: 15/09/24) – Hierarchical Clustering
Hierarchical clustering is a means of unsupervised machine learning that goes a step
ahead from other clustering methods by producing clusters of data points somewhat similar to
each other. Hierarchical clustering generates a tree-like structure, known as a dendrogram,
showing visually how the data points relate to one another. The process starts by considering
each data point as an individual cluster and then progresses iteratively by combining the most
similar pairs of clusters. The aim is to construct such a hierarchy such that the clusters in
various levels reflect various levels of similarity. Hierarchical clustering could either be
agglomerative or divisive. Agglomerative clustering starts by considering each data point as its
own cluster and successively merges pairs of clusters based on a similarity measure such as
Euclidean distance. While divisive clustering initializes cluster formation by starting with one
large cluster containing all points and iteratively splitting into smaller clusters.
Another important advantage of the hierarchical technique is that it can generate a
nested series of clusters. This shows multiple levels in the data. Because the process produces
a dendrogram, which is a hierarchical representation of clusters, it can be cut at any level to
define the number of clusters desired. This flexibility in hierarchical clustering allows it to
perform well on various data structures and makes the results interpretable. However, this is a
computationally expensive scheme, particularly in large datasets, since the algorithm needs
pairwise distances between all data points. Yet, hierarchical clustering represents a useful
exploratory technique when the total number of clusters is not defined a priori or if one seeks to
find relations present within the data over several scales.
Day 11 (Date: 16/09/24) – Natural Language Processing
Natural Language Processing (NLP) is a subfield of AI concerned with the interaction of
computers with human language. NLP combines linguistics and computer science in such a way
that machines can process text and speech in meaningful and useful ways. The activities that
go into NLP range from the seemingly simple ones, such as translation of languages and
sentiment analysis, to higher-order ones, such as named entity recognition, summarization of
texts, and answering questions. Techniques such as tokenization, part-of-speech tagging, and
syntactic parsing are employed by NLP in order to break down text into manageable units called
tokens and understand its structure and meaning.
NLP also covers machine learning, whereby, through big datasets of texts, algorithms
learn to recognize patterns and improve further with time. For example, a machine learning
model can be trained on text to predict the next word in a sentence or classify the sentiment of a
sentence as positive or negative. Deep learning has also enabled some fantastic performance
improvements for NLP in recent times, from language translation and text generation to many
more. The NLP landscape, empowered by tools like BERT and GPT, has totally flipped, with
machines to this date generating text-like output with nearly human-like precision. Applications
of NLP have spanned a wide gamut-from virtual assistants like Siri and Alexa to chatbots,
search engines, and content moderation systems-alone to make it indeed one of the most
powerful tools aimed at improving communication between humans and machines.

Day 12 (Date: 17/09/24) – Decision Tree Classifier


The Decision Tree Classifier is one of the most significant algorithms for Supervised
Machine Learning, and its work pertains to classification in order to predefine the category or
label of an input sample based on a set of attributes. It views all decisions in the form of a tree,
with internal nodes representing decisions concerning features; branches as an outcome of that
decision, and the leaf nodes corresponding to the eventual class label. Beginning with the prime
dataset, it recursively generates subsets of the data with respect to the features that offer the
maximum information gain it uses metrics such as Gini impurity or entropy. Further splitting
continues until a full classification of the data is achieved, or when stopping criterion-like
conditions are reached, such as a defined maximum depth or minimum number of samples per
leaf.
The Decision Trees are popular because of their simplicity and interpretability. They can
also consider categorical and continuous data, with scant data preprocessing, as they can
handle missing values and outliers. However, decision trees do suffer from overfitting, especially
when the tree is deep or too complex. There are various different methods that can be used in
order to handle this problem, such as pruning, setting a maximum depth, or using ensemble
methods such as Random Forests and Gradient Boosting Trees. Despite their shortcomings,
decision trees are still one of the most powerful classification tools because they give an
obvious, rule-based process when making decisions-easy to be visualized and interpreted by
humans.

Day 13 (Date: 18/09/24) – Machine Learning


Machine Learning is a subfield of artificial intelligence underpinning algorithms and
statistical models that enable the computer systems to learn from data and make predictions or
decisions based on patterns obtained from this data. Contrary to traditional programming, where
everything was instructed explicitly to do something, machine learning systems are supposed to
find out the patterns in data supplied on their own and make predictions based on these without
human interference. These systems get more accurate with time as they process more data,
learning from experience. Machine learning encompasses a variety of techniques: supervised
learning, where a model is trained from labeled data; and unsupervised learning, where the
model will find its pattern in the data because there are no predefined labels. Other approaches
involve reinforcement learning, where an agent learns to interact with its environment by
receiving feedback:.

Machine learning normally follows several critical steps in its procedure. First, data are
gathered and preprocessed so that the latter is clean and organized. Further, an appropriate
model of machine learning is chosen, taking into consideration the sort of problem dealt with-
classification, regression, clustering. Next, one would train a model on some training dataset so
that it becomes aware of how the input variables relate to the target output. After the training of
the model, one would now evaluate it on the test data to see the ability of the model to perform
well on unseen data. Performance metrics include accuracy, precision, recall, and F1-score.
Finally, when more data is available, the model should be fine-tuned and retrained. The
applications of machine learning range from stock price predictions to disease diagnosis, and
their uses continue to evolve with new algorithms and increased computational power.

Day 14 (Date: 19/09/24) – Data Warehousing


Data warehousing is the process of collecting, storing, and managing large volumes of
structured data from a variety of sources in a central repository called a data warehouse. A key
function that a data warehouse should serve includes the capability to query and analyze data
efficiently for decision-making in organizations. Unlike databases, which are designed basically
for transactional processing, all data warehouses should be able to handle complex queries and
large volumes of data. The data stored in data warehouses is usually historical, allowing users
to analyze trends over time. Data in a data warehouse are carved out into subject areas like
sales, finance, and customer data.
The different data warehousing process mainly comprises of a few steps: data
extraction, transformation, and loading, or ETL. First, data is extracted from different operational
databases or external sources. After this, it is again converted into a standardized and quality
form, which itself includes cleaning, filtering, and aggregation. Finally, the transformed data is
loaded into the data warehouse, where it is structured in such a way as to allow efficient
querying. Data warehouses are designed using specialized schemas, typically star or snowflake
schemas, that optimize performance. Advanced analytics, reporting, and creation of dashboards
can be done once the data are in the warehouse. Consolidating data into one repository assists
the organization in making choices based on data, improves operating efficiency, and brings
forth patterns in trends that may not be obvious in operational systems.

Day 15 (Date: 20/09/24) – Deep Learning


Deep learning is a class of machine learning where the training of artificial neural
networks, also called deep neural networks, involves a large number of layers to recognize
patterns in big datasets. Deep learning models-which borrow their fundamental structure from
the human brain-are created with interconnected layers of nodes that process data through
various stages of abstraction. These models can learn features themselves from raw data such
as images, text, or audio without resorting to explicit feature engineering. Deep learning has
been very successful in solving complex tasks like image recognition, natural language
processing, and speech recognition, which are hard to solve with traditional machine learning
methods.

The essence of deep learning is that it can handle huge amounts of data and perform
hierarchical learning where each layer in the network forms an abstraction of the input at a
higher level. Training a deep model often requires immense computation and/or big datasets to
attain the highest accuracy. These models are usually trained by techniques such as back-
propagation, whereby errors are back-forwarded through the network to adjust weights to
ensure that the prediction error is minimal. While large breakthroughs have been attained in
deep learning in the field of artificial intelligence, it also faces other challenges like large labeled
data sets and interpretability of models to make them more "black boxes." However, with the
improvement of computational resources and advances in algorithms, deep learning plays an
important role in driving innovation in various industries, from healthcare to autonomous
vehicles.

Day 16 (Date: 21/09/24) – Deep Learning


Supervised learning is a class of machine learning where models are trained using
labeled data, that is, data with its corresponding output label or target value. Knowledge from
patterns or relationships in data can be learned for mapping inputs to outputs. This type of
learning is called "supervised" because the algorithm is guided by these labeled examples much
as a teacher would supervise a learning process. Common applications of supervised learning
include classification tasks, where a model predicts discrete labels (e.g., spam or nonspam
email), and regression tasks, where a model predicts continuous values (e.g., stock prices,
house prices).
It basically involves a series of steps that include collecting data, preparation of data,
choosing the model, training, and evaluation in a supervised learning process. First comes
gathering labeled data, then selection of relevant features is done. Then, a model is chosen
according to the type of problem, such as decision trees, linear regression, or neural networks.
Then, the model is fitted on the labeled dataset by adjusting internal parameters to make the
difference between the predicted outputs and real outputs as low as possible. After this training
is complete, the model performance can be checked on an independent test set, namely data
that this model has never seen. Precision, accuracy, recall, or measures such as mean squared
error have to be calculated as a way to express how well this model generalizes to new unseen
data. These supervised learning algorithms can make pretty accurate predictions on previously
unseen data by refining the model through feedback and, hence, are very valuable in a range of
real-world applications.

Day 17 (Date: 22/09/24) – K-Means Clustering


K-Means is a widely used unsupervised machine learning algorithm that segregates data
into well-defined clusters based on similarities within the data. This algorithm works by
partitioning a dataset into k predefined clusters, thereby creating groups of data points with
more similarities to one another than to points in other clusters. The process starts by randomly
selecting k initial centroids (the center of each cluster). Each of the various data points is then
allocated to the closest centroid according to some distance metric; this usually involves the use
of Euclidean distance. After all the points have been assigned to a cluster, the centroids are
recomputed as an average of all data points in a cluster. The process needs to be repeated
iteratively until there are no more significant changes in centroids, or already predefined number
of iterations is attained, which again means convergence of the algorithm.
K-Means is designed to be straightforward yet efficient; however, it has its weaknesses.
The algorithm assumes spherical and comparatively equi-sized clusters. If the dataset is such
that there are irregular or nonlinear clusters, then this presents a problem. Another point is that
the number of clusters, k, needs to be predefined, which may itself be difficult to determine if an
appropriate number is not already known in advance. Traditional methodologies include
methods such as the "Elbow Method", which plot variance explained as a function of k, seeking
an "elbow" point beyond which increasing the number of clusters yields rapidly diminishing
returns. Despite such challenges, K-Means finds widespread application in customer
segmentation, image compression, and anomaly detection due to its ease of use and
computational efficiency on large datasets.

You might also like