0% found this document useful (0 votes)
14 views57 pages

DS_Module 4

The document discusses the concept of recommendation engines, their applications, and the data science process involved in building them. It covers various algorithms, including the nearest neighbor algorithm, and addresses challenges such as overfitting and dimensionality. Additionally, it highlights the importance of data visualization and tools available for effective data representation.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views57 pages

DS_Module 4

The document discusses the concept of recommendation engines, their applications, and the data science process involved in building them. It covers various algorithms, including the nearest neighbor algorithm, and addresses challenges such as overfitting and dimensionality. Additionally, it highlights the importance of data visualization and tools available for effective data representation.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Exploratory

Data Analysis, and


the Data
Science Process
Introduction
• Recommendation engines are also called as recommendation systems, are the typical data
product and are a good starting point when you’re explaining to non–data scientists what
you do or what data science really is?

• Examples for recommendation systems:


1. Getting recommended movies on Netflix or YouTube

2. Getting recommended books on Flipkart or Amazon

• To build a solid recommendation system end-to-end requires an understanding of linear


algebra and an ability to code.
A Real-World Recommendation Engine
• Recommendation engines are used all the time.
• Example—What movie would you like, knowing other movies you liked? What
book would you like, keeping in mind past purchases? What kind of vacation are you
likely to embark on, considering past trips?

• There are plenty of different ways to go about building such a model, but they have
very similar feels if not implementation.
Example to set up a recommendation engine
• Scenario - suppose you have users, which form a set U; and you have items to
recommend, which form a set V.

• You can represent the above scenario as a bipartite graph (shown in below figure) if each
user and each item has a node to represent it—there are lines from a user to an item if that
user has expressed an opinion about that item.
Contd…
• Note they might not always love that item, so the edges could have weights: they
could be positive, negative, or on a continuous scale (or discontinuous, but many-
valued like a star system).

• Example :
Contd…
• Next step is, you have training data in the form of some preferences i.e., you know
some of the opinions of some of the users on some of the items.

• From those training data, you want to predict other preferences for
your users. That’s essentially the output for a recommendation engine.

• You may also have metadata on users (i.e., they are male or female, etc.) or on items
(the color of the product).
• For example, users come to your website and set up accounts, so you may know each user’s
gender, age, and preferences for up to three items.
Contd…
• Next, You may represent a given user as a vector of features, sometimes including
only metadata—sometimes including only preferences (which would lead to a
sparse vector because you don’t know all the user’s opinions) —and sometimes
including both, depending on what you’re doing with the vector. Also, you can
sometimes bundle all the user vectors together to get a big user matrix, which we
call U.
Nearest Neighbor Algorithm
Review
• Let’s review the nearest neighbor algorithm
• Idea of Nearest Neighbor Algorithm is - if you want to predict whether user A likes something,
you look at a user B closest to user A who has an opinion, then you assume A’s opinion is the
same as B’s.

• To implement this you need a metric so you can measure distance.


• One example when the opinions are binary: Jaccard distance, i.e., 1–(the number of things
they both like divided by the number of things either of them likes).

• Other examples include cosine similarity or Euclidean distance.


• To answer, Which Metric Is Best?
- Do experiment by using different distance measure for each experiment.
Some Problems with Nearest
Neighbors
• Curse of dimensionality
There are too many dimensions, so the closest neighbors are too
far away from each other to realistically be considered “close.”

• Overfitting
One guy is closest, but that could be pure noise. How do you adjust for that? One
idea is to use kNN, with, say, k = 5 rather than k = 1, which increases the noise.
(For Optimal Solution Choose proper value for k).
Contd…
• Correlated features
There are tons of (Many) features, moreover, that are highly correlated (inter-linked) with
each other.

For example, you might imagine that as you get older you become more conservative. But
then counting both age and politics would mean you’re double counting a single feature in
some sense.

This would lead to bad performance, because you’re using redundant information and
essentially placing double the weight on some variables. It’s preferable to build in an
understanding of the correlation and project onto smaller dimensional space.
Contd…
• Relative importance of features
Some features are more informative than others. Weighting features may
therefore be helpful: maybe your age has nothing to do with your preference for
item 1. You’d probably use something like co-variances to choose your weights.

• Sparseness
If your vector (or matrix, if you put together the vectors) is too sparse (Ex: many
entries in the vector or matrix are 0s), or you have lots of missing data, then most
things are un‐known, and the Jaccard distance means nothing because there’s no
overlap.
Contd…
• Measurement errors

There’s measurement error (also called reporting error): people may lie. (Ex; When providing the data).

• Computational complexity

There’s a calculation cost—computational complexity

• Sensitivity of distance metrics

Euclidean distance also has a scaling problem: distances in age outweigh distances for other features if they’re

reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw Euclidean distance doesn’t make

much sense. Also, old and young people might think one thing but middle-aged people something else. We

seem to be assuming a linear relationship, but it may not exist.

Should you be binning by age group (Creating buckets based on Age) instead, for example? (i.e., You should

use an alternative approach for this) .


Contd…
• Preferences change over time
User preferences may also change over time, which falls outside the model. For
example, at eBay, they might be buying a printer, which makes them only want
ink for a short time.

• Cost to update
It’s also expensive to update the model as you add more data.

• The biggest issues are the first two on the list, namely overfitting and the curse of
dimensionality problem.
Beyond Nearest Neighbor: Machine Learning
Classification

• To deal with overfitting and the curse of dimensionality problem, we’ll build a separate
linear regression model for each item.

• With each model, we could then predict for a given user, knowing their attributes, whether
they would like the item corresponding to that model.

• So one model might be for predicting whether you like Mad Men and another model might
be for predicting whether you would like Bob Dylan.

• Denote by fi, j user i’s stated preference for item j if you have it (or user i’s attribute, if item j
is a metadata item like age or is_logged_in).
Contd…

The good news: You know how to estimate the coefficients by linear algebra,
optimization, and statistical inference: specifically, linear regression.
The bad news: This model only works for one item, and to be complete, you’d
need to build as many models as you have items. Moreover, you’re not using
other items’ information at all to create the model for a given item, so you’re
not leveraging (Using) other pieces of information.
Contd…
• But wait, there’s more good news: This solves the “weighting of the features”
problem we discussed earlier, because linear regression coefficients are weights. (So
that you can know which are more important and which are less important)

• Crap, more bad news: overfitting is still a problem, and it comes in the form of
having huge coefficients when you don’t have enough data (i.e., not enough opinions
on given items).
Contd…
• To solve the overfitting problem, you impose a Bayesian prior that
these weights shouldn’t be too far out of whack (hit)—this is done by adding a
penalty term for large coefficients.

• That solution depends on a single parameter, which is traditionally called λ.

• But that begs the question: how do you choose λ? You could do it
experimentally: use some data as your training set, evaluate how well
you did using particular values of λ, and adjust.
Contd…
• A final problem with this prior stuff: although the problem will have a unique
solution (as in the penalty will have a unique minimum) if you make λ large
enough, by that time you may not be solving the problem you care about.

• i.e., if you make λ absolutely huge, then the coefficients will all go to zero and
you’ll have no model at all.
The Dimensionality Problem
• So we’ve tackled the overfitting problem (previous slides), so now let’s think about
overdimensionality — i.e., the idea that you might have tens of thousands of items.

• We typically use both Singular Value Decomposition (SVD) and Principal


Component Analysis (PCA) to tackle this.
Contd…
• To understand how this works before we dive into the math, let’s think about how we reduce
dimensions and create “latent features” internally every day.

• For example, people invent concepts like “coolness,” but we can’t directly measure how cool
someone is. Other people exhibit different patterns of behavior, which we internally map or reduce to
our one dimension of “coolness.”

• So coolness is an example of a latent feature in that it’s unobserved and not measurable directly, and
we could think of it as reducing dimensions because perhaps it’s a combination of many “features”
we’ve observed about the person and implicitly weighted in our mind.

• Two things are happening here: the dimensionality is reduced into a single feature and the latent
aspect of that feature.
Contd…
• But in this algorithm, we don’t decide which latent factors to care about. Instead we let
the machines do the work of figuring out what the important latent features are.

• “Important” in this context means they explain the variance in the answers to the various
questions—in other words, they model the answers efficiently

• Our goal is to build a model that has a representation in a low dimensional subspace that
gathers “taste information” to generate recommendations.

• To know Linear algebra click the link: https://www.khanacademy.org/math/linear-algebra


Singular Value Decomposition
(SVD)
• Maths background:
Given an m×n matrix X of rank k, it is a theorem from linear algebra that we can always
compose it into the product of three matrices as follows:

where U is m×k, S is k×k, and V is k×n, the columns of U and V are pairwise orthogonal, and
S is diagonal. Note the standard statement of SVD is slightly more involved and has U and V
both square unitary matrices, and has the middle “diagonal” matrix a rectangular. We’ll be
using this form, because we’re going to be taking approximations to X of increasingly smaller
rank.
Contd…
• Let’s apply the preceding matrix decomposition to our situation. X is our original dataset,
which has users’ ratings of items. We have m users, n items, and k would be the rank of X, and
consequently would also be an upper bound on the number d of latent variables we decide to
care about—note we choose d whereas m, n, and k are defined through our training dataset. So
just like in k-NN, where k is a tuning parameter (different k entirely—not trying to confuse
you!), in this case, d is the tuning parameter.

• Each row of U corresponds to a user, whereas V has a row for each item. The values along the
diagonal of the square matrix S are called the “singular values.” They measure the importance
of each latent variable—the most important latent variable has the biggest singular value.
YouTube URLs for SVD
• https://youtu.be/EokL7E6o1AE

• https://youtu.be/P5mlg91as1c
Important Properties of SVD
• Because the columns of U and V are orthogonal to each other, you can order the columns by singular
values via a base change operation. That way, if you put the columns in decreasing order of their
corresponding singular values (which you do), then the dimensions are ordered by importance from
highest to lowest. You can take lower rank approximation of X by throwing away part of S. In other
words, replace S by a submatrix taken from the upper-left corner of S.

• Of course, if you cut off part of S you’d have to simultaneously cut off part of U and part of V, but
this is OK because you’re cutting off the least important vectors. This is essentially how you choose
the number of latent variables d—you no longer have the original matrix X anymore, only an
approximation of it, because d is typically much smaller than k, but it’s still pretty close to X .

• SVD can’t handle missing values.

• SVD is extremely computationally expensive.


Principal Component Analysis (PCA)
• Let’s look at another approach for predicting preferences. With this approach, you’re still looking for
U and V as before, but you don’t need S anymore, so you’re just searching for U and V such that:


Contd…
Contd…
• How do you choose d? It’s typically about 100, because it’s more than 20 (as we told
you, through the course of developing the product, we found that we had a pretty good
grasp on someone if we ask them 20 questions) and it’s as much as you care to add
before it’s computationally too much work.
YouTube URL for PCA
• https://www.youtube.com/watch?v=ZqXnPcyIAL8

• https://www.youtube.com/watch?v=yLdOS6xyM_Q

• https://youtu.be/FgakZw6K1QQ

• https://www.youtube.com/watch?v=0Jp4gsfOLMs
What is data visualization?
• Data visualization is the secret art of turning data into visual graphics that people can
understand (graphs, charts, info graphics, etc.).

• Data visualization allows the human eye to see trends and patterns that it otherwise can’t see
or make out.

• Data visualization ensures your information is processed faster, more easily understood and
remembered.

• In short, visual data is easier to remember than words.


• Video Link: https://drive.google.com/open?id=1-Hg1NpXdZFQrSxAvr9pdkJVUzd56ueb8
Data Visualization History
• The past nature of data collection methods forced one to consider aggregate
statistics that one can reasonably estimate by subsample—means, for example.

• But now that one can actually get one’s hands on all data and work with all data,
one no longer should focus only on the kinds of statistics that make sense in the
aggregate, but also one’s own individually designed statistics—say, coming from
graph-like interactions—that are now possible due to finer control.
Online Data Visualization
Websites
• DataHero
• Plotly
• Number Picture
• Polychart
• Juice Analytics
• Weave
• Datavisual
• Zoomdata (via the cloud platforms)
• RAW
• Datawrapper
Software or Tool for data
visualization
• Tableau
• SAP Lumira (including a free Personal Edition version)
• Microsoft Excel (or any other spreadsheet that includes charts)
• ClearStory
• Mathematica
• MATLAB
• MatPlotLib (if you are comfortable programming Python)
• R Programming Language
• ggplot2
Benefits of Data Visualization
• Amplifies your message
• Gives meaning to your data
• Saves time
• Makes for better decision making
• Is more shareable and digestible
1. Amplifies your message
• Your message is amplified in a few different ways. First of all, by taking the time
to create data visualizations, you show your audience that you’ve done your
homework. That alone gives a sense of credibility to your content.

• Without visualizations, you run the risk of your audience not understanding what
you are trying to present. Your data might even be received as meaningless and
your entire message lost.
2. Gives meaning to your data

• Visualizations communicate valuable insights by creating visual representations


of your data.

• For example, an Excel spreadsheet showing that Microsoft’s sales revenue has
almost doubled between 2011 and 2018 isn’t nearly as effective as graphic that
data in a simple column with some formatting. (Next Slide)
Contd…
3. Saves time
• Instead of spending the time trying to figure out what the facts and figures
mean, your audience members can ENGAGE with the meaning. A visual
representation allows you to analyze huge amounts of info in the blink of
an eye. As we know, the human eye can recognize and process visual
information much faster than text.
4. Makes for better decision making

• Assuming your data visualizations contain correct data and are done properly,
you’ll not only be able to make decisions faster, but they will be based on data
that you fully comprehend.
5. Is more shareable and digestible
• One of the best things about data visualization is that they are accessible
and easier to share across departments, with colleagues, your boss, or with
a large audience. They can be inserted in your PowerPoint presentation,
printed for seminar handouts, or even posted and shared on social media.
Contd…
• For example, below is a data visualization superimposing the Titanic over the
world’s new cruise ship (the Allure of the Seas) to demonstrate that the new ship
is almost 5 times bigger.
A Sample of Data Visualization
Projects

Figure is a projection onto a


power plant’s steam cloud.
The size of the green
projection corresponds to the
amount of energy the city is
using.
Contd…
In One Tree (Figure) the artist
cloned trees and planted the
genetically identical seeds in
several areas. It displays,
among other things, the
environmental conditions in
each area where they are
planted.
Contd…

Figure shows Dusty Relief, in


which the building collects
pollution around it,
displayed as dust.
Contd…
Project Reveal (Figure) is
a kind of magic mirror
that wirelessly connects
using facial recognition
technology and gives you
information about
yourself.
Contd…
• The SIDL is headed by Laura Kurgan, and in this piece shown

in Figure , she flipped Google’s crime statistics.

• She went into the prison population data, and for every

incarcerated person, she looked at their home address,

measuring per home how much money the state was spending

to keep the people who lived there in prison. She discovered

that some blocks were spending $1,000,000 to keep people in

prison.

• The moral of this project is: just because you can put something

on the map, doesn’t mean you should. It doesn’t mean there’s a

new story. Sometimes you need to dig deeper and flip it over to

get a new story.


Mark’s Data Visualization Projects
• New York Times Lobby: Moveable Type

Video Link:

https://vimeo.com/11324
0712
Contd…
• It consists of 560 text displays—two walls with 280 displays on each— and they
cycle through various scenes that each have a theme and an underlying data
science model.

• In one there are waves upon waves of digital ticker-tape–like scenes that leave
behind clusters of text, and where each cluster represents a different story from
the paper. The text for a given story highlights phrases that make a given story
different from others in an information-theory sense.
Contd…
• In another scene, the numbers coming out of stories are highlighted, so you might
see “18 gorillas” on a given display. In a third scene, crossword puzzles play
themselves accompanied by sounds of pencils writing on paper.

• Figure (Next Slide) shows an example of a display box, which are designed to
convey a retro vibe. Each box has an embedded Linux processor running Python,
and a sound card that makes various sounds—clicking, typing, waves—
depending on what scene is playing.
Contd…

The data is collected via text from New


York Times articles, blogs, and search
engine activity. Every sentence is
parsed using Stanford natural language
processing techniques, which diagram
sentences.
Altogether there are about 15 scenes
so far, and it’s written in code so one
can keep adding to it.
Project Cascade: Lives on a Screen
• Cascade came about from thinking about how people share New York Times
links on Twitter.

• The idea was to collect enough data so that you could see people browse, encode
the link in bitly, tweet that encoded link, see other people click on that tweet,
watch bitly decode the link, and then see those people browse the New York
Times. Figure (Next Slide) shows the visualization of that entire process, much
like Tarde suggested we should do.
Contd…
• There were of course data decisions to be made: a
loose matching of tweets and clicks through time,

for example. If 17 different tweets provided the

same URL, they couldn’t know which tweet/link

someone clicked on, so they guessed.

• They used the Twitter map of who follows who—if


someone you follow tweets about something

before you do, then it counts as a retweet.

• Video Link:

• https://www.youtube.com/watch?v=KTWWqUk7a
Yw
Cronkite Plaza
• It’s visible every evening at Cronkite Plaza, with
scenes projected onto the building via six different

projectors. The majority of the projected text is

sourced from Walter Cronkite’s news broadcasts,

but they also used local closed-captioned news

sources. One scene extracted the questions asked

during local news—things like “How did she

react?” or “What type of dog would you get?”

• Video Link:

• https://www.youtube.com/watch?v=NN7uKacnHjI
eBay Transactions and Books
• Mark investigated a day’s worth of eBay’s transactions that went through Paypal
and, for whatever reason, two years of book sales.

• Here’s their ingenious approach: They started with the text of Death of a Salesman
by Arthur Miller. They used a mechanical turk mechanism to locate objects in the
text that you can buy on eBay.
Contd…
• When an object is found it moves it to a special bin, e.g., “chair” or “flute” or “table.”
When it has a few collected buyable objects, it then takes the objects and sees where they
are all for sale on the day’s worth of transactions, and looks at details on outliers and such.
After examining the sales, the code will find a zip code in some quiet place like Montana.

• Then it flips over to the book sales data, looks at all the books bought or sold in that zip
code, picks a book (which is also on Project Gutenberg), and begins to read that book and
collect “buyable” objects from it. And it keeps going.

• Video Link:
• https://vimeo.com/50146828
Public Theater Shakespeare
Machine
• The piece is an oval structure with
37 bladed LED displays installed
above the theater’s bar. There’s
one blade for each of
Shakespeare’s plays. Longer plays
are in the long end of the oval—
you see Hamlet when you come
in.
Contd…
• The data input is the text of each play. Each scene does something different—for
example, it might collect noun phrases that have something to do with the body from
each play, so the “Hamlet” blade will only show a body phrase from Hamlet. In another
scene, various kinds of combinations or linguistic constructs are mined, such as three-
word phrases like “high and might” or “good and gracious” or compound-word phrases
like “devilish-holy,” “heart-sore,” or “hard-favoured.”

• Video Link:
• https://www.youtube.com/watch?v=2W8PZMV-LW8

You might also like