Data Cloud Yury Lifshits  Yahoo! Research  http://yury.name
My Beliefs The key challenge in web search is structured search Part 1: What is structured search?  The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud Part 3: Demo: numeric data distribution The key challenge in collecting data is incentive design Part 4: Economics of data distribution
Structured Search
 
 
 
 
 
 
 
Data Structured data Entity unit:  Identifier  Metadata:  Explicit key-value pairs Relational properties Evaluation Semi-structured data Content unit:  Body: text, video, audio, or image  Metadata:  Explicit key-value pairs Relational properties Evaluation Data = data of  entities  + data of  content
Structured Search Factoid search “ what's the value of property X of object Y“ Entity hubs Domain hubs Structured object search "all concerts this weekend in SF under 20$ sorted by popularity" Time focus Ranking focus  Relations focus Structured content search   "all videos with Tom Brady" “ all comments and blog posts about Bing"
Yury’s Wishlist Business-generated data Products, services, news, wishlists, contact data Reality stream, sensors Where what have happened Expert knowledge Glossary, issues, typical solutions, object databases, related objects graph Events Sport, concerts, education, corporate, community, private Market graph & signals Like, interested, use, following, want to buy; votes and ratings
Search as a Platform App 4  Classic search  App 1  App 2  App 3  Structured Data Web index Post analysis Query analysis
Data Cloud How to collect all structured data in one place?
Data Producers People:  forums, wiki, mail groups, blogs, social networks  Enterprizes:  product profiles, corporate news, professional content  Sensors:  GPS modules, web cameras, traffic sensors, RFID  Transactional data
Data Distributors Data distributor  is any technical solution to  accumulate ,  organize  and  provide access  to structured and semi-structured data Data publisher : the original distributor of some data Data retailer : a consumer-facing distributor of some data
Data Consumers Humans Email Aggregators: news, friend feeds, RSS readers Search Browsing / random walks Intelligence projects Recommendation systems Trend mining
Data Cloud Data Cloud is a centralized  fully-functional  data distribution service Success metric for data cloud strategy = the total “value” of data on the cloud
To-Cloud Solutions Extraction DBpedia.org, “web tables” Semantic markup, data APIs Yahoo! SearchMonkey Feeds Yahoo! Shopping Disqus.com, js-kit.com, Facebook Connect Direct publishing
On-Cloud Solutions  Ontology maintenance Freebase Normalization, de-duplication, antispam Named entity recognition,  metadata inference, ranking Data recycling (cross-references) Amazon Public Data Sets Viral license Hosted search  Yahoo! BOSS
From-Cloud Solutions Search, audience Y! SearchMonkey, Google Base Data API, dump access, update stream Custom notifications Gnip.com Data cloud as a primary backend Access control Ad distribution. (AT&T and Yahoo!  Local deal)
Demo: webNumbr.com Joint work with Paul Tarjan
 
webNumbr.com: Import Crawl numbers from the web URL + XPath + regex Create “numbr pages” Update their values every hour  Keep the history Anyone can create a numbr http://webnumbr.com/create
webNumbr.com: Export Embed code Graphs Search & browse  RSS
Economics of Data Distribution Joint work with Ravi Kumar and Andrew Tomkins
Network Effect in Two-Sided Markets Two sided market = every product serves consumers of two types A and B Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa Examples: operating systems, credit cards, e-commerce marketplaces Two-sided network effects: A theory of information product design G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne
Basic model Distributors D1, … Dk Producer/consumer joins only one distributor Initial shares (p1,c1) … (pk,ck) New consumer selects a distributor with a probability proportional to pi New producer selects a distributor with probability proportional to ci
Basic model a1 a4 a2 a3 a1 a4 a3 a2
Market Shares Dynamics Theorem 1 Market shares will stabilize Theorem 2 With super-liner preference rule one of distributors will tip Theorem 3 With sub-liner preference rule market shares will flatten
External Factor Preference rule with external factor: ei+ci/(c1+…+ck) Theorem 4 Market shares will stabilize on e1 : e2 : … : ek
Coalition Data Cloud
Coalitions Theorem 5 If all market shares are below 1/sqrt(k) coalition (sharing data) is profitable for  all distributors Corollary Coalitions are not monotone Example: 5 : 4 : 1 : 1
Model Variations Same-side network effect Different p-to-c and c-to-p rules Multi-homing (overlapping audiences) n^2 vs. nlog n revenue models  Mature market: newcomer rate = departing rate Diverse market (many types of producers and consumers) Newcoming and departing distributors Directed coalitions
Challenges
Marketing Data demand?  Data offerings? Requirements for distribution technology?
Incentive design Incentives for data sharing? Centralized or distributed? For profit or non-profit? Data licensing and ownership? Monetizing data cloud?
More Challenges Prototyping: Data marketplace: open data & data demand Search plugins: related objects, glossaries, object timelines Publishing tools for structured data Data client: structured news, bookmarking, notifications Tech design: Access management Namespace design User interface: Structured search UI Discovery UI
Thanks! Follow my research: http://twitter.com/yurylifshits http://yury.name/blog

More Related Content

PPT
Social Media, Big Data
PPT
Presentation big data and social media final_video
PDF
Achieve Federal Open Data Policy Compliance - Slides
PPTX
Dealing with Open Data in Istat
PPTX
EDF2014: Talk of Stefan Decker, Director, Insight Galway, Ireland & Anthony M...
PDF
a6-zhao
PDF
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
PDF
Using graph technology for multi-INT investigations
Social Media, Big Data
Presentation big data and social media final_video
Achieve Federal Open Data Policy Compliance - Slides
Dealing with Open Data in Istat
EDF2014: Talk of Stefan Decker, Director, Insight Galway, Ireland & Anthony M...
a6-zhao
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
Using graph technology for multi-INT investigations

Similar to Data Cloud - Yury Lifshits - Yahoo! Research (20)

PDF
uae views on big data
PPTX
Liberating data power of APIs
PPTX
Understanding Software Ecosystems
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPT
Attention Allocation - from Search to Social
PPT
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
PPTX
Listening in Real-Time
PPTX
Listening in Real-Time
PDF
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
PPTX
Real time pipeline at terabyte sacle
PPTX
Building Predictive Analytics on Big Data Platforms
PPT
MyThings: the business model behind the world’s most valuable registry of bel...
PDF
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
PDF
Big Data, Analytics and Data Science
uae views on big data
Liberating data power of APIs
Understanding Software Ecosystems
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Attention Allocation - from Search to Social
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Listening in Real-Time
Listening in Real-Time
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Real time pipeline at terabyte sacle
Building Predictive Analytics on Big Data Platforms
MyThings: the business model behind the world’s most valuable registry of bel...
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Big Data, Analytics and Data Science
Ad

More from Yury Lifshits (17)

PDF
Osh — Curiosity Learning on Mobile
PDF
Maester — The First Platform for Lead Generation on Mobile
PDF
Earlydays: План построения успешной компании
PPT
Екатерина Карпова: Опыт организации Ночи Музеев
PPT
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
PDF
New Advertising
PPT
Interfaces of Attention: What if people will outsource management of their at...
PDF
Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small Worl...
PPT
Future of Search | Yury Lifshits, Yahoo! Research
PDF
"Economics of Web Design" by Yury Lifshits
PDF
Social Design (Stanford version)
PDF
Social Design
PDF
The Architecture of the Web
PDF
Reputation Systems I
PDF
Reputation Systems II
PDF
Business-Consumer Networks: Concept and Challenges
PDF
Business-Consumer Networks. Project Proposal by Yury Lifshits
Osh — Curiosity Learning on Mobile
Maester — The First Platform for Lead Generation on Mobile
Earlydays: План построения успешной компании
Екатерина Карпова: Опыт организации Ночи Музеев
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
New Advertising
Interfaces of Attention: What if people will outsource management of their at...
Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small Worl...
Future of Search | Yury Lifshits, Yahoo! Research
"Economics of Web Design" by Yury Lifshits
Social Design (Stanford version)
Social Design
The Architecture of the Web
Reputation Systems I
Reputation Systems II
Business-Consumer Networks: Concept and Challenges
Business-Consumer Networks. Project Proposal by Yury Lifshits
Ad

Recently uploaded (20)

PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
Internet of Everything -Basic concepts details
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
A symptom-driven medical diagnosis support model based on machine learning te...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
LMS bot: enhanced learning management systems for improved student learning e...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Internet of Everything -Basic concepts details
giants, standing on the shoulders of - by Daniel Stenberg
Microsoft User Copilot Training Slide Deck
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
MuleSoft-Compete-Deck for midddleware integrations
Comparative analysis of machine learning models for fake news detection in so...
Rapid Prototyping: A lecture on prototyping techniques for interface design
Co-training pseudo-labeling for text classification with support vector machi...
future_of_ai_comprehensive_20250822032121.pptx
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Introduction to MCP and A2A Protocols: Enabling Agent Communication

Data Cloud - Yury Lifshits - Yahoo! Research

  • 1. Data Cloud Yury Lifshits Yahoo! Research http://yury.name
  • 2. My Beliefs The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud Part 3: Demo: numeric data distribution The key challenge in collecting data is incentive design Part 4: Economics of data distribution
  • 4.  
  • 5.  
  • 6.  
  • 7.  
  • 8.  
  • 9.  
  • 10.  
  • 11. Data Structured data Entity unit: Identifier Metadata: Explicit key-value pairs Relational properties Evaluation Semi-structured data Content unit: Body: text, video, audio, or image Metadata: Explicit key-value pairs Relational properties Evaluation Data = data of entities + data of content
  • 12. Structured Search Factoid search “ what's the value of property X of object Y“ Entity hubs Domain hubs Structured object search "all concerts this weekend in SF under 20$ sorted by popularity" Time focus Ranking focus Relations focus Structured content search "all videos with Tom Brady" “ all comments and blog posts about Bing"
  • 13. Yury’s Wishlist Business-generated data Products, services, news, wishlists, contact data Reality stream, sensors Where what have happened Expert knowledge Glossary, issues, typical solutions, object databases, related objects graph Events Sport, concerts, education, corporate, community, private Market graph & signals Like, interested, use, following, want to buy; votes and ratings
  • 14. Search as a Platform App 4 Classic search App 1 App 2 App 3 Structured Data Web index Post analysis Query analysis
  • 15. Data Cloud How to collect all structured data in one place?
  • 16. Data Producers People: forums, wiki, mail groups, blogs, social networks Enterprizes: product profiles, corporate news, professional content Sensors: GPS modules, web cameras, traffic sensors, RFID Transactional data
  • 17. Data Distributors Data distributor is any technical solution to accumulate , organize and provide access to structured and semi-structured data Data publisher : the original distributor of some data Data retailer : a consumer-facing distributor of some data
  • 18. Data Consumers Humans Email Aggregators: news, friend feeds, RSS readers Search Browsing / random walks Intelligence projects Recommendation systems Trend mining
  • 19. Data Cloud Data Cloud is a centralized fully-functional data distribution service Success metric for data cloud strategy = the total “value” of data on the cloud
  • 20. To-Cloud Solutions Extraction DBpedia.org, “web tables” Semantic markup, data APIs Yahoo! SearchMonkey Feeds Yahoo! Shopping Disqus.com, js-kit.com, Facebook Connect Direct publishing
  • 21. On-Cloud Solutions Ontology maintenance Freebase Normalization, de-duplication, antispam Named entity recognition, metadata inference, ranking Data recycling (cross-references) Amazon Public Data Sets Viral license Hosted search Yahoo! BOSS
  • 22. From-Cloud Solutions Search, audience Y! SearchMonkey, Google Base Data API, dump access, update stream Custom notifications Gnip.com Data cloud as a primary backend Access control Ad distribution. (AT&T and Yahoo! Local deal)
  • 23. Demo: webNumbr.com Joint work with Paul Tarjan
  • 24.  
  • 25. webNumbr.com: Import Crawl numbers from the web URL + XPath + regex Create “numbr pages” Update their values every hour Keep the history Anyone can create a numbr http://webnumbr.com/create
  • 26. webNumbr.com: Export Embed code Graphs Search & browse RSS
  • 27. Economics of Data Distribution Joint work with Ravi Kumar and Andrew Tomkins
  • 28. Network Effect in Two-Sided Markets Two sided market = every product serves consumers of two types A and B Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa Examples: operating systems, credit cards, e-commerce marketplaces Two-sided network effects: A theory of information product design G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne
  • 29. Basic model Distributors D1, … Dk Producer/consumer joins only one distributor Initial shares (p1,c1) … (pk,ck) New consumer selects a distributor with a probability proportional to pi New producer selects a distributor with probability proportional to ci
  • 30. Basic model a1 a4 a2 a3 a1 a4 a3 a2
  • 31. Market Shares Dynamics Theorem 1 Market shares will stabilize Theorem 2 With super-liner preference rule one of distributors will tip Theorem 3 With sub-liner preference rule market shares will flatten
  • 32. External Factor Preference rule with external factor: ei+ci/(c1+…+ck) Theorem 4 Market shares will stabilize on e1 : e2 : … : ek
  • 34. Coalitions Theorem 5 If all market shares are below 1/sqrt(k) coalition (sharing data) is profitable for all distributors Corollary Coalitions are not monotone Example: 5 : 4 : 1 : 1
  • 35. Model Variations Same-side network effect Different p-to-c and c-to-p rules Multi-homing (overlapping audiences) n^2 vs. nlog n revenue models Mature market: newcomer rate = departing rate Diverse market (many types of producers and consumers) Newcoming and departing distributors Directed coalitions
  • 37. Marketing Data demand? Data offerings? Requirements for distribution technology?
  • 38. Incentive design Incentives for data sharing? Centralized or distributed? For profit or non-profit? Data licensing and ownership? Monetizing data cloud?
  • 39. More Challenges Prototyping: Data marketplace: open data & data demand Search plugins: related objects, glossaries, object timelines Publishing tools for structured data Data client: structured news, bookmarking, notifications Tech design: Access management Namespace design User interface: Structured search UI Discovery UI
  • 40. Thanks! Follow my research: http://twitter.com/yurylifshits http://yury.name/blog