Pierre R.’s Post

Data Science Manager at Uber | Sustainable Supply Chain Evangelist (Follow me!) | Consultant (Hiring!)

2w Edited

Web Scraping at Scale with Claude Desktop + MCP Servers From manual data collection to automated intelligence—transform any website into structured, analyzable data. In this tutorial, see how Samir builds an MCP server that gives Claude Desktop web scraping superpowers: crawling entire sites, extracting specific data, and mapping content architectures that would typically require expensive enterprise tools. A user asks Claude to analyze a website. Claude crawls up to 150 pages, following links and respecting depth limits. Results return as structured data with SEO insights, orphaned pages, and linking patterns. This transforms what used to be hours of manual auditing into instant, conversational analysis. 🔗 Watch the tutorial & get the code: https://lnkd.in/ejuUPXUW — I'm Pierre Rizzitas, founder and CEO of PR Consulting (London). We support retail and supply chain businesses in aligning operations, sustainability, and profitability through simulation and analytics. #webscraping #mcp #claudedesktop #python #dataextraction #supplychain #automation #aiagents

Turn Claude into an AI Web Scraping Machine with a local MCP Server (Source Code)

https://www.youtube.com/

To view or add a comment, sign in

More Relevant Posts

Sritaj Patel

QE & Scrum Master @ DevOn | At the crossroads of AI, Ikigai, and Infinite Curiosity.
3d
Report this post
𝗙𝗿𝗼𝗺 𝗰𝗼𝗻𝗳𝘂𝘀𝗶𝗼𝗻 𝘁𝗼 𝗰𝗹𝗮𝗿𝗶𝘁𝘆: 𝗺𝘆 𝗹𝗶𝘁𝘁𝗹𝗲 𝗥𝗔𝗚 𝗱𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 When I first started working with RAG (Retrieval-Augmented Generation), one question always bugged me 👇 𝗪𝗵𝗮𝘁 𝗿𝗲𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝘁𝗼 𝘁𝗵𝗲 𝗶𝗻𝗶𝘁𝗶𝗮𝗹 𝘂𝘀𝗲𝗿 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗼𝗻𝗰𝗲 𝗶𝘁’𝘀 𝗽𝗮𝘀𝘀𝗲𝗱 𝘁𝗼 𝘁𝗵𝗲 𝗟𝗟𝗠? I always assumed it was just used to retrieve data from the vector database, but turns out that’s only half the story. The question actually serves two different purposes: 1️⃣ It’s transformed into embeddings to help the Vector DB retrieve relevant context 2️⃣ It also needs to appear again in the prompt to help the LLM understand the intent behind the retrieved data Without it, the retriever works perfectly fine, 𝙗𝙪𝙩 𝙩𝙝𝙚 𝙇𝙇𝙈 𝙚𝙣𝙙𝙨 𝙪𝙥 𝙘𝙡𝙪𝙚𝙡𝙚𝙨𝙨, 𝙣𝙤𝙩 𝙠𝙣𝙤𝙬𝙞𝙣𝙜 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙙𝙤 𝙬𝙞𝙩𝙝 𝙩𝙝𝙚 𝙘𝙤𝙣𝙩𝙚𝙭𝙩 𝙞𝙩 𝙟𝙪𝙨𝙩 𝙜𝙤𝙩. This realization really changed how I look at prompt design and RAG pipelines. I wrote a blog about it where I break down this subtle but important detail (with examples, code, and a fun story from my testing phase). If you’ve ever wondered how retrieval ≠ understanding, this might make it clearer 👇 🔗 [RAG Facts: Retrieval ≠ Understanding - Why LLMs Need the Question Too] (https://lnkd.in/gYjZcc_F) #RAG #LLMs #ArtificialIntelligence #PromptEngineering #MachineLearning #VectorDB #AIExplained #TechLearning #OpenAI #DataRetrieval

RAG Facts: Retrieval ≠ Understanding: Why LLMs Need the Question Too medium.com
Like Comment
To view or add a comment, sign in
Zyte

59,189 followers
3w
Report this post
Web scraping deserves a spec as much as any product. Too many pipelines still live only in the head of their author—fragile, undocumented, and hard to align with business goals. A simple PRD changes that. By treating your data pipeline like a product, you: • Anchor datasets to real business needs • Assign clear ownership and accountability • Build with guardrails for breakage and maintenance • Document sources, schema, cadence, and formats The result? Data that solves the intended problem, delivered on time and on budget. MIT CISR research shows that organizations that adopt a product mindset for data consistently generate more value from their investments. In scraping, the same applies: clarity upfront avoids wasted effort later. 👉 Read the full story https://lnkd.in/dYD6T7mr #WebScraping #DataEngineering #PRD
2 Comments
Like Comment
To view or add a comment, sign in
Syed Sherjeel

Sr. ML Engineer | Building AI That Works
2w
Report this post
Your RAG works in demos but fails in production. Here's the one capability you're missing. The problem isn't embeddings or vector database. It is treating RAG like a pipeline instead of a reasoning system. Here's what actually works. Traditional RAG (What Everyone Builds First) 1. Split documents into chunks 2. Create embeddings 3. Store in vector database 4. User asks question → retrieve top 5 results 5. Send to LLM Simple. Clean. Breaks on real questions. Why It Fails: Single retrieval pass: "Compare Q3 to Q4 revenue" → System gets Q3 OR Q4, not both → LLM guesses the rest No way to refine: → First search misses? Done. → Can't ask follow-up searches → Can't course-correct Agentic RAG (What Actually Works) Give your LLM search tools. Let it decide the strategy. Tools: vector_search (semantic) keyword_search (exact match) metadata_filter (date, category, source) rerank (relevance scoring) Example Flow: User: "Compare Q3 to Q4 revenue" Search 1: vector_search("Q3 2024 revenue") Agent: "Got Q3, need Q4" Search 2: vector_search("Q4 2024 revenue") Agent: "Have both, ready to compare" The agent decides when to stop searching. The Metadata Trick: User: "Latest engineering docs" Agent applies filters first: department = "engineering" date > last_30_days Then searches 500 docs instead of 100K. Results: Traditional: 1 search, 65% accuracy, hallucinations Agentic: 3-5 searches, 89% accuracy, cited sources The Insight: RAG needs multiple retrieval passes with adaptation. Pipelines can't do this. Agents can. Building RAG? What's breaking for you?
18 Comments
Like Comment
To view or add a comment, sign in
Alfred Nirmal

Programmer Analyst @Cognizant || Azure Cloud || Software Engineer ||C# .NET || JAVA DSA || ADO.NET|| MS SQL Server|| ASP.Net Web API || Eager to Learn and Contribute to Developers Insights
4d
Report this post
🚀 Optimizing Binary Search Trees: Removing Out-of-Range Nodes Recently, I worked on an interesting problem involving Binary Search Trees (BSTs): "Given a BST and a range [l, r], remove all nodes that fall outside this range, ensuring the resulting tree is still a valid BST." 🧠 Why this matters: This concept has real-world relevance—think about filtering data in a database or search engine where only values within a certain range should be retained, but the underlying structure (like an index tree) must remain optimized and valid. 🔍 Logic (No Code Needed!): Thanks to the inherent properties of a BST: Values in the left subtree of a node are always smaller, and Values in the right subtree are always greater. This means we can skip entire subtrees when we detect that a node is out of range. ➡️ If a node's value is less than l, the entire left subtree is too small, so we only explore the right. ➡️ If a node's value is greater than r, the entire right subtree is too large, so we only explore the left. ➡️ If the node is within [l, r], we keep it and recursively fix its left and right children. 📈 Result: A clean, trimmed BST that contains only the valid range of values and still respects the structure and rules of a binary search tree. 👨💻 It’s always satisfying to see how understanding core data structure principles can help us write more optimized and elegant solutions. Curious to hear how others have handled similar tree transformations or real-world filtering challenges in structured data! 🌳💡 #DataStructures #BinarySearchTree #Algorithms #TechTalk #ProblemSolving #Coding #SoftwareEngineering #LinkedInLearning
Like Comment
To view or add a comment, sign in
Serhii Soldatov

Full-Stack Software Engineer | Python & JavaScript | In love with Backend, Infra and Architecture || I was supposed to bring balance to the Force, not to write basic CRUDs 😭
4w
Report this post
Split long functions into meaningful smaller steps. I usually split stuff based on its potential to change/be optimized/break individually. E.g. if my search by name function internally does some hashmap transformation and processing - I split those, because it's not the first time I'm changing exactly this part of the algorithm and I don't need to be distracted by the rest of the function or have to worry about shared context. In that case it's easier to split into smaller chunks even if they are sequential steps, because you can concentrate on each particular step in isolation and be sure all other steps will still work the same if the interface of this one stays the same. Plus, for some functions with sequential steps - it's much easier to read an algorithm that looks like: 1. split A from B 2. transform to X 3. check for Y 4. format result to Z Than "for loop in loop if else then otherwise update x in loop of is else then if not else loop over y then while do" gibberish. Proper separation and naming of steps makes overall algorithm easier to follow and think of in abstract terms. Then again - if something is off, you can debug/test particularly "format for Z", not "that nested for loop with five conditions and shared vars on lines 178-199". Makes thinking easier in plain language. More code? Sure. But it's easier to both write and read.
Like Comment
To view or add a comment, sign in
Anurag Singh

Intern @Internbot | MERN Stack Developer | AI Enthusiast | DSA in C++ & Java | Open to Full-Time & Internship Roles
3d
Report this post
Day 71 of #100DaysOfCode Problem: Add Number Linked Lists Difficulty: Medium Problem Statement: We are given the heads of two singly linked lists head1 and head2 representing two non-negative integers. The task is to return the head of a new linked list representing the sum of these two numbers. Examples: 1️⃣ Input: 123 + 999 Output: 1122 → represented as 1 -> 1 -> 2 -> 2 2️⃣ Input: 0063 + 07 Output: 70 → represented as 7 -> 0 Brute Force Approach: Convert both linked lists to numbers. Add the numbers. Convert the sum back to a linked list. Drawbacks: Handles large numbers poorly. Extra space for number conversion. Optimal Approach (Used in my solution): Remove leading zeros from both lists. Reverse both lists to start addition from least significant digit. Use a dummy head and iterate through both lists, adding corresponding digits along with the carry. Reverse the result to get the final sum. Time Complexity: O(max(N, M)) → N and M are the lengths of the two lists. Space Complexity: O(max(N, M)) → for the new result list. Edge Cases: Lists with leading zeros. Lists of different lengths. Sum resulting in a new digit (carry at last node). Key Learnings: Reversing linked lists simplifies addition from least significant digit. Using a dummy head node avoids null checks for result list creation. Handling carry and edge cases is crucial in linked list arithmetic problems. My Solution: Efficiently handles large inputs and different lengths without converting to numbers. Time to solve: ~6–7 minutes during practice. 💡 Pro Tip: Always consider edge cases and think about how data structure properties can simplify operations. #100DaysOfCode #LinkedList #CodingChallenge #DataStructures #Algorithm #ProblemSolving #CPP #LearnToCode
Like Comment
To view or add a comment, sign in
Marinario Agalliu

Backend Tech Lead at ATIS Software Factory
5d
Report this post
𝗛𝗼𝘄 𝘁𝗼 𝗠𝗮𝗸𝗲 𝗬𝗼𝘂𝗿 𝗤𝘂𝗲𝗿𝗶𝗲𝘀 𝟭𝟬𝘅 𝗙𝗮𝘀𝘁𝗲𝗿 𝗨𝘀𝗶𝗻𝗴 𝗩𝗶𝗿𝘁𝘂𝗮𝗹 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗱 𝗖𝗼𝗹𝘂𝗺𝗻𝘀 𝗧𝗵𝗲 𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼: You have a user_settings table with tons of rows. Each user has a preferences JSON column storing settings like theme, language, and notifications. Now your product team wants to filter users by language preference for targeted campaigns. 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: 𝘝𝘪𝘳𝘵𝘶𝘢𝘭 𝘊𝘰𝘮𝘱𝘶𝘵𝘦𝘥 𝘊𝘰𝘭𝘶𝘮𝘯𝘴 Instead of migrating millions of rows to add a new column, create a 𝗰𝗼𝗺𝗽𝘂𝘁𝗲𝗱 𝗰𝗼𝗹𝘂𝗺𝗻 that extracts language directly from the JSON. 𝗪𝗵𝘆 𝘁𝗵𝗲𝘆’𝗿𝗲 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹: - They’re not physically stored (no extra storage), - They work instantly on all existing rows, - And they can be indexed, making filters lightning-fast. 𝗪𝗵𝗲𝗻 𝘁𝗵𝗶𝘀 𝗮𝗽𝗽𝗿𝗼𝗮𝗰𝗵 𝘀𝗵𝗶𝗻𝗲𝘀: - Use virtual computed columns when you have stable JSON structures you query frequently. - The table already exists, you have tons of rows, and you need performance without migration risk. 𝗪𝗵𝗲𝗻 𝘁𝗼 𝗸𝗲𝗲𝗽 𝗶𝘁 𝘀𝗶𝗺𝗽𝗹𝗲: If you’re designing a new schema or your JSON changes often, a traditional column is cleaner and more predictable. 𝗧𝗵𝗲 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳: - You add schema complexity. - Computed columns can’t be updated directly. - And if the computed values are large, indexes can grow quickly. Before jumping to the easiest but most costly solution, use the engineering mindset: Find the best solution not by adding more, but by using what you already have. #BackendEngineering #DatabaseOptimization #EngineeringMindset
Like Comment
To view or add a comment, sign in
~Tamilvanan D

Junior Software Engineer | Angular + Node.js Full-Stack Developer | TypeScript • MySQL • Linux
2w
Report this post
✦ What is Big-O(log n)? Big-O(log n) represents algorithms whose time complexity grows in proportion to the logarithm of the input size. Instead of checking every element, the dataset is repeatedly reduced (often halved) until the solution is found. ✦ Purpose ➤ Optimize performance and reduce unnecessary operations ➤ Scale efficiently with very large datasets ➤ Deliver faster solutions compared to linear approaches ✦ Use Cases ➤ Searching in sorted datasets (Binary Search) ➤ Balanced tree operations (BST, AVL, Red-Black Trees) ➤ Heap insertion and deletion ➤ Divide & Conquer algorithms (merge sort, quicksort in best/average cases) ✦ Professional Developer Perspective ➤ Efficiency over simplicity – log n solutions shine when datasets grow large ➤ Structured data required – most log n methods assume sorted or organized inputs ➤ Impact of scale – small datasets may blur the difference, but at millions of records, log n means milliseconds vs seconds ➤ Pattern recognition – divide-and-conquer thinking naturally unlocks log n solutions ✦ Attached Examples ➤ Below, I’ve added supporting images for visualization ➤ Illustrated with JavaScript search approaches (linear vs binary search) ➤ Real-world comparisons: • O(n) → Checking every item in a list, one by one • O(log n) → Splitting the search space in half repeatedly until the target is found #BigO #TimeComplexity #JavaScript #WebDevelopment #SoftwareEngineering #SystemDesign #ProgrammingConcepts #Scalability #Algorithms #TechContent #FullStackDevelopment
Like Comment
To view or add a comment, sign in
Muhammet Pakyurek, PhD

Senior CV/ML Engineer | Vector/RAG Systems, Real-time Vision | OpenSearch k-NN/Hybrid | US Work Authorized (EAD) | AWS/GCP
2w Edited
Report this post
🚀 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝘁𝗵𝗮𝘁 𝘀𝘂𝗿𝘃𝗶𝘃𝗲𝘀 𝗿𝗲𝗮𝗹𝗶𝘁𝘆 • OpenSearch 𝗵𝘆𝗯𝗿𝗶𝗱 (BM25 + vector) with 𝗸-𝗡𝗡/𝗛𝗡𝗦𝗪 and correct knn_vector mappings • 𝗩𝗲𝗿𝘀𝗶𝗼𝗻𝗲𝗱 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 (model + params stored in index metadata) • 𝗭𝗲𝗿𝗼-𝗱𝗼𝘄𝗻𝘁𝗶𝗺𝗲 𝗿𝗲𝗶𝗻𝗱𝗲𝘅 with retries + dead-letter queue 🛡️ 𝗗𝗿𝗶𝗳𝘁-𝗽𝗿𝗼𝗼𝗳 𝗱𝗮𝘁𝗮 & 𝗺𝗼𝗱𝗲𝗹𝘀 • 𝗦𝗰𝗵𝗲𝗺𝗮 𝘀𝗶𝗴𝗻𝗮𝘁𝘂𝗿𝗲 (column list + file hash) validated at ingest • 𝗠𝗼𝗱𝗲𝗹/𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗽𝗶𝗻𝗻𝗶𝗻𝗴 + CI/CD release gates • 𝗣𝗿𝗲/𝗣𝗼𝘀𝘁 𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻: row/field parity, score-distribution shift, null-field checks 🔎 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗮𝘁 𝗴𝗮𝘁𝗲𝘀 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 • Baseline queries with 𝗻𝗗𝗖𝗚 / 𝗽𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 targets (𝗣@𝟭𝟬 for UX; 𝗥@𝟱𝟬 for RAG pool) • 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲 𝗱𝗮𝘀𝗵𝗯𝗼𝗮𝗿𝗱 + 𝗮𝗹𝗲𝗿𝘁𝗶𝗻𝗴 (e.g., missing owner/responsibility) 📈 𝗢𝘂𝘁𝗰𝗼𝗺𝗲𝘀 (measured on N=500 canonical queries, 2-week pre/post, P@10 unless noted) • 𝗤𝘂𝗮𝗹𝗶𝘁𝘆: +23% precision@10 • –87% missing owner field • –42% triage time • 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: 99.7% reindex success @ 48k docs/min (bulk ingest with retries + DLQ) • 𝗢𝗽𝘀: MTTR –38%, on-call investigation time –31% (30-day production window) 🧰 𝗦𝘁𝗮𝗰𝗸 OpenSearch (k-NN/HNSW, hybrid BM25+vector), Python, PyTest, RAG eval, Docker/Kubernetes, GitLab CI, AWS/GCP, secrets-managed SMTP. 👉 If you’re scaling RAG/search in production, happy to compare notes. #RAG #OpenSearch #VectorSearch #HNSW #MLOps #SearchRelevance #Retrieval #LLM #Observability
Like Comment
To view or add a comment, sign in
Sanaz Hosseinzadeh

AI Researcher | Web Designer
6d
Report this post
🌀This repository is all you need to learn and build a #RAG application! Here’s what it covers: 🔹Query Construction – Translating natural language into structured queries (#SQL, Cypher, or vector-based retrieval). (Text-to-SQL, Text-to-Cypher, Self-query retriever) 🔹 Query Translation – Decomposing and rephrasing inputs for better retrieval. (Multi-query, RAG-Fusion, Hypothetical Docs) • Routing – Dynamically selecting the right database or embedding query context for more relevant answers. 🔹Retrieval – Ranking and refining retrieved data using Re-Rank, RankGPT, RAG-Fusion, CRAG, or even pulling real-time updates from external sources. 🔹 Indexing – Leveraging multi-representation embeddings, hierarchical summarization, and structured search optimization. (RAPTOR, CoLBERT, #Fine_tuning) 🔹 Generation – Producing and refining responses with Self-RAG and RRR, enabling iterative reasoning and retrieval loops when needed. 👉GitHub : 🔗https://lnkd.in/g9SQ4gg6
Like Comment
To view or add a comment, sign in

1,837 followers

672 Posts

View Profile Connect

LinkedIn respects your privacy

Pierre R.’s Post

Turn Claude into an AI Web Scraping Machine with a local MCP Server (Source Code)

https://www.youtube.com/

Explore content categories