Docling Adds Structured Data Extraction from Documents

View profile for Peter W. J. Staar

Principal Research Staff Member, Master Inventor, Manager of "AI for Knowledge" group in IBM Research Zurich; Chair of the technical steering committee of Docling in the Linux Foundation for AI and Data

🚀 New in Docling: Structured Data Extraction from Documents! 🚀 We’ve just added a brand-new functionality: extraction of structured data from complex documents using free-form schemas 🤩 What does that mean? 🔹 You can now skip the conversion step — instead of turning documents into text or JSON first, Docling directly extracts the structured fields you care about. 🔹 The requested fields are defined in a free-form schema, so you can instantly align the extraction with the schemas of your own databases. 🔹 This makes it ideal for data-pipelines that don’t need full document conversion, but do need to populate structured databases from messy, unstructured documents. ✨ It’s: 1️⃣ Super simple to use (check out the code snippet 👉 https://lnkd.in/eK6vaMEe ) 2️⃣ 100% open-source and runs fully local – no API calls needed 🙌 3️⃣ Powered by cutting-edge models from NuMind (YC S22) 4️⃣ Perfect for data-pipelines where you need to populate structured databases from documents (think invoices, Curriculum Vitae, contracts, product datasheets, etc) 5️⃣ Currently focused on PDFs and images (PNG) — support for pure text coming soon! The example below shows how easy it is to define a schema and extract structured fields directly from an invoice. 👉 Try it out, break it, send us feedback, and if you like what we’re building, don’t forget to ⭐ the repo (https://lnkd.in/d4UT-6_2)! After a long and fruitful summer, the Docling team has been cooking up many new features — and this is just the beginning! 🌟 #opensource #AI #documentAI #docling #IBM #IBMResearch

  • text
Luis Molina

Technical Lead AI - Engineer AI

1mo

Congrats for the release. I have to ask this question, langExtract from Google does something similar, what is the difference?

Arvind Rajpurohit

IBM Partner Data and AI Leader | Open Group Certified

1mo

Built an Agent with Docling that reads 'U.S. Customs forms' and turns them into structured, usable information. Great work team

Shoeb Masood

AI solutions Architect | Generative AI | LLM | RAG | Agentic AI | Python | NLP | Deep Learning | AI-Driven Automation | PyTorch | TensorFlow | Keras | LLMOps | Scalable AI Solutions | AI Consultant

1mo

Peter W. J. Staar : Absolutely amazing! 🚀 This is such a powerful addition. I’ve been handling something similar with a lot of glue code, but this feature will really help streamline the entire data ingestion pipeline. Excited to try it out soon and see how it performs in real-world workflows. Kudos to the team! 👏 Quick question: does it also support nested schemas (e.g., line items in invoices), or is it mainly for flat field extraction?**

Aashish Chaudhary

Technical Product Leader in AI/ML, Real-Time 3D, and Open Source | Bridging Technology with Business Strategy

1mo

Awesome! We are big fans of Docling.

Thanks for the awesome product. It is getting better. Privacy is gold.

Jeebesh Chandra Podder

MLOps & AI Platform Engineer | Architecting & Deploying Scalable AI on AWS, Azure and GCP | RAG & Agentic AI

1mo

Loved this — super clear! 🙌 Curious: which Document Loader worked best for mixed PDFs + webpages in your experiments?

Daniel Svonava

Build better AI Search with Superlinked | xYouTube

1mo

How flexible is it when document layouts vary widely? Curious because invoice/contract formats can be really inconsistent.

Aashish Jangid

Data Scientist | EX - Celebal | Generative AI | Large Language Models (LLMs) | RAG | Deep Learning | Computer Vision | Time Series | MLOps | Azure AI | Databricks | MLFlow

1mo

Accuracy is very good thanks for sharing.

Kürşad Laçin

Senior Forward Deployed Agentic Engineer (FDAE) - Allianz Türkiye | M.Sc. Universität Passau

1mo
See more comments

To view or add a comment, sign in

Explore content categories