🚀 Hiring: Data Engineer – Multimodal AI
AI4Bhārat is building the next generation of large-scale multimodal AI systems across Speech, NLP, and Vision. We are looking for exceptional Data Engineers (full-time + interns) to join our team in Chennai and help us push the frontier of multimodal AI.
The Role:
As a Data Engineer (Multimodal), you will be at the heart of our mission: Building the data backbone for Indian Language AI - the foundation that powers frontier multimodal models.
This is not a standard ETL role. You’ll design massive-scale data pipelines, integrate state-of-the-art multimodal AI models spanning speech, vision, and language directly into workflows, and optimize across 100s of GPUs. Your work will directly enable training of models that set new benchmarks for Indian and global AI.
What you’ll do:
- Architect and scale data pipelines for multimodal corpora (speech, vision, text) at petabyte scale.
- Integrate AI models in-the-loop for data cleaning, filtering, and enrichment.
- Build and optimize training and benchmarking pipelines for cutting-edge AI models.
- Work with cutting-edge dataset formats (WebDatasets, Arrow/Parquet, HF Datasets) optimized for training throughput.
- Deep understanding of data formats and sharding strategies for efficient training.
- Evaluate and stress-test AI systems across languages, modalities, and domains.
- Work on real-world, high-impact AI challenges for Indian languages.
What we’re looking for:
- Proficiency in Python & PyTorch, with strong fundamentals in deep learning, data engineering, and distributed systems.
- Hands-on with Linux, Bash, multiprocessing/SLURM, and modern data tooling (HF Datasets, PyArrow, WebDatasets, Polars).
- Good fundamentals in algorithms, systems, and databases, plus hands-on with Git, Docker, and cloud/HPC environments.
- Freshers and engineers with 1–3 years of experience welcome.
- Degree in CS, DS, or related fields (B.Tech / M.Tech / Masters)
Bonus points if you have:
- Experience experimenting with LLMs in production workflows (e.g., integrating NeMo, HuggingFace, or custom LLM APIs into pipelines)
- Hands-on exposure to large-scale, high-throughput data pipelines or multilingual AI challenges (speech, text, and vision)
- Understanding of distributed training frameworks (like DeepSpeed, FSDP) and GPU cluster orchestration.
- A strong track record in open-source contributions, or building tools and libraries that others use at scale
Why join us:
- Work on impact-driven AI for Indian languages.
- Get access to large-scale GPU clusters to run bold experiments.
- Join a friendly, high-energy team that thrives on solving hard problems
📍 Location: Chennai (on-site preferred, hybrid possible)
📅 Deadline: Apply before 5th Sept - applications will be reviewed on a rolling basis.
🚀 Start Date: Immediate (flexible for the right candidate)
📝 Apply here: https://lnkd.in/gJVafUiM
Mitesh Khapra Anoop Kunchukuttan S V Praveen Kaushal Bhogale