- Big Data News Weekly
- Posts
- šHugging Faceās 200-page guide to train your own models
šHugging Faceās 200-page guide to train your own models
š¦¾Plus: āOpenAI-AWS $38B Cloud deal š°

Hey folks! Letās get into Big Data and AI crazinessā¦
In today's edition: What's Shaping the Future of Data?
š¤How software engineers-data scientists work together?
ā”LangChain DeepAgents framework now in CLI
š¤AI Workflows that Agents Can Build and Run On-the-Fly
šVectorless Vision-Based RAG - No OCR, No Database
ā¤ļø Facebook Dating Is a Surprise Hit for the Social Network
š New benchmark tests AIās freelance automation
š” AI Tutorial:How to create realistic AI voices for your content
š¤ AI Tools and Data Tools to checkout

Hugging Face just dropped their "Smol Training Playbook," a 200+ page deep dive into building their SmolLM3 model from scratch. The team documents the complete pipeline, pretraining, post-training, and infrastructure, sharing what worked, what failed, and how to keep training runs stable. Think of it as the field notes from training a competitive 3B parameter model, minus the usual vendor mystique. And itās completely free.
Want to get the most out of ChatGPT?
ChatGPT is a superpower if you know how to use it correctly.
Discover how HubSpot's guide to AI can elevate both your productivity and creativity to get more things done.
Learn to automate tasks, enhance decision-making, and foster innovation with the power of AI.

Both data scientists and engineers must be responsible for the issue and must try to solve the issue at any step of the work. Continuous communication ensures that possible discrepancies are recognized in the early stage.

LangChain just shipped DeepAgents CLI, bringing their DeepAgents framework straight to your terminal. Install with pip install deepagents-cli, and you get an agent that can edit files, run shell commands, search the web, and even remember information across sessions by writing memories locally to remember API patterns, project conventions, and context from previous conversations.

Declarative AI workflows you can read, write, and trust - like Dockerfile or SQL but for multi-step LLM pipelines. Pipelex gives you a DSL and Python runtime for repeatable AI workflows. You declare what happens at each step, any model or provider can run it.

DeepSeek OCR is powerful, but do you even need OCR models for RAG? PageIndex takes a different approach with vision-based RAG that mimics how humans actually read documents: reasoning over a hierarchical table-of-contents structure to identify relevant pages, then processing those pages as images with VLMs like GPT-4.1 for visual understanding and answer generation.
Q4 is the perfect window to turn this yearās numbers into a clear, actionable forecast aligned with your goals. Set your business up for a stronger 2026 with BELAYās new guide.
šØāš» Data Tools, Libraries
Newsbang: Your Lens on Emerging Trends
PostgreSQL Index Advisor (GitHub Repo)
PostgreSQL Index Advisor is a PostgreSQL extension for recommending indexes to improve query performance.
pylyzer (GitHub Repo)
pylyzer is a static code analyzer and language server for Python.
AI News:

OpenAI just secured a seven-year, $38B agreement with Amazon Web Services for computing infrastructure, marking the companyās largest diversification away from Microsoftās cloud services. The partnership grants OAI access to hundreds of thousands of Nvidia GPUs across AWS data centers, with deployment targeted for late 2026 completion.
Free, private email that puts your privacy first
Proton Mailās free plan keeps your inbox private and secureāno ads, no data mining. Built by privacy experts, it gives you real protection with no strings attached.

Facebook Dating debuted in 2019. The feature lets people create a free dating profile and swipe and match with other users. It has more than 21 million daily users, making it one of the most popular online dating services. Facebook Dating shows how social networking is evolving into two broad categories: content and services.

AI cloud startup Lambda has signed a multiābillionādollar agreement with Microsoft to deploy tens of hundreds of Nvidia GPUs across its infrastructure. The partnership will expand Microsoftās use of highāend AI chips through external providers, helping it meet surging demand for model training capacity without the delays of building data centers.

Brian Koo (grandson of LG Group's founder) has co-founded Utopai East, a 50-50 joint venture with Utopai Studios to build AI-powered film and TV production infrastructure. The partnership will produce content using existing infrastructure initially, with the first piece of content expected to launch next year, focusing on Korean creators and international IP expansion.

Scale AI and the Center for AI Safety published the Remote Labor Index, a new benchmark that tests AI models on real freelance projects, revealing that even the top systems complete less than 3% of tasks at professional human standards.
You shouldnāt be. Get paid up to 2 days early and make your money go further with 4% interest on savings,* up to $200 in free overdraft coverage,** and more.
AI Tutorial
How to create realistic AI voices for your content

Open Google AI Studio and select āNative Speech Generation.ā
Pick your mode: Single-speaker for narrations or Multi-speaker for dialogues.
Write your script, adding style notes and choosing voices for each speaker.
Click āRunā to generate the audio, then download it for your project.
š„Top AI tools to increase productivity:
YouBrief is a free AI tool designed to help users quickly extract summaries from YouTube videos
VocalReplica is an AI-powered web-based tool that allows users to effortlessly isolate vocals
HomeStage lets you upload a picture and our AI will add furniture within seconds.
ChatMaxima is a Conversational Marketing SaaS platform that revolutionizes the way businesses connect with customers
Wemate - Explore, craft, and communicate with the virtual companions of your dreams through Wemate.
Forewrite - Craft and enhance various content forms, including images, code, and speech-to-text
Editby - Create content for your blog, newspaper, newsletter, press notes, social networks etc. with AI.
Data Analyst AI connects Google Analytics with ChatGPT, delivering AI-powered eCommerce insights and automated weekly reports.
View our database of all the best AI tools for your needs: aitoolsup.com
Have cool resources to share? Submit AI tool
A.I. Generated Image of the Day
š Brutalist Utopia

Recommended reading:
SPONSOR US
Get your product in front of Big Data & AI enthusiasts
Our newsletter is read by thousands of tech professionals, investors, engineers, managers, and business owners around the world.
Interested in Sponsoring the Big Data News Weekly Newsletter?Get in touch today
What did you think of today's email?Your feedback helps me create better emails for you! |



