Denis Burakov’s Post

1mo

While XGBoost and LightGBM get most of the attention, CatBoost quietly solves some of the toughest tabular problems. In this new Medium article, I share lessons learned from using CatBoost in real-world risk modeling projects: • Explainability • Built-in feature statistics and selection • Text & embeddings support • CatBoost with MLflow in SageMaker 📘 Read it here → https://lnkd.in/dCnwZ3tj #DataScience #MachineLearning #CatBoost #Python #ExplainableAI

Hidden Tricks in CatBoost You Should Know ai.plainenglish.io

5 Comments

Nikos G.

Data Science Tech Lead at Satalia

1mo

CatBoost, in my experience is much slower.

2 Reactions

Abdul A.

Servant | Strategist | Pragmatist | Skeptical Empiricist

1mo

Thanks for sharing this

1 Reaction

Curtis Raymond, MMA

Enterprise Data, Analytics & AI @ Priceline | Master of Management Analytics

1mo

This is awesome!! Thanks for sharing 🔥

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Mohammad Shirmohammadi

“Data Scientist | Business Analytics & Customer Behavior “
1mo
Report this post
💡 Customer Value Analysis with RFM and Revenue Ranking In this code, we ranked customers based on Monetary Value and looked for the top 20% customers who contributed the most revenue. 🔹 First, a rank was assigned to each customer based on purchase amount. 🔹 Then, the top 20% customers were identified. 🔹 Finally, the total revenue generated by these customers was calculated to determine what portion of the total revenue was generated by this group. This method helps to design strategies to focus on key customers and optimize sales and marketing. 📊 Data tools like Python and pandas are really powerful for RFM analysis! I would be happy if you could comment on this or express any objections. #DataScience #CustomerAnalytics #RFM #Python #BusinessIntelligence #RevenueAnalysis #Machine_Learning #Algorithm #Artificial_Intelligence #Python #Programmin #Logistic_Regression #Simple_Bayes #Decision_Tree #Random_Forest #SVM #KNN #PCA #Adaboost #Xgboost #ANN #K_means #AI
2 Comments
Like Comment
To view or add a comment, sign in
Arulraj Jeyaraj

Aspirant AI Engineer | From Data to Decisions-Crafting AI That Delivers Impact | LLMs, RAG, LangChain, MLOps | Python, TensorFlow, HuggingFace
3w
Report this post
From Raw Data to Insights: The Power of pandas, matplotlib & EDA Every AI project begins with one thing—data. But raw data is messy, incomplete, and often misleading. That’s where the trio of pandas, matplotlib, and EDA comes in. 🔹 pandas DataFrame gives structure—rows and columns that are easy to clean, merge, and analyze. 🔹 EDA (Exploratory Data Analysis) is the detective work—spotting missing values, outliers, and hidden trends. 🔹 matplotlib transforms numbers into visuals—so patterns are not just computed, but seen. 👉 Imagine analyzing customer churn: • pandas helps you aggregate user behavior. • EDA uncovers that churn is higher among users with late payments. • matplotlib shows the trend in a clear declining curve that business leaders can act on. Together, they turn raw data into actionable insights—the foundation for machine learning, forecasting, and business decisions. 💡 Data isn’t just numbers—it’s a story. And this trio helps you tell it right. 👉 What’s your favorite Python tool when you start exploring a new dataset? #Python #EDA #pandas #Matplotlib #DataScience #AIEngineer
Like Comment
To view or add a comment, sign in
Nazmus Sakib Md Adil

CSE Undergrad || Machine Learning & Deep Learning Enthusiast
1mo
Report this post
I’m excited to share my latest project: Customer Churn Prediction and Retention Strategy! Project Overview: ● Predicts which customers are likely to churn using machine learning (Logistic Regression & XGBoost) ● Analyzes key features driving churn with SHAP values ● Estimates expected revenue loss and identifies high-priority customers for retention strategies ● Provides actionable insights to reduce churn and maximize revenue Key Highlights: ✨ Complete data preprocessing and feature engineering ✨ Threshold tuning and weighted ensemble for better recall on churners ✨ Streamlit dashboard for interactive exploration This project is ideal for anyone interested in customer analytics, retention strategy, and applying machine learning in business. 🔗 Check it out and share your feedback! https://lnkd.in/giVidwin #MachineLearning #DataScience #CustomerChurn #RetentionStrategy #Python #XGBoost #LogisticRegression #SHAP #Streamlit

1 Comment
Like Comment
To view or add a comment, sign in
Afaq Amjad

Data Analyst | BI Engineer | Data Specialist | SQL, Python, Power BI, Tableau Expert
4w
Report this post
𝐘𝐨𝐮𝐫 𝐏𝐲𝐭𝐡𝐨𝐧 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐓𝐨𝐨𝐥𝐤𝐢𝐭 – 𝐄𝐬𝐬𝐞𝐧𝐭𝐢𝐚𝐥 𝐋𝐢𝐛𝐫𝐚𝐫𝐢𝐞𝐬 𝐘𝐨𝐮 𝐍𝐞𝐞𝐝 𝐭𝐨 𝐊𝐧𝐨𝐰 🐍📊 If you're stepping into data analysis or data science, knowing the right Python libraries can make all the difference. Here’s a quick rundown of the essential toolkit every analyst should master: 🔧 𝐃𝐚𝐭𝐚 𝐌𝐚𝐧𝐢𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 & 𝐂𝐥𝐞𝐚𝐧𝐢𝐧𝐠 𝐏𝐚𝐧𝐝𝐚𝐬: The backbone for working with structured data. Perfect for cleaning, merging, and aggregating datasets. 𝐍𝐮𝐦𝐏𝐲: Ideal for numerical operations and handling n-dimensional arrays with speed. 📈 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐌𝐚𝐭𝐩𝐥𝐨𝐭𝐥𝐢𝐛: Highly customizable plots for everything from line charts to histograms. 𝐒𝐞𝐚𝐛𝐨𝐫𝐧: Built on Matplotlib, it makes statistical visuals elegant and easy. 📉 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐒𝐜𝐢𝐏𝐲: Advanced stats, optimization, and scientific computing. 𝐒𝐭𝐚𝐭𝐬𝐦𝐨𝐝𝐞𝐥𝐬: Dive deep into regression, time series, and hypothesis testing. 🤖 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐒𝐜𝐢𝐤𝐢𝐭-𝐥𝐞𝐚𝐫𝐧: Your go-to for ML algorithms, model evaluation, and preprocessing. 𝐓𝐞𝐧𝐬𝐨𝐫𝐅𝐥𝐨𝐰/𝐊𝐞𝐫𝐚𝐬: Powerhouse tools for deep learning and neural networks. Whether you're cleaning data, building models, or creating dashboards, these libraries form the core of a powerful data workflow. 𝐖𝐡𝐢𝐜𝐡 𝐥𝐢𝐛𝐫𝐚𝐫𝐲 𝐢𝐬 𝐲𝐨𝐮𝐫 𝐟𝐚𝐯𝐨𝐫𝐢𝐭𝐞—𝐨𝐫 𝐰𝐡𝐢𝐜𝐡 𝐨𝐧𝐞 𝐚𝐫𝐞 𝐲𝐨𝐮 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐧𝐞𝐱𝐭? 👇 #Python #DataAnalysis #DataScience #MachineLearning #PythonLibraries #Pandas #NumPy #Seaborn #ScikitLearn #DataVisualization #Coding #DataAnalytics
Like Comment
To view or add a comment, sign in
Manjunath S

NHCE’27
1mo
Report this post
Logistic Regression is one of the most widely used algorithms in Machine Learning, especially for classification problems like spam detection, credit scoring, or disease prediction. Many people wonder: • Why do we use log-odds (logit) instead of plain probabilities? • How does the sigmoid function map values between 0 and 1? • Why do we optimize the log-likelihood and not just the likelihood? • How does gradient ascent actually update the model? In this blog, I will be explaining: 1. Intuition behind odds and log-odds 2. The role of the sigmoid function 3. Why log-likelihood is used in training 4. Step-by-step explanation of gradient ascent 5. Python (NumPy) implementation of logistic regression Read the full blog here: https://lnkd.in/gtghrMzD

Logistic regression medium.com
Like Comment
To view or add a comment, sign in
Praveen Kumar P

Support Engineer at Solaris SE.
1mo
Report this post
Excited to share my recent hands-on practice with Machine Learning using K-Nearest Neighbors (KNN) Regression. 🚀 I implemented KNN regression on the Diamonds dataset, focusing on: Data cleaning and preprocessing (scaling & splitting data) Hyperparameter tuning with GridSearchCV Evaluating the model with R², MAE, and RMSE This exercise helped me understand how distance-based algorithms work for regression tasks and the impact of parameter tuning on performance. Looking forward to exploring more ML algorithms and sharing my journey here! #MachineLearning #KNN #Regression #Python #Learning import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error """ Load the dataset """"" df = pd.read_csv("diamonds.csv") print("Initial Data:") print(df.head()) """ Data Cleaning""" # Drop unwanted columns (like unnamed index column if present) if "Unnamed: 0" in df.columns: df.drop("Unnamed: 0", axis=1, inplace=True) """ Check for missing values""" print("\nMissing values before cleaning:") print(df.isnull().sum()) """ Fill or drop missing values (here we drop rows with null values)""" df.dropna(inplace=True) # Encode categorical columns (cut, color, clarity are categorical in diamonds dataset) categorical_cols = ['cut', 'color', 'clarity'] df = pd.get_dummies(df, columns=categorical_cols, drop_first=True) print("\nData after cleaning:") print(df.head()) """ Feature & Target Split""" X = df.drop("price", axis=1) # Features y = df["price"] # Target variable (regression) """ Train-Test Split """ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Feature Scaling scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) """ Model Building (KNN Regression)""" knn = KNeighborsRegressor() # Hyperparameter tuning using GridSearchCV param_grid = {'n_neighbors': [3, 5, 7, 9, 11]} grid = GridSearchCV(knn, param_grid, cv=5, scoring='r2') ## Cross validation cv=5 grid.fit(X_train_scaled, y_train) print("\nBest Parameters:", grid.best_params_) # Final model with best parameters best_knn = grid.best_estimator_ """ Model Evaluation""" y_pred = best_knn.predict(X_test_scaled) print("\nModel Performance:") print("R2 Score:", r2_score(y_test, y_pred)) print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred)) print("Mean Squared Error:", mean_squared_error(y_test, y_pred)) Mohan Sivaraman
Like Comment
To view or add a comment, sign in
Aditya K.
1mo
Report this post
Documenting My Data Cleaning Journey as an Aspiring AI/ML Engineer Over the past few days, I’ve been diving deep into data preprocessing and cleaning using Pandas, and this notebook was a major milestone! Key Topics Covered: ✔️ Different ways to create DataFrames from: CSV files (with/without headers, skiprows) Excel files (openpyxl dependency) Python structures (dicts, tuples, lists) ✔️ Writing DataFrames to: Excel (multiple sheets) ✔️ Reading only a subset of data using nrows ✔️ Handling missing values: fillna(), ffill(), bfill(), interpolate() Filling missing values with limit and method control ✔️ Reading and converting string types to datetime using pd.to_datetime() ✔️ Setting and resetting indexes ✔️ Using interpolation methods (linear, cubic, pchip, akima, polynomial, etc.) ✔️ Identifying and fixing structural errors in a notebook ⚠️ Errors Faced & How I Solved Them: ❌ NaT issue: Occurred due to incorrect format="%m%d%Y" in pd.to_datetime() — fixed by correcting the format to match the actual date ("%m/%d/%Y"). ❌ Interpolation errors: Learned the importance of infer_objects(copy=False) and select_dtypes() before applying interpolation. ❌ JSONDecodeError: Fixed by verifying and correcting path and file contents before json.loads(). ❌ Notebook not opening: Solved by manually inspecting the notebook file and correcting malformed merge conflict remnants. ❌ ValueError during reindexing: Understood how reindexing works and the importance of consistent shapes. 📈 Practical Skills Gained: Structured data cleaning techniques Advanced interpolation strategies using scipy and pandas Debugging real-world data issues Using plotting for visual validation Preparing clean data for downstream ML tasks GitHub link to the full notebook! https://lnkd.in/gsE9f6N6 #AI #MachineLearning #DataScience #Python #Pandas #DataCleaning #Interpolation #DataPreprocessing #NotebookDebugging #scipy #Matplotlib #LearningByDoing #FromErrorsToInsights #100DaysOfML #MLJourney #LinkedInLearning
Like Comment
To view or add a comment, sign in
Ajinkya Itale

Certified Data Scientist | Machine Learning | Deep Learning | NLP | Data Analytics | Python | SQL | EDA | Model Deployment | Business Intelligence
1mo
Report this post
⚡ “Did you know? Even a few duplicate rows in your dataset can bias your Machine Learning model and lead to overfitting.” 📌 Machine Learning Series: Data Preprocessing Step 2 – Handling Duplicates Duplicate records are often caused by: ✔️ Multiple data sources ✔️ User input errors ✔️ System logging issues If not removed, they can: - Distort summary statistics - Mislead correlations - Inflate sample size artificially 🔹 How to Handle Duplicates: 1️⃣ Detection In Python (Pandas): df.duplicated() SQL: SELECT col1, COUNT(*) FROM table GROUP BY col1 HAVING COUNT(*) > 1; 2️⃣ Removal Pandas: df.drop_duplicates() SQL: DELETE with ROW_NUMBER() or CTE logic 3️⃣ Domain Knowledge Check Sometimes duplicates are valid (e.g., same customer ordering multiple times). Always check context before removing. 💡 Pro Tip: Before dropping duplicates, analyze columns carefully — what looks like a duplicate might actually represent different real-world events. 📂 I have shared a PDF guide on how to handle duplicates with code snippets. 👉 Question for you: Have you ever found duplicates in a real dataset? Did you drop them or keep them? #MachineLearning #DataScience #Preprocessing #DataCleaning #Duplicates #Analytics #AI #ML #Modeling #LearningTogether
Like Comment
To view or add a comment, sign in
André Luís Lopes da Silva

Freelancer | Data Analyst | Data Scientist | Remote | PL/SQL Developer | SQL Analytics | Python | Machine Learning | Algotrading
1mo
Report this post
𝗧𝗮𝗺𝗶𝗻𝗴 𝗢𝘃𝗲𝗿𝗳𝗶𝘁𝘁𝗶𝗻𝗴: 𝗧𝗵𝗲 𝗥𝗶𝗱𝗴𝗲 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗪𝗮𝘆 Let's talk about a classic yet powerful algorithm: Ridge Regression (or L2 Regularization). We all know that linear models can sometimes be "too confident." They fit the training data perfectly but crash and burn on unseen data. This is overfitting. Ridge Regression comes to the rescue by introducing a simple but genius penalty term: it shrinks the coefficients of the model. 𝗜𝗻 𝘀𝗶𝗺𝗽𝗹𝗲 𝘁𝗲𝗿𝗺𝘀: It tells the model, "It's okay if your line isn't a perfect fit for every single training point. Let's focus on the bigger picture and build something that generalizes." 𝗧𝗵𝗲 𝗠𝗮𝗴𝗶𝗰 𝗦𝗮𝘂𝗰𝗲: The algorithm adds the "sum of the squares of the coefficients" (the L2 norm) to the standard least-squares cost function. This discourages any single feature from having an excessively large weight, unless it's absolutely necessary. 𝗪𝗵𝘆 𝗜 𝗹𝗼𝘃𝗲 𝗶𝘁: - 𝗖𝗼𝗺𝗯𝗮𝘁𝘀 𝗢𝘃𝗲𝗿𝗳𝗶𝘁𝘁𝗶𝗻𝗴: The primary goal. It trades a small amount of bias for a significant reduction in variance. - 𝗛𝗮𝗻𝗱𝗹𝗲𝘀 𝗠𝘂𝗹𝘁𝗶𝗰𝗼𝗹𝗹𝗶𝗻𝗲𝗮𝗿𝗶𝘁𝘆: Perfect for when your independent variables are highly correlated. - 𝗡𝘂𝗺𝗲𝗿𝗶𝗰𝗮𝗹𝗹𝘆 𝗦𝘁𝗮𝗯𝗹𝗲: Always provides a solution, which plain linear regression can't do with collinear data. 𝗧𝗵𝗲 𝗖𝗮𝘁𝗰𝗵: It doesn't perform feature selection. It shrinks coefficients close to zero, but rarely sets them to exactly zero. (For that, we have its cousin, Lasso Regression! ) When do you reach for Ridge Regression in your projects? #MachineLearning #DataScience #ArtificialIntelligence #RidgeRegression #Regularization #Overfitting #Algorithms #ML #Python #DataScientist

2 Comments
Like Comment
To view or add a comment, sign in
Vishal Madargaon

Attended Visvesvaraya Technological University
1mo
Report this post
Developed a machine learning model to analyze customer behavior and predict churn. Performed exploratory data analysis (EDA) to identify key drivers of attrition, engineered features, and built classification models (Logistic Regression, Random Forest). Achieved 82% accuracy, enabling proactive customer retention strategies. Tech Stack: Python, Pandas, Scikit-learn, Matplotlib, Seaborn
Like Comment
To view or add a comment, sign in

7,831 followers

View Profile Follow

LinkedIn respects your privacy

Denis Burakov’s Post

More from this author

Designing AI Underwriters

Validating Tree-Based Risk Models

Scorecarding with Naïve Bayes

Explore content categories

Denis Burakov’s Post

More Relevant Posts

More from this author

Designing AI Underwriters

Validating Tree-Based Risk Models

Scorecarding with Naïve Bayes

Explore related topics

Explore content categories