10 Clustering Algorithms With Python

Clustering or cluster analysis is an unsupervised learning problem.
It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior.
There are many clustering algorithms to choose from and no single best clustering algorithm for all cases. Instead, it is a good idea to explore a range of clustering algorithms and different configurations for each algorithm.
In this tutorial, you will discover how to fit and use top clustering algorithms in python.
After completing this tutorial, you will know:
Clustering is an unsupervised problem of finding natural groups in the feature space of input data. There are many different clustering algorithms and no single best method for all datasets. How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library. Let’s get started.
Clustering Algorithms With Python
Photo by Lars Plougmann, some rights reserved.
Tutorial..

Continue Reading...

What Is Argmax in Machine Learning?

Argmax is a mathematical function that you may encounter in applied machine learning.
For example, you may see “argmax” or “arg max” used in a research paper used to describe an algorithm. You may also be instructed to use the argmax function in your algorithm implementation.
This may be the first time that you encounter the argmax function and you may wonder what it is and how it works.
In this tutorial, you will discover the argmax function and how it is used in machine learning.
After completing this tutorial, you will know:
Argmax is an operation that finds the argument that gives the maximum value from a target function. Argmax is most commonly used in machine learning for finding the class with the largest predicted probability. Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice. Let’s get started.
What Is argmax in Machine Learning?
Photo by Bernard Spragg. NZ, some rights reserved.
Tutorial Overview This tutorial is divided ..

Continue Reading...

Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

Gradient boosting is a powerful ensemble machine learning algorithm.
It’s popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.
There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.
In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python.
Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.
After completing this tutorial, you will know:
Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error g..

Continue Reading...

How to Calculate Feature Importance With Python

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.
There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores.
Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.
In this tutorial, you will discover feature importance scores for machine learning in python
After completing this tutorial, you will know:
The role of feature importance in a predictive modeling problem. How to calculate and review feature importance from linear models and decision trees. How to calculate and review per..

Continue Reading...

How to Develop Multi-Output Regression Models with Python

Multioutput regression are regression problems that involve predicting two or more numerical values given an input example.
An example might be to predict a coordinate given an input, e.g. predicting x and y values. Another example would be multi-step time series forecasting that involves predicting multiple future time series of a given variable.
Many machine learning algorithms are designed for predicting a single numeric value, referred to simply as regression. Some algorithms do support multioutput regression inherently, such as linear regression and decision trees. There are also special workaround models that can be used to wrap and use those algorithms that do not natively support predicting multiple outputs.
In this tutorial, you will discover how to develop machine learning models for multioutput regression.
After completing this tutorial, you will know:
The problem of multioutput regression in machine learning. How to develop machine learning models that inherently suppo..

Continue Reading...

4 Distance Measures for Machine Learning

Distance measures play an important role in machine learning.
They provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.
Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.
In this tutorial, you will discover distance measures in machine learning.
After completing this tutorial, you will know:
The role and importance of distance measures in machine learning algorithms. How to implement and calculate Hamming, Euclidean, and Manhattan distance measures. How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures. Let’s get started.
Distance Measures for Machine Learning
Photo by Prince Roy, some rights reserved.
..

Continue Reading...

PyTorch Tutorial: How to Develop Deep Learning Models with Python

Predictive modeling with deep learning is a skill that modern developers need to know.
PyTorch is the premier open-source deep learning framework developed and maintained by Facebook.
At its core, PyTorch is a mathematical library that allows you to perform efficient computation and automatic differentiation on graph-based models. Achieving this directly is challenging, although thankfully, the modern PyTorch API provides classes and idioms that allow you to easily develop a suite of deep learning models.
In this tutorial, you will discover a step-by-step guide to developing deep learning models in PyTorch.
After completing this tutorial, you will know:
The difference between Torch and PyTorch and how to install and confirm PyTorch is working. The five-step life-cycle of PyTorch models and how to define, fit, and evaluate models. How to develop PyTorch deep learning models for regression, classification, and predictive modeling tasks. Let’s get started.
PyTorch Tutorial – How to ..

Continue Reading...

Basic Data Cleaning for Machine Learning (That You Must Perform)

Data cleaning is a critically important step in any machine learning project.
In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform.
Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results.
In this tutorial, you will discover basic data cleaning you should always perform on your dataset.
After completing this tutorial, you will know:
How to identify and remove column variables that only have a single value. How to identify and consider column variables with very few unique values. How to identify and remove rows that contain duplica..

Continue Reading...

Neural Networks are Function Approximation Algorithms

Supervised learning in machine learning can be described in terms of function approximation.
Given a dataset comprised of inputs and outputs, we assume that there is an unknown underlying function that is consistent in mapping inputs to outputs in the target domain and resulted in the dataset. We then use supervised learning algorithms to approximate this function.
Neural networks are an example of a supervised machine learning algorithm that is perhaps best understood in the context of function approximation. This can be demonstrated with examples of neural networks approximating simple one-dimensional functions that aid in developing the intuition for what is being learned by the model.
In this tutorial, you will discover the intuition behind neural networks as function approximation algorithms.
After completing this tutorial, you will know:
Training a neural network on data approximates the unknown underlying mapping function from inputs to outputs. One dimensional input and ou..

Continue Reading...

Imbalanced Multiclass Classification with the E.coli Dataset

Multiclass classification problems are those where a label must be predicted, but there are more than two labels that may be predicted.
These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.
Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The E.coli protein localization sites dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.
In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass E.coli dataset.
After completing this tutorial, you will know:
How to load and..

Continue Reading...