Created with Sketch.
Machine Learning with Coffee
16 minutes | Mar 15, 2021
20 Perceptron: Machine Learning Begins
We introduce the concept of a perceptron as the basic component of a neural network. We talk about how important is to understand the concept of backpropagation applied to a single neuron.
16 minutes | Jan 24, 2021
19 ICA: Independent Component Analysis
We discuss Independent Component Analysis as one of the most popular and robust techniques to decompose mixed signals. ICA has important applications in audio processing, video, EEG and in many datasets, which present very high multicollinearity.
21 minutes | Jan 10, 2021
18 PCA: Principal Component Analysis
We discuss Principal Component Analysis as one of the most popular techniques to reduce the dimensionality of a dataset. PCA helps us be more efficient in terms of the number of variables we feed to our machine learning models.
28 minutes | Dec 22, 2020
17 Anomaly Detection: Clustering
We present 3 clustering algorithms which will help us detect anomalies: DBSCAN, Gaussian Mixture Models and K-means. These 3 algorithms are very popular and basic but have passed the test of time. All these algorithms have many variations which try to overcome some of the disadvantages of the original implementation.
22 minutes | Oct 19, 2020
16 Anomaly Detection: Control Charts
Anomaly detection is not something recent, techniques have been around for decades. Control charts are graphs with solid mathematical and statistical foundations which monitor how a process changes over time. They implement control limits which automatically flag anomalies in a process in real-time. Depending on the problem at hand, control charts might be a better alternative to more sophisticated machine learning approaches for anomaly detection.
18 minutes | Sep 28, 2020
15 Adaboost: Adaptive Boosting
Adaboost is one of the classic machine learning algorithms. Just like Random Forest and XGBoost, Adaboost belongs to the ensemble models, in other words, it aggregates the results of simpler classifiers to make robust predictions. The main different of Adaboost is that it is an adaptive algorithm, which means that it learns from the misclassified instances of previous models, assigning more weights to those errors and focusing its attention on those instances in the next round.
14 minutes | Jul 26, 2020
14 XGBoost: The Winner of Many Competitions
XGBoost is an open-source software library which has won several Machine Learning competitions in Kaggle. It is based on the principles of gradient boosting, which is based on the ideas of the Leo Breiman, the creator of Random Forest. The theory behind gradient boosting was later formalized by Jerome H. Friedman. Gradient boosting combines weak learners just as Random Forest. XGBoost is an engineering implementation which includes a clever penalization of trees and a proportional shrinking of leaf nodes.
23 minutes | Jul 12, 2020
13 Random Forest
Random Forest is one of the best out-of-the-shelf algorithms. In this episode we try to understand the intuition behind the Random Forest and how it tries to leverage the capabilities of Decision Trees by aggregating them using a very smart trick called “bagging”. Variable Importance and out-of-bag error are two of the nice capabilities of Random Forest which allow us to find the most important predictors and compute a good generalization error, respectively.
19 minutes | May 31, 2020
12 Decision Trees
We talk about Decision Trees as one of the most basic statistical learning algorithms out there that all Data Scientist should know. Decision Trees are one of a few machine learning models which are easy to interpret which makes them a favorite when it is desired to understand the logic behind a certain decision. Decision Trees naturally handle all types of variables without the need to create dummy variables, no need to scale or normalize and they are also very robust against outliers.
16 minutes | May 10, 2020
11 Inferential Statistics
We talk about the importance of inferential statistics in Data Science. Inferential statistics are a set of techniques used to make generalizations about a population from a sample. One of the tools used in inferential statistics is hypothesis testing. In this episode we provide a couple of examples on when and why to use 1-sample t-tests and 2-sample t-tests. We also argue that the mean or average of a sample means nothing if we do not also consider the variation of the data.
23 minutes | Apr 26, 2020
10 Logistic Regression
Logistic regression is a very robust machine learning technique which can be used in three modes: binary, multinomial and ordinal. We talk about assumptions and some misconceptions. For example, people believe that because logistic regression fits only a linear separator in the expanded dimensional space it wouldn’t be able to fit a complex boundary in the original space. Also, people normally use either linear regression or multinomial logistic regression when they should be using ordinal logistic regression.
16 minutes | Apr 19, 2020
09 Regularization to Deal with Overfitting
In this episode with talk about regularization, an effective technique to deal with overfitting by reducing the variance of the model. Two techniques are introduced: ridge regression and lasso. The latter one is effectively a feature selection algorithm.
22 minutes | Apr 4, 2020
08 Linear Regression: The Return of the Queen
In this episode I will try to convince you that Linear Regression is one of the most powerful Machine Learning algorithms. We will talk about common misconceptions, especially that Linear Regression is not able to model non-linear relationships. We also discuss how the myth of normality encourages many people to completely discard Linear Regression on non-normal data, when in reality, normality of the data has nothing to do with this assumption. Finally, I provide advice in how to check, but most importantly, how to fix any violated assumption in Linear Regression.
11 minutes | Mar 28, 2020
We talk about how Data Science and Machine Learning can help us better understand COVID-19 challenges. In this episode, we go back to the Kaggle website where different institutions, including the White House, have come together to try to analyze more than 45,000 published articles. The task is about answering 10 different questions which will help scientist around the world better understand this new virus and future pandemics.
30 minutes | Mar 15, 2020
06 How to Become a Data Scientist
We talk about what it takes to become a Data Scientist. We also discuss 4 prerequisites before preparing yourself to become a Data Scientist. Finally, we provide recommendations on 3 online courses, that if mastered, will put you above 90% of all Data Scientists out there.
24 minutes | Mar 8, 2020
05 Machine Learning: Use Cases Part 2
We continue exploring publicly available datasets to better understand Machine Learning use cases and its applications. This time we explore Kaggle which is the world’s largest data science community. Unlike UCI ML repository, which is more of an archive and geared towards an academic community, Kaggle has datasets that capture the latest trends in Machine Learning and hosts competitions sponsored by big companies where data scientists can participate and win big prizes.
25 minutes | Feb 23, 2020
04 Machine Learning: Use Cases
We explore the different areas of application of Machine Learning and talk about use cases which range from biology, finance and health care. We make use of the UCI Machine Learning Repository to learn about the most famous datasets in the data science community, discussing the problem they are trying to solve, the response or target they are trying to predict as well as the predictors we have available to achieve this goal.
40 minutes | Feb 12, 2020
03 What is Machine Learning?
The definition of Machine Learning and other related areas such as: artificial intelligence, business analytics, business intelligence and Big Data, is provided. These are not academic definitions extracted from books, these are real world concepts as I see them. We discuss similarities, differences and overlap between all these, sometimes confusing terms, which people tend to misuse.
29 minutes | Feb 2, 2020
02 My Personal Journey: How I Became a Data Scientist
In this episode I talk about my personal journey, how I became a Data Scientist. I start by talking about how I decided to go to college, what major to choose, how I chose my master’s degree. I talk about my time studying a PhD in Engineering and the most useful classes I took related to machine learning and data science. Finally, I briefly talk about my job experience as Data Scientist.
15 minutes | Jan 26, 2020
01 Introduction and Expectations
In this, our first episode, we will define the objective of the show as well as expectations. The show is designed for anyone who is interested in this fascinating world of Machine Learning.
Terms of Service
Do Not Sell My Personal Information
© Stitcher 2022