Are there situations where gradient descent bumps into trouble? Two.

The first is in a loss function with at least one **local minimum**

- a minimum in the loss function, but not the global minimum — a kind of “false gold”
- gradient descent gets caught in a well it can’t climb back out of
- There are potentially very many of these in a given loss function (see pic from https://www.cs.umd.edu/~tomg/projects/landscapes/)
- HOWEVER: these, in large part, might not be all that bad. …

In previous posts, I’ve talked about the importance of building on fundamental computer science concepts for data scientists and bootcamp grads (and anyone working or wanting to work in tech). Those concepts form the ecosystem that so much of data science and web development exist on top of, and while you rarely “need” to know those fundamentals to achieve project goals, it has been said that knowing those fundamentals and being able to leverage them to write better code is the difference between average and exceptional coding.

“There are 2 types of software engineer: those who understand computer science well…

I’ve recently written on the importance of buffing up on computer science fundamentals for anyone working in data science — it is, after all, the ecosystem the whole of AI/machine learning/data analytics is predicated on — and have myself been working through some of the coursework from the OSSU curriculum. I just finished – — — — course — — — — — — , UBC’s – — — — — — course — — — — — , the module of which was binary search trees (BSTs). …

If you’ve played around with language data before — classifying news articles by category, predicting a paragraph’s author, classifying tweets by sentiment––you’re already familiar with some of the ways words (or sometimes, characters) get turned into numbers, which can then be fed into some sort of classifier.

A simple solution would be to **a kind of one-hot encoding, or bag of words, **of the lexicon. If there are a total of 10,000 words in the documents being classified, then each document gets a 10,000-dimensional vector, with each dimension a count of the number of times a given word appears in…

I recently finished Flatiron School’s bootcamp in data science, and thinking back on where I was in early 2020, it’s amazing how much stuff has been crammed into my head in the space of a year. Web scraping, data visualization, A/B testing, linear algebra, vector calculus, regression, SVMs, random forests, XGBoost, Docker, AWS, Tableau, and so many different ways that neural networks can be put together. It has been an education with both breadth and depth *within the field of data science.*

There’s kind of a caveat there: as much as I’ve learned about data science in a year of…

A.k.a. the neural network acronym post, this is in fact an announcement for a series of four articles to be published, each covering one of the four major types of modern neural network: **unsupervised pretrained networks**, including autoencoders and generative adversarial networks (GANs); **convolutional neural networks** (CNNs); **recurrent neural networks** (RNNs), including long short-term memory (LSTM) and gated recurrent units (GRU) models; and **recursive neural networks**. Each of the following four posts, as well as this one, is in no small part a self study of *Deep Learning: A Practitioner’s Approach* by Patterson and Gibson, and as such, each of…

Somehow, in years of schooling, I’d never heard of beta distributions until I stumbled onto them on accident over at David Robinson’s blog, but they’ve quickly become one of my favorite distributions. I mean, just look at that thing wiggling off to the left there. They’re an incredibly flexible distribution, but that’s not what makes them really cool. It’s the ease with which you can update your predictions that makes beta distributions stand out.

Beta distributions are described by two parameters, alpha and beta (you can see how changing these values affects the shape of the distribution above), and while…

Statistics and probability lie at the heart of so very, very many applied sciences today, not least amongst them data science, but it can be a frighteningly jargon-y field; after all, it is, at *its *heart, math. What’s the difference between probability, likelihood, and statistics? What’s Bayes’ theorem all about? What’s exactly is a prior and a posterior?

A good introduction to probability is Khan Academy, which can get you started with permutations and combinations, random variables, and even basic set theory. I’d call this tier 1 foundational knowledge, and while important, you hit the ceiling on Khan Academy pretty…

If you’ve worked at all with real data, you’ve probably already had to handle cases of missing data. (I wonder what the probability of missing data is in any given natural data. I suspect about as close to certainty as you can get.)

We should be suspicious of any dataset (large or small) which appears perfect.

— David J. Hand

How did you handle it? Row-wise deletion? Column-wise deletion? Imputation? What did you impute? If continuous, did you use the mean, baseline, or a KNN-derived value? …