Having published a handy primer for marketers called “Understand Data Science Basics for Marketing” (clients enjoy here) as a cross to the thunderous jab of my previous “Understand Big Data Basics for Marketing” (clients enjoy here), I thought I’d provide a short vocabulary builder for the people.
Herewith are 14 common data science terms you’re no doubt hearing more and more often in the digital halls. Use them with care 🙂
- Clustering: Clustering describes a family of techniques that attempt to divide data into groups that are relatively compact (members are close together) and relatively distant from other groups. K-means is an iterative method that divides data into a pre-determined number (K) of clusters.
- Decision Trees: The most intuitive model type, decision trees build a series of branches from a root node, splitting nodes into branches based on the “purity” of the resulting branches (i.e., how well the split divides data into different classes).
- Deep Learning: Deep learning is an evolution of neural networks, combining multiple layers of networks into complex and nuanced models that are particularly useful for visual analysis and intelligent optimization.
- Dimensionality Reduction: These methods use linear algebra to find correlations among data. Methods include principle component analysis (PCA) and singular value decomposition (SVD).
- Graph Analysis: Graph analysis is used to analyze networks (e.g., social networks) where data points include not only a numerical or categorical value but also “edges” that are connected to other data points.
- Ensemble Methods: These methods reduce variance in individual models by combining a number of them together and averaging predictions. Many dozens (or hundreds) of decision trees can be combined into a random forest, which adds randomness into the test conditions at each node to improve flexibility.
- Logistic Regression: Although a regression model, logistic regression is used as a classifier. It maps the dependent variable onto an interval between 0-1 and so can be translated into a probability that the variable is in a particular class.
- Naive Bayes: Bayesian inference focuses on conditional probability, or the likelihood that something is true given that something else is true. It is based on Bayes Theorem , which was invented in the 19th century and has proved strangely useful in advertising technology.
- Neural Networks: These are “black box” models that use feedback loops to train very detailed representations of systems.
- Regression: Linear regression models determine the impact of a number of independent variables on a dependent variable (e.g., sales), by seeking a “best fit” that minimizes squared error. Other regression models combine linear models for more complex scenarios.
- Similarity Measures: Clustering makes use of similarity measures to determine the distance between data points. Shorter distances are equivalent to greater similarity. Common measures used are Euclidean, cosine similarity and Pearson’s correlation.
- Supervised Learning: One of two basic types of machine learning models uses labeled data to build models that make predictions. “Labeled” means that the data represents something in the past for which the outcome is known (e.g., item purchased).
- Support Vector Machines: Despite their odd name, support vector machines are just a way to describe a non-linear decision boundary between classes that maximizes the width of the boundary itself.
- Unsupervised Learning: The other of two basic types of machine learning models applies methods to unlabeled data to identify structure. Examples include clustering and dimensionality reduction.