Words of the Week – Inference and Confidence

An often-overlooked basic part of learning new things is vocabulary: if you don’t fully understand the meaning of terms, you are handicapped. Worse, if you think you do understand, but that understanding is wrong, you are deprived of the ability to identify the gap in…

Comments Off on Words of the Week – Inference and Confidence

Word of the Week – Ruin Theory

The classic Gambler’s Ruin puzzle has an actuarial parallel:  “Ruin Theory,” the calculations that govern what an insurance company should charge in premiums to reduce the probability of “ruin” for a given insurance line.  “Ruin” means encountering claims that exhaust initial reserves plus accumulated premiums. …

Comments Off on Word of the Week – Ruin Theory

Word of the Week:  Bias

In this feature, we sometimes highlight terms that can have different meanings to different parts of the data science community, or in different contexts. Today’s term is “bias.” To the lay person, and to those worried about the ethical problems sometimes posed by the deployment…

Comments Off on Word of the Week:  Bias

Word of the Week – Entity Extraction

In Natural Language Processing (our course on the subject starts Jan 15), entity extraction is the process of labeling chunks of text as entities (e.g. people or organizations).  Consider this phrase from the blog on close elections linked above:   “the tie was not between Jefferson…

Comments Off on Word of the Week – Entity Extraction

Type III Error

Type I error in statistical analysis is incorrectly rejecting the null hypothesis - being fooled by random chance into thinking something interesting is happening.  The arcane machinery of statistical inference - significance testing and confidence intervals - was erected to avoid Type I error.  Type II error…

Comments Off on Type III Error

Relative Risk Ratio and Odds Ratio

The Relative Risk Ratio and Odds Ratio are both used to measure the medical effect of a treatment or variable to which people are exposed. The effect could be beneficial (from a therapy) or harmful (from a hazard).  Risk is the number of those having…

Comments Off on Relative Risk Ratio and Odds Ratio

Endpoint or Outcome (example: Covid-19 vaccine)

In a randomized experiment, the endpoint or outcome is a formal measure (statistic) of the result of the experiment.  In a randomized clinical trial preparatory to regulatory submission, there is often more than one outcome, due to the time and expense involved in conducting a…

Comments Off on Endpoint or Outcome (example: Covid-19 vaccine)

Link Function

In generalized linear models, a link function maps a nonlinear relationship to a linear one so that a linear model can be fit (and then mapped to the original form).  For example, in logistic regression, we want to find the probability of success:  P(Y =…

Comments Off on Link Function

Model Interpretability

Model interpretability refers to the ability for a human to understand and articulate the relationship between a model’s predictors and its outcome.  For linear models, including linear and logistic regression, these relationships are seen directly in the model coefficients.  For black-box models like neural nets,…

Comments Off on Model Interpretability


Polytomous, applied to variables (usually outcome variables), means multi-category (i.e. more than two categories).  Synonym:  multinomial. 

Comments Off on Polytomous

Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

Comments Off on Bayesian Statistics


As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

Comments Off on Density


Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

Comments Off on Parameterized

Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

Comments Off on Sensitivity and Specificity

Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

Comments Off on Decision Stumps

R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

Comments Off on R0 (R-nought)


In biostatistics, hazard, or the hazard rate, is the instantaneous rate of an event (death, failure…).  It is the probability of the event occurring in a (vanishingly) small period of time, divided by the amount of time (mathematically it is the limit of this quantity…

Comments Off on Hazard

Standardized Death Rate

Often the death rate for a disease is fully known only for a group where the disease has been well studied.  For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population.  All passengers were tested…


Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

Comments Off on Regularized Model

Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities…

Comments Off on Ridge Regression


The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical - a factor variable is the same thing as a categorical variable.  These factor variables…

Comments Off on Factor


In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

Comments Off on Purity

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

Comments Off on Predictor P-Values in Predictive Modeling

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

Comments Off on ROC, Lift and Gains Curves

Kernel function

In a standard linear regression, a model is fit to a set of data (the training data); the same linear model applies to all the data.  In local regression methods, multiple models are fit to different neighborhoods of the data. A kernel function is used…

Comments Off on Kernel function

Errors and Loss

Errors - differences between predicted values and actual values, also called residuals - are a key part of statistical models.  They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive…

Comments Off on Errors and Loss

Latin hypercube

In Monte Carlo sampling for simulation problems, random values are generated from a probability distribution deemed appropriate for a given scenario (uniform, poisson, exponential, etc.).  In simple random sampling, each potential random value within the probability distribution has an equal value of being selected. Just…

Comments Off on Latin hypercube


The art of statistics and data science lies, in part, in taking a real-world problem and converting it into a well-defined quantitative problem amenable to useful solution. At the technical end of things lies regularization. In data science this involves various methods of simplifying models,…

Comments Off on Regularize

Intervals (confidence, prediction and tolerance)

All students of statistics encounter confidence intervals.  Confidence intervals tell you, roughly, the interval within which you can be, say, 95% confident that the true value of some sample statistic lies.  This is not the precise technical definition, but it is how people use the…

Comments Off on Intervals (confidence, prediction and tolerance)

Lift, Uplift, Gains

There are various metrics for assessing how well a model does, and one favored by marketers is lift, which is particularly relevant for the portion of the records predicted to be most profitable, most likely to buy, etc. 

Comments Off on Lift, Uplift, Gains


You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

Comments Off on Probability


Density is a metric that describes how well-connected a network is

Comments Off on Density


We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years.  Take a look at the results

Comments Off on Algorithms

Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%).  The Gittens index is [...]

Comments Off on Gittens Index

Cold Start Problem

There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -

Comments Off on Cold Start Problem


Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

Comments Off on Autoregressive


A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor). 

Comments Off on Tensor

Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

Confusing Terms in Data Science – A Look at Homonyms and more

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these: 



Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more

Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

Comments Off on Jaquard’s coefficient

Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

Comments Off on Rectangular data

Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

Comments Off on Selection Bias

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

Comments Off on Likert Scale

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

Comments Off on Dummy Variable


Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

Comments Off on Curbstoning

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability Word of the Week


Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

Comments Off on Churn

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:


Comments Off on ROC Curve

Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

“out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”


I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

Comments Off on BOOTSTRAP

Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.


Comments Off on Same thing, different terms..


Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

Comments Off on BENFORD’S LAW


Contingency tables are tables of counts of events or things, cross-tabulated by row and column.



Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.



Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

Comments Off on SAMPLE



The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

Comments Off on SPLINE


To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

Comments Off on NLP


As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Comments Off on OVERFIT

Week #24 – Logit

Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula: logit(p) = log  p (1 - p)   Logit is widely used to construct statistical models, for example in logistic regression. 

Comments Off on Week #24 – Logit

Week #23 – Intraobserver Reliability

Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different time points from the same respondent is often…

Comments Off on Week #23 – Intraobserver Reliability

Week #22 – Independent Events

Two events A and B are said to be independent if P(A?B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For example, if…

Comments Off on Week #22 – Independent Events

Week #21 – Residuals

Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis. 

Comments Off on Week #21 – Residuals

Week #20 – Concurrent Validity

The concurrent validity of survey instruments, like the tests used in psychometrics, is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard". The concurrent validity…

Comments Off on Week #20 – Concurrent Validity

Week #19 – Normality

Normality is a property of a random variable that is distributed according to the normal distribution. Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from…

Comments Off on Week #19 – Normality

Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

Week #16 – Weighted Kappa

Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the Kappa statistic to situations in which the categories are not equal in some respect - that is, weighted by an objective or subjective function.

Comments Off on Week #16 – Weighted Kappa

Week #15 – Rank Correlation Coefficient

Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This…

Comments Off on Week #15 – Rank Correlation Coefficient

Week #14 – Manifest Variable

In latent variable models, a manifest variable (or indicator) is an observable variable - i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical. The opposite concept is the latent variable.

Comments Off on Week #14 – Manifest Variable

Week #13 – Fisher´s Exact Test

Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications…

Comments Off on Week #13 – Fisher´s Exact Test

Week #11 – Posterior Probability

Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn…

Comments Off on Week #11 – Posterior Probability

Week #4 – Loss Function

A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate…

Comments Off on Week #4 – Loss Function

Week #3 – Endogenous Variable:

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model. The concept of endogenous variable is fundamental in path analysis and structural equation…

Comments Off on Week #3 – Endogenous Variable:

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

Week #1 – Nonstationary time series

A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary. An example of a nonstationary time series is a record of readings of the…

Comments Off on Week #1 – Nonstationary time series

Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

Week #6 – Kolmogorov-Smirnov One-sample Test

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.

Comments Off on Week #6 – Kolmogorov-Smirnov One-sample Test

Week #5 – Cohort Data

Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.

Comments Off on Week #5 – Cohort Data

Week #50 – Six-Sigma

Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any…

Comments Off on Week #50 – Six-Sigma

Week #47 – Psychometrics

Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ"…

Comments Off on Week #47 – Psychometrics

Week #46 – Azure ML

Azure is the Microsoft Cloud Computing Platform and Services.  ML stands for Machine Learning, and is one of the services.  Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Comments Off on Week #46 – Azure ML

Week #45 – Ordered categorical data

Categorical variables are non-numeric "category" variables, e.g. color.  Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale.  Doctors rate pain on a scale of 1 to 10 - a "2" has no…

Comments Off on Week #45 – Ordered categorical data

Week #44 – Bimodal

Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers.  For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.  

Comments Off on Week #44 – Bimodal

Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

Week #41 – Analysis of Variance (ANOVA)

A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

Comments Off on Week #41 – Analysis of Variance (ANOVA)

Week #40 – Two-Tailed Test

A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test…

Comments Off on Week #40 – Two-Tailed Test

Week #39 – Split-Halves Method

In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results…

Comments Off on Week #39 – Split-Halves Method

Week #38 – Life Tables

In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number…

Comments Off on Week #38 – Life Tables

Week #37 – Truncation

Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above…

Comments Off on Week #37 – Truncation

Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a…

Comments Off on Week #36 – Tukey´s HSD (Honestly Significant Differences) Test