Merging data

Go back

Sometimes, you will have a problem: You don't have enough data. In Statistical inference, some tests are requiring $n \ge 5$ or $n \ge 30$. Maybe you could...

Merge datasets

If you got two datasets, having a common column, then maybe you could merge them?

merge(data1, data2, by="common_column_name")

Use Additive Smoothing

Also called Lissage de Laplace/Lissage laplacien or Laplace smoothing. We are artificially adding values.

Let's say you are evaluating a value by year. You can consider that having no values means having $0$ and because the mean is linear, you can add alpha=one to all values.

Bootstrap method

From what I understood (but my teachers do not seem to agree), the bootstrap method allows us to increase the number of data. What I do is

  • given a sample $x$
  • pick $n$ (for instance 10000) elements from $x$ (sample with replacement) creating a new batch $b$
  • $y = mean(b)$
  • add $y$ to $x$
  • again, $x$ got enough values

Sometimes, instead of using $mean(x)$ (because you might get a value outside $x$ or because the mean is a bit problematic, as you read before with the outliers), you may use the median.