STATISTICS Course
Statistics
You'll have data, and will try to find the distribution that the values are following. You'll try to find properties of your distribution, check if this is likely that the distribution is having such properties, and predict how your data will evolve.
This is a course where you will learn
 (hard) about estimators, likelihood (vraisemblance in French), bias, ...
 (soft) about population, variables (qualitative, quantitative), ...
 how you could analyze a file of data
 plots used in statistics
 tests
 linear regression
And everything will be coupled with examples in R
, you may add examples in other languages. To be honest, I do not understand much about statistics so be sure to correct any mistakes, thanks! Be sure to crosscheck your findings with a teacher/friend.
Introduction
The main idea is that you got some data, and you need to find how this data got generated (What distribution function? What were the parameters?...). Once we established a model, then we can try to guess the future.
 Population: data we are observing, a matrix
 Variables: the columns of our matrix, can be quantitative or qualitative
 Individuals: the rows of our matrix
 Sample (échantillon): a part of our population
 Model: characterization of the dataset
 Population (paired): you took two times the data from the same individuals
 Empiric: you are calculating values using the sample
Math (summary)
Before the next part, you need to know of those
 mean (empiric/arithmetic): sum of the values divided by the number of values

mean (weighted=pondérée):
E(X) = sum of { value * frequency }

median: 50% are greater and 50% are lesser that the median. It's
F(x)=0.5
. (ifF
is the empirical cumulative distribution function)  medial (Médiale): value that divide in two, the cumulative sum of values.
 mode: most repeated value (wiki)
 quantile: split the distribution in X part (4 = quartile, 10 = decile, 100 = centile).

covariance: if two values are independents, then
cov=0
.  correlation: if two variables are correlated then when one increase, the other may variate according to the correlation
 moments: $\mathbb{E}(X)$,$V(X)$, Skewness, and Kurtosis
In R, you can use mean(x)
, median(x)
, quantiles(x)=fivenum(x)
, cov(x)
or cor(x)
. In the library modeest
, you can use mlv
to get the mode value.
Analysis of a sample
This section is a summary of what you will do, but not how you will do it, as it'll be explained in other sections.
Follow this link to learn about distributions in R
Estimators and likelihood
We are using the notation $\theta$ (theta) for the vector of the parameters of a distribution. For instance, a binomial distribution $B(n,p)$ would become $B(\theta)$. Because it's convenient, in statistics, we are always using theta while you may not see it often in probabilities. We are calling $\Theta$ (bigtheta), the space in which $\theta$ is defined.
Your goal is to estimate the vector "theta", like "what parameters seem to have generated this distribution?". We are calling estimator $\hat{\theta}$ (thetahat), the value that is likely the value of $\theta$.
Finally, we are calling Likelihood function $L(\hat{\theta})$ (Fonction de vraisemblance
) the function returning a value as to how much it seems $\hat{\theta}$ is equals to $\theta$. You will have to maximize this function to find the best $\hat{\theta}$, and that's called maximum likelihood estimation/value (maximum de vraisemblance
).
 Estimators
 Likelihood function
 Maximum Likelihood estimation (MLE)
 Loglikelihood function
 Confidence interval
 Sufficient statistic
According to the fitdistr function source code, I was able to learn more about estimators. You can find the source code here. I will use my newly earned knowledge a lot in the two next pages, but I don't know how did they learn that, unfortunately...
Other notes
Plots
In descriptive statistics, you will have to use plots and try to guess the distribution. You got all of them introduced in the R course.

plot (
plot
): simply plot your values 
histogram (
hist
): used to see the proportion of the distribution (=frequency, if prob=T) or the number of individuals taking a value, for each value 
contingency table (
table
): make a table of the unique values per the number of occurrences 
pivot table (
qhpvt
): you can use this to observe a variable with more depth, especially observing the relation of your variable and others variables. 
Bar chart: you can see the distribution of a quantitative variable split by group according to a qualitative variable
 for instance, you can see the number of students per year
 let d be a data.frame and d\$year be the year (ex: 2020) the student joined the school
 then the call would be
barplot(table(d$year))
 you may add beside=T for xtabs
barplot(xtabs(....), beside = TRUE)

Box plot: same as BarChart but you can see the quantiles, the min, and the max!
 for instance, you can see the horsepower of a car per the number of cylinders
 let d be a data.frame and d\$hp (horsepower), d\$cyl (cylinders) our variables
 the call would be
boxplot(d$hp ~ d$cyl)
Tests
We are using tests to check if it seems likely that a parameter has a value, the distribution has a property, etc. We are considering two hypotheses $H_0$ and $H_1$, and we are testing which one is more likely.
 $H_0$, null Hypothesis: we believe this is true
 $H_1$, alternate Hypothesis: what we want to demonstrate
A test could be
 twotailed test (
test bilateral
): $H_0 = \theta_0$, $H_1 \neq \theta_1$  lefttailed test (
test unilateral gauche
): $H_0 = \theta_0$, $H_1 \lt \theta_1$  righttailed test (
test unilateral droit
): $H_0 = \theta_0$, $H_1 \gt \theta_1$  fitting test (
test d'adequation
): $H_0 \sim L_1$, $H_1 \sim L_2$ with $L_1$ and $L_2$ two distributions  ...
We may introduce two errors while picking

Type I error (
risque de première espèce/seuil
), $\alpha$ accepted $H_1$ but $H_0$ was true
 $\mathbb{P}(reject\ H_0  H_0\ true)$

Type II error (
risque de seconde espèce
), $\beta$ accepted $H_0$ but $H_1$ was true
 $\mathbb{P}(reject\ H_1  H_1\ true)$
We are calling the probabilities of the errors above
 $1\alpha$: confidence coefficient (
Niveau de confiance
)  $1\beta$: power of a test (
Puissance d'un test
)
We are calling critical region or region of rejection $W$ (zone de rejet
), the set of values for which the null hypothesis is rejected. Hence, the region covered by the null hypothesis is called the region of nonrejection. We are also calling it the critical region, because the bounds of the regions are called critical values. To find the critical region, you need to find a statistical test (test statistique
) $T(x_1, ..., x_n) = T(x)$ which is a function taking a sample and returning a hypothesis. Of course, if you found $W$, then you don't have to look for $T$.
A test is successful if the result is in $W$. If that's the case, then $H_0$ is rejected. Otherwise, we do not accept $H_0$, and instead, we say that we can't reject $H_0$ at a significance level (seuil alpha
).
@ R(x) = \frac{L(x\theta_1)}{L(x\theta_0)} @
@ W = \{ x; R(x) \gt k \} =^\text{some operation} W = \{ x; T(x) \gt c \} @
with $k$ a constant giving us
@ \mathbb{P}_{H_0}(W) = \mathbb{P}_{H_0}(T(x) \gt k) = a @
and $c$ a critical value. You can also see
@ W = \{ x; T(x) \gt c \} \quad or \quad W = \{ x; c_1 \lt T(x) \lt c_2 \} @
In R (or most of the time), computers are calculating a pvalue,
 according to Neyman–Pearson lemma
 if $\text{pvalue} < \alpha$: reject H0
 else: accept
 according to Fisher
 the more $p$ is small, the more you can trust the result
 if $\alpha=0.05$, with Neyman–Pearson $0.049$: H0 is rejected while $0.050$ is accepted, so Neyman–Pearson is less permissive than Fisher.
This is too complex for me, I made too many mistakes, so I removed everything. You may add back what you know (examples, wellknown tests) or links.
Regression
The linear regression (Regression linéaire
) is a line maximizing the distance between each point of the distribution and the line (minimizing the sum of the square vertical distances between our line and a point)
The linear regression equation is $Y = a + b X + residual$. If $Y \not\in \mathbb{R}$ but in $Y \in [0,1]$, then the residual is $0$, and you should look at logistic regression.
 What's Y, a, b, X?
 Regression table/model
 Qualitative explanatory variables
 Linear regression
 Logistic regression
 Poisson regression
Sources
This is a list of all Wikipedia pages that you may want to check
 https://en.wikipedia.org/wiki/Statistical_inference
 https://en.wikipedia.org/wiki/Descriptive_statistics
 https://en.wikipedia.org/wiki/Linear_regression
 https://en.wikipedia.org/wiki/Linear_discriminant_analysis
 https://en.wikipedia.org/wiki/Regression_analysis
 https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
Estimators and likelihood