# Pca

May be you are familiar with spreadsheets and dynamic cross tables tools to compare columns behaviours as sum,means…but what happens if you have about a thousand columns, you will need a more synthetic view of your datas.

Pca(Principal Component Analysis) is a method attached to Quantitative analysis (QA) branch.

It performs multidimensional analysis (Rk space), considering “Components” as columns of a datasets.

Behaviours are calculated as covariance or correlation and represented as 2d square matrix.
Many of these features yet exists in Python modules, but python may be slow on wide datasets.

The c++ code is a backend to handle large datasets with a best time response.
Python part Docker image can be used to plot and or to crosscheck results directly or from the backend.

Matlab/Octave part is available to crosscheck, some
scripts can be used to generate graphics.

## Purpose

Demistify PCA to let exploration as simple as possible for c/c++ devs.

## Lexical

Pre-processing

• Covariance matrix is the dispersion matrix of a dataset.
• Correlation matrix is a covariance scaled matrix (identified by diagonal set to 1).

Svd (Single values decomposition) is the Eigen process applied to a matrix, it returns values and vectors.

Consider 2 forms of Pca

• covariance based (Svd on unscaled matrix).
• correlation based (Svd on scaled matrix).

As you may notice

• covariance is lossless with a wide dispersion.
• correlation is lossy with scaled dispersion.

So what should I use cov or cor ?
When using dataset with columns values of same units use covariance else use correlation.
So method to use will depend on the nature of your dataset.

Hereby

Sources

## Build

``````./build.sh
``````

## Run

``````./build/pca
``````

## Sample output

Comparing perf between Python and C++ (iso features).

Related to

Processing time

• Test platform
``````Quad Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
``````
• Python

``````real    0m5.492s
user    0m4.160s
sys    0m0.752s
``````
• C++

``````real    0m0,007s
user    0m0,008s
sys    0m0,000s
``````
• Result C++

``````Fixture csv iris species 4x150
Fixture datas (matrix)

5.100000	3.500000	1.400000	0.200000
4.900000	3.000000	1.400000	0.200000
4.700000	3.200000	1.300000	0.200000
4.600000	3.100000	1.500000	0.200000
5.000000	3.600000	1.400000	0.200000
...
Covariance (matrix)

0.685694	-0.042434	1.274315	0.516271
-0.042434	0.189979	-0.329656	-0.121639
1.274315	-0.329656	3.116278	1.295609
0.516271	-0.121639	1.295609	0.581006

Correlation (matrix)

1.000000	-0.117570	0.871754	0.817941
-0.117570	1.000000	-0.428440	-0.366126
0.871754	-0.428440	1.000000	0.962865
0.817941	-0.366126	0.962865	1.000000

Eigen vectors (matrix)

0.361387	-0.656589	0.582030	0.315487
-0.084523	-0.730161	-0.597911	-0.319723
0.856671	0.173373	-0.076236	-0.479839
0.358289	0.075481	-0.545831	0.753657

Eigen values (vector)

4.228242	0.242671	0.078210	0.023835

Explained variance

C0 0.924619
C1 0.0530665
C2 0.0171026
C3 0.00521218

Projected matrix

2.818240	-5.646350	0.659768	-0.031089
2.788223	-5.149951	0.842317	0.065675
2.613375	-5.182003	0.613952	-0.013383
2.757022	-5.008654	0.600293	-0.108928
2.773649	-5.653707	0.541773	-0.094610
...
``````

## Testing

``````./build/pca_test
``````

## Todo

• Tests implementation
• 2D graphics rendering

View Github