Pca

May be you are familiar with spreadsheets and dynamic cross tables tools to compare columns behaviours as sum,means…but what happens if you have about a thousand columns, you will need a more synthetic view of your datas.

Pca(Principal Component Analysis) is a method attached to Quantitative analysis (QA) branch.

It performs multidimensional analysis (Rk space), considering “Components” as columns of a datasets.

Behaviours are calculated as covariance or correlation and represented as 2d square matrix.
Many of these features yet exists in Python modules, but python may be slow on wide datasets.

The c++ code is a backend to handle large datasets with a best time response.
Python part Docker image can be used to plot and or to crosscheck results directly or from the backend.

Matlab/Octave part is available to crosscheck, some
scripts can be used to generate graphics.

Purpose

Demistify PCA to let exploration as simple as possible for c/c++ devs.

Lexical

Pre-processing

  • Covariance matrix is the dispersion matrix of a dataset.
  • Correlation matrix is a covariance scaled matrix (identified by diagonal set to 1).

Svd (Single values decomposition) is the Eigen process applied to a matrix, it returns values and vectors.

Consider 2 forms of Pca

  • covariance based (Svd on unscaled matrix).
  • correlation based (Svd on scaled matrix).

As you may notice

  • covariance is lossless with a wide dispersion.
  • correlation is lossy with scaled dispersion.

So what should I use cov or cor ?
When using dataset with columns values of same units use covariance else use correlation.
So method to use will depend on the nature of your dataset.

Features

References

PCA

Presentation

Tools

Pca explaination

Interpretation

Questions

Fixtures (datasets)

Hereby

Sources

Requirements

Build

./build.sh

Run

./build/pca

Sample output

Comparing perf between Python and C++ (iso features).

Related to

Processing time

  • Test platform
Quad Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
  • Python

real    0m5.492s
user    0m4.160s
sys    0m0.752s
  • C++

real    0m0,007s
user    0m0,008s
sys    0m0,000s
  • Result C++

Fixture csv iris species 4x150
	Fixture datas (matrix)

	5.100000	3.500000	1.400000	0.200000
	4.900000	3.000000	1.400000	0.200000
	4.700000	3.200000	1.300000	0.200000
	4.600000	3.100000	1.500000	0.200000
	5.000000	3.600000	1.400000	0.200000
	         ...
	Covariance (matrix)

	0.685694	-0.042434	1.274315	0.516271
	-0.042434	0.189979	-0.329656	-0.121639
	1.274315	-0.329656	3.116278	1.295609
	0.516271	-0.121639	1.295609	0.581006

	Correlation (matrix)

	1.000000	-0.117570	0.871754	0.817941
	-0.117570	1.000000	-0.428440	-0.366126
	0.871754	-0.428440	1.000000	0.962865
	0.817941	-0.366126	0.962865	1.000000

	Eigen vectors (matrix)

	0.361387	-0.656589	0.582030	0.315487
	-0.084523	-0.730161	-0.597911	-0.319723
	0.856671	0.173373	-0.076236	-0.479839
	0.358289	0.075481	-0.545831	0.753657

	Eigen values (vector)

	4.228242	0.242671	0.078210	0.023835
	
	Explained variance

	C0 0.924619
	C1 0.0530665
	C2 0.0171026
	C3 0.00521218
	
	Projected matrix

	2.818240	-5.646350	0.659768	-0.031089
	2.788223	-5.149951	0.842317	0.065675
	2.613375	-5.182003	0.613952	-0.013383
	2.757022	-5.008654	0.600293	-0.108928
	2.773649	-5.653707	0.541773	-0.094610
	         ...

Testing

./build/pca_test

Todo

  • Tests implementation
  • 2D graphics rendering

GitHub

View Github