Skip to content

Latest commit

 

History

History
50 lines (28 loc) · 38.6 KB

README.md

File metadata and controls

50 lines (28 loc) · 38.6 KB

Vanilla PCA library based on numpy and examples of practical dimensionality reduction use in stock market analysis.


Data Capital Management Interview Exercise

Introduction

During the process of developing trading strategies and data science analytics it is always useful to be able to understand data in the most concise way possible. One way to achieve this is through techniques denoted as "dimension reduction". We propose you to implement one of these techniques and use it over a realistic dataset.

Context

Your mission is to create a library of functions to process raw data of historical prices to find the ones that explain most of the risk profile from capital markets. We are interested in the relationship between the daily returns (of the adjusted closes) of different ETFs. Because these tend to have high correlation with one another, we would like to reduce the dimensionality of the space spanned by these 40 return time series.

One common technique used for dimension reduction is Principal Component Analysis (PCA). PCA is a technique that performs linear combinations on the original time-series to transform them into a set of linearly uncorrelated time-series called "Principal Components" (PC).

The transformation is defined in such a way that the first PC accounts for the largest variance (risk) in the data, with each successive component accounting for the next largest variance (risk) under the constraint of zero correlation with previous components. In real life applications only some components are kept and worked with in subsequent steps. In a nutshell, PCA provides a good approximation to a group of variables, in a way that the "dispersion" of the data is preserved as much as possible, while reducing the number of variables.

In the diagram below a 2 dimensional set of points are shown, as well as the 2 PC's. For two variables, only the first PC would be kept (the large diagonal arrow below), and all points would end up being represented by a single value, which would be equivalent to the coordinate of each point with respect to the first PC. Mathematically this is represented as a dot product of the coordinates of the point and the first PC description.

Task

You need to create a routine that is able to decompose historical prices from several assets into a low dimensional representation. The routines must be written as a library, and be coded in a re-usable manner.

  1. If you feel the need to understand more about how PCA works you should take a look at the primer at https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf. This primer describes a method to perform PCA using Singular Value Decomposition (SVD)

  2. A python implementation for one of the main methods to perform PCA is located at http://sebastianraschka.com/Articles/2014_pca_step_by_step.html#checking-the-eigenvector-eigenvalue-calculation. This reference contains a second methodology (and code in python) to perform PCA using eigenvectors.

  3. We want the library to contain the two methods mentioned above to do PCA: one that computes eigenventors, and one that uses SVD.

  4. Start by adapting the python code in the second reference into your library (consider re-organizing it as part of a generic library), and testing the eigenvector methodology; write unit tests to ensure results of this methodology are true PCA based on the properties they are supposed to have.

  5. Implement the methodology based on SVD, and compare your results with the eigenvector implementation of 4). Create tests to ensure both methodologies always provide the same result for some samples.

Bonus

  1. Attached, you will find the daily price histories of 40 exchange traded funds (ETFs). The _1,…,_5 suffixes in the file denote different categories. Apply the PCA methods over the attached price data set as follows:
  2. Extract the first PC across all of the series using your library, with any of the 2 methods. Let's denote this as the global principal component.
  3. Now for each sub-category perform another PCA on the residuals to extract the component structure of each category, removing the effect of the global principal component over all series. How many components are necessary to get a good representation of the data, per group?

What we are looking for

  • Clear, re-usable code that is adequate to become part of a numerical routine library
  • Clear and maintainable unit tests; feel free to use whatever library you like for them!
  • Your code should be reasonably well-organized and easy to navigate for the reviewer. Be ready to walk a reviewer through your code!
  • Tests must be relevant and tests for the PCA mathematical properties (within reasonable floating point precision) should be included
  • We strongly encourage you to use the Pandas and/or numpy library extensively for this task; it contains functionality to manipulate series.