Data science boom in commodities?- These are the challenges and recommendations
* Big differences exist between using machine learning in commodity trading and its usual applications in IT/finance environments.
* While machine learning in equity trading is prevalent, its use in commodity trading can be more challenging.
* Framing right questions and innovation in data collection and feature engineering, not sophisticated algorithms, is critical.
Most commodity companies, either trading houses or mining giants, have started their journey in artificial intelligence. In addition to common applications such as predictive maintenance, companies also started to use machine learning to predict commodity prices.
However, there are reasons why machine learning flourished in IT first, then in finance areas such as credit rating, and recently, equity trading, but came late in commodity industry.
Putting aside non-technical factors, the major reason is that, the nature of machine learning turns from science to a combination of science and dark art, when it comes to most questions in the commodity industry.
Most machine learning methodologies, such as random forest, support vector machine, neutral network, etc., are intended to solve independent and identically distributed (i.i.d.) questions. A slight violation also works, but as data moves further away from the i.i.d. premise, machine learning requires more human intervention, in order to simulate an i.i.d. environment, and this process is highly subjective.
To illustrate what i.i.d. means, consider Amazon’s recommendation system. Samples of each buyer’s activity is close to ‘independent and identical’, because it is a closed system with stable pattern, at least in short term. As a result, the purchase data of similar users can be used for prediction.
Non-i.i.d. data, however, is quite different. To illustrate, let’s take the example of predicting weekly or daily change of copper/aluminium price ratio. The challenges can be summarized as A) time-series correlation; B) open system; C) regime changes (unstable patterns).
First, Time Correlation. Time-series data are inherently non-i.i.d., as what happens today is correlated to what happened in the past. A solution is feature engineering, a typical dark art in data science. For instance, to reduce the correlation, you can take differences. However, what if you assume that there is a momentum driver? Then the model needs to take the moving average differences in the last 2 weeks, maybe up to 4 weeks. Moreover, you may want to assign higher weights to last 2 weeks and lower weights to older dates. Imagine there are 100 variables, and this variable generation done for 100 times. In the end, you may find a group of leading indicators, but not sure it is scientific, or just spurious correlation. This laborious process of feature engineering is the key to many machine learning problems, yet challenging for time-series (As Andrew Ng said, applied machine learning is basically feature engineering. To see some samples of the dark art, a good report to refer is the ‘101 Alpha’, it's darker than the JP Morgan guide on machine learning). In addition, model validation can also be subjective for time-series, when choosing how to set up rolling-windows.
Second, Open System. Although using price spread reduces the impact from macro-economic factors, the analyst may still feel that copper price may be more reactive to macro data than aluminium prices, especially on the upside due to differences in supply-side elasticity. But once macro indicators are added, the Pandora Box is open. For instance, if China PMI is added, then maybe M2 should be added because credit availability affects firms’ purchasing decisions. If M2 is added, CPI should also be added to smooth out the inflation impact. If CPI is added, how about PPI, because it could be a leading indicator of CPI… Stop too early, you may miss features, but stop too late, you create more spurious correlations from the astronomical links of global economy.
Third, Regime Changes. Metal prices were driven by different factors in recent years- the rise of China and Chinese commodity funds, copper financing, launch of LME aluminium premium contract, China winter capacity cutbacks, even coal policies that affect aluminium cost, etc. These changes create a lot of noises that are hard to take out. It is easy to remove a few outliers during black swans (algorithms can do it automatically, by balancing over-fitting and under-fitting), but if there were 10 black swans of different sizes and natures within a 3-year period, no algorithms learn effectively from the weak and changing patterns. The challenge is similar to letting a self-driving vehicle to train on Earth, and then the terrain is changed to Moon.
The i.i.d premise is also not perfect for equity trading either, but it is still easier than commodity trading to simulate a more stable pattern in general, due to the differences in equity and commodity:
First, the Nature. Commodity prices are often driven by heterogeneous factors and the number of traded commodities is limited compared to equities. This makes training dataset small, and some equity strategies unavailable or unpractical.
Second, the Economy. The value and frequency of decision-making is different. For equity trading in hedge funds, multi-millions of dollars decisions are made frequently, but for commodity companies, major trading decisions, are made on daily, if not weekly basis. Lower frequency and less $ value of decisions make algorithms less economical in some situations.
Third, the Evaluation. Performance of algorithms in hedge funds is much easier, as it can be based purely on P&L. But in commodity trading, there are other physical market factors for P&L, such as premiums, spread, arbitrage, etc, that make calculations subjective, not to mention that some trades are done for strategic/diversification/marketing reasons not maximizing short term P&L, and some are integrated with asset-backed trades. As it says, you can always torture data until it confesses.
These challenges do not mean that machine learning is not applicable for commodity companies, it is just more dark art than science compared to its application in other areas. That’s exactly why the industry has only started to embrace it only very recently, and still face great uncertainties about where and how to apply.
Before investing in data science, companies should be clear about targeted problems and prioritize them by expected value. In general, they can be categorized into
A) Productivity improvement from information management;
B) Conventional business analytics (i.i.d datasets, such as predictive maintenance and sales constraint optimization);
C) Predictive analytics for non-i.i.d. datasets (i.e. short-term market forecast).
The first one does not require data scientists; IT engineers are a good fit and limited domain knowledge required.
The second one requires data scientists who understands some domain knowledge.
The third one requires data scientists with more extensive domain knowledge to use data smartly, design the system, engineering feature, and monitor the validation/evaluation process.
Well done bro, but afsos C hi milega
ReplyDeleteSuper duper.
ReplyDeleteWell done man!
ReplyDeleteQuite detailed!
ReplyDelete