Global Oil Export Destination Prediction: A Machine Learning Approach.

AuthorJia, Haiying

    Oil is one of the most important raw materials and is the lifeblood of the global economy. Indeed, oil and gas provide over 50% of primary energy supplies to the world (IEA, 2019) and is the primary power source for transportation. The fourteen member states of the Organization of the Petroleum Exporting Countries (OPEC) control over 80% of world crude oil reserves (OPEC, 2016), while consumption is mainly driven by the OECD (Organisation for Economic Co-operation and Development) countries, China and India. The geographical separation of oil consuming and producing nations means that oil needs to be transported at great distances, with forty percent of the annual global oil production transported via the oceans in specialised oil tankers (Clarksons, 2016; Adland et al., 2017). The global oil trade is greater than any other commodity in terms of value and it was the world's first trillion-dollar industry in terms of annual sales (Doyle, 1994). Major oil producing countries such as Saudi Arabia, Norway, Nigeria and Venezuela derive much of their national income from the production of oil. For many other countries, such as the US and China, the cost of importing oil is a major component of their foreign exchange balance. Thus, oil trade and the price of oil are crucial factors both for national and foreign policy.

    At the macro level, oil trade flow (i.e. the spatial supply and demand balance) is driven by factors such as population growth, per capita energy usage, and structural changes (e.g. innovation in energy efficiency and the emergence of alternative sources of energy). At the micro level, the ultimate destination of oil exports is the result of a complex and dynamic system including, for instance, contractual agreements (long-term bilateral agreements and short-term commercial contracts), political factors (sanctions or restrictions), new pipelines and refineries, the use of storage, and regional price fluctuations.

    The objective of this paper is to predict the destination of oil exports at the micro level in a data-driven framework by utilizing actual oil shipment information and training machine learning algorithms based on supervised classification techniques. Based on crude oil shipment data for the period January 2013 through mid-March 2016, we investigate how destinations are determined based on four attribute clusters: cargo information (such as sellers' identity, cargo grade and cargo size), vessel information (such as vessel identity and its technical specifications), geographic information (load terminals and ports), and macroeconomic data (e.g. regional oil prices and crack spreads). We train the machine learning algorithm based on historical data and demonstrate the models' out-of-sample accuracy.

    To our knowledge there is no comparable academic research in the oil trade domain. We contribute to the literature in at least three ways. Firstly, we contribute to the choice model methodology literature by applying cutting edge machine learning techniques in the prediction. Compared to traditional discrete choice models, our approach lessens the dependence on often unrealistic statistical assumptions (such as factor independence) and remain completely data-driven thanks to the increasing availability of maritime big data. Secondly, the unique dataset of micro-level oil shipment information, which is primarily derived from the Automated Identification System (AIS) for satellite tracking of vessels, provides a new and rich information of global oil trades. The high dimensionality in the attributes is key in training machine learning algorithms to predict trade patterns. Thirdly, the variety of machine learning models that are employed in this research provides a good combination of interpretability and accuracy.

    This last contribution is key in real life applications and, thus, our methodology is potentially important as a building block in commercial applications that deal with oil and freight market analysis. For instance, the public destination information in ship tracking data is known to be of low quality and can be easily manipulated. Accordingly, analysts that want to track cargoes as a proxy for economic activity or to estimate short-term regional supply of crude oil need a tool to benchmark such information against the likely outcome predicted from past trading patterns and micro data. Importantly, our work suggests that micro data is substantially more valuable for predictive oil trade models than observable macroeconomic data such as crack spreads and oil prices.

    The remainder of the paper is organized as follows. Section 2 reviews the relevant literature, which is followed by methodology in Section 3 and data description in Section 4. Section 5 describes the feature engineering process, Section 6 presents the results, and Section 7 concludes the paper.


    The use of discrete choice models, both binary and multinomial, has been the dominating method in modelling destination choice in transportation, see for instance, Malchow and Kanafani (2004), Rich et al. (2009), Steven and Corsi (2012), Piendl et al. (2017), and Alizadeh et al. (2016). Discrete choice models in this context have dealt with destination choices for shopping trips (Timmermans, 1996; Wang and Lo, 2007), car purchases (Train, 1986; Train and Winston, 2007) or the demolishing of ships (Alizadeh et al. 2016). Professor Daniel McFadden won the 2000 Nobel prize for his development of theory and methods for analyzing discrete choices (Manski and McFadden, 1981; McFadden, 1974, 1989; McFadden and Train, 2000). In these models, the choice made by entities (a person, firm or industry) is statistically related to the attributes of the choices. For example, the choice of which port a liner shipping company uses is statistically related to the port service level, vessel sizes, cargo information, and other attributes of each available alternative. The models estimate the probability that a particular alternative is chosen using econometric methods such as parametric models (see, for example, Allenby and Rossi, 1998; Andrews et al. 2002; Hensher and Greene, 2003) or nonparametric models (see, for example, McLachlan and Peel, 2005; Train, 2008). As an extension, the model is naturally used to predict how choices will change when the attributes of the alternatives change. However, the imposed statistical model cannot possibly include all the factors or information that lead to decisions as their determinants are only partially observed or imperfectly measured. Therefore, discrete choice models rely on statistical assumptions and specifications to account for, for example, individual taste differences (Vij and Krueger, 2017). Traditional statistical techniques were designed for relatively small datasets with standardized structures, i.e. similar type of variables. The underlying assumption is that the relationship is homogeneous, that is, the same relationship between variables hold across the entire measurement space. This leads to models where only a few parameters are necessary to trace the effects of the various factors involved (Breimanetal.1998).

    As the result of increasing availability of information and the exponential growth in data in recent years, machine learning methods have been gaining popularity in various areas due to their ability to model large amounts of data without explicitly imposing a statistical model form. The term "machine learning" was coined by Samuel (1959), in which he suggests that computers can be programmed to "behave in a way which, if done by human beings or animals, would be described as involving the process of learning". Machine learning typically refers to the scientific study of algorithms that computer systems use to progressively improve their performance on a specific task (Bishop, 2006). Machine learning is today used in various research areas such as, for instance, image recognition for oil spills (Kubat et al. 1998), cancer prediction (Cruz and Wishart, 2006), information extraction (Freitag, 2000), and biology (Kampichler et al. 2010).

    A large dataset not only involves a large number of observations for many variables, but also has high complexity in the data structure. This may include high dimensionality, a mixture of data types and nonstandard data structure (Breiman et al. 1998). High dimensionality in machine learning means there is a large number of attributes, which can be features required to represent data, or independent parameters. In this case, the number of observations may be less, but rich information for each observation leads to high dimensionality which demands better handling of the data. Mathematically, in a dataset with M dimensions, the number of parameters needed to specify distributions in M dimensions increases by the factor of M(2) for a normal distribution, unless one makes the very strong assumption that the variables are independent (the typical i.i.d. assumption in traditional statistical models). Indeed, thanks to the complex impact of high dimensionality on statistics, mathematicians have termed it "the curse of dimensionality" (Bellman, 1961). With accelerating computer capability, the analysis of complex high dimensional databases with mixed data types is increasingly feasible without imposing a model structure a priori. Micro-level oil shipment data represents exactly such a dataset, which motivates our choice of the machine learning methodology.


    3.1 Multinomial Logit Model

    Discrete choice models that are based on utility maximization theory have gained popularity in transportation research, where the family of these models is typically used to predict individual choices in transport mode and routes. There is an extensive literature on the development of discrete choice models (see Cirillo and Xu, 2011, for a review). For our purpose, one of the most widely used models--the multinomial Logit model...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT