Unveiling the Power of Self-Attention for Shipping Cost Prediction: Abstract and Introduction

14 Jun 2024


(1) P Aditya Sreekar, Amazon and these authors contributed equally to this work {sreekarp@amazon.com};

(2) Sahil Verm, Amazon and these authors contributed equally to this work {vrsahil@amazon.com;}

(3) Varun Madhavan, Indian Institute of Technology, Kharagpur. Work done during internship at Amazon {varunmadhavan@iitkgp.ac.in};

(4) Abhishek Persad, Amazon {persadap@amazon.com}.


Amazon ships billions of packages to its customers annually within the United States. Shipping cost of these packages are used on the day of shipping (day 0) to estimate profitability of sales. Downstream systems utilize these days 0 profitability estimates to make financial decisions, such as pricing strategies and delisting loss-making products. However, obtaining accurate shipping cost estimates on day 0 is complex for reasons like delay in carrier invoicing or fixed cost components getting recorded at monthly cadence. Inaccurate shipping cost estimates can lead to bad decision, such as pricing items too low or high, or promoting the wrong product to the customers. Current solutions for estimating shipping costs on day 0 rely on tree-based models that require extensive manual engineering efforts. In this study, we propose a novel architecture called the Rate Card Transformer (RCT) that uses self-attention to encode all package shipping information such as package attributes, carrier information and route plan. Unlike other transformer-based tabular models, RCT has the ability to encode a variable list of one-to-many relations of a shipment, allowing it to capture more information about a shipment. For example, RCT can encode properties of all products in a package. Our results demonstrate that cost predictions made by the RCT have 28.82% less error compared to tree-based GBDT model. Moreover, the RCT outperforms the state-of-the-art transformer-based tabular model, FTTransformer, by 6.08%. We also illustrate that the RCT learns a generalized manifold of the rate card that can improve the performance of tree-based models.

1. Introduction

Amazon ships packages in the order of billions annually to its customers in the United States alone. The route planning for these packages is done on the day of shipping, day 0. As part of this plan, the shipping cost for each package is estimated by breaking down the package journey into smaller legs, and calculating the cost of each leg using a rate card. Day cost estimates are used to compute initial profitability estimates for accounting purposes, e.g.the estimate of profit/loss for each item as a result of a specific sale to a customer. These profitability estimates are used by several downstream services for decision making and planning.

However, the day 0 estimates may differ from the actual cost due to factors like improper rate-card configuration, incorrect package dimensions, wrong delivery address, etc. Inaccurate cost estimates cause skewed profitability estimates, which in turn leads to suboptimal financial decisions by downstream systems. For example, if the shipping cost of an item is consistently overestimated, then the item could be removed from the catalog. On the other hand, underestimated cost can lead pricing systems to lower the price of the item, leading to losses. Further, inaccurate estimation also leads us to promote wrong products to the customer, causing bad customer experience. To improve these shipping cost estimates, we propose a Transformer based deep learning model that accurately predicts the shipping cost at day 0.

In the context of shipping, a package is characterized by its physical dimensions, weight, and contents. It also include details about the carrier responsible for transporting it and the intended route. Additionally, a package is associated with a variable number of attributes that describe the item(s) inside and the various charges related to its shipment. Collectively, we refer to these attributes as the rate card associated with the package. For tabular datasets like package rate cards, tree based models like Gradient Boosted Decision Trees (GBDT), XGBoost (Chen and Guestrin, 2016), etc., are considered as state-of-the-art (SOTA) models. However, their effectiveness heavily relies on high-quality input features (Arik et al., 2019) which can require extensive feature engineering. For our use case, this problem is further accentuated by the fact that the target concept depends on high order combinatorial interactions between rate card attributes. For example, if the rate card is improperly configured for large containers with flammable substances shipped from Washington DC to New York by ABC carrier, then the model has to learn to associate property combination < size = large, item = flammable, source = W ashington, destination = New Y ork, carrier = ABC > with high deviation between estimated and actual costs. When dealing with feature combinations, considering all possible higher-order interactions between package properties may be impractical due to the exponential increase in the number of interactions with each increase in order, leading to the curse of dimensionality (Bishop, 2006). Another shortcoming of tree based models is their inability to handle a variable length list of features. A package may contain multiple items, and its ship cost can be broken down into multiple charges types. Previous experiments demonstrated that adding features engineered from multiple items and charges improved GBDT’s performance. However, due to inability of tree based models to handle variable list of features, complete information from them could not be learned.

In this paper, inspired by the recent success of transformers in tabular domain (Huang et al., 2020; Somepalli et al., 2021; Gorishniy et al., 2021), we propose a novel architecture called the Rate Card Transformer (RCT) to predict ship cost on day 0. The proposed model is specifically designed to learn an embedding of rate card associated with a package. The RCT leverages self-attention mechanisms to effectively capture the interdependencies between various components in a rate card by learning interactions between input features. Specifically, our contributions in this work include:

• Propose a novel architecture, Rate Card Transformer (RCT), which leverages transformer architecture to learn a manifold of the rate card, to predict shipping cost on Day 0. Further, it is demonstrated that RCT outperforms both GBDTs and the stateof-the-art tabular transformer, FT-Transformer, (Gorishniy et al., 2021) in shipping cost prediction.

• Extensive experiments are performed to show that the learned embeddings are a sufficient representation of the rate card manifold, and self-attention layers are effective feature interaction learners. Ablation studies are performed to analyze the impact of number of transformer layers and attention heads on model performance.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.