Skip to main content

Algorithms

About 13 min

Algorithms

Gorse has implemented various types of recommendation algorithms, both non-personalized and personalized. These recommendation algorithms are building blocks that compose the recommendation workflow.

Math formulas are used to introduce complex algorithms, we define math symbols used in this section as follows:

SymbolMeaning
nn<cache_size>
RRThe set of positive feedbacks.
\not RThe set of negative feedbacks.
IIThe set of items.
IuI_uThe set of favorite items by user uu.
IlI_lThe set of items with label ll.
UUThe set of users.
UlU_lThe set of users with label ll.
LuL_uThe labels of user uu.
LiL_iThe labels of item ii.
LUL_UThe labels used by all users.
LIL_IThe labels used by all items.
Nu\mathcal{N}_uThe neighbors pf user uu.
Ni\mathcal{N}_iThe neighbors of item ii.
I(p)\mathbb{I}(p)Equals 1 if condition pp satisfies, otherwise equals 0.

Tips

The <cache_size> comes from the configuration file:

[recommend]

# The cache size for recommended/popular/latest items. The default value is 100.
cache_size = 100

Non-personalized Algorithms

Non-personalized algorithms recommend the same content to all users.

Latest Items

Showing the latest items to users according to timestamps allows a new item to be exposed to users in time. To enable the latest recommender in Gorse, you need to set timestamp information for the items. Without timestamps, Gorse will not generate a list of the latest items.

The latest items recommendation is equivalent to the following SQL:

select item_id from items order by time_stamp desc limit <cache_size>;

Many websites show recent popular items to users such as Twitter trending. The popular items recommendation is equivalent to the following SQL:

select item_id from (
    select item_id, count(*) as feedback_count from feedback 
    where feedback_type in <positive_feedback_types> 
        and time_stamp >= NOW() - INTERVAL <popular_window> 
    group by item_id) t
order by feedback_count desc limit <cache_size>;

Tips

The <popular_window> in the configuration file corresponds to the window of popular items.

[recommend.popular]

# The time window of popular items. The default values is 4320h.
popular_window = "720h"

Similarity Algorithms

In some scenarios, users like specific types of items, for example, gamers like to solve puzzles and young children like to watch cartoons.

Item Similarity

An item is similar to another item if they share more common users or labels than others. The cosine similarity is ideal to capture the similarity between two items. Since the importance of each user or label is different, we use IDF to calculate the weight of each user or label.

  • Label weight: The label weight is defined as

wl=log(IlI) w_l = -\log\left(\frac{|I_l|}{|I|}\right)

If a label is used by more items, this label is more generic and has a lower weight.

  • User weight: The user weight is defined as

wu=log(IuI) w_u = -\log\left(\frac{|I_u|}{|I|}\right)

If a user has more favorite items, it means that this user has a wider preference and he or she has a lower weight.

Based on label weights and user weights, Gorse implements three similarity algorithms:

  • Similar: Calculates similarity based on label overlap between items

sij=lLiLjwllLiwl2lLjwl2 s_{ij} = \frac{\sum_{l\in L_i \cap L_j}w_l}{\sqrt{\sum_{l\in L_i}w_l^2}\sqrt{\sum_{l\in L_j}w_l^2}}

  • Related: Calculates similarity based on user overlap between items.

sij=uUiUjwuuUiwu2uUjwu2 s_{ij} = \frac{\sum_{u\in U_i \cap U_j}w_u}{\sqrt{\sum_{u\in U_i}w_u^2}\sqrt{\sum_{u\in U_j}w_u^2}}

  • Automatic: Calculates similarity based on both label overlap and user overlap between items.

sij=lLiLjwl+uUiUjwulLiwl2+uUiwu2lLjwl2+uUjwu2 s_{ij} = \frac{\sum_{l\in L_i \cap L_j}w_l + \sum_{u\in U_i \cap U_j}w_u}{\sqrt{\sum_{l\in L_i}w_l^2 + \sum_{u\in U_i}w_u^2}\sqrt{\sum_{l\in L_j}w_l^2 + \sum_{u\in U_j}w_u^2}}

For each item, top nn items with top similarity will be cached as neighbors of this item. The algorithm for item similarity can be set in the configuration file.

[recommend.item_neighbors]

# The type of neighbors for items. There are three types:
#   similar: Neighbors are found by number of common labels.
#   related: Neighbors are found by number of common users.
#   auto: Neighbors are found by number of common labels and users.
# The default value is "auto".
neighbor_type = "similar"

If items are attached with high-quality labels, similar is the best choice. If items have no labels, use related. For other situations, consider auto.

User Similarity

The computation of user similarity is similar to the computation of item similarity. First, we use IDF to calculate the weight of each item or label.

  • Label weight: The label weight is defined as

wl=log(UlU) w_l = -\log\left(\frac{|U_l|}{|U|}\right)

If a label is used by more users, this label is more generic and has a lower weight.

  • Item weight: The item weight is defined as

wi=log(UiU) w_i = -\log\left(\frac{|U_i|}{|U|}\right)

If an item has more users, it means that this item is popular but has a lower weight to characterize users' preferences.

Based on label weights and item weights, Gorse implements three similarity algorithms:

  • Similar: Calculates similarity based on label overlap between users

suv=lLuLvwllLuwl2lLvwl2 s_{uv} = \frac{\sum_{l\in L_u \cap L_v}w_l}{\sqrt{\sum_{l\in L_u}w_l^2}\sqrt{\sum_{l\in L_v}w_l^2}}

  • Related: Calculates similarity based on item overlap between users.

suv=iIuIvwiiIuwi2iIvwi2 s_{uv} = \frac{\sum_{i\in I_u \cap I_v}w_i}{\sqrt{\sum_{i\in I_u}w_i^2}\sqrt{\sum_{i\in I_v}w_i^2}}

  • Automatic: Calculates similarity based on both label overlap and item overlap between users.

suv=lLuLvwl+iIuIvwilLuwl2+iIuwi2lLvwl2+iIvwi2 s_{uv} = \frac{\sum_{l\in L_u \cap L_v}w_l + \sum_{i\in I_u \cap I_v}w_i}{\sqrt{\sum_{l\in L_u}w_l^2 + \sum_{i\in I_u}w_i^2}\sqrt{\sum_{l\in L_v}w_l^2 + \sum_{i\in I_v}w_i^2}}

For each user, the top nn users with top similarity will be cached as neighbors of this user. The algorithm for user similarity can be set in the configuration file.

[recommend.user_neighbors]

# The type of neighbors for users. There are three types:
#   similar: Neighbors are found by number of common labels.
#   related: Neighbors are found by number of common favorite items.
#   auto: Neighbors are found by number of common labels and favorite items.
# The default value is "auto".
neighbor_type = "similar"

If users are attached to high-quality labels, similar is the best choice. If users have no labels, use related. For other situations, consider auto.

Clustering Index

Gorse needs to generate neighbors for each user and item, but it is an expensive procedure. The complexity to generate neighbors for all items is O(I2)O(|I|^2) by brute force (for simplicity, assume the complexity of similarity calculation is constant). The clustering index[1] is more efficient to search neighbors for each item with acceptable precision decay. The usage of the clustering index consists of two steps:

  1. Clustering: The spherical k-means algorithm is used to cluster items to kk clusters with centroids cic_i (i{1.,K}i\in\{1.\dots,K\}). Then, each item jj belongs to aja_j-th cluster.
  • ajrand(k)a_j \leftarrow \text{rand}(k)
  • while cic_i or aja_j changed at previous step do
    • cic_i \leftarrow the centroid of items belong to ii-th cluster
    • ajarg maxi{1,,K}scija_j \leftarrow \argmax_{i\in\{1,\dots,K\}}s_{c_i j}
  • end while
  1. Search: For searching neighbors for item ii'.
  • Step 1, find the nearest LL centroid to item ii'.
  • Step 2, find the nearest nn items to item ii' on LL clusters.

The time complexity is O(ITK+IL/K)O(|I|TK+|I|L/K), where TT is the maximal number of iterations to stop the clustering algorithm. In Gorse implementation, K=IK = \sqrt{|I|}, the time complexity becomes O(I(IT+L))O(\sqrt{|I|}(|I|T+L)). Hence, the clustering index is efficient if TTT \ll \sqrt{|T|}.

The clustering index can be switched on and off by enable_index, which is turned on by default. The clustering index needs to set the parameter LL, which is the number of clusters to query. Too small LL will cause the index to fail to reach the required recall, while too large LL will reduce the performance. The construction process tries to increase LL. If the query recall reaches index_recall, or the growth epochs reach index_fit_epoch, the build process stops increasing LL.

[recommend.item_neighbors]

# Enable approximate item neighbor searching using vector index. The default value is true.
enable_index = true

# Minimal recall for approximate item neighbor searching. The default value is 0.8.
index_recall = 0.8

# Maximal number of fit epochs for approximate item neighbor searching vector index. The default value is 3.
index_fit_epoch = 3

[recommend.user_neighbors]

# Enable approximate user neighbor searching using vector index. The default value is true.
enable_index = true

# Minimal recall for approximate user neighbor searching. The default value is 0.8.
index_recall = 0.8

# Maximal number of fit epochs for approximate user neighbor searching vector index. The default value is 3.
index_fit_epoch = 3

Personalized Algorithms

There are lots of fancy recommendation algorithms these days and most of them are based on deep learning[2]. However, we believe traditional methods without deep learning are sufficient to achieve compromising recommendation performance.

Item Similarity-based Recommendation

For a user uu with favorite items IuI_u, if we know the similarity between any pair of items, The probability that a user uu likes an item ii is predicted by the sum of similarity between the item ii and favorite items.

y^ui=jIusij \hat{y}_{ui}=\sum_{j\in I_u}s_{ij}

where sijs_{ij} is the similarity between the item ii and the item jj. For a user, the time complexity to search top recommended items in the full item set is O(IuI)O(|I_u||I|). In practice, for most pairs of two items, their similarity is zero. Hence, we could search for recommended items in neighbors of favorite items.

  1. Collect candidates from neighbors of favorite items.

C=jIuNj C = \bigcup_{j\in I_u}\mathcal{N}_j

  1. For each item ii in CC, calculate the prediction score by

y^ui=jIusijI(iNj) \hat{y}_{ui}=\sum_{j\in I_u}s_{ij}\mathbb{I}(i\in\mathcal{N}_j)

the indicator I(iNj)\mathbb{I}(i\in\mathcal{N}_j) means the similarity is summed to the prediction score only if ii is in the neighbors of item jj. Since Gorse only caches top nn neighbors and the similarity of each item, lots of similarities are missing in the cache. The time complexity of the optimized algorithm is O(Iun)O(|I_u|n).

User Similarity-based Recommendation

The user similarity-based recommendation is the same as the item similarity-based recommendation:

  1. Collect candidates from favorite items of neighbors of the user.

C=vNuIv C = \bigcup_{v\in\mathcal{N}_u}I_v

  1. For each item ii in CC, calculate the prediction score by

y^ui=vNusuvI(iIv) \hat{y}_{ui} = \sum_{v\in\mathcal{N}_u}s_{uv}\mathbb{I}(i\in I_v)

the indicator I(iIv)\mathbb{I}(i\in I_v) means the similarity is summed to the prediction score only if ii is favored by user vv. The time complexity is O(Iun)O(|I_u|n) as well.

Matrix Factorization

In matrix factorization models, items and users are represented by vectors. The probability that a user uu likes an item ii is predicted by the dot product of two vectors.

y^ui=puTqi \hat y_{ui}=\mathbf{p}_u^T \mathbf{q}_i

where pu\mathbf{p}_u is the embedding vector of the user uu, and qi\mathbf{q}_i is the embedding vector of the item ii. The model of matrix factorization is simple, but there is more than one training algorithm.

BPR

BPR[3] (Bayesian Personalized Ranking) is a pairwise training algorithm. The training data for BPR consist of a set of triples:

Ds={(u,i,j)iIujIIu} D_s=\{(u,i,j)|i\in I_u\wedge j\in I \setminus I_u\}

The semantics of (u,i,j)DS(u, i, j) \in D_S is that user uu is assumed to prefer ii over jj. The negative cases are regarded implicitly.

The Bayesian formulation of finding the correct personalized ranking for all items is to maximize the following posterior probability where Θ\Theta represents the parameter vectors of the matrix factorization model

p(Θ>u)p(>uΘ)p(Θ) p(\Theta|>_u) \propto p(>_u|\Theta)p(\Theta)

where >u>_u is the desired but latent preference for user uu. All users are presumed to act independently of each other. BPR also assumes the ordering of each pair of items (i,j)(i, j) for a specific user is independent of the ordering of every other pair. Hence, the above user-specific likelihood function p(>uΘ)p(>_u|\Theta) can first be rewritten as a product of single densities and second be combined for all users uUu \in U.

uUp(>uΘ)=(u,i,j)Dsp(i>ujΘ) \prod_{u\in U}p(>_u|\Theta)=\prod_{(u,i,j)\in D_s}p(i>_u j|\Theta)

where p(i>ujΘ)=σ(y^uij)p(i>_u j|\Theta)=\sigma(\hat y_{uij}) and y^uij=y^uiy^uj\hat y_{uij}=\hat y_{ui} - \hat y_{uj}.

For the parameter vectors, BPR introduces a general prior density p(Θ)p(\Theta) which is a normal distribution with zero mean and variance-covariance matrix ΣΘ\Sigma_\Theta.

p(Θ)N(0,ΣΘ) p(\Theta) \sim N(0,\Sigma_\Theta)

where ΣΘ=λΘIΣ_Θ = λ_ΘI. Then the optimization criterion for personalized ranking is

BPR-OPT=lnp(Θ>u)=lnp(>uΘ)p(Θ)=ln(u,i,j)Dsσ(y^uij)p(Θ)=(u,i,j)Dslnσ(y^uij)+lnp(Θ)=(u,i,j)Dslnσ(y^uij)λΘΘ2 \begin{split} \text{BPR-OPT}&=\ln p(\Theta|>_u)\\ &=\ln p(>_u|\Theta)p(\Theta)\\ &=\ln\prod_{(u,i,j)\in D_s}\sigma(\hat y_{uij})p(\Theta)\\ &=\sum_{(u,i,j)\in D_s}\ln \sigma(\hat y_{uij})+\ln p(\Theta)\\ &=\sum_{(u,i,j)\in D_s}\ln \sigma(\hat y_{uij})-\lambda_\Theta\|\Theta\|^2 \end{split}

The gradient of BPR-Opt with respect to the model parameters is:

BPR-OPTΘ=(u,i,j)DsΘlnσ(x^uij)λΘΘΘ2(u,i,j)Dsey^uij1+ey^uijΘy^uijλΘΘ \begin{split} \frac{\partial\text{BPR-OPT}}{\partial\Theta}&=\sum_{(u,i,j)\in D_s}\frac{\partial}{\partial\Theta}\ln\sigma(\hat x_{uij})-\lambda_\Theta\frac{\partial}{\partial\Theta}\|\Theta\|^2\\ &\propto\sum_{(u,i,j)\in D_s}\frac{-e^{-\hat y_{uij}}}{1+e^{-\hat y_{uij}}}\cdot \frac{\partial}{\partial\Theta}\hat y_{uij}-\lambda_\Theta\Theta\\ \end{split}

A stochastic gradient-descent algorithm based on a bootstrap sampling of training triples is as follows:

  • initialize Θ\Theta
    • repeat
      • draw (u,i,j)(u,i,j) from DsD_s
      • ΘΘ+α(ey^uij1+ey^uijΘy^uij+λΘΘ)\Theta\leftarrow\Theta+\alpha\left(\frac{e^{-\hat y_{uij}}}{1+e^{-\hat y_{uij}}}\cdot \frac{\partial}{\partial\Theta}\hat y_{uij}+\lambda_\Theta\Theta\right)
    • util convergence
  • return Θ\Theta

where α\alpha . The derivatives from embedding vectors are

θy^uij={(qifqjf)if θ=pufpufif θ=qifpufif θ=qjf0else \frac{\partial}{\partial\theta}\hat y_{uij}=\begin{cases} (q_{if}-q_{jf})&\text{if }\theta=p_{uf}\\ p_uf&\text{if }\theta=q_if\\ -p_uf&\text{if }\theta=q_{jf}\\ 0&\text{else} \end{cases}

eALS

eALS[4] is a point-wise training algorithm. For a pair of a user uu and an item ii, the ground truth for training is

yui={1iIu0iIu y_{ui}=\begin{cases} 1&i\in I_u\\ 0&i\notin I_u \end{cases}

Embedding vectors are optimized by minimizing the following cost function[5]:

C=uUiIwui(yuiy^uif)2+λ(uUp2+iIqi2) \mathcal{C} = \sum_{u\in U}\sum_{i \in I}w_{ui}(y_{ui}-\hat{y}_{ui}^f)^2 + \lambda\left(\sum_{u \in U}\|\mathcal{p}\|^2+\sum_{i \in I}\|\mathbf{q}_i\|^2\right)

where y^uif=y^uipufqif\hat{y}_{ui}^f=\hat{y}_{ui}-p_{uf}q_{if} and wuiw_{ui} the weight of feedback

wui={1iIuαiIu w_{ui}=\begin{cases} 1&i\in I_u\\ \alpha&i\notin I_u \end{cases}

α\alpha (α<1\alpha < 1) is the weight for negative feedbacks. The derivative of the objective function with respect to pufp_{uf} is

Jpuf=2iI(yuiy^uif)wuiquf+2pufiIwuiqif2+2λpuf \frac{\partial J}{\partial p_{uf}}=-2\sum_{i\in I}(y_{ui}-\hat y_{ui}^f)w_{ui}q_{uf} + 2p_{uf}\sum_{i\in I}w_{ui}q_{if}^2 + 2\lambda p_{uf}

By setting this derivative to 0, obtain the solution of pufp_{uf} (Eq 1):

puf=iI(yuiy^uif)wuiqifiIwuiqif2+λ=iIu(yuiy^uif)qifiIuy^uifαqifiIuqif2+iIuαqif2+λ=iIu[yui(1α)y^uif]qifkfpukskfqiIu(1α)qif2+αsffq+λ \begin{split} p_{uf} &= \frac{\sum_{i \in I}(y_{ui}-\hat y_{ui}^f)w_{ui}q_{if}}{\sum_{i \in I}w_{ui}q^2_{if}+\lambda}\\ &=\frac{\sum_{i\in I_u}(y_{ui}-\hat{y}_{ui}^f)q_{if}-\sum_{i\in I_u}\hat{y}_{ui}^f\alpha q_{if}}{\sum_{i\in I_u}q^2_{if}+\sum_{i \in I_u}\alpha q_{if}^2+\lambda}\\ &=\frac{\sum_{i\in I_u}[y_{ui}-(1-\alpha)\hat{y}_{ui}^f]q_{if}-\sum_{k\neq f}p_{uk}s^q_{kf}}{\sum_{i\in I_u}(1-\alpha)q^2_{if}+\alpha s^q_{ff}+\lambda} \end{split}

where skfqs^q_{kf} denotes the (k,f)th(k, f)^\text{th} element of the Sq\mathbf{S}^q cache, defined as Sq=QTQ\mathbf{S}^q = \mathbf{Q}^T\mathbf{Q}. Similarly, get the solver for an item embedding vector (Eq 2):

qif=uUi(yui(1α)y^ui)pufαkfqikskfpuUi(1α)puf2+αsffp+λ q_{if} = \frac{\sum_{u \in U_i}(y_{ui}-(1-\alpha)\hat y_{ui})p_{uf}-\alpha\sum_{k\neq f}q_{ik}s^p_{kf}}{\sum_{u \in U_i}(1-\alpha)p^2_{uf} + \alpha s^p_{ff} + \lambda}

where skfps^p_{kf} denotes the (k,f)th(k, f)^\text{th} element of the Sp\mathbf{S}^p cache, defined as Sp=PTP\mathbf{S}^p = \mathbf{P}^T\mathbf{P}. The learning algorithm is summarized as

  • Randomly initialize P\mathbf{P} and Q\mathbf{Q}
  • for (u,i)R(u, i)\in R do y^ui=puTqi\hat{y}_{ui}=\mathbf{p}_u^T\mathbf{q}_i
  • while Stopping criteria is not met do
    • Sq=QTQ\mathbf{S}^q = \mathbf{Q}^T\mathbf{Q}
    • for u1u \leftarrow 1 to MM do
      • for f1f \leftarrow 1 to KK do
        • for iIui \in I_u do y^uiy^uifpufqif\hat{y}_{ui}\leftarrow\hat{y}_{ui}^f-p_{uf}q_{if}
        • pufp_{uf}\leftarrow Eq 1
        • for iIui \in I_u do y^uiy^uif+pufqif\hat{y}_{ui}\leftarrow\hat{y}_{ui}^f+p_{uf}q_{if}
      • end
    • end
    • Sp=PTP\mathbf{S}^p=\mathbf{P}^T\mathbf{P}
    • for i1i \leftarrow 1 to NN do
      • for f1f \leftarrow 1 to KK do
        • for uUiu \in U_i do y^uiy^uifpufqif\hat{y}_{ui}\leftarrow\hat{y}_{ui}^f-p_{uf}q_{if}
        • qifq_{if}\leftarrow Eq 2
        • for uUiu \in U_i do y^uiy^uif+pufqif\hat{y}_{ui}\leftarrow\hat{y}_{ui}^f+p_{uf}q_{if}
      • end
    • end
  • return P\mathbf{P} and Q\mathbf{Q}

Random Search for Hyper-parameters

There are hyper-parameters for model training such as learning rate, regularization strength, etc. Gorse uses random search[6] to find the best hyper-parameters. The hyper-parameters optimization behavior is set by model_search_epoch and model_search_trials. Large value might lead to better recommendations but cost more CPU time. The optimal model hyper-parameters are relatively stable unless the dataset changes dramatically.

[recommend.collaborative]

# The number of epochs for model searching. The default value is 100.
model_search_epoch = 100

# The number of trials for model searching. The default value is 10.
model_search_trials = 10

HNSW Index

The matrix factorization model in Gorse represents users and items as embedding vectors. For each user, items with large dot products of embedding vectors with the user are filtered as recommended items. Therefore, the most intuitive way to search for recommended items is to scan all items, calculate the dot product of embedding vectors during the scanning process, and select the top several items with the largest dot products as the recommended result. Assuming that there are N users and M items, the computational complexity to generate recommendation results for all users is O(IU)O(|I||U|). However, if the number of items and users is large, the overall computation is unacceptable.

A more efficient approach is to use the vector index HNSW[7]. The HNSW index creates a navigation graph for all item vectors. The results from HNSW are not accurate, but the small loss in the recall is worth the large performance gain. The HNSW requires a parameter, ef_construction, to be set. ef_construction that is too small will prevent the vector index from reaching the required recall, and ef_construction\text{ef\_construction} that is too large will reduce search performance. The build process tries to keep increasing ef_construction\text{ef\_construction}, and stops growing ef_construction\text{ef\_construction} if the recall reaches index_recall, or if the number of epochs reaches index_fit_epoch.

[recommend.collaborative]

# Enable approximate collaborative filtering recommend using vector index. The default value is true.
enable_index = true

# Minimal recall for approximate collaborative filtering recommend. The default value is 0.9.
index_recall = 0.9

# Maximal number of fit epochs for approximate collaborative filtering recommend vector index. The default value is 3.
index_fit_epoch = 3

Note

HNSW index is complex and it is impossible to introduce it on this page. For more information, read the original paper: https://arxiv.org/abs/1603.09320open in new window

Factorization Machines

User labels and item labels are important information for personalized recommendations, but matrix factorization only handles user embedding and item embedding. Factorization machines^3 generate recommendations with rich features such as user features and item features.

Different from the learning algorithms for matrix factorization, negative feedbacks are used in factorization machine training. The training dataset is constructed by

D={((x1,,xf,,xF),1)(u,i)R}{((x1,,xf,,xF),0)(u,i)} D=\{((x_1,\dots,x_f,\dots,x_F),1)|(u,i)\in R\}\cup\{((x_1,\dots,x_f,\dots,x_F),0)|(u,i)\in\not R\}

The dimension of input vectors x\mathbf{x} is the sum of the numbers of items, users, item labels and user labels: F=I+U+LI+LUF = |I| + |U| + |L_I| + |L_U|. Each element in x\mathbf{x} for a pair (u,i)(u,i) is defined by

xf={I(f=u)0<fII(fI=u)I<fI+UI(fIULi)I+U<f<I+U+LII(fIULULu)I+U+LI<f<F x_f=\begin{cases} \mathbb{I}(f=u)&0<f\le |I|\\ \mathbb{I}(f-|I|=u)&|I|<f\le |I|+|U|\\ \mathbb{I}(f-|I|-|U|\in L_i)&|I|+|U|<f<\le |I|+|U|+|L_I|\\ \mathbb{I}(f-|I|-|U|-|L_U| \in L_u)&|I|+|U|+|L_I|<f<\le F \end{cases}

The prediction output for a input vector x\mathbf{x} is

y^=w0+i=1nwixi+i=1nj=i+1n<vi,vj>xixj \hat y = w_0 + \sum^n_{i=1}w_i x_i + \sum^n_{i=1}\sum^n_{j=i+1}\left<\mathbf{v}_i,\mathbf{v}_j\right>x_i x_j

where the model parameters that have to be estimated are: w0Rw_0\in\mathbb{R}, wRn\mathbf{w}\in\mathbb{R}^n, VRn×k\mathbf{V}\in\mathbb{R}^{n\times k}. And <,>\left<\cdot,\cdot\right> is the dot product of two vectors. Parameters are optimized by logit loss with SGD. The loss function is

C=(x,y)Dylog(y^)(1y)log(1y^) \mathcal C=\sum_{(\mathbf{x},y)\in D}−y\log(\hat y)−(1−y)\log(1-\hat y)

The gradient for each parameter is

θy^={1,if θ is w0xi,if θ is wixij=1nvj,fxjvi,fxi2,if θ is vi,f \frac{\partial}{\partial\theta}\hat y=\begin{cases} 1,&\text{if }\theta\text{ is }w_0\\ x_i,&\text{if }\theta\text{ is }w_i\\ x_i\sum^n_{j=1}v_{j,f}x_j-v_{i,f}x^2_i,&\text{if }\theta\text{ is }v_{i,f} \end{cases}

Hyper-parameters are optimized by random search and the configuration recommend.collaborative is reused.


  1. Auvolat, Alex, et al. "Clustering is efficient for approximate maximum inner product search." arXiv preprint arXiv:1507.05910 (2015). ↩︎

  2. Zhang, Shuai, et al. "Deep learning based recommender system: A survey and new perspectives." ACM Computing Surveys (CSUR) 52.1 (2019): 1-38. ↩︎

  3. Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit feedback." Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 2009. ↩︎

  4. He, Xiangnan, et al. "Fast matrix factorization for online recommendation with implicit feedback." Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016. ↩︎

  5. Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." 2008 Eighth IEEE international conference on data mining. Ieee, 2008. ↩︎

  6. Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of machine learning research 13.2 (2012). ↩︎

  7. Malkov, Yu A., and Dmitry A. Yashunin. "Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs." IEEE transactions on pattern analysis and machine intelligence 42.4 (2018): 824-836. ↩︎