系统冷启动与推荐系统

Move over GoodReads! There’s a disruptor in town, known as PO-REC, your friendly neighborhood poem-recommending robot. (Just kidding, GoodReads. Feel free to make PO-REC an offer).

移到GoodReads！镇上有一个破坏者，称为PO-REC ，您友好的邻里诗歌推荐机器人。 (开个玩笑，GoodReads。随时向PO-REC发出要约)。

准备 (The preparation)

First, you’ll need to get your data in good order. Basically all you’ll need is a DataFrame with all numerical values and without any NaN values. For my project I used seven features I engineered based on the structure and form of poems, as well as 100-dimensional document vectors created using a Doc2Vec model.

首先，您需要使数据井井有条。基本上，您需要的是一个具有所有数值且没有任何NaN值的DataFrame。在我的项目中，我使用了根据诗歌的结构和形式设计的七个功能，以及使用Doc2Vec模型创建的100维文档向量。

Refer to my previous article for some nitty gritty details about the project. For PO-REC, the seven features I used were:

有关该项目的一些详细信息，请参阅我的上一篇文章。对于PO-REC，我使用的七个功能是：

Length of poem (number of lines)
诗歌长度(行数)
Length of line (average number of words per line)
行长(每行平均单词数)
Sentiment, polarity score (a measure of positivity, negativity, or neutrality)
情绪，极性得分(衡量阳性，阴性或中立)
Sentiment, subjectivity score (a measure of…subjectivity)
情感，主观评分(衡量…主观程度)
End rhyme ratio (number of end rhymes to number of lines)
尾韵比(尾韵数与行数)
Word complexity (average number of syllables per word)
单词复杂度(每个单词的平均音节数)
Lexical richness (unique words divided by total words)
词汇丰富度(唯一词除以总词)

So I ended up with 107-dimensional “poem vectors”. This may sound more complicated than it actually is — any dataset that happens to have 107 numerical features (i.e. columns in a DataFrame) could be considered a group of data points with 107-dimensional vectors.

因此，我最终得到了107维的“诗向量”。这听起来可能比实际要复杂得多-任何碰巧具有107个数字特征(即，DataFrame中的列)的数据集都可以视为具有107维向量的一组数据点。

But why all this talk of vectors? Well, it’s important for a pillar of content-based recommendation systems: cosine similarity.

但是为什么所有这些关于向量的话题呢？好吧，这对于基于内容的推荐系统的Struts很重要：余弦相似度。

余弦相似度 (Cosine similarity)

Cosine similarity is simply a measure of the angle between two vectors. A smaller angle results in a larger cosine value. Thus, the smaller the angle, the more similar the vectors. The image below gives you a sense of what different angles mean in terms of similarity.

余弦相似度只是两个向量之间夹角的度量。较小的角度会导致较大的余弦值。因此，角度越小，向量越相似。下图使您了解不同角度的相似性。

Cosine values range from -1 to 1. Vectors that run in completely opposite directions of each other (a 180-degree angle) have a value of -1; and vectors that run in the same exact direction (a 0-degree angle) have a value of 1. In the image above, the left-most angle has a cosine value close to 1, the middle a value around 0, and the right-most a value nearing -1.

余弦值的范围是-1到1。在彼此完全相反的方向(180度角)上运行的矢量的值是-1；在矢量之间的方向完全相反。并且沿相同精确方向(0度角)延伸的矢量的值为1。在上图中，最左侧的角的余弦值接近1，中间的值约为0，而右侧的值-最接近-1的值。

但是幅度呢？ (But what about magnitude?)

I know what you’re thinking: this doesn’t take magnitude into account. And while that’s true, for many use cases (and for most recommendation systems) magnitude typically doesn’t matter as much as direction. I recall an example of comparing three grocery orders. Person A buys 1 egg, 1 grapefruit, and 1 steak. Person B buys 1 tofu slab (extra firm), 1 bag of chips, and 1 bag of rice. Person C 100 tofu slabs (extra firm), 100 bags of chips, and 100 bags of rice. And the question is: whose orders are more similar?

我知道您在想什么：这没有考虑到幅度。尽管这是事实，但是对于许多用例(对于大多数推荐系统)，幅度通常与方向无关紧要。我记得一个比较三个杂货店订单的例子。 A人买1个鸡蛋，1个葡萄柚和1个牛排。 B人买了1块豆腐(特硬)，1包薯条和1包米。 C人100块豆腐(特硬)，100包薯条和100包米。问题是：谁的订单更相似？

Cosine similarity will tell you that Person B and Person C are as similar as it can get. They have the same angle and cosine value of 1.

余弦相似度将告诉您人B和人C尽可能相似。它们具有相同的角度和余弦值1。

This sad old man —

这个可悲的老人

— will tell you that the meat-eater and the vegan have much more similar orders because they ordered the same amount of things. In other words, the endpoints of the A and C vectors are far closer than the endpoints of the B and C vectors. If one measures similarity using Euclidean distance, one favors magnitude over direction. It can easily be argued that both matter, so it really does depend on your use case.

-会告诉您，食肉者和素食主义者有更多相似的订单，因为他们订购了相同数量的东西。换句话说，A和C向量的端点比B和C向量的端点更近。如果一个人使用欧几里得距离来衡量相似度，那么一个人就会偏爱幅度而不是方向。可以很容易地说，两者都很重要，因此确实取决于您的用例。

In my case, as is true of most text-based projects, direction matters more. For example, a short poem about death is more similar to a long poem about death than it is to a short poem about water. That said, you can include measures of magnitude within each vector, as I described above, which will change the angle of that vector accordingly.

就我而言，就像大多数基于文本的项目一样，方向更为重要。例如，关于死亡的一首短诗比关于水的一首短诗与关于死亡的一首长诗更相似。就是说，您可以在每个矢量内包含大小量度，如上所述，这将相应地更改该矢量的角度。

实作 (Implementation)

While it is difficult to describe (and impossible to envision) the angle between two 107-dimensional vectors, one can still calculate that angle using the dot product. This is easily done with scikit-learn’s cosine_similarity function:

尽管很难描述(并且无法想象)两个107维向量之间的角度，但仍然可以使用点积来计算该角度。这可以通过scikit-learn的cosine_similarity函数轻松完成：

from sklearn.metrics.pairwise import cosine_similaritysimilarities = cosine_similarity(your_dataframe)

This will return a cosine similarity matrix, which is basically an array of arrays with each text’s similarity to all other texts. The shape of the matrix will be the length of your DataFrame by the length of your DataFrame. It’s important to note that this includes each text’s similarity with itself, which will always equal 1. If you want to return a list of most similar items, you’d most likely want to exclude that value.

这将返回一个余弦相似度矩阵，该矩阵基本上是一个数组，每个文本与所有其他文本的相似性。矩阵的形状将是DataFrame的长度乘以DataFrame的长度。重要的是要注意，这包括每个文本与自身的相似性，该相似性将始终等于1。如果要返回最相似项的列表，则很可能希望排除该值。

出现问题 (A problem emerges)

Despite the fact that the cosine_similarity function runs in an instant, the resulting matrix can get rather large rather quickly. If you’re trying to host a recommendation system on something like Heroku, as I was, you can’t exactly upload a several hundred megabyte file. So for me, it was best to calculate the cosine similarity one text at a time, as needed. The code for that was something more like:

尽管cosine_similarity函数可以立即运行，但生成的矩阵却可以很快变得很大。如果像我以前那样尝试在Heroku之类的系统上托管推荐系统，则无法完全上传几百兆字节的文件。因此，对我而言，最好根据需要一次计算一个文本的余弦相似度。该代码更像是：

similarities = enumerate(cosine_similarity(
        your_dataframe.iloc[text_id].values.reshape(1,-1), 
        your_dataframe)[0]
                        )

Disregarding the enumerate portion for now, you want to use one text as an input and compare it to all of the texts. Using iloc grabs the desired text, values converts it into an array, and reshape gets it into the desired shape. Pulling a value from a DataFrame results in a one-dimensional vector, but you need an n-dimensional vector that rotates your vector to match the shape of your DataFrame, which is what reshape(1,-1) does. The second input is simply the entire DataFrame or whatever you wish to compare the text to. Using the 0 index returns the similarities as a single list, as opposed to a nested list with a length of 1.

现在暂时不考虑enumerate部分，您想使用一个文本作为输入并将其与所有文本进行比较。使用iloc抓住所需的文本， values将其转换成一个阵列，并且reshape得到它为所需的形状。从DataFrame中提取值会产生一维向量，但是您需要一个n维向量来旋转向量以匹配DataFrame的形状，这就是reshape(1,-1)作用。第二个输入只是整个DataFrame或您希望与之比较的文本。使用0索引将相似度作为单个列表返回，而不是长度为1的嵌套列表。

Why enumerate? The enumerate portion is indeed optional, but because I needed to sort from most to least similar and return a specific poem, I needed to make sure I tracked which poem corresponded to which similarity measure, and enumerate provides that index number (assuming you have indices that range from 0 to the length of your DataFrame).

为什么要枚举？ enumerate部分确实是可选的，但是由于我需要从最相似到最不相似进行排序并返回一首特定的诗歌，因此我需要确保跟踪到哪首诗对应于哪种相似性度量，并enumerate提供该索引号(假设您有索引号)范围从0到您的DataFrame的长度)。

Sorting hintTo sort the enumerate function’s resulting tuples by the cosine value (rather than the index number), you can use the following code:

排序提示要按余弦值(而不是索引号)对枚举函数的结果元组进行排序，可以使用以下代码：

from operator import itemgettersimilar_texts = sorted(similarities, 
                       key=itemgetter(1), 
                       reverse=True)

You can also use a lambda, but from what I’ve read, itemgetter is faster. Lastly, reverse=True simply sorts in descending order.

您也可以使用lambda，但据我所读， itemgetter更快。最后， reverse=True只是按降序排序。

项目回购 (Project repo)

You can check out my Heroku deployment of PO-REC, or look at my project repo on GitHub (where you can clone and run the app locally):https://github.com/p-szymo/poetry_genre_classifier

您可以查看我的Heroku PO-REC部署，或查看我在GitHub上的项目存储库(您可以在其中本地克隆和运行应用程序)： https : //github.com/p-szymo/poetry_genre_classifier