Apache Mahout is a library of Machine Learning algorithms for Hadoop.

Machine learning is an area of artificial intelligence, which focuses on learning from available data to make predictions on unseen data without explicit programming.

Apache Mahout recommendations module helps recommending to the users items based on his preferences.

You can find this kind of algorithm on Amazon for example.

 

Screen Shot 2015-04-11 at 17.58.07

The algorithm used by Amazon is called the collaborative filtering.

In this tutorial I am going to speak about the content based  filtering and the collaborative filtering.

Both of them are implemented in Apache Mahout.

 

The Content based filtering

Content based filtering is an unsupervised mechanism based on the attributes of the items and preferences and model of the user.

What I mean by unsupervised learning is a type of algorithms that try to find correlations without any external inputs other than the raw data. It does not try to find any logic in it, it clusters the data to determine groups. The data are grouped based on similar sizes, but are not labeled.

 

For example, if a user views a movie with a certain set of attributes such as genre, actors, and awards, the systems recommends items with similar attributes. The preferences of the user are mapped with the attributes or features of the recommended item.

 

ex : If I watch Star Wars I and II. The recommendation algorithm will probably propose me Star Wars III and any sci-fi movies related to this genre of movies, actors in these movies.

 

Collaborative filtering

Collaborative filtering approaches consider the notion of similarity between items and users. No features of product or properties of users are considered here, as in content based filtering.

It is a supervised learning algorithm, the algorithm is feed with data in order to make a decision, find a logic between them. All your data are labeled, so you know exactly what type of data you give to the algorithm. The data can be categorised.

Collaborative filtering approach uses historical data on user behaviours such as clicks, views, and purchases to provide better recommendations.

The algorithm learns from the users, to better understand their needs.

One more time, the best example that i can give to you is on Amazon, as showed before.

In collaborative filtering, for each item or user, a neighbourhood is formed with similar related items or users. Once you make an action to an item, the recommendations are drawn from that neighbourhood.

Collaborative filtering can be achieved using the following techniques:

  • User based recommendation
  • Item based recommendation

 

User based recommendations

In user based recommenders, similar users from a given neighbourhood is identified and the item recommendations are given based on what similar users already bought or viewed which a particular user did not buy to view yet.

 

On this example, you can see that the dude D1 likes pizzas and salads.

Dude 2 also likes pizzas and salads, but also beers.

Both of them are pretty similar, they have the same taste.

So the algorithm will propose to the dude D1 beers, regarding his tastes and his similarity with the dude D2.

 

userbased

 

Other example, with movies : 5 users rate several movies from 1 to 10 (highest).

the rates are stored in a .csv file.

Screen Shot 2015-04-11 at 20.10.16

 

 

How to implement it in java ?

 

DataModel model = new FileDataModel (new File("movie.csv"));
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(3, 2);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}

The result is :

RecommendedItem[item:6, value:8.0]
RecommendedItem[item:3, value:5.181073]

The value for item 6 is higher than that of item 3 because both user 4 and user 5 has rated higher value for item 6.

Even though user 1 and user 2 have some common item interests (item 1, item 4) with user 3, they are not considered due to low ratings given for those co-occurring items

 

More explanation about the code on Mahout website.

 

Data model

Data model represents how we read data from different data sources. In our code example, we used FileDataModel which takes CSV input.

In addition, Apache Mahout supports the following input methods.

  • JDBCDataModel – reads from JDBC driver
  • GenericDataModel – populated through Java calls
  • GenericBooleanPrefDataModel – uses given user data, suitable for small experiments

 

Item based recommendation

It measures similarities between different items and picks the top k closest (in similarity) items to a given item to arrive at a rating prediction or recommendation for a given user for a given item.

For example, if I buy a CD of a rock band, the algorithm will also propose me a CD, similar to the one what I bought. It based only on the product.

P1

 How to implement this algorithm ?

Same example as before, with 5 users who rate several movies.

DataModel model = new FileDataModel (new File("movie.csv"));
ItemSimilarity itemSimilarity = new EuclideanDistanceSimilarity (model);
Recommender itemRecommender = new GenericItemBasedRecommender(model,itemSimilarity);
List<RecommendedItem> itemRecommendations = itemRecommender.recommend(3, 2);
for (RecommendedItem itemRecommendation : itemRecommendations) {
System.out.println("Item: " + itemRecommendation);
}

Here as well, the user 3 is looking for 2 recommendations.

Here is the result:

Item: RecommendedItem[item:2, value:7.7220707]
Item: RecommendedItem[item:3, value:7.5602336]

When the algorithm is only based on the items, the recommendations is totally different that we one what we have seen before. Here, the algorithm proposes the item 2 and 3 instead of the item 6 and 3 (with user based  recommendation).

More explanation about the code on Mahout website.

Similarity

In the examples, I have used PearsonCorrelationSimilarity to find similarity between two users.

It implements UserSimilarityso this is pretty efficient to compute user similarities.

We can also it for computing item similarity  but, this is generally too slow to be useful.

 

Some other available similarity measures are listed below.

  • EuclideanDistanceSimilarity measures Euclidean distance between two users or items as dimensions, preference values given will be values along those dimensions. EuclideanDistanceSimilarity will not work if you have not given preference values.
  • TanimotoCoefficientSimilarity is applicable if preference values are consisting of binary responses. It is the number of similar items two users bought/ total number of items they bought.
  • LogLikelihoodSimilarity is a measure based on likelihood ratios
  • SpearmanCorrelationSimilarity: In SpearmanCorrelationSimilarity, relative ranking of preference values are compared instead of preference values.
  • UncenteredCosineSimilarity:  This is an implementation of cosine similarity.

When defining similarity measures you need to keep in mind that not all the datasets will work with all the similarity measures. You need to consider the nature of the dataset when selecting a similarity measure.

Also, to determine the optimal similarity measure for your scenario, you need to have a good understanding about the data set.

Trying out different similarity measures with your training data set is essential to find the optimal similarity measure.

Conclusion

The optimal recommendation algorithm depends on the nature of data and the scenario in hand.

However, if you have fewer users than items then it is better to use user-based recommendations. In contrast, if you have fewer items than users, then it is better to use item based recommendations to gain better performance.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>