Supervised learning
is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam, labeling Web pages according to their genre, and recognizing handwriting. Many algorithms are used to create supervised learners, the most common being neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.
Unsupervised learning
is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups. It also can be used to reduce the number of dimensions in a data set in order to focus on only the most useful attributes, or to detect trends. Common approaches to unsupervised learning include k-Means, hierarchical clustering, and self-organizing maps.
------------------------------------------------------------------------------------------------------------------------------------
three specific machine-learning tasks that Mahout currently implements
- Collaborative filtering
- Clustering
- Categorization
Collaborative filtering
Collaborative filtering (CF) is a technique, popularized by Amazon and others, that uses user information such as ratings, clicks, and purchases to provide recommendations to other site users. CF is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data.
Given a set of users and items, CF applications provide recommendations to the current user of the system. Four ways of generating recommendations are typical:
- User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users.
- Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline.
- Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences).
- Model-based: Provide recommendations based on developing a model of users and their ratings.
******
Clustering
Given large data sets, whether they are text or numeric, it is often useful to group together, or cluster, similar items automatically. For instance, given all of the news for the day from all of the newspapers in the United States, you might want to group all of the articles about the same story together automatically; you can then choose to focus on specific clusters and stories without needing to wade through a lot of unrelated ones. Another example: Given the output from sensors on a machine over time, you could cluster the outputs to determine normal versus problematic operation, because normal operations would all cluster together and abnormal operations would be in outlying clusters.
Like CF, clustering calculates the similarity between items in the collection, but its only job is to group together similar items. In many implementations of clustering, items in the collection are represented as vectors in an n-dimensional space. Given the vectors, one can calculate the distance between two items using measures such as the Manhattan Distance, Euclidean distance, or cosine similarity. Then, the actual clusters can be calculated by grouping together the items that are close in distance.
There are many approaches to calculating the clusters, each with its own trade-offs. Some approaches work from the bottom up, building up larger clusters from smaller ones, whereas others break a single large cluster into smaller and smaller clusters. Both have criteria for exiting the process at some point before they break down into a trivial cluster representation (all items in one cluster or all items in their own cluster). Popular approaches include k-Means and hierarchical clustering.
******
Categorization
The goal of categorization (often also called classification) is to label unseen documents, thus grouping them together. Many classification approaches in machine learning calculate a variety of statistics that associate the features of a document with the specified label, thus creating a model that can be used later to classify unseen documents. For example, a simple approach to classification might keep track of the words associated with a label, as well as the number of times those words are seen for a given label. Then, when a new document is classified, the words in the document are looked up in the model, probabilities are calculated, and the best result is output, usually along with a score indicating the confidence the result is correct.
Features for classification might include words, weights for those words (based on frequency, for instance), parts of speech, and so on. Of course, features really can be anything that helps associate a document with a label and can be incorporated into the algorithm.
Reference
http://www.ibm.com/developerworks/java/library/j-mahout/
相关推荐
概率机器学习(Probabilistic Machine Learning)是机器学习领域中一个重要的分支,它将概率论和机器学习结合起来,旨在处理不确定性和随机性问题。概率机器学习的主要目标是开发能够从数据中学习的概率模型,以便...
Introduction to Machine Learning Classification How to train a Model Different Models Combinations CHAPTER1 INTRODUCTION TO MACHINE LEARNING Theory What is machine learning? Why machine learning? When...
The purpose of this book is to provide a gentle and instructionally organized introduction to the field of data science and machine learning, with a focus on building and deploying predictive models. ...
模式识别与机器学习:Machine Learning-Introduction.pptx
This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering ...
Machine Learning Fundamentals: Use Python and scikit-learn to get up and running with the hottest developments in machine learning By 作者: Hyatt Saleh ISBN-10 书号: 1789803551 ISBN-13 书号: ...
统计机器学习作为人工智能的一个重要分支,其核心目标是通过计算机算法从数据中提取知识,以便进行有效的预测和决策。Masashi Sugiyama所著的《统计机器学习导论》作为一部入门级教材,为读者们提供了一个理解这一...
《机器学习课后参考答案》提供了对《Machine Learning: A Probabilistic Perspective》一书习题的详尽解答,这是一本深入探讨机器学习领域的经典著作。这本书由Kevin P. Murphy撰写,它从概率视角出发,系统性地阐述...
1. **Introduction**: Provides an overview of the field of machine learning and its importance. 2. **Concept Learning**: Discusses how to learn simple concepts from data, such as boolean functions. 3. ...
概率机器学习(Probabilistic Machine Learning)是杜克大学统计学系、计算机科学系以及数学系共同开设的一门高级课程,编号为STA561。该课程由Sayan Mukherjee教授讲授,其电子邮件地址为sayan@stat.duke.edu。本课程...
Introduction to Machine Learning with Python Introduction to Machine Learning with Python Introduction to Machine Learning with Python
This course will guide you to upgrade your skills in Machine Learning by practically applying them by building real-world Machine Learning projects. Each section should cover a specific project on a ...
"Introduction to machine learning"这一教材,被视作初学者了解机器学习领域的一个极好的入门读物。 监督学习 监督学习是机器学习中的一种方法,它通过使用标注好的训练样本来学习出一个模型,然后利用这个模型对...