http://www.quora.com/How-does-LinkedIns-recommendation-system-work
I gave this talk earlier this week at Hadoop World(http://www.hadoopworld.com/sessi...), a conference that is evangelizing Hadoop by way of highlighting how people across the industry are solving big business challenges by leveraging Hadoop. I am posting here the slides with an approximate transcript of my talk.
Ever since I studied Machine Learning and Data Mining at Stanford 3 years ago., I have been enamored by the idea that it is now possible to write programs that can sift through TBS of data to recommend useful things.
So here I am with my colleague Adil Aijaz, for a talk on some of the lessons we learnt and challenges we faced in building large-scale recommender system
At LinkedIn we believe in building platform, not verticals. Our talk is divided into 2 parts. In the first part of this talk, I will talk about our motivation for building the recommendation platform, followed by a discussion on how we do recommendations. No analytics platform is complete without Hadoop. So, in he next part of our talk, Adil will talk about leveraging Hadoop for scaling our products.
‘Think Platform, Leverage Hadoop’ is our core message.
Throughout our talk, we will provide examples that highlight how these two ideas have helped us ‘scale innovation’.
With north of 135 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.
With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web.
This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn.
For an average user, there is so much data, there is no way users can leverage all the data on their own.
We need to put the right information, to the right user at the right time.
With such rich data of members, jobs, groups, news, companies, schools, discussions and events. We do all kinds of recommendations in a relevant and engaging way.
We have products like
‘Job Recommendation’: here using profile data we suggest top top jobs that our member might be interested in. The idea is to let our users be aware of the possibilities out there for them.
‘Talent Match’: When recruiters post jobs, we in real-time suggest top candidates for the job.
‘Talent Match’: When recruiters post jobs, we in real-time suggest top candidates for the job.
‘News Recommendation’: Using articles shared per industry, we suggest top news that our users need to keep them updated with the latest happenings.
‘Companies You May Want to Follow’: Using a combination of content matching and collaborative filtering, we recommend companies a user might be interested in keeping up-to date with.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers
Before we discuss our motivation for building a recommendation platform or a discussion on how we do recommendation or how we leverage Hadoop. Let’s first answer a basic question: Are recommendations really important?To put things in perspective, 50% of total job applications and job views by members are a direct result of recommendations. Interestingly, in the past year and half it has risen from 6% to 50%.
Let us start with an example of the kind of data we have.
For a member, we have positions, education, Summary, Specialty, Experience and skills of the user from the profile itself. Then, from member’s activity we have data about members connections, the groups that the member has joined, the company that the member follows amongst others.
Before we can start leveraging data for recommendation, we first need to clean and canonicalize it.
‘Technical Yahoo’ is a Software Engineer at Yahoo
‘Member Technical Staff’ is a Software Engineer at Oracle
‘Software Development Engineer’ is a Software Engineer at Microsoft
‘SDE’ is a Software Engineer at Amazon
Solving this problem is itself a research topic broadly referred to as ‘Entity Resolution’.
As another example, How many variations do you think, we have for the company name ‘IBM’?
We apply machine learnt classifiers for the entity resolution using a host of features for company standardization.
Lets look at ‘News Recommendation’, which finds relevant news for our users. Relevant news today might be old news tomorrow. Hence, News recommendation has to have a strong real-time component.
On the other hand, we have another product called ‘Similar Profiles’. The motivation here is that if a hiring manager, already know the kind of person that he wants to hire. He could be like person in his team already or like one of his connections on LinkedIn, then using that as the source profile, we suggest top Similar Profiles for hiring. Since, ‘people don’t reinvent themselves everyday’. ‘People similar today are most likely similar today’. So, we can potentially do this computation completely offline with a more sophisticated model.
In solving the completely real-time Vs completely offline problem, we could have gone down the route of creating separate solutions optimized for the use-case. In the short run, that would have been a quicker solution.
But we went down the platform route because we realized that we would churn out more and more such verticals as LinkedIn grows. Now, as a result of which we have the same code that computes recommendations online as well as offline. Moreover, in the production system caching and an expiry policy allows us to keep recommendations fresh irrespective of how we compute the recommendations. Now, as a result for newer verticals. We can easily get ‘freshness’ of recommendations irrespective of whether we compute recommendations online or offline.
Another interesting trade-off, is choosing between content analysis and collaborative filtering.
On the other hand, we have a product called ‘Viewers of the profile also viewed ..’. When a member views more than 1 profile within a single session. We record it as a co-view. Then on aggregating these co-views for every member. We get the data for all profiles that get co-viewed, when someone visits any given profile. This is a classical collaborative filtering based recommendation much like Amaozon’s ‘people who viewed this item also viewed’
Most other recommendations are hybrid. For e.g. for ‘SimilarJobs’ jobs that have high content overlap with each other are similar. Interestingly, jobs that get applied to or viewed by the same members are also Similar. So, Similar Jobs is a nice mix of content and collaborative filtering.
Finally, the last key trade-off is of precision vs recall.
Even if a single job recommendation looks bad to the user either because of a lower seniority of the job or because the recommendation is for the company that the user is not fond of, our users might feel less than pleased.
Here, getting the absolute best 3 jobs even at the cost of aggressively filtering out a lot of jobs is acceptable.
On the other hand, We have another recommendation product called ‘Similar Profiles’ for hiring managers who are actively looking for candidates. Here, if one finds a candidate we suggest other candidates like the original one in terms of overall experience, specialty, education background and a host of other features.
Since, the hiring manager is actively looking so in this case they are more open to getting a few bad ones as long as they get a lot of good ones too. So in essence, recall is more important here.
Here, we look at host of different features, such as
1. User provided features like ‘Title, Specialty, Education, experience amongst others’
2. Complex derived features like ‘Seniority’ and ‘Skills’, computed using machine learnt classifiers.
3. Both these kinds of features help in precision, we also have features like ‘Related Titles’ and ‘Related Companies’ that help increase the recall.
Intuitively, one might imagine that we use the following pair of features to compute Similar Profiles. In the next slide, we will discuss a more principled approach to figuring out pair of features to match against.Here in order to compute overall of similarity between me and Adil, we are first computing similarity between our specialties, our skills, our titles and other attribute.
With this we get a ‘Similarity score vector’ for how Similar Adil is to me, Similarly we can get such a vector for other profiles.
Moreover, the fact that our skills match might matter more for hiring than whether our education matches. Hence, there should be relative importance of one feature over the others.
Once we get the topK recommendations, we also apply application specific filtering with the goal of leveraging domain knowledge
For example, it could be the case that for a ‘Data Engineer role’ you as a hiring manager are looking for a candidate like one of your team member but who is local. Whereas, for all you know the ideal ‘Data Engineer most similar to the one you are looking for in terms of skills might be working somewhere in INDIA’ .
To ensure our recommendation quality keeps improving as more and more people use our products, we use explicit and implicit user feedback combined with crowd-sourcing, to construct high quality training and test sets for constructing the ‘Importance weight vector’. Moreover, classifier with L1 regularization helps prune out the weakly correlated features. We use this for figuring out which features to match profiles against.
We just discussed an example. However, the same concepts apply to all the recommendation verticals.
And now the technologies that drives it all.
The core our matching algorithm uses Lucene with our custom query implementation.
Lucene does not provide fast real-time indexing. To keep our indices up-to date, we use a real-time indexing library on top of Lucene called Zoie.
We provide facets to our members for drilling down and exploring recommendation results. This is made possible by a Faceting Search library called Bobo.
For storing features and for caching recommendation results, we use a key-value store Voldemort.
For analyzing tracking and reporting data, we use a distributed messaging system called Kafka.
Out of these Bobo, Zoie, Voldemort and Kafka are developed at LinkedIn and are open sourced. In fact, Kafka is an apache incubator project.
Historically, we have used R for model training. We have recently started experimenting with Mahout for model training and are excited about it.
Now Adil will talk about how we leverage Hadoop.
a) Offline batch computations on Hadoop copied to online caches.
b) Aggressive online caching policy
c) Online computation when the cache expires
Have scaled our similar profiles recommendations while maintaining a high precision of recommendations.
a) High latency computation
b )High qps
c) And not so stringent freshness requirements
Our basic blending solution is this: While constructing the query for content based similar profiles, we fetch collaborative filtering recommendations and their scores, and attach them to the query. In the scoring of content based recommendations, we can use the collaborative filtering score as a boost. An alternative approach is a bag of models approach with content and collaborative filtering serving as two of the models in the bag.
In either solution, we need a way to keep collaborative filtering results fresh. If two member profiles were coviewed yesterday, we should be able to use that knowledge today.
We thought more about this problem and realized two important aspects: 1) Coview counts can be updated in batch mode. 2) We can tolerate delay in updating our collaborative filtering results.
These two properties, batch computation and tolerance for delay in impacting the online world, led us to leverage Hadoop to solve this problem. Our production servers can produce tracking events everytime a member profile is viewed. These tracking events are copied over to hdfs where everyday, we use these tracking events to batch compute a fresh set of collaborative filtering recommendations. These recommendations are then copied to online key value stores where we use the blending approaches outlined earlier to blend collaborative filtering and content based recommendations.
The lesson we derive from this case study is that by leveraging Hadoop, we were able to experiment with collaborative filtering in similar profiles without significant investment in an online system to keep collaborative filtering results fresh. Once our proof of concept was successful, we could always go back and see if reducing the lag between a profile coview and its impact on similar profile by building an online system would be useful. If it is, we could invest in a non-Hadoop system. However, by leveraging Hadoop, we were able to defer that decision till the point when we had data to backup our assumptions.
As our next case study, let’s take a look at how we approached solving this problem for our users. Let’s say we come up with an algorithm that assigns each LinkedIn member a ‘JobSeeker’ score which indicates how open is she to taking the next step in her career. As we said already, this feature would be very useful for ‘Similar Profiles’. However, the utility of this feature would be directly related to how many members have this score: aka coverage. The key challenge we faced was that since ‘Similar Profiles’ was already in production, we had to add this new feature while continuing to serve recommendations. We call this problem “grandfathering”.
So, we scratch the naïve solution and look for a solution that will batch update all members with this score in all data centers while serving traffic.
A second pass solution is to run a ‘batch’ feature extraction pipeline in parallel to the production feature extraction pipeline. This batch pipeline will query the db for all members and add a ‘job seeker score’ to every member. This solution ensures that we have an upper bound on the time it takes to grandfather all members with job seeker score. It will work great for small startups whose member base is in a few million range.
However, the downside of this solution at LinkedIn scale is:
1) It adds load on the production databases serving live traffic.
2) To avoid the load, we end up throttling the batch solution which in turn makes the batch pipeline run for days or weeks. This slows down rate of batch update.
3) Lastly, the two factors above combine to make grandfathering a ‘dreaded word’. You only end up grandfathering ‘once a quarter’ which is clearly not helpful in innovating faster.
Using Hadoop, we take a snapshot of member profiles in production, move it to hdfs, grandfather members with a ‘job seeker score’ in a matter of hours, and copy this data back online. This way we can grandfather all members with a job seeker score in hours.
The biggest advantage of using Hadoop here is that grandfathering is no longer a ‘dreaded word’. We can grandfather when ready instead of grandfather every quarter which speeds up innovation.
Now, as a base solution, we can always move every model to production. We can A/B test the model with real traffic and see which models sink and which ones float. The simplicity of this approach is very attractive, however, there are some major flaws with it:
1) For each models we have to push code to production. This takes up valuable development resources.
2) There is an upper limit on the number of A/B tests one can run at the same time. This can be due to user experience concerns and/or revenue concerns.
3) Since online tests need to run for days before enough data is accumulated to make a call, this approach slows down rate of progress.
As you can guess, Hadoop provides a very good sandbox for our ideas. We are able to filter many of the craziest ideas, and double down on only those few that show promise. Plus, it allows us to have relatively large gold sets which gives us strong confidence in our evaluation results.
The key requirements of AB testing is that: time to evaluate which bucket to send traffic to should ideally be < 1ms, at-worst a few ms.
So, we go straight to Hadoop. For complex criteria like this, we run over our entire member base on Hadoop every couple of hours, assigning them to the appropriate bucket for each test. The results of this computation are pushed online, where the problem of A/B testing reduces to given a member and a test, fetching which bucket to send the traffic to from cache.
Our last case study involves the last step of a model deployment process: tracking and reporting. These two steps allow us to have an unbiased, data-driven way of saying whether or not a new model is successful in lifting our desired metrics: ctr or revenue or engagement or whatever else one is interested in. Our production servers generate tracking events every time a recommendation is impressed, clicked or rejected by the member.
a) One cannot look at greater than certain amount of time window in the past..
b) As the number of tracking streams increases, it becomes harder and harder to join them online.
To increase the time window, we will have to spend significant engineering resources in architecting a scalable reporting system which would be an overkill. Instead we placed our bet on Hadoop. All tracking events at LinkedIn are stored on HDFS. Add to this data Pig or plain Map-Reduce, we can do arbitrary k-way joins across billions of rows to come up with reports that can look as far in the past as we want to.
We can say without any hesitation that Hadoop has now become an integral part of the whole life-cycle of our workflow starting from prototyping a new idea to eventually tracking the impact of that idea.
By leveraging Hadoop, we were able to continuously improve the quality and scale the computations.
Hence, these 2 ideas helped us ‘scale innovation’ at Linkedin.
相关推荐
LinkedIn 推荐系统存储基础设施概述 在本篇中,我们将深入探讨 LinkedIn 的推荐系统存储基础设施,了解其架构设计、挑战和解决方案。 LinkedIn 高级架构 LinkedIn 是世界上最大的职业网络平台,拥有数亿用户。...
Coursera 课程推荐使用 LinkedIn 数据。 申请链接: : : 简短的介绍: 课程推荐应用程序将使用他的 LinkedIn 技能和 Stack Overflow 的趋势技能来推荐 Coursera 的用户课程。 我们还为用户提供使用趋势技能的统计...
LinkedIn接口API调用实例是一个关于如何与LinkedIn的开发平台交互的实际应用示例。LinkedIn API允许开发者访问LinkedIn的数据,包括用户信息、公司数据、职位发布等,以便构建与LinkedIn相关的应用程序和服务。下面...
LinkedIn产品的数据处理需求表现在几个方面:用户档案、通讯、人脉推荐等,这些服务要求系统能够处理大数据量(Large dataset)、中到高写入(Medium writes到High writes)、高读取(Very high reads到High reads)...
LinkedIn API for PHP是一个用于与LinkedIn平台进行数据交互的PHP库,它允许开发者通过编程方式访问LinkedIn的公开或授权用户的数据,如个人资料、职位、公司信息等。在使用这个API时,开发者可以创建各种应用程序,...
LinkedIn Spider 是一个开源项目,专为爬取LinkedIn网站上的公开信息而设计,特别是针对特定公司的员工资料。这个爬虫能够帮助研究人员、数据分析师或者招聘人员批量获取与指定公司相关的LinkedIn用户资料,以便进行...
知乎会尝试成为中国LinkedIn吗?.docx
LinkedIn是一个以商业和职业网络为导向的社交平台,拥有超过1.16亿的用户,它提供了一个让人们能够专业地建立人际网络、发布和寻找工作、回答问题以及塑造思想领导力的环境。通过LinkedIn,用户可以发现某家企业雇用...
Optimize your LinkedIn profile—and get resultsYour LinkedIn profile is essentially a platform to shape how others see you, highlight your abilities, products, or services, and explain how your work ...
### LinkedIn开发客户方法详解 #### 一、LinkedIn平台概述 LinkedIn作为一家面向商业客户的社交网络服务网站,成立于2002年12月,并于2003年正式启动。其核心价值在于帮助用户管理和扩展其专业关系网络。通过...
**PyPI 官网下载 | linkedin_scraper-2.7.5.tar.gz** `linkedin_scraper` 是一个Python库,专门用于从LinkedIn网站上抓取数据。在Python的生态系统中,PyPI(Python Package Index)是官方的第三方库分发平台,...
LinkedIn群组功能是外贸人开发潜在客户的重要工具。它之所以受到外贸人的青睐,是因为与Google等搜索引擎相比,LinkedIn能够直接将用户引向特定行业中的专业人士,而这些人士往往具有决定权,这对于开展业务具有重大...
大数据驱动的 LinkedIn 成长之路 LinkedIn 作为一家职业社交网络平台,具有庞大的用户群体和复杂的数据系统。在本文中,我们将探讨 LinkedIn 如何通过大数据驱动的方式实现快速增长,探索其数据分析、实验设计、...
### 构建LinkedIn实时活动数据管道的关键知识点 #### 一、引言与背景 - **活动数据的重要性**:活动数据是互联网系统中的关键组成部分,在广告、相关性、搜索、推荐系统以及安全等领域扮演着核心角色。 - **...
使用和模板通过引导了该项目。 可用脚本 在项目目录中,可以运行: npm start 在开发模式下运行应用程序。 打开在浏览器中查看它。 如果您进行编辑,则页面将重新加载。 您还将在控制台中看到任何棉绒错误。...
**LinkedIn社交媒体网站模板详解** LinkedIn社交媒体网站模板是专为企业打造的一款风格鲜明的在线交流和招聘平台模板。这款模板设计简洁而专业,旨在提供一个高效、直观的环境,让企业和求职者能够有效地进行互动和...
十大领英开发客户方法,图文并茂: 1. 导入邮箱联系人 2. 巧用”People you may know” 3. 巧用 “Who viewed your profile” 4. 巧用”People also viewed ” 5. 搜索框搜索People ...10. 利用Linkedin SEO
Linkedin是一家商业客户导向的社交网络服务网站,成立于2002年12月并于2003年启动。2011年1月,LinkedIn有超过9000万的注册用户。到2012年1月,LinkedIn已经超过1.5亿的注册用户。 这是Linkedin分享的他们使用Play...
linkedin 是国外的一个职业社交网站,在哪里可以查看注册用户的个人简历信息,但是如果想要实现开发任务,则需要模拟浏览器进行操作