AI

Building the next generation of job search at LinkedIn

Job search is a critical aspect of LinkedIn which helps our 1.2 billion members find the right opportunities. With recent advances in AI, we now have the opportunity to fundamentally transform how people discover their next role.

Traditionally, job seekers have relied on keyword-based search, which usually includes simply searching for a job title. While it can be effective, there are limitations. It’s unable to understand the nuances of a search and it relies on the job seeker to include the exact keywords that will give them the best results. 

Recognizing these shortcomings of traditional keyword-based search, and empowered by large language model (LLM) technology, we embarked on a journey to build a new AI-powered job search. This new search experience lets members describe the job they want in their own words and receive results that include jobs they might not have considered and that are more closely aligned with their ideal job. The tool creates a more intuitive experience for a greater number of job seekers, including those new to the workforce who may not know where to start with keywords. 

Building the next generation of job search at LinkedIn represents a significant leap beyond traditional keyword-driven methodologies. Harnessing the power of advanced LLMs, embedding-based retrieval, and intelligent distillation techniques, we've created a system capable of deeply understanding job seekers’ nuanced intents and delivering highly personalized and relevant results at unprecedented scale.

In this blog, we delve into how we built this tool, from defining the high-level problem to building and scaling the underlying system with an LLM and GPU-first approach.

Beyond keywords: semantic understanding

For years, job search relied heavily on precise keyword queries and search filters matching job descriptions. This method often fails to capture nuanced user intent. We want to unlock a future where job seekers describe their desired roles naturally, trusting the system to understand and deliver relevant results.

Consider complex queries like:

  • "Find software engineer jobs in Silicon Valley or Seattle that are mostly remote but not from sourcing companies, posted recently, with above median pay."
  • "I want to make a difference in the environment using my marketing background, ideally working with people I’ve worked with before."

These examples wouldn’t be possible with traditional keyword search and underline the need for semantic comprehension, handling complex criteria, and leveraging external context data. Our AI-powered job search addresses this, moving beyond keyword constraints toward natural language understanding. However, building a system that can truly understand the nuances of natural language queries and match them against millions of job postings tens of thousands of times every second presented significant technical challenges.

Model capacity: Our existing candidate selection and ranking models, based on fixed taxonomy-based methods and older large language model (LLM) technology, lacked the capacity for deep semantic understanding. We recognized the need to leverage modern LLM architectures, already fine-tuned on trillions of tokens and proven capable in generative AI applications, to achieve the required level of natural language understanding.

Infrastructure demands: Advanced LLMs significantly increased compute requirements, prompting investment in GPU infrastructure, embedding-based retrieval (EBR), and high-dimensional data handling.

The solution: LLM-powered retrieval and ranking

In an ideal world with infinite computational power, we would have a single, very powerful, model comparing every search query to every job and ranking them in the ideal order. Obviously, generating such an “ideal ranking” is not cost effective at scale, so our goal is to build a system that balances cost and compute efficiency with accuracy. 

A systems architecture that enables efficient inference is important, and efficiency is achieved by moving as much compute (both training and inference) as possible offline. This means splitting up that single powerful model into two steps, first retrieval and then ranking. Our high level approach was to generate a very powerful foundational “teacher” model that could rank a user’s query and a job accurately, and then use a variety of fine tuning and distillation techniques so that the retrieval and ranking models are both aligned, closely mimicking the foundational model.

Multi-stage distillation diagram
Figure 1. Multi-stage distillation

This is also a critical step in developer efficiency as our old system for serving jobs had become more and more complex over many years of its evolution, with each component of the system acting independently from each other as they were optimized for different goals. At one point, there were nine different stages that made up the pipeline for searching and matching a job with a job seeker, and frequently these were duplicated over a dozen different job search and recommendation channels, leading to difficulty in identifying where relevance problems stemmed from and ultimately making appropriate fixes. By reducing the model pipeline complexity by an order of magnitude and ensuring alignment between each layer by design, we not only improved model performance and interpretability, but also significantly enhanced developer velocity.

In addition to pure semantic matching, the jobs marketplace takes a holistic view on matching members and jobs:

  • Semantic textual similarity: Measuring semantic alignment between queries and job postings, aligning job relevance with LinkedIn’s quality standards.
  • Engagement-based prediction: Estimating user engagement likelihood (clicking a job, applying to a job).
  • Value-based prediction: Whether this user matches the qualifications the job poster is looking for, probability of being short listed and hired.

To do this we use a common technique of multi-objective optimization (MOO). To ensure retrieval and ranking are aligned, it is important that retrieval ranks documents using the same MOO that the ranking stage uses. The goal is to keep retrieval simple, but without introducing unnecessary burden on AI developer productivity.

Fueling the system: generating quality training data

The success of our semantic search hinges on the quality and quantity of training data. We recognized that relying solely on existing click log data would not be sufficient to capture the nuances of future natural language query patterns. Therefore, we proposed a strategy of augmenting our real data with synthetic data. This synthetic data was initially generated using advanced LLMs with prompt templates designed for semantic textual similarity tasks and aligned with our new 5-point grading policy.

Even determining the policy was a challenge in and of itself, because we needed to be explicit about dozens of different cases and how to properly grade them. If a human can not explain why something is rated “good” vs “excellent,” how could an LLM? Once this was done, we aimed to synthesize records containing natural language queries, members, job postings, and policy grades, reflecting the expected user behavior in a semantic search environment. Initially, we had a group of human evaluators use this product policy to “grade” these queries and job postings. However, this was incredibly time consuming, so in order to scale this up even further, we built out an LLM fine-tuned on human annotations to apply learned product policies and grade arbitrary query-member-job records. This approach allows for automated and scalable data annotation (to the tune of millions or tens of millions of grades per day, well beyond what is possible with humans). This allows us to not only continually train our models to automatically improve over time, but as a safeguard to ensure that the experience stays highly relevant when testing and experimenting with new features. 

Architecting semantic understanding

Retrieval of jobs that match the user’s query requires more than generating an embedding based on a raw user query and doing embedding-based retrieval. Semantic understanding of the user query is needed to augment the input into query embedding generation and create additional strict filtering queries to satisfy the user criteria.

We developed our query engine to fulfill this purpose. The query engine constructs the appropriate retrieval strategy by classifying the user intent, fetching external data such as profile and preferences needed for an effective search, and performing natural entity recognition to tag strict taxonomy elements needed for filtering. As an example, if a job seeker says “jobs in New York Metro Area where I have a connection already,” the query engine resolves “New York Metro Area” to a geo ID and invokes our graph service to look up the company IDs where they have connections. These IDs are then passed along to the search index as strict filters, while non-strict criteria are captured and passed along in the form of a query embedding. To accomplish these goals, we leverage the Tool Calling pattern when querying our fine-tuned LLM. 

Beyond understanding and refining queries, the query engine generates personalized suggestions to help members build and iterate on their job searches. These suggestions serve two key purposes: 

  1. Exploring potential facets and attributes to clarify ambiguous or broad queries 
  2. Exploiting high-accuracy attributes to refine and narrow search results after they have been retrieved. 

For instance, a member typing "project management jobs" might be prompted to suggest additional facets like industry, seniority level, or certification requirements that further specify their intent before searching. These types of attributes are data mined and extracted from millions of job postings offline, stored in a vector database and passed to the query engine LLM via the RAG pattern. Conversely, once results are returned, precise filters, such as specific company tags, allow members to drill down efficiently into relevant opportunities. This dynamic suggestion model ensures members not only articulate their needs clearly but also explore opportunities they might not have explicitly considered.

Scaling the query engine involves a variety of techniques to handle a high volume of user interactions efficiently while keeping latency manageable. Some of these include:

  • Non-personalized queries, being broadly applicable, are cached separately from personalized queries, which depend heavily on individual profiles and network contexts. 
  • KV caching in the LLM serving engine significantly reduces computation overhead by reducing the amount of duplicate work across requests, allowing faster responses for frequently invoked model calls. 
  • Optimization of tokens in response schema (verbose XML / JSON to minimized equivalents)
  • Reduction in model size through distillation and fine tuning 

Scalable GPU infrastructure for retrieval

The modern industry-standard approach for doing search is embedding-based retrieval. Within this, there are two fundamental approaches: approximate nearest neighbor search or exhaustive nearest neighbor search. Each approach, along with related index structures like HNSW, IVFPQ, Locality-Sensitive Hashing, has various pros and cons. Over the past few years, we have been using an approximate nearest neighbor search index at LinkedIn, as discussed here. However, we found that we still struggled to meet our needs for low latency, high index turn over (jobs are often only live for a few weeks), maximizing liquidity, and ability to do complex hard filters. All of these were needed for AI-powered Job Search, so we decided to explore alternative ideas. 

One of the simplest approaches is doing exhaustive search by scanning over a flat list of documents laid out in contiguous memory and computing the distance between the query embedding and each document embedding (i.e. exhaustive nearest neighbor search). Typically, this kind of O(n) operation is slow and only feasible for very small datasets, but we knew that GPUs can attain astronomical performance if they are doing the exact same thing again and again (no pointer chasing, no constant passing of data between the GPU and CPU, etc). After spinning up a proof of concept, we found that by only doing dense matrix multiply operations, investing in fused kernels, and minor sub-partitioning of the data layout, we were able to actually beat more complex index structures on latency while still retaining the benefits of the simplicity of managing a flat list of vectors. In the real world, sometimes O(n) approaches can beat out O(log(n)) when the constant factors are sufficiently different. 

After figuring out the right index to meet all of our needs, we were able to place it into our existing ecosystem to index the offline-generated embeddings for job postings and serve the K closest job postings for a given search query in just a few milliseconds per query. Then the problem became the quality of the AI model itself.

To improve the quality of this embedding-based retrieval, we fine-tuned open source models on millions of pairwise query-job examples, optimizing for both retrieval quality and score calibration. Key characteristics of the fine-tuning approach included:

  • RL training loop: During training, the model retrieves top-K jobs per query. These are scored in real-time by the teacher model, which acts as a reward model to guide updates. This setup allows us to train the model directly on what good retrieval looks like, rather than relying only on static labeled data.
  • Composite loss function: We jointly optimized for pairwise (contrastive) accuracy, list-level ranking (ListNet), KL divergence loss (to prevent catastrophic forgetting across iterations), and score regularization to maintain a well-calibrated output distribution that cleanly separates Good, Fair, and Poor matches
  • Infra & scaling: Training used Fully Sharded Data Parallel (FSDP), BF16 precision, and cosine learning rate scheduling to maximize throughput and stability.
  • Evaluation: As we iterate on models, we run automated evaluations to measure aggregate and query-type changes in relevance metrics between models. 

This RL-based fine-tuning approach leads to significantly better retrieval performance, more interpretable match scores, and a faster iteration loop for future model improvements.

Enhanced ranking with cross-encoder models

Although we get a reasonable set of candidates from retrieval, the low-rank model used is not sufficient to meet our relevance bar for the product. For classic job search, we already use an AI model based on the Deep and Cross Network architecture to estimate engagement and relevance based on dozens of features, taking into account the sequence of behaviors from a job seeker. However, when compared with our Foundational (Teacher) Model, this model is not able to meet our relevance bar and does not have the ability to learn from the Teacher. Of course since the Teacher model is too large, slow, and expensive to run for every single request, we have generated a “small” language model (or SLM) that learns from the Teacher while still having close performance.

Below is the high-level sketch of the model architecture. The model takes as input the text of the job, text of the query, and outputs the relevance score.

Cross-Encoder Architecture
Figure 2. Cross-Encoder Architecture

Our approach is to use supervised distillation (paper) to distill from the Teacher model. The benefit of supervised distillation is to not only learn from the labels produced by the teacher, but also provide all logits of the teacher model to the student model (SLM) during training. The logits provide richer information for the model to learn effectively in addition to the soft label. By combining this distillation with techniques like model pruning / cutting and serving optimizations through intelligent KV-caching and sparse attention, we’re able to meet our threshold for relevance accuracy while reducing our reliance on dozens of feature pipelines and aligning our entire stack.

Our journey with AI-powered job search, though filled with complex technical challenges—from crafting quality synthetic training data to scaling GPU infrastructure—has established a robust foundation for future innovation. As we continue refining this semantic-first approach, we remain committed to empowering job-seekers by making the job discovery process more intuitive, inclusive, and effective, ultimately bringing us closer to achieving LinkedIn’s vision of creating economic opportunity for every member of the global workforce.

Acknowledgements

To the more than 100 team members who’ve contributed to AI Job Search across infrastructure, AI modeling, user experience, and data science: thank you for your creativity, grit, and teamwork. This milestone belongs to all of us.