Revenue Attribution Report: how we used homomorphic encryption to enhance privacy and cut network congestion by 99%
A recent LinkedIn survey revealed that 87% of B2B marketers say it’s getting harder to measure the long-term impact of a campaign. LinkedIn’s Revenue Attribution Report (RAR) helps solve this challenge, enabling advertisers to measure the business value of their LinkedIn marketing campaigns, such as LinkedIn ads’ attributed sales revenue and win rates. Customers rely on it to bridge the gap between marketing efforts and real business outcomes, making it a trusted tool for guiding ROI measurement.
To protect the large datasets of members’ LinkedIn ad activity data and advertiser data that RAR analyzes, we encrypt all data on a real-time distributed data store (Apache Pinot) and store the keys securely in a central key management system. Generating the RAR requires computing aggregate queries on these datasets, such as computing the total revenue and count grouped by status where entries in some of the columns are encrypted. Previously, our system would retrieve all the records from Pinot, decrypt them row by row, compute the aggregates on the plaintext data, and apply privacy guardrails–like adding differentially private noise to the aggregates to prevent member re-identification–before surfacing them back to the advertiser.
The process was slow and resource-intensive, so we began exploring ways to enhance the system’s performance and security by avoiding the decryption of each row while processing a query.
This blog details how our new system computes queries over encrypted records without decrypting each sensitive individual row while handling large-scale data and maintaining low latency. To do so, we leverage privacy-enhancing technologies such as additive symmetric homomorphic encryption, as introduced in this paper. The new approach has reduced network congestion by over 99% and unlocked utilization of Pinot’s in-built aggregation enabling a wider range of reporting capability.
RAR
High-level overview
RAR is a report in Business Manager that allows advertisers to attribute their Customer Relationship Management (CRM) data to ad activities on LinkedIn, helping them understand how their marketing investments influence business outcomes with metrics, such as revenue influenced, return on ad spend, and pipeline generated. The datasets used to generate the report are created by combining CRM data from advertisers with LinkedIn's first-party ad engagement data. These datasets are then stored on Pinot and consist of two main tables, each containing millions of records: Leads and Opportunities. The system responsible for generating the report (hereafter referred to as the API server) has to handle a large QPS (queries per second) from advertisers with each query aggregating several thousand records.
Some of the columns in these tables, such as advertiser revenue, are sensitive. Protecting member and customer privacy necessitates that they be handled with strong security and privacy guardrails.
Previous approach to generating reports
Previously, the system encrypted all the sensitive columns using AES encryption, created DEKs (data encryption keys) using envelope encryption, and stored the encrypted tables on Pinot. Each advertiser has unique keys stored securely in LinkedIn’s central key management system, which includes robust guardrails and access control policies. Then, for each query issued by the advertiser, the API servers running RAR did the following:
- Retrieve all records corresponding to the query from Pinot.
- Decrypt the sensitive columns in each row.
- Perform aggregation operations on all records post-decryption.
- Post-process the aggregated sum to include additional guardrails like adding differentially private noise.
- Return the response to the advertiser.
All of the above steps were performed in real-time, leading to computationally expensive queries both from a CPU and network traffic point of view. Moreover, our API server had to handle sensitive data in plaintext after decryption, before aggregating it. This setup also required us to re-implement online analytical processing (OLAP) style aggregation functions, which were already present in Pinot, causing unnecessary implementation and engineering toil.
Figure 1 illustrates a simple query example representing the current system, considering two encrypted sensitive columns (Revenue and Count). All data exchanged between Pinot and our API server, or between the advertiser/web portal and our server, is encrypted using the transport layer security (TLS) protocol. For simplicity, further details are omitted.
Additive symmetric homomorphic encryption
To improve this experience, we explored the privacy enhancing technology of additive symmetric homomorphic encryption (ASHE), introduced by researchers from Microsoft and their collaborators in this paper. An ASHE scheme allows new ciphertexts to be computed from any set of ciphertexts without knowing the secret key. This results in an encryption of the sum of the underlying messages. It offers significant performance advantages over public key additive homomorphic encryption systems like Paillier cryptosystem, as elaborated upon in the paper, making it a powerful alternative when the data owner and analyzer are the same entity.
Unlike traditional encryption or public key homomorphic encryption ASHE ties message encryption to a tag/identifier. Operating on the corresponding ciphertext, either to perform summation or decryption, requires knowledge of this tag. In more detail, the structure of an ASHE scheme is as follows:
1. ASHE.Setup(): Generates a secret key sk
2. ASHE.Encrypt(msg, id, sk): An algorithm that takes as input a message msg, an identifier id and the secret key sk to generate a ciphertext ct.
3. ASHE.Decrypt(ct, id, sk): Outputs a plaintext msg.
4. ASHE.Add((ct1, id1), (ct2, id2)): Outputs a ciphertext and identifier tuple (ct3, id3) such that:
ASHE.Decrypt(ct3, id3, sk) =
ASHE.Decrypt(ct1, id1, sk) + ASHE.Decrypt(ct2, id2, sk).
While we will not delve into the specifics of the encryption scheme's construction, we will extract and highlight some notable observations and properties of the scheme that will be instrumental for our work.
Properties
- If we use a standard datatype like Long to represent the ciphertexts, while running ASHE.Add((ct1, id1), (ct2, id2)), the resulting ciphertext ct3 can be computed by performing a regular sum of the underlying ciphertexts instead of a custom function. In other words, ct3 = ct1 + ct2.
- The identifier generated as part of the output id3 is created by concatenating id1 and id2. The paper also describes several ways to compress the identifiers, but we ignore that detail for this exposition.
- The same identifier can be used for encrypting multiple values that belong to the same record. That is, encrypting entries in multiple columns of a row can be done by re-using the same identifier, while using different encryption keys per column.
- This extends naturally to summing multiple ciphertexts.
For the rest of this blog, we use an ASHE scheme as an abstract system with the above structure and properties.
New system for RAR
To encrypt sensitive columns that get aggregated as part of queries, we now use ASHE instead of AES encryption. These ASHE-encrypted columns, along with their associated identifiers (as part of a new column), are now stored on Pinot. As before, each advertiser has unique keys which are securely stored in LinkedIn’s central key management system.
To process advertisers’ queries, we issue queries to Pinot that can directly aggregate on these ASHE encrypted columns to return an encrypted result, along with concatenating the corresponding identifiers (using Pinot’s in-built functions). At a high level, here are the steps run by the API server for each query:
- Issue modified aggregation query to Pinot.
- Pinot runs this query which includes:
- Aggregating the results of all necessary (encrypted) columns.
- Concatenating the associated identifiers using Pinot’s in-built function ArrayAgg.
- API server decrypts the result, which is not at the record level and is instead one decryption per column returned per query.
- Post-process the aggregated sum to include additional guardrails, like adding differentially private noise.
- Return the response to the advertiser.
Figure 2 shows a pictorial representation of the new system with two sensitive columns (Revenue and Count).
Extensions
In our system, we extend the above example to encrypt multiple sensitive columns that need to be aggregated. We also consider more complicated queries of the form “Select Sum(Revenue) GroupBy Status” where Status is also a sensitive column – we encrypt the entries of this column using a deterministic encryption scheme and build on ideas from the SPLASHE encryption scheme from the same paper to defend against frequency attacks.
Impact
Privacy
Our new system enhances RAR's privacy and security by removing row-level decryption and ensuring sensitive information remains protected during processing. This improves member and customer trust, while adhering to privacy regulations’ requirements of data minimization.
Performance
Our new system reduces network congestion by up to 99%, lowers CPU usage, and minimally affects latency and service times. Moreover, it also unlocks utilization of Pinot’s in-built aggregation allowing for a more performant horizontal scaling, by reducing memory footprint and leveraging Pinot's column based storage optimizations. This allows us to enhance reporting for a wider range of campaigns and offer better insights for our customers.
Metrics | Old System | New System |
Service API E2E | 418 ms | 422 ms |
Lead - Pinot Service Latency | 14 ms | 13 ms |
Lead - Decryption Time | 15 ms | 12 ms |
Lead - Response Size | 695 KB | 5 KB (~99% reduction) |
Opportunity - Pinot Service Latency | 22 ms | 27 ms |
Opportunity - Decryption Time | 41 ms | 37 ms |
Opportunity - Response Size | 2000 KB | 5 KB (~99% reduction) |
CPU | Usage spikes | Spikes mitigated |
Acknowledgements
This work was the result of a fruitful collaboration between several teams including AI Privacy, Ads Privacy Engineering, Ads Attribution, Ads Reporting, Audience Insights, and Legal. We would like to thank Miao Cheng, Ethan Pan, Yuan Wu, Jiss Jose, Xiaolan Gu, Brandon Poole, Yajun Wang, Vivek Iyer Vaidyanathan, Siddharth Teotia, Catalin Cosovanu, Kamola Kobildjanova, and several others for their contributions and for providing valuable feedback. We would like to thank Joe Xue, Rahul Tandra, and John McCarthy for their leadership support.