How Vectors, RAG and Llama 3 Are Changing First-Party Data

27 Jun 2024

First-Party Data is Back … With a Little Help From Its Friends. How Vectors, RAG and LLAMA 3 Are Driving a Sea Change

Over the last five years, the dominant narrative surrounding data infrastructure has emphasized the importance of companies owning and leveraging their data as they acquire as much information about their users and customers as possible. Due to mounting privacy regulations, they need to gather the data themselves instead of relying on third-party data operators, such as ad networks or platforms like Google, Meta, and Amazon. Companies have fallen in line with this narrative and made the shift.

But, in the battle for the best data, is first-party really better? Not by itself, but it could be with a little help from vectors, frameworks like RAG, and open-source foundation models like Llama 3.

The Push for First-Party Data

The argument for first-party data generally goes like this: companies need to become better stewards of data acquisition and management amid growing calls for data privacy. Consumers increasingly want to know who is hanging onto their personal information, how they got it, why they have it, and what is being done with it– and they generally don’t like the answers to those questions. Access and Delete requests are rapidly increasing, and the data privacy landscape is changing quickly. Companies are finding it hard enough to figure out their own privacy management processes; they don’t want to have to worry about third parties' as well.

Users are increasingly concerned with what happens to the data they give companies and data privacy laws are becoming stricter.

But the shift to first-party is not all about privacy. There is also the notion that third-party data will lose value as we move towards the proverbial cookieless future. Companies can’t get the same granular details they once could so why should they devote budget to a service that delivers less than it used to?

And then there is the perpetual concern that the big platforms and ad networks will make unexpected changes. For example, they could change their algorithms, restrict access to certain types of data, or alter advertising policies in a manner that could harm a business’ performance with little or no notice. Being dependent on another company’s practices leaves one vulnerable. Companies feel stuck because they’ve already invested so much time, money, and resources into their data strategy. In this light, the push to take back control of data seems essential. But is it practical?

The First-Data Problems No One is Talking About

Early results for companies taking a gamble on first-party data have not met expectations. We’re seeing case after case of consumer companies that made the shift suffering. Farfetch, Allbirds, and  SmileDirectClub are just a few examples. Would better, more strategic use of first-party data turn the tide on the rising customer acquisition costs?

Still, the current reliance on first-party data– and the practices to extract that data– is a commonality across several companies that are experiencing tougher times today. It’s noticeable enough to make analysts, VCs, and marketers themselves ask if they were wrong to prioritize first-party data.

The downsides of first-party data, as it is currently acquired and leveraged, are often considered to be underestimating the disappearance of device IDs, changing IP addresses, consumer adoption of fake emails, and ad blockers. While true, much more significant issues are at play.

First, there is a massive talent gap. Big Tech and the platform companies snag the best. They have the most to offer, making it difficult for up-and-coming consumer companies to compete for the data scientists and ML talent needed to make sense of the information being collected and analyzed. Without truly exceptional people, companies have struggled.

There is also a real tooling problem. The offerings available to companies are nowhere near equivalent to the tools Big Tech companies boast internally (which can be a factor in the talent gap). Tooling makes a difference, and most companies simply can’t compete right now.

Finally, there is a volume challenge. Big Tech and the ad networks have mountains of data, as these companies pool and anonymize hundreds of billions of data points to make their models run effectively. By contrast, if a company has only its data to work with, there is just not enough for ML to work as promised.

While these problems seem serious, is it time to give up on the potential– and the need for– first-party data? No way!

The Power of Vectors

The biggest issue hindering first-party data is how companies are trying to access it. Up to this point, businesses have taken an Old World approach. Models must be built from scratch to extract the value from data that businesses require. This takes time, money, and most of all talent; it depends on how good your ML engineers and data scientists are. However, as discussed above, there is not enough available talent to make this approach work better than leveraging third-party data. The lack of talent creates the bottleneck.

Representing information as vectors allows for deeper understanding and semantic relationship analysis.

However, this does not mean we need to give up on first-party data. We just need to change how we approach it. In the New World that is possible today, vectors and vector embeddings are the key. Vectors are generic mathematical objects that can represent data points' features or attributes while embedding models analyze patterns in data to generate these information-packed, meaningful representations learned from data; they capture the semantic relationships. Vector embeddings are the format that can encode everything you know about a user or a customer and make that information accessible to an analytics system, or leverageable to personalize the user experience, or even to catch fraud. There are so many possibilities. Vectors are poised to drive a sea change because they can power analytics in a fundamentally different way.

Retrieval-augmented generation (RAG) is generating a lot of buzz right now for all it can enable, but vector embeddings are what make RAG useful. They are a central component of the framework that helps with context, responses, integration of retrieval and fine-tuning of models. Generating high-quality vectors and querying them correctly is a critical task that enables any RAG system to actually work. There are other frameworks, but RAG is particularly well-suited for a first-party data revolution.

Sounds great. Let’s all figure out how to use vectors and vector embeddings. This is not the total answer. The issues like limited data sets and tooling remain. All is not yet wrapped up in a pretty bow, but I believe it soon will be. Because open-source, pre-trained foundation models like Meta’s Llama 2, which will give way to the more robust Llama 3 in July, can level the playing field. The issue of not having enough data volume compared to BigTech gets mitigated. By using an open source model pre-trained on large and diverse datasets, that model has a certain level of knowledge and understanding built in. Companies simply need to fine-tune Llama 2 (or Llama 3) on their specific domain or task with their data. This eases the bottleneck because in many cases, you no longer have to train a model from scratch.

This may sound like an oversimplification, given that Llama helps companies deal with text, but most of the data companies run on is not text. The structured data companies run on need to be integrated into this process. For example, the user behavioural events which form a large percentage of first-party data typically, are not suitable to be processed by any LLM. This is changing so companies should be ready as new multimodal solutions emerge. Similarly, tooling is still lacking, but there is a lot of attention on the space, so great strides are being taken. It’s coming along!

With the biggest issues fundamentally addressed, first-party data hype is back, baby! Companies don’t have to worry about third parties violating privacy, nor do they have to depend on Big Tech to help them know their customers. Expect to see first-party data explode this year as companies finally take full advantage– especially with Llama 3 at the ready. For all of its promise, perhaps the greatest potential for Llama 3 will be to actually solve the first-party data problem once and for all.