# Data Ingestion & Profile Construction

### About this export

| Field | Value |
| --- | --- |
| **content_type** | course |
| **platform** | contentstack-academy |
| **source_url** | https://www.contentstack.com/academy/courses/data-insights-data-ingestion-profile-construction |
| **language** | en |
| **product_area** | Contentstack Academy |
| **learning_path** | data-and-insights-practitioner-certification |
| **course_id** | data-insights-data-ingestion-profile-construction |
| **slug** | data-insights-data-ingestion-profile-construction |
| **version** | 2026-03-01 |
| **last_updated** | 2026-04-28 |
| **status** | published |
| **keywords** | ["Contentstack Academy"] |
| **summary_one_line** | This course dives deep into the technical architecture of customer profile creation and data unification. You'll master the data processing pipeline and learn advanced techniques for enriching customer understanding thro… |
| **total_duration_minutes** | 75 |
| **lessons_count** | 16 |
| **video_lessons_count** | 15 |
| **text_lessons_count** | 1 |
| **linked_learning_path** | data-and-insights-practitioner-certification |
| **linked_assessment_ref** | LMS_UNCONFIGURED_COURSE_ASSESSMENT |
| **markdown_file_url** | /academy/md/courses/data-insights-data-ingestion-profile-construction.md |
| **generated_at** | 2026-04-28T06:55:44.122Z |
| **intended_audience** | [] |
| **prerequisites** | [] |
| **related_courses** | [] |

> **Academy MD v3** — companion `.md` for Ask AI. Quizzes and graded assessments are **LMS-only**; this file never contains answer keys.

## Course Overview

| Metadata | Value |
| --- | --- |
| Catalog duration | 1h 15m 8s |
| Released (if known) | 2026-03-01 |
| Product area | Contentstack Academy |

### Description

_This course dives deep into the technical architecture of customer profile creation and data unification. You'll master the data processing pipeline and learn advanced techniques for enriching customer understanding through multiple data sources and intelligent modeling._

### Overview

### What You'll Learn

This comprehensive session teaches you how to build robust, unified customer profiles using advanced identity resolution, data integration, and enrichment techniques. You'll gain practical experience with schema design, data mapping, and leveraging AI-powered features for deeper customer insights.

### What We'll Cover

We'll explore how identity resolution automatically merges data fragments using shared identifiers like email addresses, demonstrating real-time profile unification in action. You'll master the data processing pipeline architecture, learning to create custom fields with proper data types and merge operators, configure mappings to transform raw data into profile attributes, and publish schema changes with version control. We'll cover advanced data integration through Cloud Connect for warehouse data, implement lookalike modeling to score unknown visitors, and leverage automated content classification and interest scoring. Finally, you'll learn to set up triggers that enable real-time personalization and downstream activations.

### Learning objectives

1. Follow each lesson in order.
2. Practice in a training stack using placeholders **YOUR_STACK_API_KEY** and **YOUR_DELIVERY_TOKEN** in local `.env` files only.
3. Validate API responses against the official documentation.

### Topics covered

Contentstack Academy

## Course structure

```text
data-insights-data-ingestion-profile-construction/
├── 01-data-insights-course-3--identity-resolution-recap · video · 248s
├── 02-data-insights-course-3--the-data-pipeline · video · 334s
├── 03-data-insights-course-3--leveraging-common-schema · video · 406s
├── 04-data-insights-course-3--customizing-schema · video · 540s
├── 05-data-insights-course-3--the-importance-of-identity-fields · video · 150s
├── 06-data-insights-course-3--publishing-schema-version-control · video · 98s
├── 07-data-insights-course-3--working-with-apis-csvs · video · 301s
├── 08-data-insights-course-3--working-with-integrations · video · 450s
├── 09-data-insights-course-3--identifier-ranks · video · 253s
├── 10-data-insights-course-3--working-with-warehouse-data · video · 579s
├── 11-data-insights-course-3--building-lookalike-models · video · 224s
├── 12-data-insights-course-3--interest-scores-classification · video · 340s
├── 13-data-insights-course-3--example-exploring-classified-content · video · 73s
├── 14-data-insights-course-3--content-recommendations · video · 289s
├── 15-data-insights-course-3--what-are-triggers · video · 223s
├── 16-data-insights-course-3--quiz · quiz (LMS only) · 3 min
```

## Lessons

### Lesson 01 — Identity Resolution (recap)

<!-- ai_metadata: {"lesson_id":"01","type":"video","duration_seconds":248,"video_url":"https://cdn.jwplayer.com/previews/qzUxiNrH","thumbnail_url":"https://cdn.jwplayer.com/v2/media/qzUxiNrH/poster.jpg?width=720","topics":["Identity","Resolution","recap"]} -->

#### Video details

#### At a glance

- **Title:** 9-data-insights-identity-resolution-recap
- **Duration:** 4m 8s
- **Media link:** https://cdn.jwplayer.com/previews/qzUxiNrH
- **Publish date (unix):** 1752870442

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113697 kbps
- video/mp4 · 180p · 180p · 148960 kbps
- video/mp4 · 270p · 270p · 173809 kbps
- video/mp4 · 360p · 360p · 194886 kbps
- video/mp4 · 406p · 406p · 209537 kbps
- video/mp4 · 540p · 540p · 260745 kbps
- video/mp4 · 720p · 720p · 342337 kbps
- video/mp4 · 1080p · 1080p · 577512 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/qzUxiNrH-120.vtt`

#### Video transcript

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish -->
[00:00] Transcript not attached in source entry.
```

#### Key takeaways

- Connect **Identity Resolution (recap)** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 02 — The Data Pipeline

<!-- ai_metadata: {"lesson_id":"02","type":"video","duration_seconds":334,"video_url":"https://cdn.jwplayer.com/previews/iDsatXS7","thumbnail_url":"https://cdn.jwplayer.com/v2/media/iDsatXS7/poster.jpg?width=720","topics":["The","Data","Pipeline"]} -->

#### Video details

#### At a glance

- **Title:** 10-data-insights-the-data-pipeline
- **Duration:** 5m 34s
- **Media link:** https://cdn.jwplayer.com/previews/iDsatXS7
- **Publish date (unix):** 1752870616

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113538 kbps
- video/mp4 · 180p · 180p · 139245 kbps
- video/mp4 · 270p · 270p · 155091 kbps
- video/mp4 · 360p · 360p · 169179 kbps
- video/mp4 · 406p · 406p · 178649 kbps
- video/mp4 · 540p · 540p · 213373 kbps
- video/mp4 · 720p · 720p · 267723 kbps
- video/mp4 · 1080p · 1080p · 425608 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/iDsatXS7-120.vtt`

#### Transcript

So, that all happens in real time. Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does it have the logic to understand for all of these different sort of data sources and streams, how do they ultimately map to the profile, if there's a UID or a first name in each of those streams, which one wins, right, so like is it the first in, is it the last in, what's that logic? So really the process that the data, so every single event, takes as it goes through Linux is it starts in a data stream, so under data pipeline you have access to all of your streams. The default stream is where the web information will go by default. You can customize that in the JavaScript tag if you want to, if there's some obscure use case or maybe you're collecting stuff from different websites and you want to map it a little bit differently, but all of that data automatically goes to this sort of raw event stream. From there, it uses mappings, is what we call them, to say, okay, I want to take the data that comes in from this stream, so we'll use email as an example inside of building profiles under schema, you have access to your fields and mappings. So if I, for instance, look for email under mappings, you'll see that there's a number of different streams and ways that this sort of system tells it how to handle the event. If I, for instance, go to the default stream, and actually maybe I'll just go into the field, it's a little bit easier to see. So if I click on email, you'll see all of the mappings, and you can see on the default stream, there's a few different ways that we map it, but in most cases it's just taking, okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some normalization, verify that it's an actual valid email, and then I'm going to push that up to the email field. So the mapping is sort of that translation layer between raw data that comes into a stream and how it can ultimately send it to a field. The field itself on it, if I go back into email, has all of the controls on how that merge happens. So for email in particular, and we'll run through all of these and go through a few different examples of building them from scratch, but for email address, you define a data type. So in the case of email, it's just a string. There's essentially every data type that you could ever possibly want that we can support, but in this case, it's just a string. In this case, it's flagged as an identity key. So again, that's what's telling the system that if I see email in the default stream, and I also see email in this MailChimp stream, I can merge those fragments together to build the profile. Likewise, if I see an event with an email in just the default stream, and then another event in that stream with email, just like we did in our demo, it can merge those fragments together to build that unified profile. So this flag for an identifier key is really, really important. It's also one of the ways that customers can get themselves in trouble by being overzealous on what actually is an identity key, which causes you to overmerge profiles into one big super profile. The merge operator is how you actually handle the data coming in. We'll come back to this one on a field that hasn't been predefined so that I can actually show you the different options, but you can actually say that, okay, for this particular field, maybe I only want the first value that it's ever seen. Maybe I want the latest value. Maybe it's an array, and I want to merge them together. All sorts of different merge operations are controlled at the field level, which is what tells you how to make that information come together. Otherwise, you would have first name over here, this different from this first name over here, and you'd have this big, crazy, nebulous, unusable profile. The merge operators are really important in how that unified profile actually gets resolved and surfaced to the end user. There's a variety of things that we can go through when we actually build a field on the format type to, you know, do we want to base 64 encode it? Do you want to set any type? Length sort of characteristics. We'll come back to kind of cap and keep helps us sort of limit the size of arrays over time and how long data hangs around. So all of that said, essentially, streams receive data. Mappings allow that data to be translated and sent to a field, and then a field is ultimately kind of like the rule master on how that data gets represented on the profile. These fields are ultimately what you see over here on an actual profile when you see it. So the profile fields are driven from that schema, and that's kind of the resolved piece there. Anything at a high level that I missed, Eric? Just one minor clarification. Use the term cap and keep, which is our internal term for capacity and number of days to keep. It's a restriction that we put on fields to, we'll shorten it to cap and keep often, so it might slip out when we're talking, but it means the capacity, which is the number of things that can be stored in a map and keep the number of days to keep each element.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:18.320
So, that all happens in real time.

2
00:00:18.320 --> 00:00:23.000
Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does

3
00:00:23.000 --> 00:00:28.560
it have the logic to understand for all of these different sort of data sources and streams,

4
00:00:28.560 --> 00:00:33.280
how do they ultimately map to the profile, if there's a UID or a first name in each of

5
00:00:33.280 --> 00:00:37.440
those streams, which one wins, right, so like is it the first in, is it the last in, what's

6
00:00:37.440 --> 00:00:39.480
that logic?

7
00:00:39.480 --> 00:00:46.280
So really the process that the data, so every single event, takes as it goes through Linux

8
00:00:46.280 --> 00:00:52.800
is it starts in a data stream, so under data pipeline you have access to all of your streams.

9
00:00:52.800 --> 00:00:56.840
The default stream is where the web information will go by default.

10
00:00:56.840 --> 00:01:00.600
You can customize that in the JavaScript tag if you want to, if there's some obscure

11
00:01:00.600 --> 00:01:03.480
use case or maybe you're collecting stuff from different websites and you want to map

12
00:01:03.480 --> 00:01:08.520
it a little bit differently, but all of that data automatically goes to this sort of raw

13
00:01:08.520 --> 00:01:10.800
event stream.

14
00:01:10.800 --> 00:01:16.200
From there, it uses mappings, is what we call them, to say, okay, I want to take the data

15
00:01:16.200 --> 00:01:21.840
that comes in from this stream, so we'll use email as an example inside of building profiles

16
00:01:21.840 --> 00:01:24.960
under schema, you have access to your fields and mappings.

17
00:01:25.080 --> 00:01:30.240
So if I, for instance, look for email under mappings, you'll see that there's a number

18
00:01:30.240 --> 00:01:36.400
of different streams and ways that this sort of system tells it how to handle the event.

19
00:01:36.400 --> 00:01:40.720
If I, for instance, go to the default stream, and actually maybe I'll just go into the field,

20
00:01:40.720 --> 00:01:43.520
it's a little bit easier to see.

21
00:01:43.520 --> 00:01:50.240
So if I click on email, you'll see all of the mappings, and you can see on the default

22
00:01:50.240 --> 00:01:54.120
stream, there's a few different ways that we map it, but in most cases it's just taking,

23
00:01:54.280 --> 00:01:58.840
okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some

24
00:01:58.840 --> 00:02:02.760
normalization, verify that it's an actual valid email, and then I'm going to push that

25
00:02:02.760 --> 00:02:04.960
up to the email field.

26
00:02:04.960 --> 00:02:09.240
So the mapping is sort of that translation layer between raw data that comes into a stream

27
00:02:09.240 --> 00:02:13.280
and how it can ultimately send it to a field.

28
00:02:13.280 --> 00:02:20.360
The field itself on it, if I go back into email, has all of the controls on how that

29
00:02:20.360 --> 00:02:21.880
merge happens.

30
00:02:21.880 --> 00:02:24.840
So for email in particular, and we'll run through all of these and go through a few

31
00:02:24.840 --> 00:02:30.880
different examples of building them from scratch, but for email address, you define a data type.

32
00:02:30.880 --> 00:02:32.640
So in the case of email, it's just a string.

33
00:02:32.640 --> 00:02:37.440
There's essentially every data type that you could ever possibly want that we can support,

34
00:02:37.440 --> 00:02:40.840
but in this case, it's just a string.

35
00:02:40.840 --> 00:02:43.600
In this case, it's flagged as an identity key.

36
00:02:43.600 --> 00:02:47.560
So again, that's what's telling the system that if I see email in the default stream,

37
00:02:47.560 --> 00:02:53.960
and I also see email in this MailChimp stream, I can merge those fragments together to build

38
00:02:53.960 --> 00:02:54.960
the profile.

39
00:02:54.960 --> 00:02:58.520
Likewise, if I see an event with an email in just the default stream, and then another

40
00:02:58.520 --> 00:03:03.000
event in that stream with email, just like we did in our demo, it can merge those fragments

41
00:03:03.000 --> 00:03:05.580
together to build that unified profile.

42
00:03:05.580 --> 00:03:09.000
So this flag for an identifier key is really, really important.

43
00:03:09.000 --> 00:03:14.840
It's also one of the ways that customers can get themselves in trouble by being overzealous

44
00:03:14.840 --> 00:03:19.200
on what actually is an identity key, which causes you to overmerge profiles into one

45
00:03:19.200 --> 00:03:22.120
big super profile.

46
00:03:22.120 --> 00:03:26.680
The merge operator is how you actually handle the data coming in.

47
00:03:26.680 --> 00:03:29.640
We'll come back to this one on a field that hasn't been predefined so that I can actually

48
00:03:29.640 --> 00:03:33.960
show you the different options, but you can actually say that, okay, for this particular

49
00:03:33.960 --> 00:03:38.120
field, maybe I only want the first value that it's ever seen.

50
00:03:38.120 --> 00:03:39.480
Maybe I want the latest value.

51
00:03:39.480 --> 00:03:42.440
Maybe it's an array, and I want to merge them together.

52
00:03:42.440 --> 00:03:46.560
All sorts of different merge operations are controlled at the field level, which is what

53
00:03:46.560 --> 00:03:49.880
tells you how to make that information come together.

54
00:03:49.880 --> 00:03:53.320
Otherwise, you would have first name over here, this different from this first name

55
00:03:53.320 --> 00:03:57.960
over here, and you'd have this big, crazy, nebulous, unusable profile.

56
00:03:57.960 --> 00:04:03.360
The merge operators are really important in how that unified profile actually gets resolved

57
00:04:03.360 --> 00:04:07.000
and surfaced to the end user.

58
00:04:07.000 --> 00:04:09.760
There's a variety of things that we can go through when we actually build a field on

59
00:04:09.760 --> 00:04:13.080
the format type to, you know, do we want to base 64 encode it?

60
00:04:13.080 --> 00:04:15.080
Do you want to set any type?

61
00:04:15.080 --> 00:04:17.000
Length sort of characteristics.

62
00:04:17.000 --> 00:04:23.080
We'll come back to kind of cap and keep helps us sort of limit the size of arrays over time

63
00:04:23.080 --> 00:04:24.360
and how long data hangs around.

64
00:04:24.360 --> 00:04:29.760
So all of that said, essentially, streams receive data.

65
00:04:29.760 --> 00:04:33.920
Mappings allow that data to be translated and sent to a field, and then a field is ultimately

66
00:04:33.920 --> 00:04:39.360
kind of like the rule master on how that data gets represented on the profile.

67
00:04:39.800 --> 00:04:45.040
These fields are ultimately what you see over here on an actual profile when you see it.

68
00:04:45.040 --> 00:04:51.400
So the profile fields are driven from that schema, and that's kind of the resolved piece

69
00:04:51.400 --> 00:04:53.520
there.

70
00:04:53.520 --> 00:04:55.560
Anything at a high level that I missed, Eric?

71
00:04:55.560 --> 00:04:57.640
Just one minor clarification.

72
00:04:57.640 --> 00:05:07.920
Use the term cap and keep, which is our internal term for capacity and number of days to keep.

73
00:05:07.960 --> 00:05:13.960
It's a restriction that we put on fields to, we'll shorten it to cap and keep often, so

74
00:05:13.960 --> 00:05:20.200
it might slip out when we're talking, but it means the capacity, which is the number

75
00:05:20.200 --> 00:05:24.120
of things that can be stored in a map and keep the number of days to keep each element.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So, that all happens in real time.
[00:18] Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does
[00:23] it have the logic to understand for all of these different sort of data sources and streams,
[00:28] how do they ultimately map to the profile, if there's a UID or a first name in each of
[00:33] those streams, which one wins, right, so like is it the first in, is it the last in, what's
[00:37] that logic?
[00:39] So really the process that the data, so every single event, takes as it goes through Linux
[00:46] is it starts in a data stream, so under data pipeline you have access to all of your streams.
[00:52] The default stream is where the web information will go by default.
[00:56] You can customize that in the JavaScript tag if you want to, if there's some obscure
[01:00] use case or maybe you're collecting stuff from different websites and you want to map
[01:03] it a little bit differently, but all of that data automatically goes to this sort of raw
[01:08] event stream.
[01:10] From there, it uses mappings, is what we call them, to say, okay, I want to take the data
[01:16] that comes in from this stream, so we'll use email as an example inside of building profiles
[01:21] under schema, you have access to your fields and mappings.
[01:25] So if I, for instance, look for email under mappings, you'll see that there's a number
[01:30] of different streams and ways that this sort of system tells it how to handle the event.
[01:36] If I, for instance, go to the default stream, and actually maybe I'll just go into the field,
[01:40] it's a little bit easier to see.
[01:43] So if I click on email, you'll see all of the mappings, and you can see on the default
[01:50] stream, there's a few different ways that we map it, but in most cases it's just taking,
[01:54] okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some
[01:58] normalization, verify that it's an actual valid email, and then I'm going to push that
[02:02] up to the email field.
[02:04] So the mapping is sort of that translation layer between raw data that comes into a stream
[02:09] and how it can ultimately send it to a field.
[02:13] The field itself on it, if I go back into email, has all of the controls on how that
[02:20] merge happens.
[02:21] So for email in particular, and we'll run through all of these and go through a few
[02:24] different examples of building them from scratch, but for email address, you define a data type.
[02:30] So in the case of email, it's just a string.
[02:32] There's essentially every data type that you could ever possibly want that we can support,
[02:37] but in this case, it's just a string.
[02:40] In this case, it's flagged as an identity key.
[02:43] So again, that's what's telling the system that if I see email in the default stream,
[02:47] and I also see email in this MailChimp stream, I can merge those fragments together to build
[02:53] the profile.
[02:54] Likewise, if I see an event with an email in just the default stream, and then another
[02:58] event in that stream with email, just like we did in our demo, it can merge those fragments
[03:03] together to build that unified profile.
[03:05] So this flag for an identifier key is really, really important.
[03:09] It's also one of the ways that customers can get themselves in trouble by being overzealous
[03:14] on what actually is an identity key, which causes you to overmerge profiles into one
[03:19] big super profile.
[03:22] The merge operator is how you actually handle the data coming in.
[03:26] We'll come back to this one on a field that hasn't been predefined so that I can actually
[03:29] show you the different options, but you can actually say that, okay, for this particular
[03:33] field, maybe I only want the first value that it's ever seen.
[03:38] Maybe I want the latest value.
[03:39] Maybe it's an array, and I want to merge them together.
[03:42] All sorts of different merge operations are controlled at the field level, which is what
[03:46] tells you how to make that information come together.
[03:49] Otherwise, you would have first name over here, this different from this first name
[03:53] over here, and you'd have this big, crazy, nebulous, unusable profile.
[03:57] The merge operators are really important in how that unified profile actually gets resolved
[04:03] and surfaced to the end user.
[04:07] There's a variety of things that we can go through when we actually build a field on
[04:09] the format type to, you know, do we want to base 64 encode it?
[04:13] Do you want to set any type?
```

#### Key takeaways

- Connect **The Data Pipeline** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 03 — Leveraging Common Schema

<!-- ai_metadata: {"lesson_id":"03","type":"video","duration_seconds":406,"video_url":"https://cdn.jwplayer.com/previews/IpTB9DvQ","thumbnail_url":"https://cdn.jwplayer.com/v2/media/IpTB9DvQ/poster.jpg?width=720","topics":["Leveraging","Common","Schema"]} -->

#### Video details

#### At a glance

- **Title:** 11-data-insights-leveraging-common-schema
- **Duration:** 6m 46s
- **Media link:** https://cdn.jwplayer.com/previews/IpTB9DvQ
- **Publish date (unix):** 1752871658

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113582 kbps
- video/mp4 · 180p · 180p · 148044 kbps
- video/mp4 · 270p · 270p · 171185 kbps
- video/mp4 · 360p · 360p · 186101 kbps
- video/mp4 · 406p · 406p · 201597 kbps
- video/mp4 · 540p · 540p · 249425 kbps
- video/mp4 · 720p · 720p · 324626 kbps
- video/mp4 · 1080p · 1080p · 551509 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/IpTB9DvQ-120.vtt`

#### Transcript

All right, so let's pull up one thing that I like to show folks is just some sort of as you start to play with the system, Lytx comes with a set of what we call the common schema. It's a set of attributes that you can use out of the box that don't require any mappings, any maintenance, any sort of configuration. Definitely always recommend that folks start with those and just sending data into the fields that exist. So I like to call it a couple of things that are really easy to use as you start to play with it. And then once you kind of get the hang of stuff, you can go look at your scheme and see what all is available. So if I go back over here. Let's pull, make sure we're on the right profile. Okay, so this is our profile on the left hand side. We'll make this a little bit bigger because it is probably teeny tiny. So we went through sort of just like first name, last name, email. As I'm playing with accounts, two really, really useful fields that are kind of special in our predefined schema or common schema is one, this attribute. So you can send, and we should do this within reason. You shouldn't send like 10,000 different attributes to be very clear. But as you're starting to play with it, you're just like, I want to send data into the system so that I can build an audience and see how it works. There's this sort of special attribute field where if you prefix a value with HTR underscore and then send a string, it'll actually add it to a map on the profile. So for instance, if I go in here and say, I'm going to just send the attribute underscore example as testing, I'm going to copy and paste that, send it as that user. And then over here, if I refresh and go to my profiles. So you'll see this field called custom user attributes. And in this case, we pass example as value testing, you can then actually build segments and do some sort of playing with that particular piece of data with no mapping, no configuration, nothing necessary. It works very similarly to some of our event and actions. So for instance, if I pass event open, it's going to map that to an email open and so on and so forth. But as an example, if we then go into the mapping for this attribute, just to see how it's defined, how we're able to essentially be smart enough to just pull if it matches that prefix, it gets mapped, we'll go into our schema, we'll go into fields, and we'll search for attribute. So if I pull up this field, which is not an identity field, so it's not going to merge profile data together, it's just going to surface information that you can segment, ultimately leverage it, one is a map string string, so it's going to take it and create a map of the information that you pass. And then we talked about merge operators, in this case, it's doing the merge operator of merge, which just means it's going to continue to push that object together, it's not going to fully overwrite it. So if I push a new key of like, example two, it's just going to add it to that object and combine the two together. So it's not going to overwrite it. If you open this up, there's a number of merge operators. So like I was mentioning, in the case of say, like, I want to know the first item that a user ever purchased, maybe you want to keep the oldest value of that. If I want to know the latest product that they purchased, you can keep the latest. Same thing with latest map and oldest map. So the different field types have different merge operators. I don't know if it's worth walking through, like all of the different options, they're all in our docs. But I would say as those things come up, and you'll see it kind of come up in some of the examples that we'll build, the merge operator is really important, and that it tells the system how to sort of combine that data together. Merge, merge is the one that I always like to touch on, because it's probably poorly named in the long run. But it's what sort of like keeps the old data and merges it with the new data that comes in so that you're not wholesale overwriting the object. So you've been doing or we've been using the JavaScript tag to collect data, which just uses our collection API. But there are a number of different ways that you can ultimately pull data into the system. So one of which is just our API. So for instance, I have just a simple, very simple JSON file over here, that's just an array of two objects. If I wanted to just push that information to a particular stream, and again, I'll share this document out so that you have some of the API calls, but I can just essentially fire a POST request to our collection API. The last parameter is the stream that it's going to pass to. So it's important to make sure that whatever stream you're passing data to has the mappings in place necessary to ultimately surface that on the profile. One major gotcha that a lot of folks run into is if you collect the data before it's mapped, it's not going to retroactively surface that data. We do have the concept of a replay or a rebuild. It's sort of I always like to describe it as like the last, the last ditch effort and that like if a customer were to go and goof up their data or mess up their mappings and need to essentially replay all of their events, we have the ability to do that from scratch to fix some of the issues that they run into. But it's a thing that one, can be very expensive, two, we don't like to do it very often. So it's important to think about before you actually collect data, where are you sending it and making sure that the mappings are in place to translate those raw events to the fields. Otherwise, it'll just sit in your stream and never get used. Could you just walk through the stream functionality stream once actually? And like streams just a tag that gets tagged onto the event that comes in. It's commonly just used to give you an idea of what the source it's from. So like if the data is coming from like Salesforce, the data from Salesforce will be in the Salesforce stream. If it's coming from the web, it goes to the default stream, which is typically web data. So it's really just a way to say, oh, this data from this source, I want to treat with these mappings, which are how the, it's kind of like the ETL process of how we convert that stream into fields. Yep.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:21.620
All right, so let's pull up one thing that I like to show folks is just some sort of

2
00:00:21.620 --> 00:00:26.680
as you start to play with the system, Lytx comes with a set of what we call the common

3
00:00:26.680 --> 00:00:27.680
schema.

4
00:00:27.680 --> 00:00:31.440
It's a set of attributes that you can use out of the box that don't require any mappings,

5
00:00:31.440 --> 00:00:34.360
any maintenance, any sort of configuration.

6
00:00:34.360 --> 00:00:37.680
Definitely always recommend that folks start with those and just sending data into the

7
00:00:37.680 --> 00:00:38.680
fields that exist.

8
00:00:38.680 --> 00:00:42.840
So I like to call it a couple of things that are really easy to use as you start to play

9
00:00:42.840 --> 00:00:43.840
with it.

10
00:00:43.840 --> 00:00:45.820
And then once you kind of get the hang of stuff, you can go look at your scheme and

11
00:00:45.820 --> 00:00:48.120
see what all is available.

12
00:00:48.120 --> 00:00:54.640
So if I go back over here.

13
00:00:54.640 --> 00:01:03.280
Let's pull, make sure we're on the right profile.

14
00:01:03.280 --> 00:01:09.040
Okay, so this is our profile on the left hand side.

15
00:01:09.040 --> 00:01:13.400
We'll make this a little bit bigger because it is probably teeny tiny.

16
00:01:13.400 --> 00:01:16.080
So we went through sort of just like first name, last name, email.

17
00:01:16.080 --> 00:01:20.920
As I'm playing with accounts, two really, really useful fields that are kind of special

18
00:01:20.920 --> 00:01:25.400
in our predefined schema or common schema is one, this attribute.

19
00:01:25.400 --> 00:01:28.720
So you can send, and we should do this within reason.

20
00:01:28.720 --> 00:01:32.300
You shouldn't send like 10,000 different attributes to be very clear.

21
00:01:32.300 --> 00:01:34.840
But as you're starting to play with it, you're just like, I want to send data into the system

22
00:01:34.840 --> 00:01:37.720
so that I can build an audience and see how it works.

23
00:01:37.720 --> 00:01:43.560
There's this sort of special attribute field where if you prefix a value with HTR underscore

24
00:01:43.560 --> 00:01:47.760
and then send a string, it'll actually add it to a map on the profile.

25
00:01:47.760 --> 00:01:52.880
So for instance, if I go in here and say, I'm going to just send the attribute underscore

26
00:01:52.880 --> 00:01:59.240
example as testing, I'm going to copy and paste that, send it as that user.

27
00:01:59.240 --> 00:02:05.040
And then over here, if I refresh and go to my profiles.

28
00:02:05.040 --> 00:02:07.760
So you'll see this field called custom user attributes.

29
00:02:07.760 --> 00:02:11.920
And in this case, we pass example as value testing, you can then actually build segments

30
00:02:11.920 --> 00:02:17.640
and do some sort of playing with that particular piece of data with no mapping, no configuration,

31
00:02:17.640 --> 00:02:19.400
nothing necessary.

32
00:02:19.400 --> 00:02:24.520
It works very similarly to some of our event and actions.

33
00:02:24.520 --> 00:02:29.340
So for instance, if I pass event open, it's going to map that to an email open and so

34
00:02:29.340 --> 00:02:30.520
on and so forth.

35
00:02:30.520 --> 00:02:36.360
But as an example, if we then go into the mapping for this attribute, just to see how

36
00:02:36.360 --> 00:02:40.960
it's defined, how we're able to essentially be smart enough to just pull if it matches

37
00:02:40.960 --> 00:02:47.080
that prefix, it gets mapped, we'll go into our schema, we'll go into fields, and we'll

38
00:02:47.080 --> 00:02:51.120
search for attribute.

39
00:02:51.120 --> 00:02:55.440
So if I pull up this field, which is not an identity field, so it's not going to merge

40
00:02:55.440 --> 00:02:59.320
profile data together, it's just going to surface information that you can segment,

41
00:02:59.320 --> 00:03:04.980
ultimately leverage it, one is a map string string, so it's going to take it and create

42
00:03:04.980 --> 00:03:08.160
a map of the information that you pass.

43
00:03:08.160 --> 00:03:12.720
And then we talked about merge operators, in this case, it's doing the merge operator

44
00:03:12.720 --> 00:03:16.680
of merge, which just means it's going to continue to push that object together, it's not going

45
00:03:16.680 --> 00:03:17.680
to fully overwrite it.

46
00:03:17.680 --> 00:03:24.080
So if I push a new key of like, example two, it's just going to add it to that object and

47
00:03:24.080 --> 00:03:25.080
combine the two together.

48
00:03:25.080 --> 00:03:26.720
So it's not going to overwrite it.

49
00:03:26.720 --> 00:03:30.240
If you open this up, there's a number of merge operators.

50
00:03:30.240 --> 00:03:35.240
So like I was mentioning, in the case of say, like, I want to know the first item that a

51
00:03:35.240 --> 00:03:39.400
user ever purchased, maybe you want to keep the oldest value of that.

52
00:03:39.400 --> 00:03:44.240
If I want to know the latest product that they purchased, you can keep the latest.

53
00:03:44.240 --> 00:03:45.920
Same thing with latest map and oldest map.

54
00:03:45.920 --> 00:03:50.560
So the different field types have different merge operators.

55
00:03:50.560 --> 00:03:54.520
I don't know if it's worth walking through, like all of the different options, they're

56
00:03:54.520 --> 00:03:56.000
all in our docs.

57
00:03:56.000 --> 00:03:58.840
But I would say as those things come up, and you'll see it kind of come up in some of the

58
00:03:58.840 --> 00:04:03.320
examples that we'll build, the merge operator is really important, and that it tells the

59
00:04:03.320 --> 00:04:05.720
system how to sort of combine that data together.

60
00:04:05.720 --> 00:04:09.760
Merge, merge is the one that I always like to touch on, because it's probably poorly

61
00:04:09.760 --> 00:04:12.680
named in the long run.

62
00:04:12.680 --> 00:04:16.280
But it's what sort of like keeps the old data and merges it with the new data that comes

63
00:04:16.280 --> 00:04:22.080
in so that you're not wholesale overwriting the object.

64
00:04:22.080 --> 00:04:27.240
So you've been doing or we've been using the JavaScript tag to collect data, which just

65
00:04:27.240 --> 00:04:30.240
uses our collection API.

66
00:04:30.240 --> 00:04:35.520
But there are a number of different ways that you can ultimately pull data into the system.

67
00:04:35.520 --> 00:04:39.040
So one of which is just our API.

68
00:04:39.040 --> 00:04:43.800
So for instance, I have just a simple, very simple JSON file over here, that's just an

69
00:04:43.800 --> 00:04:45.560
array of two objects.

70
00:04:45.560 --> 00:04:50.840
If I wanted to just push that information to a particular stream, and again, I'll share

71
00:04:50.840 --> 00:04:54.840
this document out so that you have some of the API calls, but I can just essentially

72
00:04:54.840 --> 00:04:57.920
fire a POST request to our collection API.

73
00:04:57.920 --> 00:05:01.440
The last parameter is the stream that it's going to pass to.

74
00:05:01.440 --> 00:05:06.560
So it's important to make sure that whatever stream you're passing data to has the mappings

75
00:05:06.560 --> 00:05:11.440
in place necessary to ultimately surface that on the profile.

76
00:05:11.440 --> 00:05:16.680
One major gotcha that a lot of folks run into is if you collect the data before it's mapped,

77
00:05:16.680 --> 00:05:20.600
it's not going to retroactively surface that data.

78
00:05:20.600 --> 00:05:24.520
We do have the concept of a replay or a rebuild.

79
00:05:24.520 --> 00:05:29.640
It's sort of I always like to describe it as like the last, the last ditch effort and

80
00:05:29.640 --> 00:05:33.320
that like if a customer were to go and goof up their data or mess up their mappings and

81
00:05:33.320 --> 00:05:37.680
need to essentially replay all of their events, we have the ability to do that from scratch

82
00:05:37.680 --> 00:05:40.120
to fix some of the issues that they run into.

83
00:05:40.120 --> 00:05:43.960
But it's a thing that one, can be very expensive, two, we don't like to do it very often.

84
00:05:43.960 --> 00:05:48.640
So it's important to think about before you actually collect data, where are you sending

85
00:05:48.640 --> 00:05:52.640
it and making sure that the mappings are in place to translate those raw events to the

86
00:05:52.640 --> 00:05:53.640
fields.

87
00:05:53.640 --> 00:05:56.240
Otherwise, it'll just sit in your stream and never get used.

88
00:05:57.120 --> 00:06:03.280
Could you just walk through the stream functionality stream once actually?

89
00:06:03.280 --> 00:06:08.320
And like streams just a tag that gets tagged onto the event that comes in.

90
00:06:08.320 --> 00:06:14.040
It's commonly just used to give you an idea of what the source it's from.

91
00:06:14.040 --> 00:06:19.080
So like if the data is coming from like Salesforce, the data from Salesforce will be in the Salesforce

92
00:06:19.080 --> 00:06:20.080
stream.

93
00:06:20.080 --> 00:06:26.160
If it's coming from the web, it goes to the default stream, which is typically web data.

94
00:06:27.080 --> 00:06:30.000
So it's really just a way to say, oh, this data from this source, I want to treat with

95
00:06:30.000 --> 00:06:34.520
these mappings, which are how the, it's kind of like the ETL process of how we convert

96
00:06:34.520 --> 00:06:36.520
that stream into fields.

97
00:06:36.520 --> 00:06:37.520
Yep.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] All right, so let's pull up one thing that I like to show folks is just some sort of
[00:21] as you start to play with the system, Lytx comes with a set of what we call the common
[00:26] schema.
[00:27] It's a set of attributes that you can use out of the box that don't require any mappings,
[00:31] any maintenance, any sort of configuration.
[00:34] Definitely always recommend that folks start with those and just sending data into the
[00:37] fields that exist.
[00:38] So I like to call it a couple of things that are really easy to use as you start to play
[00:42] with it.
[00:43] And then once you kind of get the hang of stuff, you can go look at your scheme and
[00:45] see what all is available.
[00:48] So if I go back over here.
[00:54] Let's pull, make sure we're on the right profile.
[01:03] Okay, so this is our profile on the left hand side.
[01:09] We'll make this a little bit bigger because it is probably teeny tiny.
[01:13] So we went through sort of just like first name, last name, email.
[01:16] As I'm playing with accounts, two really, really useful fields that are kind of special
[01:20] in our predefined schema or common schema is one, this attribute.
[01:25] So you can send, and we should do this within reason.
[01:28] You shouldn't send like 10,000 different attributes to be very clear.
[01:32] But as you're starting to play with it, you're just like, I want to send data into the system
[01:34] so that I can build an audience and see how it works.
[01:37] There's this sort of special attribute field where if you prefix a value with HTR underscore
[01:43] and then send a string, it'll actually add it to a map on the profile.
[01:47] So for instance, if I go in here and say, I'm going to just send the attribute underscore
[01:52] example as testing, I'm going to copy and paste that, send it as that user.
[01:59] And then over here, if I refresh and go to my profiles.
[02:05] So you'll see this field called custom user attributes.
[02:07] And in this case, we pass example as value testing, you can then actually build segments
[02:11] and do some sort of playing with that particular piece of data with no mapping, no configuration,
[02:17] nothing necessary.
[02:19] It works very similarly to some of our event and actions.
[02:24] So for instance, if I pass event open, it's going to map that to an email open and so
[02:29] on and so forth.
[02:30] But as an example, if we then go into the mapping for this attribute, just to see how
[02:36] it's defined, how we're able to essentially be smart enough to just pull if it matches
[02:40] that prefix, it gets mapped, we'll go into our schema, we'll go into fields, and we'll
[02:47] search for attribute.
[02:51] So if I pull up this field, which is not an identity field, so it's not going to merge
[02:55] profile data together, it's just going to surface information that you can segment,
[02:59] ultimately leverage it, one is a map string string, so it's going to take it and create
[03:04] a map of the information that you pass.
[03:08] And then we talked about merge operators, in this case, it's doing the merge operator
[03:12] of merge, which just means it's going to continue to push that object together, it's not going
[03:16] to fully overwrite it.
[03:17] So if I push a new key of like, example two, it's just going to add it to that object and
[03:24] combine the two together.
[03:25] So it's not going to overwrite it.
[03:26] If you open this up, there's a number of merge operators.
[03:30] So like I was mentioning, in the case of say, like, I want to know the first item that a
[03:35] user ever purchased, maybe you want to keep the oldest value of that.
[03:39] If I want to know the latest product that they purchased, you can keep the latest.
[03:44] Same thing with latest map and oldest map.
[03:45] So the different field types have different merge operators.
[03:50] I don't know if it's worth walking through, like all of the different options, they're
[03:54] all in our docs.
[03:56] But I would say as those things come up, and you'll see it kind of come up in some of the
[03:58] examples that we'll build, the merge operator is really important, and that it tells the
[04:03] system how to sort of combine that data together.
[04:05] Merge, merge is the one that I always like to touch on, because it's probably poorly
```

#### Key takeaways

- Connect **Leveraging Common Schema** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 04 — Customizing Schema (fields & mappings)

<!-- ai_metadata: {"lesson_id":"04","type":"video","duration_seconds":540,"video_url":"https://cdn.jwplayer.com/previews/fGpn7GIn","thumbnail_url":"https://cdn.jwplayer.com/v2/media/fGpn7GIn/poster.jpg?width=720","topics":["Customizing","Schema","fields","mappings"]} -->

#### Video details

#### At a glance

- **Title:** 12-data-insights-custom-schema
- **Duration:** 9m
- **Media link:** https://cdn.jwplayer.com/previews/fGpn7GIn
- **Publish date (unix):** 1752872612

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113479 kbps
- video/mp4 · 180p · 180p · 140452 kbps
- video/mp4 · 270p · 270p · 157205 kbps
- video/mp4 · 360p · 360p · 174433 kbps
- video/mp4 · 540p · 540p · 224977 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/fGpn7GIn-120.vtt`

#### Transcript

The main thing that I want to cover in this session is how the fields and the mappings work and how their relationship is to stream. So a perfect example of that is we have a common schema, a bunch of the heavy lifting has already been done, we'll connect MailChimp in a little while and it automatically gets the fields and mappings, but it is inevitable that you're going to want to push custom data into Linux. So I built a very simple, also Game of Thrones themed CSV, where we're pulling in things like email and first name and last name, and we use the attribute field, which comes kind of pre-mapped. But then you run into this field where it's likelihood to rain, which is, you know, akin to say like a custom score that's coming out of your warehouse or whatever it may be. So this is a thing that isn't in the system, it's totally custom. We need to make sure that when we upload this data, whichever stream we upload it to, it's going to ultimately get mapped properly to that particular profile. So I wanted to walk through that example together anyway, so we can just start to do that. So if I were to upload this information right now with no mappings, you would see it in the streams. So I'm not going to do that, but you could ultimately upload this to whatever stream you would want. You could go over here and you would see the raw data in the stream that actually was received, but it's not going to have anywhere to go with it. So it's going to sit in the stream as sort of a metric. It's never going to get mapped to the profile. It's just going to sort of confuse you. So it's really, really important to know that streams to Eric's point are kind of just a construct to help you separate the logic for mapping that data to a particular field. So in our case, we want to add the field likelihood to rain. The way that I would start to do this, and there's a couple of different approaches, but because we know that this field is net new, you're going to first need to go create a field, right? So you need to put a place on the profile where this information can actually live. We don't want to use any of the common scheme of fields, so I'm going to go to create new. You first select an ID. This can be any sort of sluggified input, so I'm just going to paste likelihood to rain. We'll just call it that for now. We'll say, you know, custom score on like rain, which is really just surfaced in the information in the UI. We'll choose the data type. So in our case, it looks like it's an integer. We want to be able to do things like greater than less than in the segmentation engine. That's one really important point when you're choosing the data type. It impacts the way that you can actually leverage that particular field in segmentation. So for instance, if I were to make this a string, it would work, but when I go to build a segment, if I wanted to find anybody that was like greater than 50 on likelihood terrain, it's not going to work. It's not going to allow you to do that because it can't do a sort of numeric operation on a string. So those kind of things are really important to keep in mind as you're doing custom mapping. Obviously there's lots of different data types that you could use. There's arrays, there's string arrays, there's time arrays, there's maps, et cetera, et cetera. We're just going to choose an integer in this case. You can add an optional description. This is just purely for the UI, so we won't do that here. You could choose it as an identity key. If it's a string, you can't do that with an integer. We don't want to do that with likelihood terrain. If I were to accidentally do that, the net result would essentially be that anybody that had, say, like a 95 score of likelihood terrain would be merged into one super profile, which is definitely not what you would want to do. So that's why only clicking the identity key when it's absolutely necessary and you want to use it to merge data together is very, very important. For a merge operator, because it's an integer, you'll see that there's a number of different ways that you can ultimately handle this data. Maybe I want the total purchase amount over time. You could do the integer as a sum. As new events come in, it's going to continue to add those up so that you have this one value that represents total purchase over time. You could just count the events. So instead of having an understanding of, like, the specific number, it's just how many times I've seen this sort of field change. The max number, the min number, the latest, the oldest. In our case, we want it to just represent the actual score value, and we want it to be the latest value so that if a new one comes in tomorrow when we push this information, it gets updated and overridden. So we're going to do latest. And then we'll leave all of the other fields. They're all grayed out, actually, in this case, because you can't set them. If we were to set or choose an array or a map, you'd have some more options to Eric's kind of earlier point on the size limit and how long some of that information hangs around. The other thing that's really important is you can flag your fields as if they contain PII or not. Because a CDP is collecting user information, a lot of that information can be personally identified, which becomes really, really important. We have lots of controls in the UI to hide or show that information for particular users. In a lot of cases, our customers will maybe encrypt or encode or Base64 encode an email address to never expose that. So there's lots of controls that you'll see around privacy. This just lets you know the field may contain PII so that throughout the UI, we can put the proper controls in place. And then there's some categorization options that are totally optional. They just show up in the actual profile kind of filtering when you're exploring. So we'll just say that that's behavior for now. So this is how we're going to configure our field. If I hit create field, this essentially just gives it a place to live on the profile. But there's still no logic on how do I take the raw data that's coming in and map it to that particular profile, right? There's still that kind of gap. So that's where mappings come in. So if I go back to my field, likelihood terrain, I'll go to current mappings and I see that there's no mappings. So we'll hit create new mapping and you'll see three new concepts that we have not touched on yet. So well, two of the three we haven't touched on yet. So stream is going to say, where do I want to create this mapping to? I can choose one of the streams that exist, or we'll just do demo custom CSV. I'm going to copy this and put it over here so I don't forget that. And then you have an opportunity to do an expression or a condition. So what the expression does is there's a few different functions that we have. So you saw it like on email, as an example, where it's going to take it, it's going to lowercase it, it's going to do some basic validation. There's things for phone number to normalize. There's ways that you can split, you can uppercase, you can lowercase. There's a whole list of the different expressions in our documentation that we can kind of come back to. But this is one of the, I wouldn't call it a cleansing layer by any means, but it's one of the ways that you can manipulate some of the endpoints to make them consistent. In our case, we're just going to do likelihood to rain. So we just want to like take a direct translation, oops, I pasted the wrong value. That's not what we want. So all I want to do is I want to take the key that'll ultimately come in as likelihood to rain in our CSV. And I want to say anytime that that value comes in, we're going to, as long as it's on the demo custom CSV stream, we're going to map it up to that profile. The conditions we'll come back to, and I'll show you a few different examples, but conditions give you a way to say, like when we're mapping conversion events or events in general, if it's a particular type, then I want to map it in this way. So you can add essentially if else equals type logic to a particular mapping so that there's conditional logic, whether it should map that field or not, as opposed to just in this case, anytime that likelihood to rain shows up in this particular stream, it's going to map it to the field that I chose, which is this custom score likelihood to rain. So I'm going to create, oh yeah, go ahead. One comment here real quick. If this was like a untrusted source, where this data was coming from, you probably would want to put an expression in there that would validate that it's a number in the range of one to a hundred. But since this is a source that we generated that we know is clean the whole way through, we can just take the value straight from the CSV. Yep. And we'll pull up the docs and show some of like the conditions and expressions and whatnot, just so you know where it is. There's a whole bunch of things that you can do ultimately from a logic perspective, but try to keep it as simple as possible for this example. So I am going to create that mapping for that field.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:19.160
The main thing that I want to cover in this session is how the fields and the mappings

2
00:00:19.160 --> 00:00:21.060
work and how their relationship is to stream.

3
00:00:21.060 --> 00:00:25.920
So a perfect example of that is we have a common schema, a bunch of the heavy lifting

4
00:00:25.920 --> 00:00:29.600
has already been done, we'll connect MailChimp in a little while and it automatically gets

5
00:00:29.600 --> 00:00:34.100
the fields and mappings, but it is inevitable that you're going to want to push custom data

6
00:00:34.100 --> 00:00:35.100
into Linux.

7
00:00:35.100 --> 00:00:41.320
So I built a very simple, also Game of Thrones themed CSV, where we're pulling in things

8
00:00:41.320 --> 00:00:44.760
like email and first name and last name, and we use the attribute field, which comes kind

9
00:00:44.760 --> 00:00:45.900
of pre-mapped.

10
00:00:45.900 --> 00:00:49.640
But then you run into this field where it's likelihood to rain, which is, you know, akin

11
00:00:49.640 --> 00:00:53.520
to say like a custom score that's coming out of your warehouse or whatever it may be.

12
00:00:53.520 --> 00:00:57.300
So this is a thing that isn't in the system, it's totally custom.

13
00:00:57.300 --> 00:01:01.980
We need to make sure that when we upload this data, whichever stream we upload it to,

14
00:01:01.980 --> 00:01:05.800
it's going to ultimately get mapped properly to that particular profile.

15
00:01:05.800 --> 00:01:11.020
So I wanted to walk through that example together anyway, so we can just start to do that.

16
00:01:11.020 --> 00:01:17.980
So if I were to upload this information right now with no mappings, you would see it in

17
00:01:17.980 --> 00:01:18.980
the streams.

18
00:01:18.980 --> 00:01:23.100
So I'm not going to do that, but you could ultimately upload this to whatever stream

19
00:01:23.100 --> 00:01:24.380
you would want.

20
00:01:24.420 --> 00:01:29.580
You could go over here and you would see the raw data in the stream that actually was received,

21
00:01:29.580 --> 00:01:31.140
but it's not going to have anywhere to go with it.

22
00:01:31.140 --> 00:01:33.860
So it's going to sit in the stream as sort of a metric.

23
00:01:33.860 --> 00:01:35.780
It's never going to get mapped to the profile.

24
00:01:35.780 --> 00:01:37.380
It's just going to sort of confuse you.

25
00:01:37.380 --> 00:01:41.700
So it's really, really important to know that streams to Eric's point are kind of just a

26
00:01:41.700 --> 00:01:48.020
construct to help you separate the logic for mapping that data to a particular field.

27
00:01:48.020 --> 00:01:53.120
So in our case, we want to add the field likelihood to rain.

28
00:01:53.120 --> 00:01:55.800
The way that I would start to do this, and there's a couple of different approaches,

29
00:01:55.800 --> 00:02:00.840
but because we know that this field is net new, you're going to first need to go create

30
00:02:00.840 --> 00:02:01.840
a field, right?

31
00:02:01.840 --> 00:02:05.720
So you need to put a place on the profile where this information can actually live.

32
00:02:05.720 --> 00:02:09.920
We don't want to use any of the common scheme of fields, so I'm going to go to create new.

33
00:02:09.920 --> 00:02:11.320
You first select an ID.

34
00:02:11.320 --> 00:02:16.360
This can be any sort of sluggified input, so I'm just going to paste likelihood to rain.

35
00:02:16.360 --> 00:02:18.400
We'll just call it that for now.

36
00:02:18.400 --> 00:02:29.080
We'll say, you know, custom score on like rain, which is really just surfaced in the

37
00:02:29.080 --> 00:02:30.840
information in the UI.

38
00:02:30.840 --> 00:02:32.400
We'll choose the data type.

39
00:02:32.400 --> 00:02:35.000
So in our case, it looks like it's an integer.

40
00:02:35.000 --> 00:02:40.120
We want to be able to do things like greater than less than in the segmentation engine.

41
00:02:40.120 --> 00:02:43.720
That's one really important point when you're choosing the data type.

42
00:02:43.720 --> 00:02:48.560
It impacts the way that you can actually leverage that particular field in segmentation.

43
00:02:48.560 --> 00:02:52.440
So for instance, if I were to make this a string, it would work, but when I go to build

44
00:02:52.440 --> 00:02:58.320
a segment, if I wanted to find anybody that was like greater than 50 on likelihood terrain,

45
00:02:58.320 --> 00:02:59.320
it's not going to work.

46
00:02:59.320 --> 00:03:02.640
It's not going to allow you to do that because it can't do a sort of numeric operation on

47
00:03:02.640 --> 00:03:04.000
a string.

48
00:03:04.000 --> 00:03:10.520
So those kind of things are really important to keep in mind as you're doing custom mapping.

49
00:03:10.520 --> 00:03:14.760
Obviously there's lots of different data types that you could use.

50
00:03:14.760 --> 00:03:19.000
There's arrays, there's string arrays, there's time arrays, there's maps, et cetera, et cetera.

51
00:03:19.000 --> 00:03:21.960
We're just going to choose an integer in this case.

52
00:03:21.960 --> 00:03:23.200
You can add an optional description.

53
00:03:23.200 --> 00:03:26.140
This is just purely for the UI, so we won't do that here.

54
00:03:26.140 --> 00:03:28.200
You could choose it as an identity key.

55
00:03:28.200 --> 00:03:30.120
If it's a string, you can't do that with an integer.

56
00:03:30.120 --> 00:03:33.120
We don't want to do that with likelihood terrain.

57
00:03:33.120 --> 00:03:38.920
If I were to accidentally do that, the net result would essentially be that anybody that

58
00:03:38.920 --> 00:03:45.120
had, say, like a 95 score of likelihood terrain would be merged into one super profile, which

59
00:03:45.120 --> 00:03:47.160
is definitely not what you would want to do.

60
00:03:47.160 --> 00:03:51.720
So that's why only clicking the identity key when it's absolutely necessary and you want

61
00:03:51.720 --> 00:03:54.720
to use it to merge data together is very, very important.

62
00:03:54.720 --> 00:03:59.080
For a merge operator, because it's an integer, you'll see that there's a number of different

63
00:03:59.080 --> 00:04:01.720
ways that you can ultimately handle this data.

64
00:04:01.720 --> 00:04:05.880
Maybe I want the total purchase amount over time.

65
00:04:05.880 --> 00:04:07.280
You could do the integer as a sum.

66
00:04:07.280 --> 00:04:10.120
As new events come in, it's going to continue to add those up so that you have this one

67
00:04:10.120 --> 00:04:14.000
value that represents total purchase over time.

68
00:04:14.000 --> 00:04:15.400
You could just count the events.

69
00:04:15.400 --> 00:04:19.280
So instead of having an understanding of, like, the specific number, it's just how many

70
00:04:19.280 --> 00:04:22.280
times I've seen this sort of field change.

71
00:04:22.280 --> 00:04:24.280
The max number, the min number, the latest, the oldest.

72
00:04:24.280 --> 00:04:27.880
In our case, we want it to just represent the actual score value, and we want it to

73
00:04:27.880 --> 00:04:32.240
be the latest value so that if a new one comes in tomorrow when we push this information,

74
00:04:32.240 --> 00:04:33.520
it gets updated and overridden.

75
00:04:33.520 --> 00:04:35.680
So we're going to do latest.

76
00:04:35.680 --> 00:04:39.200
And then we'll leave all of the other fields.

77
00:04:39.200 --> 00:04:41.640
They're all grayed out, actually, in this case, because you can't set them.

78
00:04:41.640 --> 00:04:45.480
If we were to set or choose an array or a map, you'd have some more options to Eric's

79
00:04:45.480 --> 00:04:51.280
kind of earlier point on the size limit and how long some of that information hangs around.

80
00:04:51.280 --> 00:04:56.640
The other thing that's really important is you can flag your fields as if they contain

81
00:04:56.640 --> 00:04:58.240
PII or not.

82
00:04:58.240 --> 00:05:02.480
Because a CDP is collecting user information, a lot of that information can be personally

83
00:05:02.480 --> 00:05:04.420
identified, which becomes really, really important.

84
00:05:04.420 --> 00:05:09.860
We have lots of controls in the UI to hide or show that information for particular users.

85
00:05:09.860 --> 00:05:15.780
In a lot of cases, our customers will maybe encrypt or encode or Base64 encode an email

86
00:05:15.780 --> 00:05:17.220
address to never expose that.

87
00:05:17.220 --> 00:05:21.180
So there's lots of controls that you'll see around privacy.

88
00:05:21.180 --> 00:05:26.460
This just lets you know the field may contain PII so that throughout the UI, we can put

89
00:05:26.460 --> 00:05:27.940
the proper controls in place.

90
00:05:27.940 --> 00:05:32.140
And then there's some categorization options that are totally optional.

91
00:05:32.140 --> 00:05:36.500
They just show up in the actual profile kind of filtering when you're exploring.

92
00:05:36.500 --> 00:05:40.060
So we'll just say that that's behavior for now.

93
00:05:40.060 --> 00:05:43.220
So this is how we're going to configure our field.

94
00:05:43.220 --> 00:05:48.140
If I hit create field, this essentially just gives it a place to live on the profile.

95
00:05:48.140 --> 00:05:53.420
But there's still no logic on how do I take the raw data that's coming in and map it to

96
00:05:53.420 --> 00:05:54.900
that particular profile, right?

97
00:05:54.900 --> 00:05:56.780
There's still that kind of gap.

98
00:05:56.780 --> 00:05:58.480
So that's where mappings come in.

99
00:05:58.560 --> 00:06:04.240
So if I go back to my field, likelihood terrain, I'll go to current mappings and I see that

100
00:06:04.240 --> 00:06:06.280
there's no mappings.

101
00:06:06.280 --> 00:06:13.400
So we'll hit create new mapping and you'll see three new concepts that we have not touched

102
00:06:13.400 --> 00:06:14.400
on yet.

103
00:06:14.400 --> 00:06:16.840
So well, two of the three we haven't touched on yet.

104
00:06:16.840 --> 00:06:20.600
So stream is going to say, where do I want to create this mapping to?

105
00:06:20.600 --> 00:06:28.040
I can choose one of the streams that exist, or we'll just do demo custom CSV.

106
00:06:28.080 --> 00:06:34.000
I'm going to copy this and put it over here so I don't forget that.

107
00:06:34.000 --> 00:06:38.480
And then you have an opportunity to do an expression or a condition.

108
00:06:38.480 --> 00:06:42.600
So what the expression does is there's a few different functions that we have.

109
00:06:42.600 --> 00:06:45.960
So you saw it like on email, as an example, where it's going to take it, it's going to

110
00:06:45.960 --> 00:06:49.040
lowercase it, it's going to do some basic validation.

111
00:06:49.040 --> 00:06:50.840
There's things for phone number to normalize.

112
00:06:50.840 --> 00:06:54.280
There's ways that you can split, you can uppercase, you can lowercase.

113
00:06:54.280 --> 00:06:58.040
There's a whole list of the different expressions in our documentation that we can kind of come

114
00:06:58.040 --> 00:06:59.280
back to.

115
00:06:59.280 --> 00:07:04.320
But this is one of the, I wouldn't call it a cleansing layer by any means, but it's one

116
00:07:04.320 --> 00:07:08.200
of the ways that you can manipulate some of the endpoints to make them consistent.

117
00:07:08.200 --> 00:07:14.240
In our case, we're just going to do likelihood to rain.

118
00:07:14.240 --> 00:07:19.480
So we just want to like take a direct translation, oops, I pasted the wrong value.

119
00:07:19.480 --> 00:07:21.900
That's not what we want.

120
00:07:21.900 --> 00:07:26.300
So all I want to do is I want to take the key that'll ultimately come in as likelihood

121
00:07:26.300 --> 00:07:28.060
to rain in our CSV.

122
00:07:28.060 --> 00:07:31.780
And I want to say anytime that that value comes in, we're going to, as long as it's

123
00:07:31.780 --> 00:07:36.700
on the demo custom CSV stream, we're going to map it up to that profile.

124
00:07:36.700 --> 00:07:39.940
The conditions we'll come back to, and I'll show you a few different examples, but conditions

125
00:07:39.940 --> 00:07:46.820
give you a way to say, like when we're mapping conversion events or events in general, if

126
00:07:46.820 --> 00:07:49.740
it's a particular type, then I want to map it in this way.

127
00:07:49.740 --> 00:07:55.580
So you can add essentially if else equals type logic to a particular mapping so that

128
00:07:55.580 --> 00:07:59.980
there's conditional logic, whether it should map that field or not, as opposed to just

129
00:07:59.980 --> 00:08:05.260
in this case, anytime that likelihood to rain shows up in this particular stream, it's going

130
00:08:05.260 --> 00:08:11.200
to map it to the field that I chose, which is this custom score likelihood to rain.

131
00:08:11.200 --> 00:08:13.420
So I'm going to create, oh yeah, go ahead.

132
00:08:13.420 --> 00:08:14.980
One comment here real quick.

133
00:08:14.980 --> 00:08:21.260
If this was like a untrusted source, where this data was coming from, you probably would

134
00:08:21.260 --> 00:08:24.980
want to put an expression in there that would validate that it's a number in the range of

135
00:08:24.980 --> 00:08:27.120
one to a hundred.

136
00:08:27.120 --> 00:08:31.380
But since this is a source that we generated that we know is clean the whole way through,

137
00:08:31.380 --> 00:08:34.220
we can just take the value straight from the CSV.

138
00:08:34.220 --> 00:08:35.220
Yep.

139
00:08:35.220 --> 00:08:39.540
And we'll pull up the docs and show some of like the conditions and expressions and whatnot,

140
00:08:39.540 --> 00:08:40.780
just so you know where it is.

141
00:08:40.780 --> 00:08:45.940
There's a whole bunch of things that you can do ultimately from a logic perspective, but

142
00:08:45.940 --> 00:08:47.980
try to keep it as simple as possible for this example.

143
00:08:47.980 --> 00:08:51.460
So I am going to create that mapping for that field.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] The main thing that I want to cover in this session is how the fields and the mappings
[00:19] work and how their relationship is to stream.
[00:21] So a perfect example of that is we have a common schema, a bunch of the heavy lifting
[00:25] has already been done, we'll connect MailChimp in a little while and it automatically gets
[00:29] the fields and mappings, but it is inevitable that you're going to want to push custom data
[00:34] into Linux.
[00:35] So I built a very simple, also Game of Thrones themed CSV, where we're pulling in things
[00:41] like email and first name and last name, and we use the attribute field, which comes kind
[00:44] of pre-mapped.
[00:45] But then you run into this field where it's likelihood to rain, which is, you know, akin
[00:49] to say like a custom score that's coming out of your warehouse or whatever it may be.
[00:53] So this is a thing that isn't in the system, it's totally custom.
[00:57] We need to make sure that when we upload this data, whichever stream we upload it to,
[01:01] it's going to ultimately get mapped properly to that particular profile.
[01:05] So I wanted to walk through that example together anyway, so we can just start to do that.
[01:11] So if I were to upload this information right now with no mappings, you would see it in
[01:17] the streams.
[01:18] So I'm not going to do that, but you could ultimately upload this to whatever stream
[01:23] you would want.
[01:24] You could go over here and you would see the raw data in the stream that actually was received,
[01:29] but it's not going to have anywhere to go with it.
[01:31] So it's going to sit in the stream as sort of a metric.
[01:33] It's never going to get mapped to the profile.
[01:35] It's just going to sort of confuse you.
[01:37] So it's really, really important to know that streams to Eric's point are kind of just a
[01:41] construct to help you separate the logic for mapping that data to a particular field.
[01:48] So in our case, we want to add the field likelihood to rain.
[01:53] The way that I would start to do this, and there's a couple of different approaches,
[01:55] but because we know that this field is net new, you're going to first need to go create
[02:00] a field, right?
[02:01] So you need to put a place on the profile where this information can actually live.
[02:05] We don't want to use any of the common scheme of fields, so I'm going to go to create new.
[02:09] You first select an ID.
[02:11] This can be any sort of sluggified input, so I'm just going to paste likelihood to rain.
[02:16] We'll just call it that for now.
[02:18] We'll say, you know, custom score on like rain, which is really just surfaced in the
[02:29] information in the UI.
[02:30] We'll choose the data type.
[02:32] So in our case, it looks like it's an integer.
[02:35] We want to be able to do things like greater than less than in the segmentation engine.
[02:40] That's one really important point when you're choosing the data type.
[02:43] It impacts the way that you can actually leverage that particular field in segmentation.
[02:48] So for instance, if I were to make this a string, it would work, but when I go to build
[02:52] a segment, if I wanted to find anybody that was like greater than 50 on likelihood terrain,
[02:58] it's not going to work.
[02:59] It's not going to allow you to do that because it can't do a sort of numeric operation on
[03:02] a string.
[03:04] So those kind of things are really important to keep in mind as you're doing custom mapping.
[03:10] Obviously there's lots of different data types that you could use.
[03:14] There's arrays, there's string arrays, there's time arrays, there's maps, et cetera, et cetera.
[03:19] We're just going to choose an integer in this case.
[03:21] You can add an optional description.
[03:23] This is just purely for the UI, so we won't do that here.
[03:26] You could choose it as an identity key.
[03:28] If it's a string, you can't do that with an integer.
[03:30] We don't want to do that with likelihood terrain.
[03:33] If I were to accidentally do that, the net result would essentially be that anybody that
[03:38] had, say, like a 95 score of likelihood terrain would be merged into one super profile, which
[03:45] is definitely not what you would want to do.
[03:47] So that's why only clicking the identity key when it's absolutely necessary and you want
```

#### Key takeaways

- Connect **Customizing Schema (fields & mappings)** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 05 — The Importance of "Identity" Fields

<!-- ai_metadata: {"lesson_id":"05","type":"video","duration_seconds":150,"video_url":"https://cdn.jwplayer.com/previews/6luUta7L","thumbnail_url":"https://cdn.jwplayer.com/v2/media/6luUta7L/poster.jpg?width=720","topics":["The","Importance","Identity","Fields"]} -->

#### Video details

#### At a glance

- **Title:** 13-data-insights-importance-of-identity-fields
- **Duration:** 2m 30s
- **Media link:** https://cdn.jwplayer.com/previews/6luUta7L
- **Publish date (unix):** 1752873256

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 114055 kbps
- video/mp4 · 180p · 180p · 145867 kbps
- video/mp4 · 270p · 270p · 165550 kbps
- video/mp4 · 360p · 360p · 178700 kbps
- video/mp4 · 406p · 406p · 191694 kbps
- video/mp4 · 540p · 540p · 230255 kbps
- video/mp4 · 720p · 720p · 292367 kbps
- video/mp4 · 1080p · 1080p · 470432 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/6luUta7L-120.vtt`

#### Transcript

So now I have a way to map likelihood to rain from this new stream that we're going to create based on the CSV to the new likelihood to rain field, however, every stream has to have at least one identity key in order to map it up to that profile. So if I were to just publish this and import my CSV, it would say, okay, I found likelihood to rain and that score is 85 for Jon Snow. But I don't have any context of that Jon Snow is the sort of owner of that particular event until I've created a mapping from the new stream that we're going to create to the email field. If I go into fields and email again, as an example, there's a whole bunch of mappings, none of which are represented by this new stream that we're going to create. So I need to also create, oops, a, oh boy, misclicked, go back in here. So I need to also create a new mapping for the new stream that we're going to add. That's why I copied it over here. So I also want to say that, okay, in the demo custom CSV stream that we're about to create, I also want to make sure that I map email to email, I think we called it email, yeah. So email to email, just like it is. So it's going to do some, and I don't know if you know all the logic actually off the top of your head, Eric, that the email function actually does, I think it's pretty basic validation. Yeah, it basically, there's a giant regular expression under the hood of that, that validates that it's an email. Got it. It's basically, if it has that regular expression that says that it's a valid email, then it's an email. Perfect. So we're going to make sure that we do that just so that it's consistent and mapped the same way that all the other sources are. So now I can create that mapping. So now at this point, I've created some changes, I've added a new field, I've mapped that new field to the new stream, and I've also made the association from the new stream that has an email to give it a path essentially to map it to that master profile.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:20.360
So now I have a way to map likelihood to rain from this new stream that we're going to create

2
00:00:20.360 --> 00:00:27.000
based on the CSV to the new likelihood to rain field, however, every stream has to have

3
00:00:27.000 --> 00:00:30.840
at least one identity key in order to map it up to that profile.

4
00:00:30.840 --> 00:00:35.660
So if I were to just publish this and import my CSV, it would say, okay, I found likelihood

5
00:00:35.660 --> 00:00:39.060
to rain and that score is 85 for Jon Snow.

6
00:00:39.060 --> 00:00:45.180
But I don't have any context of that Jon Snow is the sort of owner of that particular event

7
00:00:45.180 --> 00:00:50.220
until I've created a mapping from the new stream that we're going to create to the email

8
00:00:50.220 --> 00:00:51.220
field.

9
00:00:52.060 --> 00:00:58.780
If I go into fields and email again, as an example, there's a whole bunch of mappings,

10
00:00:58.780 --> 00:01:03.380
none of which are represented by this new stream that we're going to create.

11
00:01:03.380 --> 00:01:18.020
So I need to also create, oops, a, oh boy, misclicked, go back in here.

12
00:01:18.020 --> 00:01:22.220
So I need to also create a new mapping for the new stream that we're going to add.

13
00:01:22.220 --> 00:01:24.560
That's why I copied it over here.

14
00:01:24.560 --> 00:01:28.820
So I also want to say that, okay, in the demo custom CSV stream that we're about to create,

15
00:01:28.820 --> 00:01:34.940
I also want to make sure that I map email to email, I think we called it email, yeah.

16
00:01:34.940 --> 00:01:37.100
So email to email, just like it is.

17
00:01:37.100 --> 00:01:40.020
So it's going to do some, and I don't know if you know all the logic actually off the

18
00:01:40.020 --> 00:01:44.660
top of your head, Eric, that the email function actually does, I think it's pretty basic validation.

19
00:01:45.260 --> 00:01:51.900
Yeah, it basically, there's a giant regular expression under the hood of that, that validates

20
00:01:51.900 --> 00:01:52.900
that it's an email.

21
00:01:52.900 --> 00:01:53.900
Got it.

22
00:01:53.900 --> 00:01:59.140
It's basically, if it has that regular expression that says that it's a valid email, then it's

23
00:01:59.140 --> 00:02:00.140
an email.

24
00:02:00.140 --> 00:02:01.140
Perfect.

25
00:02:01.140 --> 00:02:04.700
So we're going to make sure that we do that just so that it's consistent and mapped the

26
00:02:04.700 --> 00:02:07.620
same way that all the other sources are.

27
00:02:07.620 --> 00:02:10.100
So now I can create that mapping.

28
00:02:10.100 --> 00:02:14.020
So now at this point, I've created some changes, I've added a new field, I've mapped that new

29
00:02:14.020 --> 00:02:17.060
field to the new stream, and I've also made the association from the new stream that has

30
00:02:17.060 --> 00:02:21.780
an email to give it a path essentially to map it to that master profile.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So now I have a way to map likelihood to rain from this new stream that we're going to create
[00:20] based on the CSV to the new likelihood to rain field, however, every stream has to have
[00:27] at least one identity key in order to map it up to that profile.
[00:30] So if I were to just publish this and import my CSV, it would say, okay, I found likelihood
[00:35] to rain and that score is 85 for Jon Snow.
[00:39] But I don't have any context of that Jon Snow is the sort of owner of that particular event
[00:45] until I've created a mapping from the new stream that we're going to create to the email
[00:50] field.
[00:52] If I go into fields and email again, as an example, there's a whole bunch of mappings,
[00:58] none of which are represented by this new stream that we're going to create.
[01:03] So I need to also create, oops, a, oh boy, misclicked, go back in here.
[01:18] So I need to also create a new mapping for the new stream that we're going to add.
[01:22] That's why I copied it over here.
[01:24] So I also want to say that, okay, in the demo custom CSV stream that we're about to create,
[01:28] I also want to make sure that I map email to email, I think we called it email, yeah.
[01:34] So email to email, just like it is.
[01:37] So it's going to do some, and I don't know if you know all the logic actually off the
[01:40] top of your head, Eric, that the email function actually does, I think it's pretty basic validation.
[01:45] Yeah, it basically, there's a giant regular expression under the hood of that, that validates
[01:51] that it's an email.
[01:52] Got it.
[01:53] It's basically, if it has that regular expression that says that it's a valid email, then it's
[01:59] an email.
[02:00] Perfect.
[02:01] So we're going to make sure that we do that just so that it's consistent and mapped the
[02:04] same way that all the other sources are.
[02:07] So now I can create that mapping.
[02:10] So now at this point, I've created some changes, I've added a new field, I've mapped that new
[02:14] field to the new stream, and I've also made the association from the new stream that has
[02:17] an email to give it a path essentially to map it to that master profile.
```

#### Key takeaways

- Connect **The Importance of "Identity" Fields** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 06 — Publishing Schema & Version Control

<!-- ai_metadata: {"lesson_id":"06","type":"video","duration_seconds":98,"video_url":"https://cdn.jwplayer.com/previews/PCj1HuBz","thumbnail_url":"https://cdn.jwplayer.com/v2/media/PCj1HuBz/poster.jpg?width=720","topics":["Publishing","Schema","Version","Control"]} -->

#### Video details

#### At a glance

- **Title:** 14-data-insights-publishing-schema
- **Duration:** 1m 38s
- **Media link:** https://cdn.jwplayer.com/previews/PCj1HuBz
- **Publish date (unix):** 1752873656

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113834 kbps
- video/mp4 · 270p · 270p · 171748 kbps

#### Transcript

The final step as you're changing your schema is to actually publish the version. So if on the left hand side under schema, I go to versions, you'll see that okay, there's three unpublished changes that we've just made. So I actually have to go in here and hit publish changes, it'll walk me through the sort of diff what was added, we added a new field, we added two mappings, there's no ranking changes, we will come back to ranking and the importance of that, you know, in a byfield or an identity field, we'll go through next. So you're going to name something around the version. So added, likely, we'll just do the same thing for the description. And I'm going to publish that change. So now at this point, all of those changes have been merged with my actual live schema. One caveat here to be aware of is, it can take some time for schema changes to be reflected in the UI and take effect. But if as you're playing with it, you're actually customizing the fields and adding some stuff, and it's not working right away, we have to kick what we call field info on our side, which is what refreshes all the things to make sure that the mappings and whatnot take place.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:20.440
The final step as you're changing your schema is to actually publish the version.

2
00:00:20.440 --> 00:00:24.260
So if on the left hand side under schema, I go to versions, you'll see that okay, there's

3
00:00:24.260 --> 00:00:27.520
three unpublished changes that we've just made.

4
00:00:27.520 --> 00:00:30.960
So I actually have to go in here and hit publish changes, it'll walk me through the sort of

5
00:00:30.960 --> 00:00:36.680
diff what was added, we added a new field, we added two mappings, there's no ranking

6
00:00:36.680 --> 00:00:40.560
changes, we will come back to ranking and the importance of that, you know, in a byfield

7
00:00:40.560 --> 00:00:43.840
or an identity field, we'll go through next.

8
00:00:43.840 --> 00:00:45.960
So you're going to name something around the version.

9
00:00:45.960 --> 00:00:58.000
So added, likely, we'll just do the same thing for the description.

10
00:00:58.000 --> 00:00:59.720
And I'm going to publish that change.

11
00:00:59.720 --> 00:01:05.840
So now at this point, all of those changes have been merged with my actual live schema.

12
00:01:05.840 --> 00:01:12.160
One caveat here to be aware of is, it can take some time for schema changes to be reflected

13
00:01:12.160 --> 00:01:15.640
in the UI and take effect.

14
00:01:15.640 --> 00:01:18.440
But if as you're playing with it, you're actually customizing the fields and adding

15
00:01:18.440 --> 00:01:23.600
some stuff, and it's not working right away, we have to kick what we call field info on

16
00:01:23.600 --> 00:01:27.280
our side, which is what refreshes all the things to make sure that the mappings and

17
00:01:27.280 --> 00:01:28.200
whatnot take place.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] The final step as you're changing your schema is to actually publish the version.
[00:20] So if on the left hand side under schema, I go to versions, you'll see that okay, there's
[00:24] three unpublished changes that we've just made.
[00:27] So I actually have to go in here and hit publish changes, it'll walk me through the sort of
[00:30] diff what was added, we added a new field, we added two mappings, there's no ranking
[00:36] changes, we will come back to ranking and the importance of that, you know, in a byfield
[00:40] or an identity field, we'll go through next.
[00:43] So you're going to name something around the version.
[00:45] So added, likely, we'll just do the same thing for the description.
[00:58] And I'm going to publish that change.
[00:59] So now at this point, all of those changes have been merged with my actual live schema.
[01:05] One caveat here to be aware of is, it can take some time for schema changes to be reflected
[01:12] in the UI and take effect.
[01:15] But if as you're playing with it, you're actually customizing the fields and adding
[01:18] some stuff, and it's not working right away, we have to kick what we call field info on
[01:23] our side, which is what refreshes all the things to make sure that the mappings and
[01:27] whatnot take place.
```

#### Key takeaways

- Connect **Publishing Schema & Version Control** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 07 — Working with APIs & CSVs

<!-- ai_metadata: {"lesson_id":"07","type":"video","duration_seconds":301,"video_url":"https://cdn.jwplayer.com/previews/5OaqXTP0","thumbnail_url":"https://cdn.jwplayer.com/v2/media/5OaqXTP0/poster.jpg?width=720","topics":["Working","with","APIs","CSVs"]} -->

#### Video details

#### At a glance

- **Title:** 15-data-insights-working-with-apis-and-csvs
- **Duration:** 5m 1s
- **Media link:** https://cdn.jwplayer.com/previews/5OaqXTP0
- **Publish date (unix):** 1752876729

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113680 kbps
- video/mp4 · 180p · 180p · 145874 kbps
- video/mp4 · 270p · 270p · 166323 kbps
- video/mp4 · 360p · 360p · 183204 kbps
- video/mp4 · 406p · 406p · 196247 kbps
- video/mp4 · 540p · 540p · 240152 kbps
- video/mp4 · 720p · 720p · 307691 kbps
- video/mp4 · 1080p · 1080p · 506438 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/5OaqXTP0-120.vtt`

#### Video transcript

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish -->
[00:00] Transcript not attached in source entry.
```

#### Key takeaways

- Connect **Working with APIs & CSVs** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 08 — Working with Integrations

<!-- ai_metadata: {"lesson_id":"08","type":"video","duration_seconds":450,"video_url":"https://cdn.jwplayer.com/previews/gzn6uDlP","thumbnail_url":"https://cdn.jwplayer.com/v2/media/gzn6uDlP/poster.jpg?width=720","topics":["Working","with","Integrations"]} -->

#### Video details

#### At a glance

- **Title:** 16-data-insights-working-with-integrationsed
- **Duration:** 7m 30s
- **Media link:** https://cdn.jwplayer.com/previews/gzn6uDlP
- **Publish date (unix):** 1752876102

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113470 kbps
- video/mp4 · 180p · 180p · 143275 kbps
- video/mp4 · 270p · 270p · 163596 kbps
- video/mp4 · 360p · 360p · 184251 kbps
- video/mp4 · 406p · 406p · 198040 kbps
- video/mp4 · 540p · 540p · 243738 kbps
- video/mp4 · 720p · 720p · 316433 kbps
- video/mp4 · 1080p · 1080p · 530638 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/gzn6uDlP-120.vtt`

#### Transcript

The other popular way to get data into Lytics, which is super nice and magical because all of the fields and mappings have already been handled by our data team, is using one of our pre-built integrations. So out of the box, there are hundreds of different ways to connect with different tools. And under most of these tools, there's several different jobs. So for instance, if I search for Google Cloud, as an example, and I go under here, not only can we integrate with Google Cloud, you can integrate with BigQuery and Cloud Storage and PubSub and Event Stream, essentially any way that you'd ever dream of pulling data in or pushing data out. There is probably already a connector for it. If there's not a connector for it, there's some really cool capabilities that we'll probably cover in tomorrow's session around webhooks and webhook templates that allow you to essentially build your own integration. But for our example, real quick, just to kind of walk through what it looks like to connect data from a typical marketing tool, we will go here, Create New. Before this meeting, I created a free MailChimp account. So hopefully I remember all of my passwords and stuff, but we'll just do MailChimp. We'll import audiences and activity data. You can also export your audience and do some webhook stuff within MailChimp. First thing that's going to ask you to do is create an authorization. Authorizations are going to be dependent on the particular downstream tool. Sometimes it's OAuth, sometimes it's a JWT, sometimes it's just an access token, just depends on what that sort of downstream tool needs. I'll hit Create New. MailChimp, I believe, just uses OAuth. So I will sign in. Hopefully this pre-fills and it locks me in. I will allow it. Perfect. I'll just name it so that it's in the account. I'll continue. And then so now once I have an authorization over here in MailChimp, I don't know how many folks have seen MailChimp or any ESP sort of system. They all fundamentally work the same in that they have some sort of list mechanism. I just ingested sort of like 50 records, 20 records, whatever it is of sample email addresses with first name, last name, just to kind of show what it looks like to connect that information. So this is the account that we actually connected to. If I go back over here to the right screen. So all of our pre-built integrations have a set of configuration options. It's obviously going to be custom for the particular channel. Some are super simple. Some have a lot of different options. In the case of, say, like Salesforce Marketing Cloud, you can choose specifically which fields to pull in and how those fields map to the profile fields. And there's lots of different configuration options. MailChimp is pretty simple in that you just say this is kind of like demo MailChimp import. We don't need a description. You'll choose the list to import. My list was just named Lytx because it's in the lytx.com account. You'll have some options on, do you want to sync subscribes and unsubscribes? Do you want to import just a portion of the data? Maybe you have 100 million records and some of them are old, so you only want to do the last year, whatever, controls like that. You can import activity data as well. This account's not going to have any activity data, but in most ESPs, traditionally, you would have opens and clicks and bounces and that kind of information that you also want to pull in. And then you can have the same sort of mechanism to control how far back you go. This account is brand new as of like 45 minutes ago, so there's not a whole lot of history there. So I'm not going to do any of those toggles. And then when I hit complete, two things are going to happen. So one very important thing is that the job will automatically kick off in the background. All of our integrations are built to be real time if they can be. So that means that as a user enters the audience, we have that trigger capability. We're going to sync individual profiles as fast as we possibly can. Many of our integrations don't support real time syncs, so they'll do some sort of batch cadence on every hour, every day, whatever it may be. There's a whole bunch of functionality at the end of the day that we can support in real time and throttling and controls, which I know have come up a lot in kind of the Linux to content stack integration. MailChimp, I believe, is real time. So as users enter that audiences that we ultimately push out, they're going to be pushed to those lists in real time. So the job on the back end is sort of thing one that happens. Thing two that is really important that our customers love is it's also going to automatically update your schema based on what we know about MailChimp. So over here, just as a example, so all of our prebuilt integrations outside of webhooks and like CSV and like custom data type integrations come with a predefined set of fields and mappings. What you see on the screen is our older logic, still very useful. We call it LQL. So it's kind of more of a raw view of how those mappings happen. But you can see. So in the case of MailChimp, these are all the fields and mappings that come out of the box. So we map email to the email field. We pull off the domain with the email domain function. We hash it. We pull in the list IDs. Some of the consent stuff that comes from MailChimp, first name, last name. So all of our integrations come pre-mapped, which makes it really, really simple to get data in. In the case that you're, you inevitably have some custom mechanism in these tools. You can update the schema to kind of match accordingly. But for most of our customers, the out of the box sort of integrations do the heavy lifting and that you don't have to think about identity or how the things merge together or which ID to use. We've already done all of that work for every single one of our integrations, which makes connecting things super easy. You'll also see in this case. So now that I've connected MailChimp, there's actually 54 unpublished changes. This is one of the things that we're working on in our UI to make a little bit more seamless and easy. But it'll actually ask me to go through and publish the MailChimp fields. Oops. I think I need to refresh. So you could see all of the fields and mappings and stuff that come from MailChimp that ultimately get added all of the different mappings. All of this stuff gets added out of the box. Soon we'll bypass the having to actually approve the schema changes that will just apply what we call a patch. So this is sort of a temporary thing that we're working through that's a little bit clunky, but it allows you to sort of see all the things that are happening out of the box. I'll just say MailChimp connect and publish. So now all of those fields are also in my account. As the MailChimp data comes in, you'll see those users and profiles. So lots of different mechanisms to kind of pull data in automatically. But most of our pre-built integrations are that easy. It's just clicking through, doing an auth, filling in some questions, and then we do all the heavy lifting on the back end.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:22.240
The other popular way to get data into Lytics, which is super nice and magical because all

2
00:00:22.240 --> 00:00:27.120
of the fields and mappings have already been handled by our data team, is using one of

3
00:00:27.120 --> 00:00:28.560
our pre-built integrations.

4
00:00:28.560 --> 00:00:34.680
So out of the box, there are hundreds of different ways to connect with different tools.

5
00:00:34.680 --> 00:00:38.120
And under most of these tools, there's several different jobs.

6
00:00:38.120 --> 00:00:43.680
So for instance, if I search for Google Cloud, as an example, and I go under here, not only

7
00:00:43.680 --> 00:00:46.960
can we integrate with Google Cloud, you can integrate with BigQuery and Cloud Storage

8
00:00:46.960 --> 00:00:52.360
and PubSub and Event Stream, essentially any way that you'd ever dream of pulling data

9
00:00:52.360 --> 00:00:54.440
in or pushing data out.

10
00:00:54.440 --> 00:00:57.780
There is probably already a connector for it.

11
00:00:57.780 --> 00:01:01.380
If there's not a connector for it, there's some really cool capabilities that we'll probably

12
00:01:01.380 --> 00:01:06.040
cover in tomorrow's session around webhooks and webhook templates that allow you to essentially

13
00:01:06.040 --> 00:01:08.480
build your own integration.

14
00:01:08.480 --> 00:01:11.660
But for our example, real quick, just to kind of walk through what it looks like to connect

15
00:01:11.660 --> 00:01:17.220
data from a typical marketing tool, we will go here, Create New.

16
00:01:17.220 --> 00:01:20.940
Before this meeting, I created a free MailChimp account.

17
00:01:20.940 --> 00:01:26.020
So hopefully I remember all of my passwords and stuff, but we'll just do MailChimp.

18
00:01:26.020 --> 00:01:28.300
We'll import audiences and activity data.

19
00:01:28.300 --> 00:01:33.300
You can also export your audience and do some webhook stuff within MailChimp.

20
00:01:33.300 --> 00:01:36.460
First thing that's going to ask you to do is create an authorization.

21
00:01:36.460 --> 00:01:39.940
Authorizations are going to be dependent on the particular downstream tool.

22
00:01:39.940 --> 00:01:43.740
Sometimes it's OAuth, sometimes it's a JWT, sometimes it's just an access token, just

23
00:01:43.740 --> 00:01:47.220
depends on what that sort of downstream tool needs.

24
00:01:47.220 --> 00:01:48.940
I'll hit Create New.

25
00:01:48.940 --> 00:01:51.660
MailChimp, I believe, just uses OAuth.

26
00:01:51.660 --> 00:01:53.300
So I will sign in.

27
00:01:53.620 --> 00:01:55.620
Hopefully this pre-fills and it locks me in.

28
00:02:00.460 --> 00:02:02.660
I will allow it.

29
00:02:02.660 --> 00:02:03.660
Perfect.

30
00:02:07.300 --> 00:02:09.460
I'll just name it so that it's in the account.

31
00:02:09.460 --> 00:02:12.140
I'll continue.

32
00:02:12.140 --> 00:02:17.540
And then so now once I have an authorization over here in MailChimp, I don't know how many

33
00:02:17.540 --> 00:02:20.920
folks have seen MailChimp or any ESP sort of system.

34
00:02:20.920 --> 00:02:24.880
They all fundamentally work the same in that they have some sort of list mechanism.

35
00:02:24.880 --> 00:02:30.720
I just ingested sort of like 50 records, 20 records, whatever it is of sample email addresses

36
00:02:30.720 --> 00:02:35.600
with first name, last name, just to kind of show what it looks like to connect that information.

37
00:02:35.600 --> 00:02:38.120
So this is the account that we actually connected to.

38
00:02:38.120 --> 00:02:44.800
If I go back over here to the right screen.

39
00:02:44.800 --> 00:02:49.240
So all of our pre-built integrations have a set of configuration options.

40
00:02:49.240 --> 00:02:52.120
It's obviously going to be custom for the particular channel.

41
00:02:52.120 --> 00:02:53.120
Some are super simple.

42
00:02:53.120 --> 00:02:54.120
Some have a lot of different options.

43
00:02:54.120 --> 00:02:58.520
In the case of, say, like Salesforce Marketing Cloud, you can choose specifically which fields

44
00:02:58.520 --> 00:03:02.000
to pull in and how those fields map to the profile fields.

45
00:03:02.000 --> 00:03:04.920
And there's lots of different configuration options.

46
00:03:04.920 --> 00:03:11.800
MailChimp is pretty simple in that you just say this is kind of like demo MailChimp import.

47
00:03:11.800 --> 00:03:13.400
We don't need a description.

48
00:03:13.400 --> 00:03:14.980
You'll choose the list to import.

49
00:03:14.980 --> 00:03:19.780
My list was just named Lytx because it's in the lytx.com account.

50
00:03:19.780 --> 00:03:25.440
You'll have some options on, do you want to sync subscribes and unsubscribes?

51
00:03:25.440 --> 00:03:27.900
Do you want to import just a portion of the data?

52
00:03:27.900 --> 00:03:30.540
Maybe you have 100 million records and some of them are old, so you only want to do the

53
00:03:30.540 --> 00:03:34.320
last year, whatever, controls like that.

54
00:03:34.320 --> 00:03:36.260
You can import activity data as well.

55
00:03:36.260 --> 00:03:39.940
This account's not going to have any activity data, but in most ESPs, traditionally, you

56
00:03:39.940 --> 00:03:43.060
would have opens and clicks and bounces and that kind of information that you also want

57
00:03:43.060 --> 00:03:44.900
to pull in.

58
00:03:44.900 --> 00:03:49.420
And then you can have the same sort of mechanism to control how far back you go.

59
00:03:49.420 --> 00:03:53.460
This account is brand new as of like 45 minutes ago, so there's not a whole lot of history

60
00:03:53.460 --> 00:03:54.460
there.

61
00:03:54.460 --> 00:03:55.460
So I'm not going to do any of those toggles.

62
00:03:55.460 --> 00:03:59.360
And then when I hit complete, two things are going to happen.

63
00:03:59.360 --> 00:04:05.900
So one very important thing is that the job will automatically kick off in the background.

64
00:04:05.900 --> 00:04:09.140
All of our integrations are built to be real time if they can be.

65
00:04:09.140 --> 00:04:14.000
So that means that as a user enters the audience, we have that trigger capability.

66
00:04:14.000 --> 00:04:18.380
We're going to sync individual profiles as fast as we possibly can.

67
00:04:18.380 --> 00:04:22.700
Many of our integrations don't support real time syncs, so they'll do some sort of batch

68
00:04:22.700 --> 00:04:26.440
cadence on every hour, every day, whatever it may be.

69
00:04:26.440 --> 00:04:29.740
There's a whole bunch of functionality at the end of the day that we can support in

70
00:04:29.740 --> 00:04:34.420
real time and throttling and controls, which I know have come up a lot in kind of the Linux

71
00:04:34.420 --> 00:04:37.060
to content stack integration.

72
00:04:37.060 --> 00:04:39.220
MailChimp, I believe, is real time.

73
00:04:39.220 --> 00:04:43.620
So as users enter that audiences that we ultimately push out, they're going to be pushed to those

74
00:04:43.640 --> 00:04:45.640
lists in real time.

75
00:04:45.640 --> 00:04:49.400
So the job on the back end is sort of thing one that happens.

76
00:04:49.400 --> 00:04:54.360
Thing two that is really important that our customers love is it's also going to automatically

77
00:04:54.360 --> 00:04:59.240
update your schema based on what we know about MailChimp.

78
00:04:59.240 --> 00:05:08.480
So over here, just as a example, so all of our prebuilt integrations outside of webhooks

79
00:05:08.540 --> 00:05:15.820
and like CSV and like custom data type integrations come with a predefined set of fields and mappings.

80
00:05:15.820 --> 00:05:19.320
What you see on the screen is our older logic, still very useful.

81
00:05:19.320 --> 00:05:20.320
We call it LQL.

82
00:05:20.320 --> 00:05:23.000
So it's kind of more of a raw view of how those mappings happen.

83
00:05:23.000 --> 00:05:24.000
But you can see.

84
00:05:24.000 --> 00:05:27.920
So in the case of MailChimp, these are all the fields and mappings that come out of the

85
00:05:27.920 --> 00:05:28.920
box.

86
00:05:28.920 --> 00:05:31.160
So we map email to the email field.

87
00:05:31.160 --> 00:05:35.160
We pull off the domain with the email domain function.

88
00:05:35.160 --> 00:05:36.160
We hash it.

89
00:05:36.160 --> 00:05:38.200
We pull in the list IDs.

90
00:05:38.200 --> 00:05:40.760
Some of the consent stuff that comes from MailChimp, first name, last name.

91
00:05:40.760 --> 00:05:45.680
So all of our integrations come pre-mapped, which makes it really, really simple to get

92
00:05:45.680 --> 00:05:46.680
data in.

93
00:05:46.680 --> 00:05:52.560
In the case that you're, you inevitably have some custom mechanism in these tools.

94
00:05:52.560 --> 00:05:55.600
You can update the schema to kind of match accordingly.

95
00:05:55.600 --> 00:06:00.360
But for most of our customers, the out of the box sort of integrations do the heavy

96
00:06:00.360 --> 00:06:04.560
lifting and that you don't have to think about identity or how the things merge together

97
00:06:04.560 --> 00:06:05.800
or which ID to use.

98
00:06:05.800 --> 00:06:11.160
We've already done all of that work for every single one of our integrations, which makes

99
00:06:11.160 --> 00:06:14.240
connecting things super easy.

100
00:06:14.240 --> 00:06:16.600
You'll also see in this case.

101
00:06:16.600 --> 00:06:21.160
So now that I've connected MailChimp, there's actually 54 unpublished changes.

102
00:06:21.160 --> 00:06:24.680
This is one of the things that we're working on in our UI to make a little bit more seamless

103
00:06:24.680 --> 00:06:25.680
and easy.

104
00:06:25.680 --> 00:06:29.040
But it'll actually ask me to go through and publish the MailChimp fields.

105
00:06:29.040 --> 00:06:30.040
Oops.

106
00:06:30.040 --> 00:06:35.240
I think I need to refresh.

107
00:06:35.240 --> 00:06:38.960
So you could see all of the fields and mappings and stuff that come from MailChimp that ultimately

108
00:06:38.960 --> 00:06:41.360
get added all of the different mappings.

109
00:06:41.360 --> 00:06:44.320
All of this stuff gets added out of the box.

110
00:06:44.320 --> 00:06:48.840
Soon we'll bypass the having to actually approve the schema changes that will just apply what

111
00:06:48.840 --> 00:06:50.680
we call a patch.

112
00:06:50.680 --> 00:06:53.240
So this is sort of a temporary thing that we're working through that's a little bit

113
00:06:53.240 --> 00:06:56.840
clunky, but it allows you to sort of see all the things that are happening out of the box.

114
00:06:56.840 --> 00:07:03.280
I'll just say MailChimp connect and publish.

115
00:07:03.320 --> 00:07:06.840
So now all of those fields are also in my account.

116
00:07:06.840 --> 00:07:09.600
As the MailChimp data comes in, you'll see those users and profiles.

117
00:07:09.600 --> 00:07:13.680
So lots of different mechanisms to kind of pull data in automatically.

118
00:07:13.680 --> 00:07:15.720
But most of our pre-built integrations are that easy.

119
00:07:15.720 --> 00:07:19.240
It's just clicking through, doing an auth, filling in some questions, and then we do

120
00:07:19.240 --> 00:07:20.600
all the heavy lifting on the back end.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] The other popular way to get data into Lytics, which is super nice and magical because all
[00:22] of the fields and mappings have already been handled by our data team, is using one of
[00:27] our pre-built integrations.
[00:28] So out of the box, there are hundreds of different ways to connect with different tools.
[00:34] And under most of these tools, there's several different jobs.
[00:38] So for instance, if I search for Google Cloud, as an example, and I go under here, not only
[00:43] can we integrate with Google Cloud, you can integrate with BigQuery and Cloud Storage
[00:46] and PubSub and Event Stream, essentially any way that you'd ever dream of pulling data
[00:52] in or pushing data out.
[00:54] There is probably already a connector for it.
[00:57] If there's not a connector for it, there's some really cool capabilities that we'll probably
[01:01] cover in tomorrow's session around webhooks and webhook templates that allow you to essentially
[01:06] build your own integration.
[01:08] But for our example, real quick, just to kind of walk through what it looks like to connect
[01:11] data from a typical marketing tool, we will go here, Create New.
[01:17] Before this meeting, I created a free MailChimp account.
[01:20] So hopefully I remember all of my passwords and stuff, but we'll just do MailChimp.
[01:26] We'll import audiences and activity data.
[01:28] You can also export your audience and do some webhook stuff within MailChimp.
[01:33] First thing that's going to ask you to do is create an authorization.
[01:36] Authorizations are going to be dependent on the particular downstream tool.
[01:39] Sometimes it's OAuth, sometimes it's a JWT, sometimes it's just an access token, just
[01:43] depends on what that sort of downstream tool needs.
[01:47] I'll hit Create New.
[01:48] MailChimp, I believe, just uses OAuth.
[01:51] So I will sign in.
[01:53] Hopefully this pre-fills and it locks me in.
[02:00] I will allow it.
[02:02] Perfect.
[02:07] I'll just name it so that it's in the account.
[02:09] I'll continue.
[02:12] And then so now once I have an authorization over here in MailChimp, I don't know how many
[02:17] folks have seen MailChimp or any ESP sort of system.
[02:20] They all fundamentally work the same in that they have some sort of list mechanism.
[02:24] I just ingested sort of like 50 records, 20 records, whatever it is of sample email addresses
[02:30] with first name, last name, just to kind of show what it looks like to connect that information.
[02:35] So this is the account that we actually connected to.
[02:38] If I go back over here to the right screen.
[02:44] So all of our pre-built integrations have a set of configuration options.
[02:49] It's obviously going to be custom for the particular channel.
[02:52] Some are super simple.
[02:53] Some have a lot of different options.
[02:54] In the case of, say, like Salesforce Marketing Cloud, you can choose specifically which fields
[02:58] to pull in and how those fields map to the profile fields.
[03:02] And there's lots of different configuration options.
[03:04] MailChimp is pretty simple in that you just say this is kind of like demo MailChimp import.
[03:11] We don't need a description.
[03:13] You'll choose the list to import.
[03:14] My list was just named Lytx because it's in the lytx.com account.
[03:19] You'll have some options on, do you want to sync subscribes and unsubscribes?
[03:25] Do you want to import just a portion of the data?
[03:27] Maybe you have 100 million records and some of them are old, so you only want to do the
[03:30] last year, whatever, controls like that.
[03:34] You can import activity data as well.
[03:36] This account's not going to have any activity data, but in most ESPs, traditionally, you
[03:39] would have opens and clicks and bounces and that kind of information that you also want
[03:43] to pull in.
[03:44] And then you can have the same sort of mechanism to control how far back you go.
[03:49] This account is brand new as of like 45 minutes ago, so there's not a whole lot of history
[03:53] there.
```

#### Key takeaways

- Connect **Working with Integrations** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 09 — Identifier Ranks

<!-- ai_metadata: {"lesson_id":"09","type":"video","duration_seconds":253,"video_url":"https://cdn.jwplayer.com/previews/FFIRINGI","thumbnail_url":"https://cdn.jwplayer.com/v2/media/FFIRINGI/poster.jpg?width=720","topics":["Identifier","Ranks"]} -->

#### Video details

#### At a glance

- **Title:** 17-data-insights-working-with-ranks
- **Duration:** 4m 13s
- **Media link:** https://cdn.jwplayer.com/previews/FFIRINGI
- **Publish date (unix):** 1752877181

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113666 kbps
- video/mp4 · 180p · 180p · 141064 kbps
- video/mp4 · 270p · 270p · 157856 kbps
- video/mp4 · 360p · 360p · 171996 kbps
- video/mp4 · 406p · 406p · 182307 kbps
- video/mp4 · 540p · 540p · 217867 kbps
- video/mp4 · 720p · 720p · 276684 kbps
- video/mp4 · 1080p · 1080p · 443311 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/FFIRINGI-120.vtt`

#### Transcript

I'm coming back real quick, just because I don't want to skip over it. So if I were to go in and create a field, so we'll just like do it example one, like let's say this was, you know, Mark's school ID, and I want to make it a string. And in this case, I want to make it an identity key. I'll let Eric talk through so there's one setting for all identity keys, that's really important and ultimately how we resolve those identity called ranks. So it's managed over here in the ranks section, I'll back out of here, but you'll see all the different identifiers that are in this particular account. So there's an email, there's a UID, there's a chat user ID, external ID, et cetera, et cetera, in kind of a priority order. So I'll kick it over to Eric for a moment to talk about the importance of ranks and how they actually impact the merging and building of profiles. Yeah, so if you think about identity, there's lots and lots of identities that represent us as real, real, real people. If you ask a marketer, they'll always tell you that, well, no, there's just Eric, like there's just Eric. But if you think about it, if you start to think about it, like I have four or five email addresses, so that's not really a good identifier for me, but it's better than all the cookies that I probably have. I probably have thousands of cookies on the web. And so there's this concept of ID, identifiers being stronger and stronger as you get to certain identifiers to the point where maybe the strongest is a customer ID. And so at my bank, my account, my social security number, my government ID is probably my strongest identifier. And so without establishing rank and identifiers, if we just had a graph with no ranking structure, we would never have this concept of a canonical ID. And all marketers really want this concept of what we call a canonical ID, which is that they can just call Eric, Eric, and they're always targeting Eric. And if Eric opts out, they know that Eric's opted out and that kind of thing. And they don't want my identifier to churn and change all the time. So the rank statements allow us to say, this identifier has a higher precedence. And what we do is when we generate what we call the Lytx ID, which is this canonical ID that marketers can use as like a key in databases and things like that, is we look at these ranks and we always choose the highest ranked identifier out of the list of all the identifiers that we know about a user. So there's the Lytx ID and that profile. And that one is actually derived from, if you look at the identifiers that are on this profile, it's derived from the highest ranking ID. This one's a bad one, because it only has- It's only got a UID. Well, if there's an email on here, if there was an email on here, it would be derived from emails. So externally, that's the primary use case for it, for the concept of ranks. Internally, we use it a lot to optimize graphs. And that's maybe at the very end, when we get into some more architecture discussions, I'll start to talk about how we maintain graph health and optimize graph lookups. And we have this concept of identifiers aging out and expiring, and all of that is related to ranks as well, but we'll get to that one in the last session.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:19.760
I'm coming back real quick, just because I don't want to skip over it.

2
00:00:19.760 --> 00:00:25.440
So if I were to go in and create a field, so we'll just like do it example one, like

3
00:00:25.440 --> 00:00:33.920
let's say this was, you know, Mark's school ID, and I want to make it a string.

4
00:00:33.920 --> 00:00:36.720
And in this case, I want to make it an identity key.

5
00:00:36.720 --> 00:00:41.480
I'll let Eric talk through so there's one setting for all identity keys, that's really

6
00:00:41.480 --> 00:00:45.080
important and ultimately how we resolve those identity called ranks.

7
00:00:45.080 --> 00:00:48.760
So it's managed over here in the ranks section, I'll back out of here, but you'll see all

8
00:00:48.760 --> 00:00:51.360
the different identifiers that are in this particular account.

9
00:00:51.360 --> 00:00:55.600
So there's an email, there's a UID, there's a chat user ID, external ID, et cetera, et

10
00:00:55.600 --> 00:00:57.600
cetera, in kind of a priority order.

11
00:00:57.600 --> 00:01:01.120
So I'll kick it over to Eric for a moment to talk about the importance of ranks and

12
00:01:01.120 --> 00:01:04.920
how they actually impact the merging and building of profiles.

13
00:01:04.920 --> 00:01:11.760
Yeah, so if you think about identity, there's lots and lots of identities that represent

14
00:01:11.760 --> 00:01:15.480
us as real, real, real people.

15
00:01:15.480 --> 00:01:19.760
If you ask a marketer, they'll always tell you that, well, no, there's just Eric, like

16
00:01:19.760 --> 00:01:21.260
there's just Eric.

17
00:01:21.260 --> 00:01:24.980
But if you think about it, if you start to think about it, like I have four or five email

18
00:01:24.980 --> 00:01:30.740
addresses, so that's not really a good identifier for me, but it's better than all the cookies

19
00:01:30.740 --> 00:01:31.740
that I probably have.

20
00:01:31.740 --> 00:01:33.960
I probably have thousands of cookies on the web.

21
00:01:33.960 --> 00:01:39.140
And so there's this concept of ID, identifiers being stronger and stronger as you get to

22
00:01:39.140 --> 00:01:44.180
certain identifiers to the point where maybe the strongest is a customer ID.

23
00:01:44.180 --> 00:01:52.300
And so at my bank, my account, my social security number, my government ID is probably

24
00:01:52.300 --> 00:01:55.940
my strongest identifier.

25
00:01:55.940 --> 00:02:06.900
And so without establishing rank and identifiers, if we just had a graph with no ranking structure,

26
00:02:06.900 --> 00:02:09.340
we would never have this concept of a canonical ID.

27
00:02:09.340 --> 00:02:15.300
And all marketers really want this concept of what we call a canonical ID, which is that

28
00:02:15.300 --> 00:02:18.140
they can just call Eric, Eric, and they're always targeting Eric.

29
00:02:18.140 --> 00:02:24.260
And if Eric opts out, they know that Eric's opted out and that kind of thing.

30
00:02:24.260 --> 00:02:29.020
And they don't want my identifier to churn and change all the time.

31
00:02:29.020 --> 00:02:36.500
So the rank statements allow us to say, this identifier has a higher precedence.

32
00:02:36.500 --> 00:02:42.820
And what we do is when we generate what we call the Lytx ID, which is this canonical

33
00:02:42.820 --> 00:02:49.340
ID that marketers can use as like a key in databases and things like that, is we look

34
00:02:49.340 --> 00:02:55.380
at these ranks and we always choose the highest ranked identifier out of the list of all the

35
00:02:55.380 --> 00:02:57.960
identifiers that we know about a user.

36
00:02:57.960 --> 00:03:03.780
So there's the Lytx ID and that profile.

37
00:03:03.780 --> 00:03:09.900
And that one is actually derived from, if you look at the identifiers that are on this

38
00:03:09.900 --> 00:03:17.340
profile, it's derived from the highest ranking ID.

39
00:03:17.340 --> 00:03:18.580
This one's a bad one, because it only has-

40
00:03:18.580 --> 00:03:19.580
It's only got a UID.

41
00:03:19.580 --> 00:03:25.820
Well, if there's an email on here, if there was an email on here, it would be derived

42
00:03:25.820 --> 00:03:27.860
from emails.

43
00:03:27.860 --> 00:03:33.620
So externally, that's the primary use case for it, for the concept of ranks.

44
00:03:33.620 --> 00:03:38.740
Internally, we use it a lot to optimize graphs.

45
00:03:38.740 --> 00:03:44.660
And that's maybe at the very end, when we get into some more architecture discussions,

46
00:03:44.660 --> 00:03:51.420
I'll start to talk about how we maintain graph health and optimize graph lookups.

47
00:03:51.420 --> 00:03:58.500
And we have this concept of identifiers aging out and expiring, and all of that is related

48
00:03:58.500 --> 00:04:03.820
to ranks as well, but we'll get to that one in the last session.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] I'm coming back real quick, just because I don't want to skip over it.
[00:19] So if I were to go in and create a field, so we'll just like do it example one, like
[00:25] let's say this was, you know, Mark's school ID, and I want to make it a string.
[00:33] And in this case, I want to make it an identity key.
[00:36] I'll let Eric talk through so there's one setting for all identity keys, that's really
[00:41] important and ultimately how we resolve those identity called ranks.
[00:45] So it's managed over here in the ranks section, I'll back out of here, but you'll see all
[00:48] the different identifiers that are in this particular account.
[00:51] So there's an email, there's a UID, there's a chat user ID, external ID, et cetera, et
[00:55] cetera, in kind of a priority order.
[00:57] So I'll kick it over to Eric for a moment to talk about the importance of ranks and
[01:01] how they actually impact the merging and building of profiles.
[01:04] Yeah, so if you think about identity, there's lots and lots of identities that represent
[01:11] us as real, real, real people.
[01:15] If you ask a marketer, they'll always tell you that, well, no, there's just Eric, like
[01:19] there's just Eric.
[01:21] But if you think about it, if you start to think about it, like I have four or five email
[01:24] addresses, so that's not really a good identifier for me, but it's better than all the cookies
[01:30] that I probably have.
[01:31] I probably have thousands of cookies on the web.
[01:33] And so there's this concept of ID, identifiers being stronger and stronger as you get to
[01:39] certain identifiers to the point where maybe the strongest is a customer ID.
[01:44] And so at my bank, my account, my social security number, my government ID is probably
[01:52] my strongest identifier.
[01:55] And so without establishing rank and identifiers, if we just had a graph with no ranking structure,
[02:06] we would never have this concept of a canonical ID.
[02:09] And all marketers really want this concept of what we call a canonical ID, which is that
[02:15] they can just call Eric, Eric, and they're always targeting Eric.
[02:18] And if Eric opts out, they know that Eric's opted out and that kind of thing.
[02:24] And they don't want my identifier to churn and change all the time.
[02:29] So the rank statements allow us to say, this identifier has a higher precedence.
[02:36] And what we do is when we generate what we call the Lytx ID, which is this canonical
[02:42] ID that marketers can use as like a key in databases and things like that, is we look
[02:49] at these ranks and we always choose the highest ranked identifier out of the list of all the
[02:55] identifiers that we know about a user.
[02:57] So there's the Lytx ID and that profile.
[03:03] And that one is actually derived from, if you look at the identifiers that are on this
[03:09] profile, it's derived from the highest ranking ID.
[03:17] This one's a bad one, because it only has-
[03:18] It's only got a UID.
[03:19] Well, if there's an email on here, if there was an email on here, it would be derived
[03:25] from emails.
[03:27] So externally, that's the primary use case for it, for the concept of ranks.
[03:33] Internally, we use it a lot to optimize graphs.
[03:38] And that's maybe at the very end, when we get into some more architecture discussions,
[03:44] I'll start to talk about how we maintain graph health and optimize graph lookups.
[03:51] And we have this concept of identifiers aging out and expiring, and all of that is related
[03:58] to ranks as well, but we'll get to that one in the last session.
```

#### Key takeaways

- Connect **Identifier Ranks** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 10 — Working w/ Warehouse Data

<!-- ai_metadata: {"lesson_id":"10","type":"video","duration_seconds":579,"video_url":"https://cdn.jwplayer.com/previews/u1mD3rGg","thumbnail_url":"https://cdn.jwplayer.com/v2/media/u1mD3rGg/poster.jpg?width=720","topics":["Working","Warehouse","Data"]} -->

#### Video details

#### At a glance

- **Title:** 18-data-insights-warehouses
- **Duration:** 9m 39s
- **Media link:** https://cdn.jwplayer.com/previews/u1mD3rGg
- **Publish date (unix):** 1752878424

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113588 kbps
- video/mp4 · 180p · 180p · 135498 kbps
- video/mp4 · 270p · 270p · 148923 kbps
- video/mp4 · 360p · 360p · 162108 kbps
- video/mp4 · 406p · 406p · 170215 kbps
- video/mp4 · 540p · 540p · 200907 kbps
- video/mp4 · 720p · 720p · 245465 kbps
- video/mp4 · 1080p · 1080p · 382520 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/u1mD3rGg-120.vtt`

#### Transcript

So, we talked about integrations that pull from marketing tools. We talked about all their APIs, our JavaScript tag, which uses our APIs. The thing that we haven't touched on that I think is pretty easy to demo, it certainly can be a bigger conversation to go into the weeds. But the other source of data that is super common to pull in is from your warehouse. So we have integrations in the data pipeline, if you just want to stream the entire table in and not have any filtering from your warehouse, you can do that through the connection just like we did MailChimp. But we have a special tool called Cloud Connect that allows you to actually connect to your warehouse. We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera. What it allows you to do is create a connection, which is very similar to what we did with MailChimp, so I won't walk through that. Like it's a BigQuery, we just use a JWT to get authorization into BigQuery. But what it allows you to then do is build data models, which is essentially a SQL query against that particular table or set of tables inside of your warehouse to pull that data in uniquely to the profile. It goes through, and this is why we'll have a bigger conversation on what it actually means to the profile and how it works, but it maps things in a pretty unique way in that it doesn't go through a data stream. It doesn't necessarily have to adhere to the mapping and the sort of those rules. It creates its new own set of fields that also go away if you lose access to the data. So this comes up a lot when customers want to add scores or information to a profile, but they don't want to create a duplicate and copy and stream it in and go through all of the inherent kind of risks there. They just want to plop something on a profile and kind of like override some of those settings. It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor. I have a very simple query that I wrote. You can just paste it and say, I want to select everybody from the sample data set with email first name, last name, and an average annual revenue from sample customers. You can test the query. It'll actually query that, in this case, BigQuery in real time. And then as you connect it, you can actually then describe how you want to map that to a profile. So we'll just say BQ test. The only thing really you have to choose is the primary key. So from my data set, I want to map the email that I just selected. Again, that's the only kind of context. You still have to tell it how you're going to map this Cloud Connect data, this warehouse data to a profile. So I want to merge it based on the email address. I want to merge that with the email field. And then you can choose optionally if you want to pull in additional information. So I want to add first name, last name, and average annual revenue. With Cloud Connect, because it's less of a real-time thing, more of a query-based thing, you can then choose the cadence of how often you want that to run. For tests and demo, I always do an hour. In reality, you probably don't want to just spam your warehouse instance with these big expensive queries every single hour. So most customers are going to do 24 hours or 48 hours or whatever it may be. But you have that. You can flag if you want to create net new profiles from that data. If you don't check this box, it's only going to map to the profiles that exist and never create net new ones. Because our sample database isn't built of Game of Thrones characters, I'll create new so that it creates those profiles. Then ultimately, you create this data model, and it's going to go through, query that BigQuery database on that cadence that we described. And then ultimately, those profiles will come in to the UI. I don't know how long that will take. So I think in our next session, I'll be sure to show you what that data looks like on a profile because it looks a little bit different. All of the segmentation and activation capabilities are exactly the same. But I just wanted to touch quickly just to introduce the idea that's the final piece of where data can come from. The Cloud Connect product represents a little bit of a different method for getting data into Linux. All of the other methods that we talked about, the JavaScript tag collection, the APIs, all of our background jobs, all of that kind of stuff uses our streaming pipeline. So it goes into a stream, a stream maps to a field, fields ultimately show up on the profile. Cloud Connect, just to kind of re-cover this part, is quite a bit different, actually, in that it doesn't use our streaming pipeline to actually get data onto the profile. It has a whole different mechanism that we can talk at length at. At some point, Eric can go into details there, but it essentially bypasses that streaming pipeline and injects the results of that query directly onto the profile. It's really useful for a few different reasons, but the one context that it comes up often is around sort of security and control. So think about like a situation where customer A wants to share a subset of data with one of their partners, with an agency, with another customer, whatever it may be, but they don't want to just give access to the raw data, something that they can like copy and own forever. So all of the warehouses have a different kind of methodology for how you can share and unlock that capability. In BigQuery, you essentially can give access to a specific dataset, and then you don't have to necessarily expose all of the raw data. With how Cloud Connect works, where it doesn't stream that data in, it's not creating a sort of like hard copy that gets written to our system, that gets backed up in our files, and it's creating kind of a less persistent temporary store for that data. If customer A wants to unshare essentially that data from their partner, from the other customer, whatever it may be, and they then lift that access, so they prevent somebody from actually being able to query that database, the next time that query runs, it'll actually clean up that system all the way through, so you don't have that kind of like legacy data that's in that stream system and all that kind of stuff existing. So it comes up often when security or data control, data access, is a key part of that conversation. And it's because, like I said, it doesn't actually stream the data to Lytx in the exact same way. So to just quickly kind of recap on Cloud Connect, it's under data pipeline, the same place that all of our jobs and all the other sort of collection profile sort of building is. Within Cloud Connect, you have the idea of connections, which is just that connection to the database. We won't rehash there. You have the data model, which is essentially the query that's going to run. In our case, in the last conversation, we built this sample query. It just pulls in a set of sample users, first name, last name, email, customer type, and then just an example of a score, for instance, that would be maybe in your warehouse. The thing that we didn't totally cover was how you then get the Cloud Connect data, this warehouse query into Lytx to store it on a profile so that it functions essentially fundamentally like one of our normal attributes. There's a publishing process in this. So when I hit next, but I think we might have briefly touched on this, if I recall, but we didn't actually complete it. And then we definitely didn't show it on the profile. So with Cloud Connect, you don't have to have everything mapped. You don't have to have all of the attributes configured in the same way that the streaming pipeline is. The only question that you have to answer is essentially how to map this data to a single profile. So you have to essentially pick the key from your data that you're querying from, say BigQuery, and what key you want to be able to write it to, so which identifier inside of Lytx you want to associate that data with. So in this case, in this one that I've already pre-configured, we're basically just saying in this query, there's a bunch of stuff, but all we want to do is we want to find anybody that matches on email. And if they match on email, we're going to append this information to that profile. The thing that is unique about Cloud Connect, if I go to a profile, for instance, that I think I had pulled up, yeah. So this one is one of the records that's in that sample data set. They have a profile, just like any other user, regardless of where that was generated from. But if you scroll down and see the data that came in for that particular data model on their profile, you're going to see a few different things. One, you'll see the raw attribute that we pulled in, first name, last name, but it's going to be independent of the other first name, last name fields that are already in the schema. And then you see this unique, and this is actually the more useful part of this particular thing, is there's this unique membership attribute that now gets added. So back to that example of like, I'm Nike.com and I'm sharing data with a partner and I want to give them access to everybody that has a high propensity to buy women's running shoes or whatever it may be, but I don't want to give them all of the data that I needed to use to pull this score to build that list. I just want to give them sort of that Boolean yes, no. That's where this membership flag can come in and that you don't have to have access to all of the information in order to make the calculation. You're just essentially pulling this information in temporarily as long as you have access. And then when you go to build a segment, it functions full scale, just like all of the other attributes in the system. So you can mix and match. There's no sort of limitations there, but like the one thing to just kind of be aware of and know is the use case of where this particular method for pulling data in from a warehouse is super useful, is around that sort of access control. I want to be able to pull things away. I don't want that data to persist. I don't want to store it somewhere. Whereas the streaming method, which also has a warehouse connection, or if I just want to pull everything from a particular table or whatever it may be, you can use our kind of back end integration.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:19.280
So, we talked about integrations that pull from marketing tools.

2
00:00:19.280 --> 00:00:23.260
We talked about all their APIs, our JavaScript tag, which uses our APIs.

3
00:00:23.260 --> 00:00:26.160
The thing that we haven't touched on that I think is pretty easy to demo, it certainly

4
00:00:26.160 --> 00:00:28.800
can be a bigger conversation to go into the weeds.

5
00:00:28.800 --> 00:00:32.960
But the other source of data that is super common to pull in is from your warehouse.

6
00:00:32.960 --> 00:00:37.840
So we have integrations in the data pipeline, if you just want to stream the entire table

7
00:00:37.840 --> 00:00:41.440
in and not have any filtering from your warehouse, you can do that through the connection just

8
00:00:41.440 --> 00:00:42.920
like we did MailChimp.

9
00:00:42.920 --> 00:00:47.240
But we have a special tool called Cloud Connect that allows you to actually connect to your

10
00:00:47.240 --> 00:00:48.240
warehouse.

11
00:00:48.240 --> 00:00:52.400
We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera.

12
00:00:52.400 --> 00:00:56.360
What it allows you to do is create a connection, which is very similar to what we did with

13
00:00:56.360 --> 00:00:57.920
MailChimp, so I won't walk through that.

14
00:00:57.920 --> 00:01:03.040
Like it's a BigQuery, we just use a JWT to get authorization into BigQuery.

15
00:01:03.040 --> 00:01:08.120
But what it allows you to then do is build data models, which is essentially a SQL query

16
00:01:08.120 --> 00:01:14.240
against that particular table or set of tables inside of your warehouse to pull that data

17
00:01:14.240 --> 00:01:16.360
in uniquely to the profile.

18
00:01:16.360 --> 00:01:19.920
It goes through, and this is why we'll have a bigger conversation on what it actually

19
00:01:19.920 --> 00:01:24.600
means to the profile and how it works, but it maps things in a pretty unique way in that

20
00:01:24.600 --> 00:01:26.640
it doesn't go through a data stream.

21
00:01:26.640 --> 00:01:31.480
It doesn't necessarily have to adhere to the mapping and the sort of those rules.

22
00:01:31.480 --> 00:01:37.360
It creates its new own set of fields that also go away if you lose access to the data.

23
00:01:37.360 --> 00:01:42.160
So this comes up a lot when customers want to add scores or information to a profile,

24
00:01:42.160 --> 00:01:45.280
but they don't want to create a duplicate and copy and stream it in and go through all

25
00:01:45.280 --> 00:01:48.060
of the inherent kind of risks there.

26
00:01:48.060 --> 00:01:51.820
They just want to plop something on a profile and kind of like override some of those settings.

27
00:01:51.820 --> 00:01:57.580
It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor.

28
00:01:57.580 --> 00:02:01.220
I have a very simple query that I wrote.

29
00:02:01.220 --> 00:02:04.980
You can just paste it and say, I want to select everybody from the sample data set with email

30
00:02:04.980 --> 00:02:08.880
first name, last name, and an average annual revenue from sample customers.

31
00:02:08.880 --> 00:02:10.300
You can test the query.

32
00:02:10.300 --> 00:02:14.860
It'll actually query that, in this case, BigQuery in real time.

33
00:02:14.860 --> 00:02:18.940
And then as you connect it, you can actually then describe how you want to map that to

34
00:02:18.940 --> 00:02:19.940
a profile.

35
00:02:20.060 --> 00:02:23.900
So we'll just say BQ test.

36
00:02:23.900 --> 00:02:27.120
The only thing really you have to choose is the primary key.

37
00:02:27.120 --> 00:02:30.660
So from my data set, I want to map the email that I just selected.

38
00:02:30.660 --> 00:02:32.720
Again, that's the only kind of context.

39
00:02:32.720 --> 00:02:36.380
You still have to tell it how you're going to map this Cloud Connect data, this warehouse

40
00:02:36.380 --> 00:02:38.980
data to a profile.

41
00:02:38.980 --> 00:02:42.220
So I want to merge it based on the email address.

42
00:02:42.220 --> 00:02:46.340
I want to merge that with the email field.

43
00:02:46.340 --> 00:02:49.980
And then you can choose optionally if you want to pull in additional information.

44
00:02:49.980 --> 00:02:53.700
So I want to add first name, last name, and average annual revenue.

45
00:02:53.700 --> 00:02:58.660
With Cloud Connect, because it's less of a real-time thing, more of a query-based thing,

46
00:02:58.660 --> 00:03:02.320
you can then choose the cadence of how often you want that to run.

47
00:03:02.320 --> 00:03:04.260
For tests and demo, I always do an hour.

48
00:03:04.260 --> 00:03:09.340
In reality, you probably don't want to just spam your warehouse instance with these big

49
00:03:09.340 --> 00:03:11.220
expensive queries every single hour.

50
00:03:11.220 --> 00:03:14.980
So most customers are going to do 24 hours or 48 hours or whatever it may be.

51
00:03:15.540 --> 00:03:16.940
But you have that.

52
00:03:16.940 --> 00:03:20.740
You can flag if you want to create net new profiles from that data.

53
00:03:20.740 --> 00:03:24.540
If you don't check this box, it's only going to map to the profiles that exist and never

54
00:03:24.540 --> 00:03:26.220
create net new ones.

55
00:03:26.220 --> 00:03:30.140
Because our sample database isn't built of Game of Thrones characters, I'll create new

56
00:03:30.140 --> 00:03:32.260
so that it creates those profiles.

57
00:03:32.260 --> 00:03:35.940
Then ultimately, you create this data model, and it's going to go through, query that BigQuery

58
00:03:35.940 --> 00:03:39.380
database on that cadence that we described.

59
00:03:39.380 --> 00:03:43.060
And then ultimately, those profiles will come in to the UI.

60
00:03:43.460 --> 00:03:46.460
I don't know how long that will take.

61
00:03:46.460 --> 00:03:50.500
So I think in our next session, I'll be sure to show you what that data looks like on a

62
00:03:50.500 --> 00:03:52.620
profile because it looks a little bit different.

63
00:03:52.620 --> 00:03:57.100
All of the segmentation and activation capabilities are exactly the same.

64
00:03:57.100 --> 00:04:01.820
But I just wanted to touch quickly just to introduce the idea that's the final piece

65
00:04:01.820 --> 00:04:04.260
of where data can come from.

66
00:04:04.260 --> 00:04:08.820
The Cloud Connect product represents a little bit of a different method for getting data

67
00:04:08.820 --> 00:04:10.580
into Linux.

68
00:04:10.580 --> 00:04:15.060
All of the other methods that we talked about, the JavaScript tag collection, the APIs, all

69
00:04:15.060 --> 00:04:19.340
of our background jobs, all of that kind of stuff uses our streaming pipeline.

70
00:04:19.340 --> 00:04:23.660
So it goes into a stream, a stream maps to a field, fields ultimately show up on the

71
00:04:23.660 --> 00:04:25.220
profile.

72
00:04:25.220 --> 00:04:29.980
Cloud Connect, just to kind of re-cover this part, is quite a bit different, actually,

73
00:04:29.980 --> 00:04:34.220
in that it doesn't use our streaming pipeline to actually get data onto the profile.

74
00:04:34.220 --> 00:04:38.820
It has a whole different mechanism that we can talk at length at.

75
00:04:38.820 --> 00:04:42.700
At some point, Eric can go into details there, but it essentially bypasses that streaming

76
00:04:42.700 --> 00:04:47.980
pipeline and injects the results of that query directly onto the profile.

77
00:04:47.980 --> 00:04:52.420
It's really useful for a few different reasons, but the one context that it comes up often

78
00:04:52.420 --> 00:04:54.620
is around sort of security and control.

79
00:04:54.620 --> 00:05:00.980
So think about like a situation where customer A wants to share a subset of data with one

80
00:05:00.980 --> 00:05:04.740
of their partners, with an agency, with another customer, whatever it may be, but they don't

81
00:05:04.740 --> 00:05:08.060
want to just give access to the raw data, something that they can like copy and own

82
00:05:08.060 --> 00:05:09.060
forever.

83
00:05:09.060 --> 00:05:13.900
So all of the warehouses have a different kind of methodology for how you can share

84
00:05:13.900 --> 00:05:16.060
and unlock that capability.

85
00:05:16.060 --> 00:05:21.100
In BigQuery, you essentially can give access to a specific dataset, and then you don't

86
00:05:21.100 --> 00:05:24.100
have to necessarily expose all of the raw data.

87
00:05:24.100 --> 00:05:28.780
With how Cloud Connect works, where it doesn't stream that data in, it's not creating a sort

88
00:05:28.780 --> 00:05:33.020
of like hard copy that gets written to our system, that gets backed up in our files,

89
00:05:33.020 --> 00:05:38.060
and it's creating kind of a less persistent temporary store for that data.

90
00:05:38.060 --> 00:05:43.740
If customer A wants to unshare essentially that data from their partner, from the other

91
00:05:43.740 --> 00:05:47.820
customer, whatever it may be, and they then lift that access, so they prevent somebody

92
00:05:47.820 --> 00:05:52.060
from actually being able to query that database, the next time that query runs, it'll actually

93
00:05:52.060 --> 00:05:56.140
clean up that system all the way through, so you don't have that kind of like legacy

94
00:05:56.140 --> 00:05:59.340
data that's in that stream system and all that kind of stuff existing.

95
00:05:59.340 --> 00:06:05.740
So it comes up often when security or data control, data access, is a key part of that

96
00:06:05.740 --> 00:06:06.740
conversation.

97
00:06:06.740 --> 00:06:11.020
And it's because, like I said, it doesn't actually stream the data to Lytx in the exact

98
00:06:11.020 --> 00:06:12.020
same way.

99
00:06:12.020 --> 00:06:16.460
So to just quickly kind of recap on Cloud Connect, it's under data pipeline, the same

100
00:06:16.460 --> 00:06:21.740
place that all of our jobs and all the other sort of collection profile sort of building

101
00:06:21.740 --> 00:06:22.740
is.

102
00:06:22.740 --> 00:06:25.220
Within Cloud Connect, you have the idea of connections, which is just that connection

103
00:06:25.220 --> 00:06:26.220
to the database.

104
00:06:26.220 --> 00:06:27.220
We won't rehash there.

105
00:06:27.220 --> 00:06:30.500
You have the data model, which is essentially the query that's going to run.

106
00:06:30.500 --> 00:06:34.980
In our case, in the last conversation, we built this sample query.

107
00:06:34.980 --> 00:06:39.020
It just pulls in a set of sample users, first name, last name, email, customer type, and

108
00:06:39.020 --> 00:06:43.300
then just an example of a score, for instance, that would be maybe in your warehouse.

109
00:06:43.300 --> 00:06:47.820
The thing that we didn't totally cover was how you then get the Cloud Connect data, this

110
00:06:47.820 --> 00:06:53.500
warehouse query into Lytx to store it on a profile so that it functions essentially fundamentally

111
00:06:53.500 --> 00:06:56.460
like one of our normal attributes.

112
00:06:56.460 --> 00:06:58.140
There's a publishing process in this.

113
00:06:58.140 --> 00:07:01.420
So when I hit next, but I think we might have briefly touched on this, if I recall, but

114
00:07:01.420 --> 00:07:02.420
we didn't actually complete it.

115
00:07:02.420 --> 00:07:04.740
And then we definitely didn't show it on the profile.

116
00:07:04.740 --> 00:07:08.860
So with Cloud Connect, you don't have to have everything mapped.

117
00:07:08.860 --> 00:07:12.440
You don't have to have all of the attributes configured in the same way that the streaming

118
00:07:12.440 --> 00:07:13.780
pipeline is.

119
00:07:13.780 --> 00:07:19.380
The only question that you have to answer is essentially how to map this data to a single

120
00:07:19.380 --> 00:07:20.380
profile.

121
00:07:20.380 --> 00:07:25.220
So you have to essentially pick the key from your data that you're querying from, say BigQuery,

122
00:07:25.220 --> 00:07:28.580
and what key you want to be able to write it to, so which identifier inside of Lytx

123
00:07:28.580 --> 00:07:30.940
you want to associate that data with.

124
00:07:30.940 --> 00:07:34.540
So in this case, in this one that I've already pre-configured, we're basically just saying

125
00:07:34.540 --> 00:07:37.180
in this query, there's a bunch of stuff, but all we want to do is we want to find anybody

126
00:07:37.180 --> 00:07:38.940
that matches on email.

127
00:07:38.940 --> 00:07:43.420
And if they match on email, we're going to append this information to that profile.

128
00:07:43.420 --> 00:07:49.140
The thing that is unique about Cloud Connect, if I go to a profile, for instance, that I

129
00:07:49.140 --> 00:07:52.100
think I had pulled up, yeah.

130
00:07:52.100 --> 00:07:55.660
So this one is one of the records that's in that sample data set.

131
00:07:55.660 --> 00:08:00.820
They have a profile, just like any other user, regardless of where that was generated from.

132
00:08:00.820 --> 00:08:05.180
But if you scroll down and see the data that came in for that particular data model on

133
00:08:05.180 --> 00:08:07.940
their profile, you're going to see a few different things.

134
00:08:07.940 --> 00:08:11.580
One, you'll see the raw attribute that we pulled in, first name, last name, but it's

135
00:08:11.580 --> 00:08:16.740
going to be independent of the other first name, last name fields that are already in

136
00:08:16.740 --> 00:08:18.080
the schema.

137
00:08:18.080 --> 00:08:21.420
And then you see this unique, and this is actually the more useful part of this particular

138
00:08:21.420 --> 00:08:25.580
thing, is there's this unique membership attribute that now gets added.

139
00:08:25.580 --> 00:08:30.940
So back to that example of like, I'm Nike.com and I'm sharing data with a partner and I

140
00:08:30.940 --> 00:08:35.300
want to give them access to everybody that has a high propensity to buy women's running

141
00:08:35.300 --> 00:08:39.260
shoes or whatever it may be, but I don't want to give them all of the data that I needed

142
00:08:39.260 --> 00:08:41.220
to use to pull this score to build that list.

143
00:08:41.220 --> 00:08:44.300
I just want to give them sort of that Boolean yes, no.

144
00:08:44.300 --> 00:08:47.540
That's where this membership flag can come in and that you don't have to have access

145
00:08:47.540 --> 00:08:50.540
to all of the information in order to make the calculation.

146
00:08:50.660 --> 00:08:55.140
You're just essentially pulling this information in temporarily as long as you have access.

147
00:08:55.140 --> 00:08:59.540
And then when you go to build a segment, it functions full scale, just like all of the

148
00:08:59.540 --> 00:09:00.940
other attributes in the system.

149
00:09:00.940 --> 00:09:02.660
So you can mix and match.

150
00:09:02.660 --> 00:09:06.580
There's no sort of limitations there, but like the one thing to just kind of be aware

151
00:09:06.580 --> 00:09:11.060
of and know is the use case of where this particular method for pulling data in from

152
00:09:11.060 --> 00:09:15.980
a warehouse is super useful, is around that sort of access control.

153
00:09:15.980 --> 00:09:17.420
I want to be able to pull things away.

154
00:09:17.420 --> 00:09:18.420
I don't want that data to persist.

155
00:09:18.620 --> 00:09:20.340
I don't want to store it somewhere.

156
00:09:20.340 --> 00:09:23.780
Whereas the streaming method, which also has a warehouse connection, or if I just want

157
00:09:23.780 --> 00:09:28.380
to pull everything from a particular table or whatever it may be, you can use our kind

158
00:09:28.380 --> 00:09:30.460
of back end integration.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So, we talked about integrations that pull from marketing tools.
[00:19] We talked about all their APIs, our JavaScript tag, which uses our APIs.
[00:23] The thing that we haven't touched on that I think is pretty easy to demo, it certainly
[00:26] can be a bigger conversation to go into the weeds.
[00:28] But the other source of data that is super common to pull in is from your warehouse.
[00:32] So we have integrations in the data pipeline, if you just want to stream the entire table
[00:37] in and not have any filtering from your warehouse, you can do that through the connection just
[00:41] like we did MailChimp.
[00:42] But we have a special tool called Cloud Connect that allows you to actually connect to your
[00:47] warehouse.
[00:48] We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera.
[00:52] What it allows you to do is create a connection, which is very similar to what we did with
[00:56] MailChimp, so I won't walk through that.
[00:57] Like it's a BigQuery, we just use a JWT to get authorization into BigQuery.
[01:03] But what it allows you to then do is build data models, which is essentially a SQL query
[01:08] against that particular table or set of tables inside of your warehouse to pull that data
[01:14] in uniquely to the profile.
[01:16] It goes through, and this is why we'll have a bigger conversation on what it actually
[01:19] means to the profile and how it works, but it maps things in a pretty unique way in that
[01:24] it doesn't go through a data stream.
[01:26] It doesn't necessarily have to adhere to the mapping and the sort of those rules.
[01:31] It creates its new own set of fields that also go away if you lose access to the data.
[01:37] So this comes up a lot when customers want to add scores or information to a profile,
[01:42] but they don't want to create a duplicate and copy and stream it in and go through all
[01:45] of the inherent kind of risks there.
[01:48] They just want to plop something on a profile and kind of like override some of those settings.
[01:51] It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor.
[01:57] I have a very simple query that I wrote.
[02:01] You can just paste it and say, I want to select everybody from the sample data set with email
[02:04] first name, last name, and an average annual revenue from sample customers.
[02:08] You can test the query.
[02:10] It'll actually query that, in this case, BigQuery in real time.
[02:14] And then as you connect it, you can actually then describe how you want to map that to
[02:18] a profile.
[02:20] So we'll just say BQ test.
[02:23] The only thing really you have to choose is the primary key.
[02:27] So from my data set, I want to map the email that I just selected.
[02:30] Again, that's the only kind of context.
[02:32] You still have to tell it how you're going to map this Cloud Connect data, this warehouse
[02:36] data to a profile.
[02:38] So I want to merge it based on the email address.
[02:42] I want to merge that with the email field.
[02:46] And then you can choose optionally if you want to pull in additional information.
[02:49] So I want to add first name, last name, and average annual revenue.
[02:53] With Cloud Connect, because it's less of a real-time thing, more of a query-based thing,
[02:58] you can then choose the cadence of how often you want that to run.
[03:02] For tests and demo, I always do an hour.
[03:04] In reality, you probably don't want to just spam your warehouse instance with these big
[03:09] expensive queries every single hour.
[03:11] So most customers are going to do 24 hours or 48 hours or whatever it may be.
[03:15] But you have that.
[03:16] You can flag if you want to create net new profiles from that data.
[03:20] If you don't check this box, it's only going to map to the profiles that exist and never
[03:24] create net new ones.
[03:26] Because our sample database isn't built of Game of Thrones characters, I'll create new
[03:30] so that it creates those profiles.
[03:32] Then ultimately, you create this data model, and it's going to go through, query that BigQuery
[03:35] database on that cadence that we described.
[03:39] And then ultimately, those profiles will come in to the UI.
[03:43] I don't know how long that will take.
```

#### Key takeaways

- Connect **Working w/ Warehouse Data** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 11 — Building Lookalike Models

<!-- ai_metadata: {"lesson_id":"11","type":"video","duration_seconds":224,"video_url":"https://cdn.jwplayer.com/previews/qZ02gNuc","thumbnail_url":"https://cdn.jwplayer.com/v2/media/qZ02gNuc/poster.jpg?width=720","topics":["Building","Lookalike","Models"]} -->

#### Video details

#### At a glance

- **Title:** 19-data-insights-lookalike-models
- **Duration:** 3m 44s
- **Media link:** https://cdn.jwplayer.com/previews/qZ02gNuc
- **Publish date (unix):** 1752878955

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113880 kbps
- video/mp4 · 180p · 180p · 132922 kbps
- video/mp4 · 270p · 270p · 144948 kbps
- video/mp4 · 360p · 360p · 153786 kbps
- video/mp4 · 406p · 406p · 159929 kbps
- video/mp4 · 540p · 540p · 180737 kbps
- video/mp4 · 720p · 720p · 211742 kbps
- video/mp4 · 1080p · 1080p · 295901 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/qZ02gNuc-120.vtt`

#### Transcript

Linux has a pretty powerful lookalike modeling feature which allows you essentially to go into the UI, it makes it super easy for marketers and less technical folks. If you have a data science team, they're probably already doing scoring, it probably lives in your warehouse, you probably don't want to just bank on lookalike models. But for somebody that wants to kind of like fill gaps in some of their data or get a better understanding easily in a few clicks, what lookalike model allows you to do, and I think one of the best use cases is like, if I am a brand and I use Sixth Sense, for those that don't know, Sixth Sense allows you to understand where traffic, like what business traffic comes from. When I go to a website, they can, for a portion of my audience associated with, okay, Mark works at ContentStack, ContentStack has X number of employees and here's their annual revenue and all this information. The downside to tools like Sixth Sense is that they only effectively analyze 10, 15-ish percent of traffic. So for that 10% of the users, you have a really good understanding of, is this a highly qualified lead, right? So even like ContentStack as a company, when people go to the website, do you want to understand, do they work for and represent a company that's a high value company that looks like a good ContentStack customer? You can do that with a tool like Sixth Sense for about 10% of the audience, but you lose the other 90% and that you don't really know where they're coming from or what they could do. So what lookalike model allows you to do is take a target audience, say the 10% of the users that you know that you've built an audience that, okay, here's my highly qualified users. Here's the people that work for a company that's big enough that has the right sort of focuses or whatever it may be. And then I want to compare that to the other 90% of the audience and see where their behaviors overlap. So I don't necessarily know specifically where they work or what companies, but I can understand that this 90% of the audience, whether they behave like, they act like, they're looking at the things, they're clicking on the same things, that sort of like 10% of highly qualified leads looks at. Effectively, if you have a portion of your audience that you don't understand and a portion of your audience that you do understand, you can go in here, you can create a new lookalike model, you can choose your source audience. So if I go to one that's already configured, for instance, this one that we just did, it is going to essentially compare the no Sixth Sense data to the all audience that it already has Sixth Sense data. And then from there, it's going to essentially put a score on how much they look like based on their behavior, based on their interest scores and their behavioral scores and all that information that we're ultimately putting on these profiles to tell you how much or how little they actually look like a particular set of users. That data too is represented by scores on the profile. So like, so we can come back and kind of like dig deeper into lookalike models. But there are particular like marketing use cases that are super useful. The net result, though, is ultimately a score on the profile. So you'll be able to see that Mark Hayden has a score for this particular model, and it's a 76 out of 100 or a 10 out of 100. So then you can go build an audience of people that look more like highly qualified leads than less than highly qualified leads. A lot of the marketing tools also have similar functionality, like when you get to the ad tech and whatnot, they'll have their own lookalike modeling and whatnot. But we find this is really useful when you're trying to explore your data. So that's it for this week. I hope you found this video helpful. If you did, please like and subscribe. And if you want to see more videos like this, please subscribe to our YouTube channel. And if you have any questions, or you have any questions about the product, or any other topics that you'd like to see covered in this video, please leave them in the comments below. And I'll see you in the next one. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:22.000
Linux has a pretty powerful lookalike modeling feature which allows you essentially to go

2
00:00:22.000 --> 00:00:26.520
into the UI, it makes it super easy for marketers and less technical folks.

3
00:00:26.520 --> 00:00:30.760
If you have a data science team, they're probably already doing scoring, it probably lives in

4
00:00:30.760 --> 00:00:33.720
your warehouse, you probably don't want to just bank on lookalike models.

5
00:00:33.720 --> 00:00:37.960
But for somebody that wants to kind of like fill gaps in some of their data or get a better

6
00:00:37.960 --> 00:00:42.000
understanding easily in a few clicks, what lookalike model allows you to do, and I think

7
00:00:42.000 --> 00:00:48.040
one of the best use cases is like, if I am a brand and I use Sixth Sense, for those that

8
00:00:48.040 --> 00:00:52.360
don't know, Sixth Sense allows you to understand where traffic, like what business traffic

9
00:00:52.360 --> 00:00:53.360
comes from.

10
00:00:53.600 --> 00:00:57.560
When I go to a website, they can, for a portion of my audience associated with, okay, Mark

11
00:00:57.560 --> 00:01:01.280
works at ContentStack, ContentStack has X number of employees and here's their annual

12
00:01:01.280 --> 00:01:03.560
revenue and all this information.

13
00:01:03.560 --> 00:01:10.720
The downside to tools like Sixth Sense is that they only effectively analyze 10, 15-ish

14
00:01:10.720 --> 00:01:12.720
percent of traffic.

15
00:01:12.720 --> 00:01:16.840
So for that 10% of the users, you have a really good understanding of, is this a highly qualified

16
00:01:16.840 --> 00:01:17.840
lead, right?

17
00:01:17.840 --> 00:01:23.080
So even like ContentStack as a company, when people go to the website, do you want to understand,

18
00:01:23.080 --> 00:01:26.240
do they work for and represent a company that's a high value company that looks like

19
00:01:26.240 --> 00:01:28.160
a good ContentStack customer?

20
00:01:28.160 --> 00:01:32.760
You can do that with a tool like Sixth Sense for about 10% of the audience, but you lose

21
00:01:32.760 --> 00:01:36.760
the other 90% and that you don't really know where they're coming from or what they could

22
00:01:36.760 --> 00:01:37.760
do.

23
00:01:37.760 --> 00:01:42.520
So what lookalike model allows you to do is take a target audience, say the 10% of the

24
00:01:42.520 --> 00:01:46.640
users that you know that you've built an audience that, okay, here's my highly qualified users.

25
00:01:46.640 --> 00:01:49.560
Here's the people that work for a company that's big enough that has the right sort

26
00:01:49.560 --> 00:01:52.040
of focuses or whatever it may be.

27
00:01:52.040 --> 00:01:56.560
And then I want to compare that to the other 90% of the audience and see where their behaviors

28
00:01:56.560 --> 00:01:57.560
overlap.

29
00:01:57.560 --> 00:02:03.760
So I don't necessarily know specifically where they work or what companies, but I can understand

30
00:02:03.760 --> 00:02:07.840
that this 90% of the audience, whether they behave like, they act like, they're looking

31
00:02:07.840 --> 00:02:12.400
at the things, they're clicking on the same things, that sort of like 10% of highly qualified

32
00:02:12.400 --> 00:02:14.040
leads looks at.

33
00:02:14.040 --> 00:02:19.440
Effectively, if you have a portion of your audience that you don't understand and a portion

34
00:02:19.440 --> 00:02:23.200
of your audience that you do understand, you can go in here, you can create a new lookalike

35
00:02:23.200 --> 00:02:26.040
model, you can choose your source audience.

36
00:02:26.040 --> 00:02:30.800
So if I go to one that's already configured, for instance, this one that we just did, it

37
00:02:30.800 --> 00:02:36.280
is going to essentially compare the no Sixth Sense data to the all audience that it already

38
00:02:36.280 --> 00:02:37.840
has Sixth Sense data.

39
00:02:37.840 --> 00:02:42.160
And then from there, it's going to essentially put a score on how much they look like based

40
00:02:42.160 --> 00:02:45.640
on their behavior, based on their interest scores and their behavioral scores and all

41
00:02:45.640 --> 00:02:50.720
that information that we're ultimately putting on these profiles to tell you how much or

42
00:02:50.720 --> 00:02:55.560
how little they actually look like a particular set of users.

43
00:02:55.560 --> 00:02:59.600
That data too is represented by scores on the profile.

44
00:02:59.600 --> 00:03:04.640
So like, so we can come back and kind of like dig deeper into lookalike models.

45
00:03:04.640 --> 00:03:07.720
But there are particular like marketing use cases that are super useful.

46
00:03:07.720 --> 00:03:10.480
The net result, though, is ultimately a score on the profile.

47
00:03:10.640 --> 00:03:15.840
So you'll be able to see that Mark Hayden has a score for this particular model, and

48
00:03:15.840 --> 00:03:18.320
it's a 76 out of 100 or a 10 out of 100.

49
00:03:18.320 --> 00:03:22.840
So then you can go build an audience of people that look more like highly qualified leads

50
00:03:22.840 --> 00:03:25.120
than less than highly qualified leads.

51
00:03:25.120 --> 00:03:28.440
A lot of the marketing tools also have similar functionality, like when you get to the ad

52
00:03:28.440 --> 00:03:32.480
tech and whatnot, they'll have their own lookalike modeling and whatnot.

53
00:03:32.480 --> 00:03:35.560
But we find this is really useful when you're trying to explore your data.

54
00:03:40.840 --> 00:03:42.120
So that's it for this week.

55
00:03:42.120 --> 00:03:43.720
I hope you found this video helpful.

56
00:03:43.720 --> 00:03:45.520
If you did, please like and subscribe.

57
00:03:45.520 --> 00:03:48.840
And if you want to see more videos like this, please subscribe to our YouTube channel.

58
00:03:48.840 --> 00:03:52.760
And if you have any questions, or you have any questions about the product, or any other

59
00:03:52.760 --> 00:03:56.480
topics that you'd like to see covered in this video, please leave them in the comments

60
00:03:56.480 --> 00:03:57.480
below.

61
00:03:57.480 --> 00:03:58.480
And I'll see you in the next one.

62
00:03:58.480 --> 00:03:59.480
Bye.

63
00:03:59.480 --> 00:04:00.480
Bye.

64
00:04:00.480 --> 00:04:01.480
Bye.

65
00:04:01.480 --> 00:04:02.480
Bye.

66
00:04:02.480 --> 00:04:03.480
Bye.

67
00:04:03.480 --> 00:04:04.480
Bye.

68
00:04:04.480 --> 00:04:05.480
Bye.

69
00:04:05.480 --> 00:04:06.480
Bye.

70
00:04:06.480 --> 00:04:07.480
Bye.

71
00:04:07.480 --> 00:04:08.480
Bye.

72
00:04:08.480 --> 00:04:09.480
Bye.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] Linux has a pretty powerful lookalike modeling feature which allows you essentially to go
[00:22] into the UI, it makes it super easy for marketers and less technical folks.
[00:26] If you have a data science team, they're probably already doing scoring, it probably lives in
[00:30] your warehouse, you probably don't want to just bank on lookalike models.
[00:33] But for somebody that wants to kind of like fill gaps in some of their data or get a better
[00:37] understanding easily in a few clicks, what lookalike model allows you to do, and I think
[00:42] one of the best use cases is like, if I am a brand and I use Sixth Sense, for those that
[00:48] don't know, Sixth Sense allows you to understand where traffic, like what business traffic
[00:52] comes from.
[00:53] When I go to a website, they can, for a portion of my audience associated with, okay, Mark
[00:57] works at ContentStack, ContentStack has X number of employees and here's their annual
[01:01] revenue and all this information.
[01:03] The downside to tools like Sixth Sense is that they only effectively analyze 10, 15-ish
[01:10] percent of traffic.
[01:12] So for that 10% of the users, you have a really good understanding of, is this a highly qualified
[01:16] lead, right?
[01:17] So even like ContentStack as a company, when people go to the website, do you want to understand,
[01:23] do they work for and represent a company that's a high value company that looks like
[01:26] a good ContentStack customer?
[01:28] You can do that with a tool like Sixth Sense for about 10% of the audience, but you lose
[01:32] the other 90% and that you don't really know where they're coming from or what they could
[01:36] do.
[01:37] So what lookalike model allows you to do is take a target audience, say the 10% of the
[01:42] users that you know that you've built an audience that, okay, here's my highly qualified users.
[01:46] Here's the people that work for a company that's big enough that has the right sort
[01:49] of focuses or whatever it may be.
[01:52] And then I want to compare that to the other 90% of the audience and see where their behaviors
[01:56] overlap.
[01:57] So I don't necessarily know specifically where they work or what companies, but I can understand
[02:03] that this 90% of the audience, whether they behave like, they act like, they're looking
[02:07] at the things, they're clicking on the same things, that sort of like 10% of highly qualified
[02:12] leads looks at.
[02:14] Effectively, if you have a portion of your audience that you don't understand and a portion
[02:19] of your audience that you do understand, you can go in here, you can create a new lookalike
[02:23] model, you can choose your source audience.
[02:26] So if I go to one that's already configured, for instance, this one that we just did, it
[02:30] is going to essentially compare the no Sixth Sense data to the all audience that it already
[02:36] has Sixth Sense data.
[02:37] And then from there, it's going to essentially put a score on how much they look like based
[02:42] on their behavior, based on their interest scores and their behavioral scores and all
[02:45] that information that we're ultimately putting on these profiles to tell you how much or
[02:50] how little they actually look like a particular set of users.
[02:55] That data too is represented by scores on the profile.
[02:59] So like, so we can come back and kind of like dig deeper into lookalike models.
[03:04] But there are particular like marketing use cases that are super useful.
[03:07] The net result, though, is ultimately a score on the profile.
[03:10] So you'll be able to see that Mark Hayden has a score for this particular model, and
[03:15] it's a 76 out of 100 or a 10 out of 100.
[03:18] So then you can go build an audience of people that look more like highly qualified leads
[03:22] than less than highly qualified leads.
[03:25] A lot of the marketing tools also have similar functionality, like when you get to the ad
[03:28] tech and whatnot, they'll have their own lookalike modeling and whatnot.
[03:32] But we find this is really useful when you're trying to explore your data.
[03:40] So that's it for this week.
[03:42] I hope you found this video helpful.
[03:43] If you did, please like and subscribe.
[03:45] And if you want to see more videos like this, please subscribe to our YouTube channel.
[03:48] And if you have any questions, or you have any questions about the product, or any other
[03:52] topics that you'd like to see covered in this video, please leave them in the comments
[03:56] below.
```

#### Key takeaways

- Connect **Building Lookalike Models** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 12 — Interest Scores & Classification

<!-- ai_metadata: {"lesson_id":"12","type":"video","duration_seconds":340,"video_url":"https://cdn.jwplayer.com/previews/iRhAPCRQ","thumbnail_url":"https://cdn.jwplayer.com/v2/media/iRhAPCRQ/poster.jpg?width=720","topics":["Interest","Scores","Classification"]} -->

#### Video details

#### At a glance

- **Title:** 20-data-insights-interest-scores-classification
- **Duration:** 5m 40s
- **Media link:** https://cdn.jwplayer.com/previews/iRhAPCRQ
- **Publish date (unix):** 1752879751

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113718 kbps
- video/mp4 · 180p · 180p · 138386 kbps
- video/mp4 · 270p · 270p · 155796 kbps
- video/mp4 · 360p · 360p · 172959 kbps
- video/mp4 · 406p · 406p · 183666 kbps
- video/mp4 · 540p · 540p · 221294 kbps
- video/mp4 · 720p · 720p · 279070 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/iRhAPCRQ-120.vtt`

#### Transcript

But we covered our behavioral scores as well, right? The momentum and propensity, those are really, really important behavioral statistics, behavioral scores that help fuel and empower these models. The other one that is maybe the most important thing, certainly one of the most important things in the kind of Linux coming together with content stack story is our interest scores. So you saw them, I think, in the very first conversation where we go back to Petsy, turn on our trustee Chrome extension. So at the bottom of the Chrome extension, you'll see a set of interest scores. These, for most customers, just come out of the box. So we'll talk about how they actually work, what we're doing to classify the content associated, but ultimately what it allows you to do as a customer is as I browse, you'll see in real time and it might be kind of like subtle because I've used this profile a lot, but you'll see my scores change with every single piece of content that I interact with. So as I go here and I look at pet carriers, obviously, and this is like a sandbox demo account, but you'll see those scores, like again, it's a little bit kind of like hard to see in the bigger scale, but they're being recalculated every single time that an event comes into the pipeline, which is super, super useful in one, helping you understand anonymous users, which is most users in the case of marketing. That's where a lot of other CDPs fail and fall short as they don't talk about the importance of anonymous and how that particular product helps with the anonymous use cases. It's always around. We help you get your data together and build known profiles and it will, at the end of the day, it's pretty easy to mark two people that, you know, if you have their email address and everything about them, it's much more difficult when you only have the information that they give you in sort of like different browser sessions, essentially. So kind of to just start at how all of this works, these values on the profile are called interest scores. So if I go to the raw details for this particular user and find. So the thing that's actually happening is for every single one of the topics, which we'll talk about topics here in a second, there is a score of how much or how little I am interested. This works the exact same way that all of our other scores do. So not only do you understand what I'm interested in, you can target the people that have an above average interest in a specific topic on your website. So that granularity is super important. But what you see in the bar is essentially this data getting digitized. Today how this works is every time that a user visits your site that's been tagged, we get a URL. Our system actually goes out, scrapes that URL and runs it through a series of different systems to do NLP, image analysis, some of those kind of things to ultimately uncover the topics that are associated with a particular document. So for instance, if I close this, give us some more screen real estate. If I go into our Petsy sandbox account real quick and just go to like documents with images, what it's actually doing automatically. So this is with no configuration from the user other than saying, you're okay to classify this domain. We're going to go out, we're going to look at the content, we're going to pull that content in, we're going to understand what topics it's about, what images are on it, all of this information. So that every time that a user visits that particular URL, we know what the content is about. And then we can start to build scores for how much the topics on that particular document align with all of the other interests that we've seen from that particular user. So it's essentially, we go out, we scrape the content, we turn it into topics. And then every time that the user then sees that content, we understand what it's about, we can associate that with their scores and we can update them in real time. So we're actually going through what we call classifying the documents. So if I go into, for instance, this like Elegant Paws Cat Carrier, again, make-believe content. And this is the thing that I wanted to touch on. One of the things that is really cool, and my biased opinion about Linux is that the same exact identity resolution model, the way that we handle profiles and that graph and all of that kind of stuff, works in the exact same way for content. So we essentially out of the box build a user table, and that's how we're able to associate all the different profiles with mark together. We also have a content table. So for every document, we go out, we analyze that document, we collect the information about that document, and we essentially build document profiles inside of Linux. So on the screen, they're not as pretty, because we haven't spent as much time sort of like showcasing this particular thing. But you can see that like for a piece of content, this URL that we just clicked on, you understand the hashed value, which is how we associate it with the user, the number of users that have seen it, the different information, the entire body of that document, the topics, the header image, the primary image, we can pull in meta tags, whether it's fail, like all of this meta information around not just the content, but this sort of document that ultimately users interact with. Thank you. Thank you.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:18.400
But we covered our behavioral scores as well, right?

2
00:00:18.400 --> 00:00:22.560
The momentum and propensity, those are really, really important behavioral statistics, behavioral

3
00:00:22.560 --> 00:00:25.840
scores that help fuel and empower these models.

4
00:00:25.840 --> 00:00:29.800
The other one that is maybe the most important thing, certainly one of the most important

5
00:00:29.800 --> 00:00:35.720
things in the kind of Linux coming together with content stack story is our interest scores.

6
00:00:35.720 --> 00:00:45.440
So you saw them, I think, in the very first conversation where we go back to Petsy, turn

7
00:00:45.440 --> 00:00:53.920
on our trustee Chrome extension.

8
00:00:53.920 --> 00:00:59.560
So at the bottom of the Chrome extension, you'll see a set of interest scores.

9
00:00:59.560 --> 00:01:02.320
These, for most customers, just come out of the box.

10
00:01:02.320 --> 00:01:05.920
So we'll talk about how they actually work, what we're doing to classify the content associated,

11
00:01:05.920 --> 00:01:11.320
but ultimately what it allows you to do as a customer is as I browse, you'll see in real

12
00:01:11.320 --> 00:01:14.400
time and it might be kind of like subtle because I've used this profile a lot, but you'll see

13
00:01:14.400 --> 00:01:18.760
my scores change with every single piece of content that I interact with.

14
00:01:18.760 --> 00:01:23.920
So as I go here and I look at pet carriers, obviously, and this is like a sandbox demo

15
00:01:23.920 --> 00:01:27.440
account, but you'll see those scores, like again, it's a little bit kind of like hard

16
00:01:27.440 --> 00:01:32.320
to see in the bigger scale, but they're being recalculated every single time that an event

17
00:01:32.320 --> 00:01:38.640
comes into the pipeline, which is super, super useful in one, helping you understand anonymous

18
00:01:38.640 --> 00:01:42.000
users, which is most users in the case of marketing.

19
00:01:42.000 --> 00:01:45.880
That's where a lot of other CDPs fail and fall short as they don't talk about the importance

20
00:01:45.880 --> 00:01:50.760
of anonymous and how that particular product helps with the anonymous use cases.

21
00:01:50.760 --> 00:01:51.760
It's always around.

22
00:01:51.760 --> 00:01:55.280
We help you get your data together and build known profiles and it will, at the end of

23
00:01:55.280 --> 00:01:58.920
the day, it's pretty easy to mark two people that, you know, if you have their email address

24
00:01:58.920 --> 00:02:03.760
and everything about them, it's much more difficult when you only have the information

25
00:02:03.760 --> 00:02:10.400
that they give you in sort of like different browser sessions, essentially.

26
00:02:10.400 --> 00:02:17.280
So kind of to just start at how all of this works, these values on the profile are called

27
00:02:17.280 --> 00:02:18.960
interest scores.

28
00:02:18.960 --> 00:02:26.480
So if I go to the raw details for this particular user and find.

29
00:02:26.480 --> 00:02:30.520
So the thing that's actually happening is for every single one of the topics, which

30
00:02:30.520 --> 00:02:35.240
we'll talk about topics here in a second, there is a score of how much or how little

31
00:02:35.240 --> 00:02:36.480
I am interested.

32
00:02:36.480 --> 00:02:39.160
This works the exact same way that all of our other scores do.

33
00:02:39.160 --> 00:02:42.560
So not only do you understand what I'm interested in, you can target the people that have an

34
00:02:42.560 --> 00:02:45.960
above average interest in a specific topic on your website.

35
00:02:45.960 --> 00:02:48.000
So that granularity is super important.

36
00:02:48.000 --> 00:02:52.400
But what you see in the bar is essentially this data getting digitized.

37
00:02:52.400 --> 00:02:58.200
Today how this works is every time that a user visits your site that's been tagged,

38
00:02:58.200 --> 00:02:59.680
we get a URL.

39
00:02:59.680 --> 00:03:04.860
Our system actually goes out, scrapes that URL and runs it through a series of different

40
00:03:04.860 --> 00:03:11.760
systems to do NLP, image analysis, some of those kind of things to ultimately uncover

41
00:03:11.760 --> 00:03:15.300
the topics that are associated with a particular document.

42
00:03:15.300 --> 00:03:22.540
So for instance, if I close this, give us some more screen real estate.

43
00:03:22.540 --> 00:03:28.500
If I go into our Petsy sandbox account real quick and just go to like documents with images,

44
00:03:28.500 --> 00:03:30.700
what it's actually doing automatically.

45
00:03:30.700 --> 00:03:34.900
So this is with no configuration from the user other than saying, you're okay to classify

46
00:03:34.900 --> 00:03:36.480
this domain.

47
00:03:36.480 --> 00:03:39.540
We're going to go out, we're going to look at the content, we're going to pull that content

48
00:03:39.540 --> 00:03:43.780
in, we're going to understand what topics it's about, what images are on it, all of

49
00:03:43.780 --> 00:03:45.860
this information.

50
00:03:45.860 --> 00:03:51.100
So that every time that a user visits that particular URL, we know what the content is

51
00:03:51.100 --> 00:03:52.100
about.

52
00:03:52.100 --> 00:03:57.620
And then we can start to build scores for how much the topics on that particular document

53
00:03:57.620 --> 00:04:01.140
align with all of the other interests that we've seen from that particular user.

54
00:04:01.140 --> 00:04:05.780
So it's essentially, we go out, we scrape the content, we turn it into topics.

55
00:04:05.780 --> 00:04:09.940
And then every time that the user then sees that content, we understand what it's about,

56
00:04:09.940 --> 00:04:13.060
we can associate that with their scores and we can update them in real time.

57
00:04:13.060 --> 00:04:15.980
So we're actually going through what we call classifying the documents.

58
00:04:15.980 --> 00:04:22.100
So if I go into, for instance, this like Elegant Paws Cat Carrier, again, make-believe content.

59
00:04:22.100 --> 00:04:25.100
And this is the thing that I wanted to touch on.

60
00:04:25.100 --> 00:04:29.100
One of the things that is really cool, and my biased opinion about Linux is that the

61
00:04:29.100 --> 00:04:34.540
same exact identity resolution model, the way that we handle profiles and that graph

62
00:04:34.540 --> 00:04:38.820
and all of that kind of stuff, works in the exact same way for content.

63
00:04:38.820 --> 00:04:42.540
So we essentially out of the box build a user table, and that's how we're able to associate

64
00:04:42.540 --> 00:04:44.940
all the different profiles with mark together.

65
00:04:44.940 --> 00:04:46.660
We also have a content table.

66
00:04:46.660 --> 00:04:51.980
So for every document, we go out, we analyze that document, we collect the information

67
00:04:51.980 --> 00:04:57.760
about that document, and we essentially build document profiles inside of Linux.

68
00:04:57.760 --> 00:05:01.340
So on the screen, they're not as pretty, because we haven't spent as much time sort of like

69
00:05:01.340 --> 00:05:04.040
showcasing this particular thing.

70
00:05:04.040 --> 00:05:09.940
But you can see that like for a piece of content, this URL that we just clicked on, you understand

71
00:05:09.940 --> 00:05:14.420
the hashed value, which is how we associate it with the user, the number of users that

72
00:05:14.420 --> 00:05:20.220
have seen it, the different information, the entire body of that document, the topics,

73
00:05:20.220 --> 00:05:24.580
the header image, the primary image, we can pull in meta tags, whether it's fail, like

74
00:05:24.580 --> 00:05:30.260
all of this meta information around not just the content, but this sort of document that

75
00:05:30.260 --> 00:05:31.820
ultimately users interact with.

76
00:05:39.940 --> 00:05:40.940
Thank you.

77
00:05:40.940 --> 00:05:41.940
Thank you.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] But we covered our behavioral scores as well, right?
[00:18] The momentum and propensity, those are really, really important behavioral statistics, behavioral
[00:22] scores that help fuel and empower these models.
[00:25] The other one that is maybe the most important thing, certainly one of the most important
[00:29] things in the kind of Linux coming together with content stack story is our interest scores.
[00:35] So you saw them, I think, in the very first conversation where we go back to Petsy, turn
[00:45] on our trustee Chrome extension.
[00:53] So at the bottom of the Chrome extension, you'll see a set of interest scores.
[00:59] These, for most customers, just come out of the box.
[01:02] So we'll talk about how they actually work, what we're doing to classify the content associated,
[01:05] but ultimately what it allows you to do as a customer is as I browse, you'll see in real
[01:11] time and it might be kind of like subtle because I've used this profile a lot, but you'll see
[01:14] my scores change with every single piece of content that I interact with.
[01:18] So as I go here and I look at pet carriers, obviously, and this is like a sandbox demo
[01:23] account, but you'll see those scores, like again, it's a little bit kind of like hard
[01:27] to see in the bigger scale, but they're being recalculated every single time that an event
[01:32] comes into the pipeline, which is super, super useful in one, helping you understand anonymous
[01:38] users, which is most users in the case of marketing.
[01:42] That's where a lot of other CDPs fail and fall short as they don't talk about the importance
[01:45] of anonymous and how that particular product helps with the anonymous use cases.
[01:50] It's always around.
[01:51] We help you get your data together and build known profiles and it will, at the end of
[01:55] the day, it's pretty easy to mark two people that, you know, if you have their email address
[01:58] and everything about them, it's much more difficult when you only have the information
[02:03] that they give you in sort of like different browser sessions, essentially.
[02:10] So kind of to just start at how all of this works, these values on the profile are called
[02:17] interest scores.
[02:18] So if I go to the raw details for this particular user and find.
[02:26] So the thing that's actually happening is for every single one of the topics, which
[02:30] we'll talk about topics here in a second, there is a score of how much or how little
[02:35] I am interested.
[02:36] This works the exact same way that all of our other scores do.
[02:39] So not only do you understand what I'm interested in, you can target the people that have an
[02:42] above average interest in a specific topic on your website.
[02:45] So that granularity is super important.
[02:48] But what you see in the bar is essentially this data getting digitized.
[02:52] Today how this works is every time that a user visits your site that's been tagged,
[02:58] we get a URL.
[02:59] Our system actually goes out, scrapes that URL and runs it through a series of different
[03:04] systems to do NLP, image analysis, some of those kind of things to ultimately uncover
[03:11] the topics that are associated with a particular document.
[03:15] So for instance, if I close this, give us some more screen real estate.
[03:22] If I go into our Petsy sandbox account real quick and just go to like documents with images,
[03:28] what it's actually doing automatically.
[03:30] So this is with no configuration from the user other than saying, you're okay to classify
[03:34] this domain.
[03:36] We're going to go out, we're going to look at the content, we're going to pull that content
[03:39] in, we're going to understand what topics it's about, what images are on it, all of
[03:43] this information.
[03:45] So that every time that a user visits that particular URL, we know what the content is
[03:51] about.
[03:52] And then we can start to build scores for how much the topics on that particular document
[03:57] align with all of the other interests that we've seen from that particular user.
[04:01] So it's essentially, we go out, we scrape the content, we turn it into topics.
[04:05] And then every time that the user then sees that content, we understand what it's about,
[04:09] we can associate that with their scores and we can update them in real time.
[04:13] So we're actually going through what we call classifying the documents.
[04:15] So if I go into, for instance, this like Elegant Paws Cat Carrier, again, make-believe content.
[04:22] And this is the thing that I wanted to touch on.
[04:25] One of the things that is really cool, and my biased opinion about Linux is that the
```

#### Key takeaways

- Connect **Interest Scores & Classification** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 13 — Example: Exploring Classified Content

<!-- ai_metadata: {"lesson_id":"13","type":"video","duration_seconds":73,"video_url":"https://cdn.jwplayer.com/previews/r5mEINso","thumbnail_url":"https://cdn.jwplayer.com/v2/media/r5mEINso/poster.jpg?width=720","topics":["Example","Exploring","Classified","Content"]} -->

#### Video details

#### At a glance

- **Title:** 21-data-insights-exploring-classified-content
- **Duration:** 1m 13s
- **Media link:** https://cdn.jwplayer.com/previews/r5mEINso
- **Publish date (unix):** 1752879987

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 114950 kbps
- video/mp4 · 180p · 180p · 145410 kbps
- video/mp4 · 270p · 270p · 165318 kbps
- video/mp4 · 360p · 360p · 171917 kbps
- video/mp4 · 406p · 406p · 181549 kbps
- video/mp4 · 540p · 540p · 212738 kbps
- video/mp4 · 720p · 720p · 263879 kbps
- video/mp4 · 1080p · 1080p · 399501 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/r5mEINso-120.vtt`

#### Transcript

So with that, so real quick, just to kind of show how Linux works today. If I close this, so let's say let's just go to the bond toy. And if I go to classification, you can do a manual classification. So again, we do this automatically, we look at what URLs users are engaging with, we make sure that we have a document that matches that, keep it up to date, etc, etc. But it'll essentially go out and classify that particular document. So it's going to pull in to make sure you have the right thing, it's going to say here's the image. In some cases, there's more context, you can manually curate topics, I think inevitably over time, this goes away, that's a thing that you would want to do inside of the CMS portion of the management system. But this is how users can start to get kind of a sneak peek of what's working and what we're seeing and ultimately classifying.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:21.220
So with that, so real quick, just to kind of show how Linux works today.

2
00:00:21.220 --> 00:00:29.140
If I close this, so let's say let's just go to the bond toy.

3
00:00:29.140 --> 00:00:34.560
And if I go to classification, you can do a manual classification.

4
00:00:34.560 --> 00:00:38.140
So again, we do this automatically, we look at what URLs users are engaging with, we make

5
00:00:38.140 --> 00:00:42.480
sure that we have a document that matches that, keep it up to date, etc, etc.

6
00:00:42.480 --> 00:00:45.960
But it'll essentially go out and classify that particular document.

7
00:00:45.960 --> 00:00:48.520
So it's going to pull in to make sure you have the right thing, it's going to say here's

8
00:00:48.520 --> 00:00:49.520
the image.

9
00:00:49.520 --> 00:00:54.160
In some cases, there's more context, you can manually curate topics, I think inevitably

10
00:00:54.160 --> 00:00:57.600
over time, this goes away, that's a thing that you would want to do inside of the CMS

11
00:00:57.600 --> 00:00:59.040
portion of the management system.

12
00:00:59.040 --> 00:01:03.160
But this is how users can start to get kind of a sneak peek of what's working and what

13
00:01:03.160 --> 00:01:04.800
we're seeing and ultimately classifying.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So with that, so real quick, just to kind of show how Linux works today.
[00:21] If I close this, so let's say let's just go to the bond toy.
[00:29] And if I go to classification, you can do a manual classification.
[00:34] So again, we do this automatically, we look at what URLs users are engaging with, we make
[00:38] sure that we have a document that matches that, keep it up to date, etc, etc.
[00:42] But it'll essentially go out and classify that particular document.
[00:45] So it's going to pull in to make sure you have the right thing, it's going to say here's
[00:48] the image.
[00:49] In some cases, there's more context, you can manually curate topics, I think inevitably
[00:54] over time, this goes away, that's a thing that you would want to do inside of the CMS
[00:57] portion of the management system.
[00:59] But this is how users can start to get kind of a sneak peek of what's working and what
[01:03] we're seeing and ultimately classifying.
```

#### Key takeaways

- Connect **Example: Exploring Classified Content** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 14 — Content Recommendations

<!-- ai_metadata: {"lesson_id":"14","type":"video","duration_seconds":289,"video_url":"https://cdn.jwplayer.com/previews/zt7iltQ2","thumbnail_url":"https://cdn.jwplayer.com/v2/media/zt7iltQ2/poster.jpg?width=720","topics":["Content","Recommendations"]} -->

#### Video details

#### At a glance

- **Title:** 22-data-insights-understanding-content-recommendations
- **Duration:** 4m 49s
- **Media link:** https://cdn.jwplayer.com/previews/zt7iltQ2
- **Publish date (unix):** 1752880653

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113809 kbps
- video/mp4 · 180p · 180p · 136938 kbps
- video/mp4 · 270p · 270p · 150874 kbps
- video/mp4 · 360p · 360p · 163745 kbps
- video/mp4 · 406p · 406p · 172426 kbps
- video/mp4 · 540p · 540p · 200843 kbps
- video/mp4 · 720p · 720p · 246386 kbps
- video/mp4 · 1080p · 1080p · 366924 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/zt7iltQ2-120.vtt`

#### Transcript

That's kind of how classification works at a high level. The net result is we understand the content. We have topics. Those topics are associated with the user. The thing that we haven't talked about at all is kind of then how can a customer use that information? One of the simple ways is you can build audiences, of course. I can go into here. I can build an audience. I can say that it's content, say content topics, and I want to just target anybody that has a ... We're just making up numbers on a demo site. Any number that you want to have a higher interest for this particular topic and it's going to pull up three users, so you can build a very targeted audience of just people that are going to want to engage with this particular type of content is kind of like obvious way one to use it. So Lytics out of the box for every one of our interest engines, which we'll kind of cover what ... Interest engines, what we used to call them, we're now calling them context layers as we kind of like start to rebrand some things. But for any of these interest engines, you could actually make a recommendation for a specific user from a selection of content. So when Mark visits the website, I want to surface some portion of content that we know that he's going to be interested in based on his past behaviors. If I go back over here real quick ... So if I do a recommendation, and I have not tested this in a little while, but it still works. So it's going to actually go through and it's going to select from a collection, which we'll talk about here in a second. But based on my information that I know specifically about Mark and what he is interested in, as well as the specific corpus of content that could be surfaced to Mark, here's a set of recommendations that also have scores on how well or how little they align. We have our path forward kind of web personalization tool allows you to surface content recommendations. Content recommendations is one of our most common use cases. It's the easiest thing to stand up and you can just surface a modal with recommendation for content based on whatever a user is going to want to help drive their sort of session depth essentially. So that all comes out of the box with Lytx. The APIs are already all there to support it. And then the other unique aspect of content in that sort of recommendation pipeline is what we call collections. You can think of them as just a segment of content. So again, in the same way that Lytx is building profiles, Lytx is building document profiles. If you can build an audience of Lytx users, nothing stops you from building an audience essentially of Lytx content. So recommendations are made based on content collections. So if I only want to recommend documents that have images or documents by a particular author or documents that are about dogs, whatever it is, you can build a segment of content. And then when you make a recommendation, you can choose that segment collector, that content collection to recommend from to sort of whittle down the content that you can choose from. That way you're not making just a recommendation from all of your content, you're making one only from products that want to focus on in fall or whatever it may be. So content collections, and we can walk through just creating a collection real quick, are super simple. You can go here and just say I want particular titles or authors, if it has an image, if it has a description. And then I think where I usually go is in this top right button, there's an advanced editor, which takes you into the literal segment builder, but now you're focused on a document. So you can go in here and say, you know, any, you know, content that was created after whatever certain day, things that you're going to publish in the future, certain attributes, certain brands, again, like there's some part of I think what we can unlock is that when you're building the content model, more information that you all probably already have instead of content stack could be surfaced on the document profile so that you can then segment on that information, for instance. But anyway, you can build a collection off of essentially any of the document attributes that are on there.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:18.600
That's kind of how classification works at a high level.

2
00:00:18.600 --> 00:00:20.520
The net result is we understand the content.

3
00:00:20.520 --> 00:00:21.520
We have topics.

4
00:00:21.520 --> 00:00:23.640
Those topics are associated with the user.

5
00:00:23.640 --> 00:00:28.280
The thing that we haven't talked about at all is kind of then how can a customer use

6
00:00:28.280 --> 00:00:29.400
that information?

7
00:00:30.400 --> 00:00:33.320
One of the simple ways is you can build audiences, of course.

8
00:00:33.320 --> 00:00:34.920
I can go into here.

9
00:00:34.920 --> 00:00:36.840
I can build an audience.

10
00:00:36.840 --> 00:00:46.960
I can say that it's content, say content topics, and I want to just target anybody that has

11
00:00:46.960 --> 00:00:54.280
a ... We're just making up numbers on a demo site.

12
00:00:54.280 --> 00:00:58.480
Any number that you want to have a higher interest for this particular topic and it's

13
00:00:58.480 --> 00:01:02.280
going to pull up three users, so you can build a very targeted audience of just people

14
00:01:02.280 --> 00:01:06.120
that are going to want to engage with this particular type of content is kind of like

15
00:01:06.120 --> 00:01:09.080
obvious way one to use it.

16
00:01:09.080 --> 00:01:13.040
So Lytics out of the box for every one of our interest engines, which we'll kind of

17
00:01:13.040 --> 00:01:17.080
cover what ... Interest engines, what we used to call them, we're now calling them context

18
00:01:17.080 --> 00:01:20.000
layers as we kind of like start to rebrand some things.

19
00:01:20.000 --> 00:01:25.600
But for any of these interest engines, you could actually make a recommendation for a

20
00:01:25.600 --> 00:01:29.160
specific user from a selection of content.

21
00:01:29.160 --> 00:01:35.440
So when Mark visits the website, I want to surface some portion of content that we know

22
00:01:35.440 --> 00:01:37.960
that he's going to be interested in based on his past behaviors.

23
00:01:37.960 --> 00:02:04.560
If I go back over here real quick ... So if I do a recommendation, and I have not tested

24
00:02:04.560 --> 00:02:06.640
this in a little while, but it still works.

25
00:02:06.640 --> 00:02:10.000
So it's going to actually go through and it's going to select from a collection, which we'll

26
00:02:10.000 --> 00:02:11.760
talk about here in a second.

27
00:02:11.760 --> 00:02:16.760
But based on my information that I know specifically about Mark and what he is interested in, as

28
00:02:16.760 --> 00:02:21.600
well as the specific corpus of content that could be surfaced to Mark, here's a set of

29
00:02:21.600 --> 00:02:26.520
recommendations that also have scores on how well or how little they align.

30
00:02:26.520 --> 00:02:31.440
We have our path forward kind of web personalization tool allows you to surface content recommendations.

31
00:02:31.440 --> 00:02:33.760
Content recommendations is one of our most common use cases.

32
00:02:33.760 --> 00:02:37.520
It's the easiest thing to stand up and you can just surface a modal with recommendation

33
00:02:37.520 --> 00:02:41.760
for content based on whatever a user is going to want to help drive their sort of session

34
00:02:41.760 --> 00:02:43.440
depth essentially.

35
00:02:43.440 --> 00:02:47.040
So that all comes out of the box with Lytx.

36
00:02:47.040 --> 00:02:49.200
The APIs are already all there to support it.

37
00:02:49.200 --> 00:02:54.640
And then the other unique aspect of content in that sort of recommendation pipeline is

38
00:02:54.640 --> 00:02:55.960
what we call collections.

39
00:02:55.960 --> 00:02:58.720
You can think of them as just a segment of content.

40
00:02:58.720 --> 00:03:04.040
So again, in the same way that Lytx is building profiles, Lytx is building document profiles.

41
00:03:04.040 --> 00:03:08.680
If you can build an audience of Lytx users, nothing stops you from building an audience

42
00:03:08.680 --> 00:03:10.940
essentially of Lytx content.

43
00:03:10.940 --> 00:03:14.840
So recommendations are made based on content collections.

44
00:03:14.840 --> 00:03:21.680
So if I only want to recommend documents that have images or documents by a particular author

45
00:03:21.680 --> 00:03:27.820
or documents that are about dogs, whatever it is, you can build a segment of content.

46
00:03:27.920 --> 00:03:31.940
And then when you make a recommendation, you can choose that segment collector, that content

47
00:03:31.940 --> 00:03:38.500
collection to recommend from to sort of whittle down the content that you can choose from.

48
00:03:38.500 --> 00:03:41.980
That way you're not making just a recommendation from all of your content, you're making one

49
00:03:41.980 --> 00:03:45.020
only from products that want to focus on in fall or whatever it may be.

50
00:03:45.020 --> 00:03:50.740
So content collections, and we can walk through just creating a collection real quick, are

51
00:03:50.740 --> 00:03:51.740
super simple.

52
00:03:51.740 --> 00:03:55.540
You can go here and just say I want particular titles or authors, if it has an image, if

53
00:03:55.540 --> 00:03:57.140
it has a description.

54
00:03:57.140 --> 00:04:00.580
And then I think where I usually go is in this top right button, there's an advanced

55
00:04:00.580 --> 00:04:05.940
editor, which takes you into the literal segment builder, but now you're focused on a document.

56
00:04:05.940 --> 00:04:13.060
So you can go in here and say, you know, any, you know, content that was created after whatever

57
00:04:13.060 --> 00:04:18.300
certain day, things that you're going to publish in the future, certain attributes, certain

58
00:04:18.300 --> 00:04:23.900
brands, again, like there's some part of I think what we can unlock is that when you're

59
00:04:23.900 --> 00:04:28.660
building the content model, more information that you all probably already have instead

60
00:04:28.660 --> 00:04:33.340
of content stack could be surfaced on the document profile so that you can then segment

61
00:04:33.340 --> 00:04:36.260
on that information, for instance.

62
00:04:36.260 --> 00:04:40.220
But anyway, you can build a collection off of essentially any of the document attributes

63
00:04:40.220 --> 00:04:40.860
that are on there.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] That's kind of how classification works at a high level.
[00:18] The net result is we understand the content.
[00:20] We have topics.
[00:21] Those topics are associated with the user.
[00:23] The thing that we haven't talked about at all is kind of then how can a customer use
[00:28] that information?
[00:30] One of the simple ways is you can build audiences, of course.
[00:33] I can go into here.
[00:34] I can build an audience.
[00:36] I can say that it's content, say content topics, and I want to just target anybody that has
[00:46] a ... We're just making up numbers on a demo site.
[00:54] Any number that you want to have a higher interest for this particular topic and it's
[00:58] going to pull up three users, so you can build a very targeted audience of just people
[01:02] that are going to want to engage with this particular type of content is kind of like
[01:06] obvious way one to use it.
[01:09] So Lytics out of the box for every one of our interest engines, which we'll kind of
[01:13] cover what ... Interest engines, what we used to call them, we're now calling them context
[01:17] layers as we kind of like start to rebrand some things.
[01:20] But for any of these interest engines, you could actually make a recommendation for a
[01:25] specific user from a selection of content.
[01:29] So when Mark visits the website, I want to surface some portion of content that we know
[01:35] that he's going to be interested in based on his past behaviors.
[01:37] If I go back over here real quick ... So if I do a recommendation, and I have not tested
[02:04] this in a little while, but it still works.
[02:06] So it's going to actually go through and it's going to select from a collection, which we'll
[02:10] talk about here in a second.
[02:11] But based on my information that I know specifically about Mark and what he is interested in, as
[02:16] well as the specific corpus of content that could be surfaced to Mark, here's a set of
[02:21] recommendations that also have scores on how well or how little they align.
[02:26] We have our path forward kind of web personalization tool allows you to surface content recommendations.
[02:31] Content recommendations is one of our most common use cases.
[02:33] It's the easiest thing to stand up and you can just surface a modal with recommendation
[02:37] for content based on whatever a user is going to want to help drive their sort of session
[02:41] depth essentially.
[02:43] So that all comes out of the box with Lytx.
[02:47] The APIs are already all there to support it.
[02:49] And then the other unique aspect of content in that sort of recommendation pipeline is
[02:54] what we call collections.
[02:55] You can think of them as just a segment of content.
[02:58] So again, in the same way that Lytx is building profiles, Lytx is building document profiles.
[03:04] If you can build an audience of Lytx users, nothing stops you from building an audience
[03:08] essentially of Lytx content.
[03:10] So recommendations are made based on content collections.
[03:14] So if I only want to recommend documents that have images or documents by a particular author
[03:21] or documents that are about dogs, whatever it is, you can build a segment of content.
[03:27] And then when you make a recommendation, you can choose that segment collector, that content
[03:31] collection to recommend from to sort of whittle down the content that you can choose from.
[03:38] That way you're not making just a recommendation from all of your content, you're making one
[03:41] only from products that want to focus on in fall or whatever it may be.
[03:45] So content collections, and we can walk through just creating a collection real quick, are
[03:50] super simple.
[03:51] You can go here and just say I want particular titles or authors, if it has an image, if
[03:55] it has a description.
[03:57] And then I think where I usually go is in this top right button, there's an advanced
[04:00] editor, which takes you into the literal segment builder, but now you're focused on a document.
[04:05] So you can go in here and say, you know, any, you know, content that was created after whatever
[04:13] certain day, things that you're going to publish in the future, certain attributes, certain
[04:18] brands, again, like there's some part of I think what we can unlock is that when you're
[04:23] building the content model, more information that you all probably already have instead
[04:28] of content stack could be surfaced on the document profile so that you can then segment
```

#### Key takeaways

- Connect **Content Recommendations** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 15 — What are "triggers"?

<!-- ai_metadata: {"lesson_id":"15","type":"video","duration_seconds":223,"video_url":"https://cdn.jwplayer.com/previews/AW2Mx2Bb","thumbnail_url":"https://cdn.jwplayer.com/v2/media/AW2Mx2Bb/poster.jpg?width=720","topics":["What","are","triggers"]} -->

#### Video details

#### At a glance

- **Title:** 23-data-insights-trigger-basics
- **Duration:** 3m 43s
- **Media link:** https://cdn.jwplayer.com/previews/AW2Mx2Bb
- **Publish date (unix):** 1752881258

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113752 kbps
- video/mp4 · 180p · 180p · 130023 kbps
- video/mp4 · 270p · 270p · 138491 kbps
- video/mp4 · 360p · 360p · 145476 kbps
- video/mp4 · 406p · 406p · 150625 kbps
- video/mp4 · 540p · 540p · 166917 kbps
- video/mp4 · 720p · 720p · 192897 kbps
- video/mp4 · 1080p · 1080p · 263885 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/AW2Mx2Bb-120.vtt`

#### Transcript

Once you have a profile built and you start to build audiences, what's the actual mechanism to allow you to do something with that? How do we know somebody just entered an audience? How do we know they exited an audience? How do we know a value changed? How do the jobs that are real-time syncs even work at all? All of the answers to those questions are what we call triggers. So essentially, this is where I'm going to explain it like a noob. If we want to go into more detail, Eric can go into much more detail on the processing pipeline. But essentially, data comes into Lytx. You have raw events. It's the stuff that we're collecting. It's the stuff that we're pulling from systems. We do what I call enrichment. So that's where our behavioral scores are calculated. Audience memberships are calculated. The model scores are calculated. So there's a bunch of things that are happening to where we're improving, updating, and adding additional context onto the profile. Based on the results of that, it then can trigger things. Most of the triggers, certainly the most common triggers, are around segment membership. So at that point in the pipeline, once data has been collected, we've done all this analysis, now we understand all of the audiences that this user is either a member of and no longer a member of or isn't a member of and now will be a member of. At that point in time, you can essentially use what we call a trigger to say, Eric is now joining segment A. Because Eric's joining segment A, I want to fire off that information, that profile, in these 16 different systems, or I want to fire this webhook, or I want to do something downstream based on that specific signal. So segment entry is one of those. I'm now a member of a segment. Segment exit is one of those signals. So now I'm no longer a member of that segment. Yep. A new trigger, I guess, technically, is flows. We'll go through and talk about flows specifically, but flows is somewhere between segment and attribute. It's a little bit different than its own special thing, but sort of like flow, stage, step membership is also a trigger, but it's all contained kind of within flows, which is in a segment context. But yeah, the triggers are super important in that it's how data gets out of the system and kind of understanding what you can do and what you can't do with them is important. And then also for our jobs, you have the non-trigger base, which is sort of like on a cadence. We're going to go through and just do a scan of the users in that segment and pass it. That's still a possibility as well. But I think more and more, we try to do everything in that sort of real-time streaming system. Every time a trigger happens, it sends to the information. Yeah, the other thing that we probably won't get into very deep here is there are sort of like candidates to be triggered later. So it's the concept of like, maybe we need to look at this profile again in the future. And that's another thing that we look at during the same evaluation. So if one of these custom rules has been inactive for five days, that would be like, as soon as they have any activity, it would make them a candidate to be inactive five days in the future. And that's another thing that happens all at the same time that we're calculating these triggers.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:22.040
Once you have a profile built and you start to build audiences, what's the actual mechanism

2
00:00:22.040 --> 00:00:24.680
to allow you to do something with that?

3
00:00:24.680 --> 00:00:27.040
How do we know somebody just entered an audience?

4
00:00:27.040 --> 00:00:28.560
How do we know they exited an audience?

5
00:00:28.560 --> 00:00:30.520
How do we know a value changed?

6
00:00:30.520 --> 00:00:34.800
How do the jobs that are real-time syncs even work at all?

7
00:00:34.800 --> 00:00:37.360
All of the answers to those questions are what we call triggers.

8
00:00:37.360 --> 00:00:40.360
So essentially, this is where I'm going to explain it like a noob.

9
00:00:40.360 --> 00:00:43.360
If we want to go into more detail, Eric can go into much more detail on the processing

10
00:00:43.360 --> 00:00:44.360
pipeline.

11
00:00:44.360 --> 00:00:46.100
But essentially, data comes into Lytx.

12
00:00:46.100 --> 00:00:47.100
You have raw events.

13
00:00:47.100 --> 00:00:48.100
It's the stuff that we're collecting.

14
00:00:48.100 --> 00:00:50.280
It's the stuff that we're pulling from systems.

15
00:00:50.280 --> 00:00:51.760
We do what I call enrichment.

16
00:00:51.760 --> 00:00:55.000
So that's where our behavioral scores are calculated.

17
00:00:55.000 --> 00:00:56.240
Audience memberships are calculated.

18
00:00:56.240 --> 00:00:57.520
The model scores are calculated.

19
00:00:57.520 --> 00:01:01.920
So there's a bunch of things that are happening to where we're improving, updating, and adding

20
00:01:01.920 --> 00:01:04.880
additional context onto the profile.

21
00:01:04.880 --> 00:01:08.960
Based on the results of that, it then can trigger things.

22
00:01:08.960 --> 00:01:13.720
Most of the triggers, certainly the most common triggers, are around segment membership.

23
00:01:13.720 --> 00:01:18.280
So at that point in the pipeline, once data has been collected, we've done all this analysis,

24
00:01:18.280 --> 00:01:22.520
now we understand all of the audiences that this user is either a member of and no longer

25
00:01:22.520 --> 00:01:26.520
a member of or isn't a member of and now will be a member of.

26
00:01:26.520 --> 00:01:32.200
At that point in time, you can essentially use what we call a trigger to say, Eric is

27
00:01:32.200 --> 00:01:38.080
now joining segment A. Because Eric's joining segment A, I want to fire off that information,

28
00:01:38.080 --> 00:01:42.560
that profile, in these 16 different systems, or I want to fire this webhook, or I want

29
00:01:42.560 --> 00:01:47.400
to do something downstream based on that specific signal.

30
00:01:47.400 --> 00:01:49.320
So segment entry is one of those.

31
00:01:49.320 --> 00:01:51.280
I'm now a member of a segment.

32
00:01:51.280 --> 00:01:53.080
Segment exit is one of those signals.

33
00:01:53.080 --> 00:01:55.760
So now I'm no longer a member of that segment.

34
00:01:56.000 --> 00:01:57.000
Yep.

35
00:01:57.000 --> 00:02:00.800
A new trigger, I guess, technically, is flows.

36
00:02:00.800 --> 00:02:06.000
We'll go through and talk about flows specifically, but flows is somewhere between segment and

37
00:02:06.000 --> 00:02:07.000
attribute.

38
00:02:07.000 --> 00:02:11.360
It's a little bit different than its own special thing, but sort of like flow, stage, step

39
00:02:11.360 --> 00:02:15.240
membership is also a trigger, but it's all contained kind of within flows, which is in

40
00:02:15.240 --> 00:02:17.240
a segment context.

41
00:02:17.240 --> 00:02:25.120
But yeah, the triggers are super important in that it's how data gets out of the system

42
00:02:25.480 --> 00:02:28.200
and kind of understanding what you can do and what you can't do with them is important.

43
00:02:28.200 --> 00:02:32.560
And then also for our jobs, you have the non-trigger base, which is sort of like on a cadence.

44
00:02:32.560 --> 00:02:35.560
We're going to go through and just do a scan of the users in that segment and pass it.

45
00:02:35.560 --> 00:02:37.080
That's still a possibility as well.

46
00:02:37.080 --> 00:02:44.400
But I think more and more, we try to do everything in that sort of real-time streaming system.

47
00:02:44.400 --> 00:02:47.200
Every time a trigger happens, it sends to the information.

48
00:02:47.200 --> 00:02:59.360
Yeah, the other thing that we probably won't get into very deep here is there are sort

49
00:02:59.360 --> 00:03:02.640
of like candidates to be triggered later.

50
00:03:02.640 --> 00:03:09.200
So it's the concept of like, maybe we need to look at this profile again in the future.

51
00:03:09.200 --> 00:03:13.080
And that's another thing that we look at during the same evaluation.

52
00:03:13.120 --> 00:03:20.400
So if one of these custom rules has been inactive for five days, that would be like, as soon

53
00:03:20.400 --> 00:03:25.800
as they have any activity, it would make them a candidate to be inactive five days in the future.

54
00:03:25.800 --> 00:03:33.560
And that's another thing that happens all at the same time that we're calculating these triggers.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] Once you have a profile built and you start to build audiences, what's the actual mechanism
[00:22] to allow you to do something with that?
[00:24] How do we know somebody just entered an audience?
[00:27] How do we know they exited an audience?
[00:28] How do we know a value changed?
[00:30] How do the jobs that are real-time syncs even work at all?
[00:34] All of the answers to those questions are what we call triggers.
[00:37] So essentially, this is where I'm going to explain it like a noob.
[00:40] If we want to go into more detail, Eric can go into much more detail on the processing
[00:43] pipeline.
[00:44] But essentially, data comes into Lytx.
[00:46] You have raw events.
[00:47] It's the stuff that we're collecting.
[00:48] It's the stuff that we're pulling from systems.
[00:50] We do what I call enrichment.
[00:51] So that's where our behavioral scores are calculated.
[00:55] Audience memberships are calculated.
[00:56] The model scores are calculated.
[00:57] So there's a bunch of things that are happening to where we're improving, updating, and adding
[01:01] additional context onto the profile.
[01:04] Based on the results of that, it then can trigger things.
[01:08] Most of the triggers, certainly the most common triggers, are around segment membership.
[01:13] So at that point in the pipeline, once data has been collected, we've done all this analysis,
[01:18] now we understand all of the audiences that this user is either a member of and no longer
[01:22] a member of or isn't a member of and now will be a member of.
[01:26] At that point in time, you can essentially use what we call a trigger to say, Eric is
[01:32] now joining segment A. Because Eric's joining segment A, I want to fire off that information,
[01:38] that profile, in these 16 different systems, or I want to fire this webhook, or I want
[01:42] to do something downstream based on that specific signal.
[01:47] So segment entry is one of those.
[01:49] I'm now a member of a segment.
[01:51] Segment exit is one of those signals.
[01:53] So now I'm no longer a member of that segment.
[01:56] Yep.
[01:57] A new trigger, I guess, technically, is flows.
[02:00] We'll go through and talk about flows specifically, but flows is somewhere between segment and
[02:06] attribute.
[02:07] It's a little bit different than its own special thing, but sort of like flow, stage, step
[02:11] membership is also a trigger, but it's all contained kind of within flows, which is in
[02:15] a segment context.
[02:17] But yeah, the triggers are super important in that it's how data gets out of the system
[02:25] and kind of understanding what you can do and what you can't do with them is important.
[02:28] And then also for our jobs, you have the non-trigger base, which is sort of like on a cadence.
[02:32] We're going to go through and just do a scan of the users in that segment and pass it.
[02:35] That's still a possibility as well.
[02:37] But I think more and more, we try to do everything in that sort of real-time streaming system.
[02:44] Every time a trigger happens, it sends to the information.
[02:47] Yeah, the other thing that we probably won't get into very deep here is there are sort
[02:59] of like candidates to be triggered later.
[03:02] So it's the concept of like, maybe we need to look at this profile again in the future.
[03:09] And that's another thing that we look at during the same evaluation.
[03:13] So if one of these custom rules has been inactive for five days, that would be like, as soon
[03:20] as they have any activity, it would make them a candidate to be inactive five days in the future.
[03:25] And that's another thing that happens all at the same time that we're calculating these triggers.
```

#### Key takeaways

- Connect **What are "triggers"?** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

### Lesson 16 — Data Insights: Building Profiles Quiz

<!-- ai_metadata: {"lesson_id":"16","type":"text","duration_minutes":1,"topics":["LMS","Knowledge check"]} -->

#### Lesson text

**This lesson is a knowledge check hosted in the Academy LMS.** This companion Markdown contains **no quiz questions, answers, scoring rules, or explanations**.

#### Key takeaways

- Connect **Data Insights: Building Profiles Quiz** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

## Resources & references

| Page | Companion Markdown |
| --- | --- |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--identity-resolution-recap | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--identity-resolution-recap.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-data-pipeline | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-data-pipeline.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--leveraging-common-schema | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--leveraging-common-schema.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--customizing-schema | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--customizing-schema.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-importance-of-identity-fields | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-importance-of-identity-fields.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--publishing-schema-version-control | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--publishing-schema-version-control.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-apis-csvs | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-apis-csvs.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-integrations | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-integrations.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--identifier-ranks | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--identifier-ranks.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-warehouse-data | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-warehouse-data.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--building-lookalike-models | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--building-lookalike-models.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--interest-scores-classification | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--interest-scores-classification.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--example-exploring-classified-content | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--example-exploring-classified-content.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--content-recommendations | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--content-recommendations.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--what-are-triggers | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--what-are-triggers.md |
| /courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--quiz | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--quiz.md |

## Supplement for indexing

### Content summary

This course dives deep into the technical architecture of customer profile creation and data unification. You'll master the data processing pipeline and learn advanced techniques for enriching customer understanding thro… This course dives deep into the technical architecture of customer profile creation and data unification. You'll master the data processing pipeline and learn advanced techniques for enriching customer understanding through multiple data sources and intelligent modeling. What You'll Learn This comprehensive session teaches you how to build robust, unified customer profiles using advanced identity resolution, data integration, and enrichment techniques. You'll gain practical experience with schema design, data mapping, and leveraging AI-powered features for deeper customer insights. What We'll Cover We'll explore how identity resolution automatically merges data fragments using shared identifiers like email addresses, demonstrating real-time profile unification in action. You'll master the data processing pipeline architecture, learning to create custom fields

### Retrieval tags

- Contentstack Academy
- data-insights-data-ingestion-profile-construction
- Identity
- Resolution
- recap
- The
- Data
- Pipeline
- Leveraging
- Common
- Schema
- Customizing
- fields
- mappings

### Indexing notes

Chunk at each "### Lesson NN — Title" heading; copy lesson_id and topics from the preceding HTML comment into chunk metadata for RAG filters.
Course slug: data-insights-data-ingestion-profile-construction. Union of lesson topic tokens: Identity, Resolution, recap, The, Data, Pipeline, Leveraging, Common, Schema, Customizing, fields, mappings, Importance, Fields, Publishing, Version, Control, Working, with, APIs, CSVs, Integrations, Identifier, Ranks, Warehouse, Building, Lookalike, Models, Interest, Scores, Classification, Example, Exploring, Classified, Content, Recommendations, What, are, triggers, Insights, Profiles, Quiz.
Do not embed or retrieve LMS-only quiz items or mastery exam answer keys from this export.

### Asset references

| Label | URL |
| --- | --- |
| Video thumbnail: Identity Resolution (recap) | `https://cdn.jwplayer.com/v2/media/qzUxiNrH/poster.jpg?width=720` |
| Video thumbnail: The Data Pipeline | `https://cdn.jwplayer.com/v2/media/iDsatXS7/poster.jpg?width=720` |
| Video thumbnail: Leveraging Common Schema | `https://cdn.jwplayer.com/v2/media/IpTB9DvQ/poster.jpg?width=720` |
| Video thumbnail: Customizing Schema (fields & mappings) | `https://cdn.jwplayer.com/v2/media/fGpn7GIn/poster.jpg?width=720` |
| Video thumbnail: The Importance of "Identity" Fields | `https://cdn.jwplayer.com/v2/media/6luUta7L/poster.jpg?width=720` |
| Video thumbnail: Publishing Schema & Version Control | `https://cdn.jwplayer.com/v2/media/PCj1HuBz/poster.jpg?width=720` |
| Video thumbnail: Working with APIs & CSVs | `https://cdn.jwplayer.com/v2/media/5OaqXTP0/poster.jpg?width=720` |
| Video thumbnail: Working with Integrations | `https://cdn.jwplayer.com/v2/media/gzn6uDlP/poster.jpg?width=720` |
| Video thumbnail: Identifier Ranks | `https://cdn.jwplayer.com/v2/media/FFIRINGI/poster.jpg?width=720` |
| Video thumbnail: Working w/ Warehouse Data | `https://cdn.jwplayer.com/v2/media/u1mD3rGg/poster.jpg?width=720` |
| Video thumbnail: Building Lookalike Models | `https://cdn.jwplayer.com/v2/media/qZ02gNuc/poster.jpg?width=720` |
| Video thumbnail: Interest Scores & Classification | `https://cdn.jwplayer.com/v2/media/iRhAPCRQ/poster.jpg?width=720` |
| Video thumbnail: Example: Exploring Classified Content | `https://cdn.jwplayer.com/v2/media/r5mEINso/poster.jpg?width=720` |
| Video thumbnail: Content Recommendations | `https://cdn.jwplayer.com/v2/media/zt7iltQ2/poster.jpg?width=720` |
| Video thumbnail: What are "triggers"? | `https://cdn.jwplayer.com/v2/media/AW2Mx2Bb/poster.jpg?width=720` |

### External links

| Label | URL |
| --- | --- |
| Contentstack Academy home | `https://www.contentstack.com/academy/` |
| Training instance setup | `https://www.contentstack.com/academy/training-instance` |
| Academy playground (GitHub) | `https://github.com/contentstack/contentstack-academy-playground` |
| Contentstack documentation | `https://www.contentstack.com/docs/` |