# The Data Pipeline

### About this export

| Field | Value |
| --- | --- |
| **content_type** | lesson |
| **platform** | contentstack-academy |
| **source_url** | https://www.contentstack.com/academy/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-data-pipeline |
| **course_slug** | data-insights-data-ingestion-profile-construction |
| **lesson_slug** | data-insights-course-3--the-data-pipeline |
| **markdown_file_url** | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--the-data-pipeline.md |
| **generated_at** | 2026-04-28T06:55:44.145Z |

> Part of **[Data Ingestion & Profile Construction](https://www.contentstack.com/academy/courses/data-insights-data-ingestion-profile-construction)** on Contentstack Academy. **Academy MD v3** — structured for retrieval; no quiz or assessment keys.

<!-- ai_metadata: {"lesson_id":"02","type":"video","duration_seconds":334,"video_url":"https://cdn.jwplayer.com/previews/iDsatXS7","thumbnail_url":"https://cdn.jwplayer.com/v2/media/iDsatXS7/poster.jpg?width=720","topics":["The","Data","Pipeline"]} -->

#### Video details

#### At a glance

- **Title:** 10-data-insights-the-data-pipeline
- **Duration:** 5m 34s
- **Media link:** https://cdn.jwplayer.com/previews/iDsatXS7
- **Publish date (unix):** 1752870616

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113538 kbps
- video/mp4 · 180p · 180p · 139245 kbps
- video/mp4 · 270p · 270p · 155091 kbps
- video/mp4 · 360p · 360p · 169179 kbps
- video/mp4 · 406p · 406p · 178649 kbps
- video/mp4 · 540p · 540p · 213373 kbps
- video/mp4 · 720p · 720p · 267723 kbps
- video/mp4 · 1080p · 1080p · 425608 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/iDsatXS7-120.vtt`

#### Transcript

So, that all happens in real time. Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does it have the logic to understand for all of these different sort of data sources and streams, how do they ultimately map to the profile, if there's a UID or a first name in each of those streams, which one wins, right, so like is it the first in, is it the last in, what's that logic? So really the process that the data, so every single event, takes as it goes through Linux is it starts in a data stream, so under data pipeline you have access to all of your streams. The default stream is where the web information will go by default. You can customize that in the JavaScript tag if you want to, if there's some obscure use case or maybe you're collecting stuff from different websites and you want to map it a little bit differently, but all of that data automatically goes to this sort of raw event stream. From there, it uses mappings, is what we call them, to say, okay, I want to take the data that comes in from this stream, so we'll use email as an example inside of building profiles under schema, you have access to your fields and mappings. So if I, for instance, look for email under mappings, you'll see that there's a number of different streams and ways that this sort of system tells it how to handle the event. If I, for instance, go to the default stream, and actually maybe I'll just go into the field, it's a little bit easier to see. So if I click on email, you'll see all of the mappings, and you can see on the default stream, there's a few different ways that we map it, but in most cases it's just taking, okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some normalization, verify that it's an actual valid email, and then I'm going to push that up to the email field. So the mapping is sort of that translation layer between raw data that comes into a stream and how it can ultimately send it to a field. The field itself on it, if I go back into email, has all of the controls on how that merge happens. So for email in particular, and we'll run through all of these and go through a few different examples of building them from scratch, but for email address, you define a data type. So in the case of email, it's just a string. There's essentially every data type that you could ever possibly want that we can support, but in this case, it's just a string. In this case, it's flagged as an identity key. So again, that's what's telling the system that if I see email in the default stream, and I also see email in this MailChimp stream, I can merge those fragments together to build the profile. Likewise, if I see an event with an email in just the default stream, and then another event in that stream with email, just like we did in our demo, it can merge those fragments together to build that unified profile. So this flag for an identifier key is really, really important. It's also one of the ways that customers can get themselves in trouble by being overzealous on what actually is an identity key, which causes you to overmerge profiles into one big super profile. The merge operator is how you actually handle the data coming in. We'll come back to this one on a field that hasn't been predefined so that I can actually show you the different options, but you can actually say that, okay, for this particular field, maybe I only want the first value that it's ever seen. Maybe I want the latest value. Maybe it's an array, and I want to merge them together. All sorts of different merge operations are controlled at the field level, which is what tells you how to make that information come together. Otherwise, you would have first name over here, this different from this first name over here, and you'd have this big, crazy, nebulous, unusable profile. The merge operators are really important in how that unified profile actually gets resolved and surfaced to the end user. There's a variety of things that we can go through when we actually build a field on the format type to, you know, do we want to base 64 encode it? Do you want to set any type? Length sort of characteristics. We'll come back to kind of cap and keep helps us sort of limit the size of arrays over time and how long data hangs around. So all of that said, essentially, streams receive data. Mappings allow that data to be translated and sent to a field, and then a field is ultimately kind of like the rule master on how that data gets represented on the profile. These fields are ultimately what you see over here on an actual profile when you see it. So the profile fields are driven from that schema, and that's kind of the resolved piece there. Anything at a high level that I missed, Eric? Just one minor clarification. Use the term cap and keep, which is our internal term for capacity and number of days to keep. It's a restriction that we put on fields to, we'll shorten it to cap and keep often, so it might slip out when we're talking, but it means the capacity, which is the number of things that can be stored in a map and keep the number of days to keep each element.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:18.320
So, that all happens in real time.

2
00:00:18.320 --> 00:00:23.000
Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does

3
00:00:23.000 --> 00:00:28.560
it have the logic to understand for all of these different sort of data sources and streams,

4
00:00:28.560 --> 00:00:33.280
how do they ultimately map to the profile, if there's a UID or a first name in each of

5
00:00:33.280 --> 00:00:37.440
those streams, which one wins, right, so like is it the first in, is it the last in, what's

6
00:00:37.440 --> 00:00:39.480
that logic?

7
00:00:39.480 --> 00:00:46.280
So really the process that the data, so every single event, takes as it goes through Linux

8
00:00:46.280 --> 00:00:52.800
is it starts in a data stream, so under data pipeline you have access to all of your streams.

9
00:00:52.800 --> 00:00:56.840
The default stream is where the web information will go by default.

10
00:00:56.840 --> 00:01:00.600
You can customize that in the JavaScript tag if you want to, if there's some obscure

11
00:01:00.600 --> 00:01:03.480
use case or maybe you're collecting stuff from different websites and you want to map

12
00:01:03.480 --> 00:01:08.520
it a little bit differently, but all of that data automatically goes to this sort of raw

13
00:01:08.520 --> 00:01:10.800
event stream.

14
00:01:10.800 --> 00:01:16.200
From there, it uses mappings, is what we call them, to say, okay, I want to take the data

15
00:01:16.200 --> 00:01:21.840
that comes in from this stream, so we'll use email as an example inside of building profiles

16
00:01:21.840 --> 00:01:24.960
under schema, you have access to your fields and mappings.

17
00:01:25.080 --> 00:01:30.240
So if I, for instance, look for email under mappings, you'll see that there's a number

18
00:01:30.240 --> 00:01:36.400
of different streams and ways that this sort of system tells it how to handle the event.

19
00:01:36.400 --> 00:01:40.720
If I, for instance, go to the default stream, and actually maybe I'll just go into the field,

20
00:01:40.720 --> 00:01:43.520
it's a little bit easier to see.

21
00:01:43.520 --> 00:01:50.240
So if I click on email, you'll see all of the mappings, and you can see on the default

22
00:01:50.240 --> 00:01:54.120
stream, there's a few different ways that we map it, but in most cases it's just taking,

23
00:01:54.280 --> 00:01:58.840
okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some

24
00:01:58.840 --> 00:02:02.760
normalization, verify that it's an actual valid email, and then I'm going to push that

25
00:02:02.760 --> 00:02:04.960
up to the email field.

26
00:02:04.960 --> 00:02:09.240
So the mapping is sort of that translation layer between raw data that comes into a stream

27
00:02:09.240 --> 00:02:13.280
and how it can ultimately send it to a field.

28
00:02:13.280 --> 00:02:20.360
The field itself on it, if I go back into email, has all of the controls on how that

29
00:02:20.360 --> 00:02:21.880
merge happens.

30
00:02:21.880 --> 00:02:24.840
So for email in particular, and we'll run through all of these and go through a few

31
00:02:24.840 --> 00:02:30.880
different examples of building them from scratch, but for email address, you define a data type.

32
00:02:30.880 --> 00:02:32.640
So in the case of email, it's just a string.

33
00:02:32.640 --> 00:02:37.440
There's essentially every data type that you could ever possibly want that we can support,

34
00:02:37.440 --> 00:02:40.840
but in this case, it's just a string.

35
00:02:40.840 --> 00:02:43.600
In this case, it's flagged as an identity key.

36
00:02:43.600 --> 00:02:47.560
So again, that's what's telling the system that if I see email in the default stream,

37
00:02:47.560 --> 00:02:53.960
and I also see email in this MailChimp stream, I can merge those fragments together to build

38
00:02:53.960 --> 00:02:54.960
the profile.

39
00:02:54.960 --> 00:02:58.520
Likewise, if I see an event with an email in just the default stream, and then another

40
00:02:58.520 --> 00:03:03.000
event in that stream with email, just like we did in our demo, it can merge those fragments

41
00:03:03.000 --> 00:03:05.580
together to build that unified profile.

42
00:03:05.580 --> 00:03:09.000
So this flag for an identifier key is really, really important.

43
00:03:09.000 --> 00:03:14.840
It's also one of the ways that customers can get themselves in trouble by being overzealous

44
00:03:14.840 --> 00:03:19.200
on what actually is an identity key, which causes you to overmerge profiles into one

45
00:03:19.200 --> 00:03:22.120
big super profile.

46
00:03:22.120 --> 00:03:26.680
The merge operator is how you actually handle the data coming in.

47
00:03:26.680 --> 00:03:29.640
We'll come back to this one on a field that hasn't been predefined so that I can actually

48
00:03:29.640 --> 00:03:33.960
show you the different options, but you can actually say that, okay, for this particular

49
00:03:33.960 --> 00:03:38.120
field, maybe I only want the first value that it's ever seen.

50
00:03:38.120 --> 00:03:39.480
Maybe I want the latest value.

51
00:03:39.480 --> 00:03:42.440
Maybe it's an array, and I want to merge them together.

52
00:03:42.440 --> 00:03:46.560
All sorts of different merge operations are controlled at the field level, which is what

53
00:03:46.560 --> 00:03:49.880
tells you how to make that information come together.

54
00:03:49.880 --> 00:03:53.320
Otherwise, you would have first name over here, this different from this first name

55
00:03:53.320 --> 00:03:57.960
over here, and you'd have this big, crazy, nebulous, unusable profile.

56
00:03:57.960 --> 00:04:03.360
The merge operators are really important in how that unified profile actually gets resolved

57
00:04:03.360 --> 00:04:07.000
and surfaced to the end user.

58
00:04:07.000 --> 00:04:09.760
There's a variety of things that we can go through when we actually build a field on

59
00:04:09.760 --> 00:04:13.080
the format type to, you know, do we want to base 64 encode it?

60
00:04:13.080 --> 00:04:15.080
Do you want to set any type?

61
00:04:15.080 --> 00:04:17.000
Length sort of characteristics.

62
00:04:17.000 --> 00:04:23.080
We'll come back to kind of cap and keep helps us sort of limit the size of arrays over time

63
00:04:23.080 --> 00:04:24.360
and how long data hangs around.

64
00:04:24.360 --> 00:04:29.760
So all of that said, essentially, streams receive data.

65
00:04:29.760 --> 00:04:33.920
Mappings allow that data to be translated and sent to a field, and then a field is ultimately

66
00:04:33.920 --> 00:04:39.360
kind of like the rule master on how that data gets represented on the profile.

67
00:04:39.800 --> 00:04:45.040
These fields are ultimately what you see over here on an actual profile when you see it.

68
00:04:45.040 --> 00:04:51.400
So the profile fields are driven from that schema, and that's kind of the resolved piece

69
00:04:51.400 --> 00:04:53.520
there.

70
00:04:53.520 --> 00:04:55.560
Anything at a high level that I missed, Eric?

71
00:04:55.560 --> 00:04:57.640
Just one minor clarification.

72
00:04:57.640 --> 00:05:07.920
Use the term cap and keep, which is our internal term for capacity and number of days to keep.

73
00:05:07.960 --> 00:05:13.960
It's a restriction that we put on fields to, we'll shorten it to cap and keep often, so

74
00:05:13.960 --> 00:05:20.200
it might slip out when we're talking, but it means the capacity, which is the number

75
00:05:20.200 --> 00:05:24.120
of things that can be stored in a map and keep the number of days to keep each element.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So, that all happens in real time.
[00:18] Why I wanted to kind of refresh and show that again is it speaks to then, okay, how does
[00:23] it have the logic to understand for all of these different sort of data sources and streams,
[00:28] how do they ultimately map to the profile, if there's a UID or a first name in each of
[00:33] those streams, which one wins, right, so like is it the first in, is it the last in, what's
[00:37] that logic?
[00:39] So really the process that the data, so every single event, takes as it goes through Linux
[00:46] is it starts in a data stream, so under data pipeline you have access to all of your streams.
[00:52] The default stream is where the web information will go by default.
[00:56] You can customize that in the JavaScript tag if you want to, if there's some obscure
[01:00] use case or maybe you're collecting stuff from different websites and you want to map
[01:03] it a little bit differently, but all of that data automatically goes to this sort of raw
[01:08] event stream.
[01:10] From there, it uses mappings, is what we call them, to say, okay, I want to take the data
[01:16] that comes in from this stream, so we'll use email as an example inside of building profiles
[01:21] under schema, you have access to your fields and mappings.
[01:25] So if I, for instance, look for email under mappings, you'll see that there's a number
[01:30] of different streams and ways that this sort of system tells it how to handle the event.
[01:36] If I, for instance, go to the default stream, and actually maybe I'll just go into the field,
[01:40] it's a little bit easier to see.
[01:43] So if I click on email, you'll see all of the mappings, and you can see on the default
[01:50] stream, there's a few different ways that we map it, but in most cases it's just taking,
[01:54] okay, if I see email, just like this, the raw key, all lowercase, I'm going to do some
[01:58] normalization, verify that it's an actual valid email, and then I'm going to push that
[02:02] up to the email field.
[02:04] So the mapping is sort of that translation layer between raw data that comes into a stream
[02:09] and how it can ultimately send it to a field.
[02:13] The field itself on it, if I go back into email, has all of the controls on how that
[02:20] merge happens.
[02:21] So for email in particular, and we'll run through all of these and go through a few
[02:24] different examples of building them from scratch, but for email address, you define a data type.
[02:30] So in the case of email, it's just a string.
[02:32] There's essentially every data type that you could ever possibly want that we can support,
[02:37] but in this case, it's just a string.
[02:40] In this case, it's flagged as an identity key.
[02:43] So again, that's what's telling the system that if I see email in the default stream,
[02:47] and I also see email in this MailChimp stream, I can merge those fragments together to build
[02:53] the profile.
[02:54] Likewise, if I see an event with an email in just the default stream, and then another
[02:58] event in that stream with email, just like we did in our demo, it can merge those fragments
[03:03] together to build that unified profile.
[03:05] So this flag for an identifier key is really, really important.
[03:09] It's also one of the ways that customers can get themselves in trouble by being overzealous
[03:14] on what actually is an identity key, which causes you to overmerge profiles into one
[03:19] big super profile.
[03:22] The merge operator is how you actually handle the data coming in.
[03:26] We'll come back to this one on a field that hasn't been predefined so that I can actually
[03:29] show you the different options, but you can actually say that, okay, for this particular
[03:33] field, maybe I only want the first value that it's ever seen.
[03:38] Maybe I want the latest value.
[03:39] Maybe it's an array, and I want to merge them together.
[03:42] All sorts of different merge operations are controlled at the field level, which is what
[03:46] tells you how to make that information come together.
[03:49] Otherwise, you would have first name over here, this different from this first name
[03:53] over here, and you'd have this big, crazy, nebulous, unusable profile.
[03:57] The merge operators are really important in how that unified profile actually gets resolved
[04:03] and surfaced to the end user.
[04:07] There's a variety of things that we can go through when we actually build a field on
[04:09] the format type to, you know, do we want to base 64 encode it?
[04:13] Do you want to set any type?
```

#### Key takeaways

- Connect **The Data Pipeline** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

## Supplement for indexing

### Content summary

The Data Pipeline. The Data Pipeline in Data Ingestion & Profile Construction (data-insights-data-ingestion-profile-construction).

### Retrieval tags

- The
- Data
- Pipeline
- data-insights-data-ingestion-profile-construction
- lesson 02
- The Data Pipeline
- data-insights-data-ingestion-profile-construction lesson

### Indexing notes

Index this lesson as a primary chunk tagged with lesson_id "02" and topics: [The, Data, Pipeline].
Parent course slug: data-insights-data-ingestion-profile-construction. Use asset_references URLs as thumbnail hints in search results when present.
Never surface LMS quiz content or assessment answers from this file.

### Asset references

| Label | URL |
| --- | --- |
| Video thumbnail: The Data Pipeline | `https://cdn.jwplayer.com/v2/media/iDsatXS7/poster.jpg?width=720` |

### External links

| Label | URL |
| --- | --- |
| Contentstack Academy home | `https://www.contentstack.com/academy/` |
| Training instance setup | `https://www.contentstack.com/academy/training-instance` |
| Academy playground (GitHub) | `https://github.com/contentstack/contentstack-academy-playground` |
| Contentstack documentation | `https://www.contentstack.com/docs/` |