# Working w/ Warehouse Data

### About this export

| Field | Value |
| --- | --- |
| **content_type** | lesson |
| **platform** | contentstack-academy |
| **source_url** | https://www.contentstack.com/academy/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-warehouse-data |
| **course_slug** | data-insights-data-ingestion-profile-construction |
| **lesson_slug** | data-insights-course-3--working-with-warehouse-data |
| **markdown_file_url** | /academy/md/courses/data-insights-data-ingestion-profile-construction/data-insights-course-3--working-with-warehouse-data.md |
| **generated_at** | 2026-04-28T06:55:44.156Z |

> Part of **[Data Ingestion & Profile Construction](https://www.contentstack.com/academy/courses/data-insights-data-ingestion-profile-construction)** on Contentstack Academy. **Academy MD v3** — structured for retrieval; no quiz or assessment keys.

<!-- ai_metadata: {"lesson_id":"10","type":"video","duration_seconds":579,"video_url":"https://cdn.jwplayer.com/previews/u1mD3rGg","thumbnail_url":"https://cdn.jwplayer.com/v2/media/u1mD3rGg/poster.jpg?width=720","topics":["Working","Warehouse","Data"]} -->

#### Video details

#### At a glance

- **Title:** 18-data-insights-warehouses
- **Duration:** 9m 39s
- **Media link:** https://cdn.jwplayer.com/previews/u1mD3rGg
- **Publish date (unix):** 1752878424

#### Streaming renditions

- application/vnd.apple.mpegurl
- audio/mp4 · AAC Audio · 113588 kbps
- video/mp4 · 180p · 180p · 135498 kbps
- video/mp4 · 270p · 270p · 148923 kbps
- video/mp4 · 360p · 360p · 162108 kbps
- video/mp4 · 406p · 406p · 170215 kbps
- video/mp4 · 540p · 540p · 200907 kbps
- video/mp4 · 720p · 720p · 245465 kbps
- video/mp4 · 1080p · 1080p · 382520 kbps

#### Timed text tracks (delivery)

- **thumbnails:** `https://cdn.jwplayer.com/strips/u1mD3rGg-120.vtt`

#### Transcript

So, we talked about integrations that pull from marketing tools. We talked about all their APIs, our JavaScript tag, which uses our APIs. The thing that we haven't touched on that I think is pretty easy to demo, it certainly can be a bigger conversation to go into the weeds. But the other source of data that is super common to pull in is from your warehouse. So we have integrations in the data pipeline, if you just want to stream the entire table in and not have any filtering from your warehouse, you can do that through the connection just like we did MailChimp. But we have a special tool called Cloud Connect that allows you to actually connect to your warehouse. We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera. What it allows you to do is create a connection, which is very similar to what we did with MailChimp, so I won't walk through that. Like it's a BigQuery, we just use a JWT to get authorization into BigQuery. But what it allows you to then do is build data models, which is essentially a SQL query against that particular table or set of tables inside of your warehouse to pull that data in uniquely to the profile. It goes through, and this is why we'll have a bigger conversation on what it actually means to the profile and how it works, but it maps things in a pretty unique way in that it doesn't go through a data stream. It doesn't necessarily have to adhere to the mapping and the sort of those rules. It creates its new own set of fields that also go away if you lose access to the data. So this comes up a lot when customers want to add scores or information to a profile, but they don't want to create a duplicate and copy and stream it in and go through all of the inherent kind of risks there. They just want to plop something on a profile and kind of like override some of those settings. It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor. I have a very simple query that I wrote. You can just paste it and say, I want to select everybody from the sample data set with email first name, last name, and an average annual revenue from sample customers. You can test the query. It'll actually query that, in this case, BigQuery in real time. And then as you connect it, you can actually then describe how you want to map that to a profile. So we'll just say BQ test. The only thing really you have to choose is the primary key. So from my data set, I want to map the email that I just selected. Again, that's the only kind of context. You still have to tell it how you're going to map this Cloud Connect data, this warehouse data to a profile. So I want to merge it based on the email address. I want to merge that with the email field. And then you can choose optionally if you want to pull in additional information. So I want to add first name, last name, and average annual revenue. With Cloud Connect, because it's less of a real-time thing, more of a query-based thing, you can then choose the cadence of how often you want that to run. For tests and demo, I always do an hour. In reality, you probably don't want to just spam your warehouse instance with these big expensive queries every single hour. So most customers are going to do 24 hours or 48 hours or whatever it may be. But you have that. You can flag if you want to create net new profiles from that data. If you don't check this box, it's only going to map to the profiles that exist and never create net new ones. Because our sample database isn't built of Game of Thrones characters, I'll create new so that it creates those profiles. Then ultimately, you create this data model, and it's going to go through, query that BigQuery database on that cadence that we described. And then ultimately, those profiles will come in to the UI. I don't know how long that will take. So I think in our next session, I'll be sure to show you what that data looks like on a profile because it looks a little bit different. All of the segmentation and activation capabilities are exactly the same. But I just wanted to touch quickly just to introduce the idea that's the final piece of where data can come from. The Cloud Connect product represents a little bit of a different method for getting data into Linux. All of the other methods that we talked about, the JavaScript tag collection, the APIs, all of our background jobs, all of that kind of stuff uses our streaming pipeline. So it goes into a stream, a stream maps to a field, fields ultimately show up on the profile. Cloud Connect, just to kind of re-cover this part, is quite a bit different, actually, in that it doesn't use our streaming pipeline to actually get data onto the profile. It has a whole different mechanism that we can talk at length at. At some point, Eric can go into details there, but it essentially bypasses that streaming pipeline and injects the results of that query directly onto the profile. It's really useful for a few different reasons, but the one context that it comes up often is around sort of security and control. So think about like a situation where customer A wants to share a subset of data with one of their partners, with an agency, with another customer, whatever it may be, but they don't want to just give access to the raw data, something that they can like copy and own forever. So all of the warehouses have a different kind of methodology for how you can share and unlock that capability. In BigQuery, you essentially can give access to a specific dataset, and then you don't have to necessarily expose all of the raw data. With how Cloud Connect works, where it doesn't stream that data in, it's not creating a sort of like hard copy that gets written to our system, that gets backed up in our files, and it's creating kind of a less persistent temporary store for that data. If customer A wants to unshare essentially that data from their partner, from the other customer, whatever it may be, and they then lift that access, so they prevent somebody from actually being able to query that database, the next time that query runs, it'll actually clean up that system all the way through, so you don't have that kind of like legacy data that's in that stream system and all that kind of stuff existing. So it comes up often when security or data control, data access, is a key part of that conversation. And it's because, like I said, it doesn't actually stream the data to Lytx in the exact same way. So to just quickly kind of recap on Cloud Connect, it's under data pipeline, the same place that all of our jobs and all the other sort of collection profile sort of building is. Within Cloud Connect, you have the idea of connections, which is just that connection to the database. We won't rehash there. You have the data model, which is essentially the query that's going to run. In our case, in the last conversation, we built this sample query. It just pulls in a set of sample users, first name, last name, email, customer type, and then just an example of a score, for instance, that would be maybe in your warehouse. The thing that we didn't totally cover was how you then get the Cloud Connect data, this warehouse query into Lytx to store it on a profile so that it functions essentially fundamentally like one of our normal attributes. There's a publishing process in this. So when I hit next, but I think we might have briefly touched on this, if I recall, but we didn't actually complete it. And then we definitely didn't show it on the profile. So with Cloud Connect, you don't have to have everything mapped. You don't have to have all of the attributes configured in the same way that the streaming pipeline is. The only question that you have to answer is essentially how to map this data to a single profile. So you have to essentially pick the key from your data that you're querying from, say BigQuery, and what key you want to be able to write it to, so which identifier inside of Lytx you want to associate that data with. So in this case, in this one that I've already pre-configured, we're basically just saying in this query, there's a bunch of stuff, but all we want to do is we want to find anybody that matches on email. And if they match on email, we're going to append this information to that profile. The thing that is unique about Cloud Connect, if I go to a profile, for instance, that I think I had pulled up, yeah. So this one is one of the records that's in that sample data set. They have a profile, just like any other user, regardless of where that was generated from. But if you scroll down and see the data that came in for that particular data model on their profile, you're going to see a few different things. One, you'll see the raw attribute that we pulled in, first name, last name, but it's going to be independent of the other first name, last name fields that are already in the schema. And then you see this unique, and this is actually the more useful part of this particular thing, is there's this unique membership attribute that now gets added. So back to that example of like, I'm Nike.com and I'm sharing data with a partner and I want to give them access to everybody that has a high propensity to buy women's running shoes or whatever it may be, but I don't want to give them all of the data that I needed to use to pull this score to build that list. I just want to give them sort of that Boolean yes, no. That's where this membership flag can come in and that you don't have to have access to all of the information in order to make the calculation. You're just essentially pulling this information in temporarily as long as you have access. And then when you go to build a segment, it functions full scale, just like all of the other attributes in the system. So you can mix and match. There's no sort of limitations there, but like the one thing to just kind of be aware of and know is the use case of where this particular method for pulling data in from a warehouse is super useful, is around that sort of access control. I want to be able to pull things away. I don't want that data to persist. I don't want to store it somewhere. Whereas the streaming method, which also has a warehouse connection, or if I just want to pull everything from a particular table or whatever it may be, you can use our kind of back end integration.

#### Subtitles (WebVTT)

```webvtt
WEBVTT

1
00:00:00.000 --> 00:00:19.280
So, we talked about integrations that pull from marketing tools.

2
00:00:19.280 --> 00:00:23.260
We talked about all their APIs, our JavaScript tag, which uses our APIs.

3
00:00:23.260 --> 00:00:26.160
The thing that we haven't touched on that I think is pretty easy to demo, it certainly

4
00:00:26.160 --> 00:00:28.800
can be a bigger conversation to go into the weeds.

5
00:00:28.800 --> 00:00:32.960
But the other source of data that is super common to pull in is from your warehouse.

6
00:00:32.960 --> 00:00:37.840
So we have integrations in the data pipeline, if you just want to stream the entire table

7
00:00:37.840 --> 00:00:41.440
in and not have any filtering from your warehouse, you can do that through the connection just

8
00:00:41.440 --> 00:00:42.920
like we did MailChimp.

9
00:00:42.920 --> 00:00:47.240
But we have a special tool called Cloud Connect that allows you to actually connect to your

10
00:00:47.240 --> 00:00:48.240
warehouse.

11
00:00:48.240 --> 00:00:52.400
We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera.

12
00:00:52.400 --> 00:00:56.360
What it allows you to do is create a connection, which is very similar to what we did with

13
00:00:56.360 --> 00:00:57.920
MailChimp, so I won't walk through that.

14
00:00:57.920 --> 00:01:03.040
Like it's a BigQuery, we just use a JWT to get authorization into BigQuery.

15
00:01:03.040 --> 00:01:08.120
But what it allows you to then do is build data models, which is essentially a SQL query

16
00:01:08.120 --> 00:01:14.240
against that particular table or set of tables inside of your warehouse to pull that data

17
00:01:14.240 --> 00:01:16.360
in uniquely to the profile.

18
00:01:16.360 --> 00:01:19.920
It goes through, and this is why we'll have a bigger conversation on what it actually

19
00:01:19.920 --> 00:01:24.600
means to the profile and how it works, but it maps things in a pretty unique way in that

20
00:01:24.600 --> 00:01:26.640
it doesn't go through a data stream.

21
00:01:26.640 --> 00:01:31.480
It doesn't necessarily have to adhere to the mapping and the sort of those rules.

22
00:01:31.480 --> 00:01:37.360
It creates its new own set of fields that also go away if you lose access to the data.

23
00:01:37.360 --> 00:01:42.160
So this comes up a lot when customers want to add scores or information to a profile,

24
00:01:42.160 --> 00:01:45.280
but they don't want to create a duplicate and copy and stream it in and go through all

25
00:01:45.280 --> 00:01:48.060
of the inherent kind of risks there.

26
00:01:48.060 --> 00:01:51.820
They just want to plop something on a profile and kind of like override some of those settings.

27
00:01:51.820 --> 00:01:57.580
It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor.

28
00:01:57.580 --> 00:02:01.220
I have a very simple query that I wrote.

29
00:02:01.220 --> 00:02:04.980
You can just paste it and say, I want to select everybody from the sample data set with email

30
00:02:04.980 --> 00:02:08.880
first name, last name, and an average annual revenue from sample customers.

31
00:02:08.880 --> 00:02:10.300
You can test the query.

32
00:02:10.300 --> 00:02:14.860
It'll actually query that, in this case, BigQuery in real time.

33
00:02:14.860 --> 00:02:18.940
And then as you connect it, you can actually then describe how you want to map that to

34
00:02:18.940 --> 00:02:19.940
a profile.

35
00:02:20.060 --> 00:02:23.900
So we'll just say BQ test.

36
00:02:23.900 --> 00:02:27.120
The only thing really you have to choose is the primary key.

37
00:02:27.120 --> 00:02:30.660
So from my data set, I want to map the email that I just selected.

38
00:02:30.660 --> 00:02:32.720
Again, that's the only kind of context.

39
00:02:32.720 --> 00:02:36.380
You still have to tell it how you're going to map this Cloud Connect data, this warehouse

40
00:02:36.380 --> 00:02:38.980
data to a profile.

41
00:02:38.980 --> 00:02:42.220
So I want to merge it based on the email address.

42
00:02:42.220 --> 00:02:46.340
I want to merge that with the email field.

43
00:02:46.340 --> 00:02:49.980
And then you can choose optionally if you want to pull in additional information.

44
00:02:49.980 --> 00:02:53.700
So I want to add first name, last name, and average annual revenue.

45
00:02:53.700 --> 00:02:58.660
With Cloud Connect, because it's less of a real-time thing, more of a query-based thing,

46
00:02:58.660 --> 00:03:02.320
you can then choose the cadence of how often you want that to run.

47
00:03:02.320 --> 00:03:04.260
For tests and demo, I always do an hour.

48
00:03:04.260 --> 00:03:09.340
In reality, you probably don't want to just spam your warehouse instance with these big

49
00:03:09.340 --> 00:03:11.220
expensive queries every single hour.

50
00:03:11.220 --> 00:03:14.980
So most customers are going to do 24 hours or 48 hours or whatever it may be.

51
00:03:15.540 --> 00:03:16.940
But you have that.

52
00:03:16.940 --> 00:03:20.740
You can flag if you want to create net new profiles from that data.

53
00:03:20.740 --> 00:03:24.540
If you don't check this box, it's only going to map to the profiles that exist and never

54
00:03:24.540 --> 00:03:26.220
create net new ones.

55
00:03:26.220 --> 00:03:30.140
Because our sample database isn't built of Game of Thrones characters, I'll create new

56
00:03:30.140 --> 00:03:32.260
so that it creates those profiles.

57
00:03:32.260 --> 00:03:35.940
Then ultimately, you create this data model, and it's going to go through, query that BigQuery

58
00:03:35.940 --> 00:03:39.380
database on that cadence that we described.

59
00:03:39.380 --> 00:03:43.060
And then ultimately, those profiles will come in to the UI.

60
00:03:43.460 --> 00:03:46.460
I don't know how long that will take.

61
00:03:46.460 --> 00:03:50.500
So I think in our next session, I'll be sure to show you what that data looks like on a

62
00:03:50.500 --> 00:03:52.620
profile because it looks a little bit different.

63
00:03:52.620 --> 00:03:57.100
All of the segmentation and activation capabilities are exactly the same.

64
00:03:57.100 --> 00:04:01.820
But I just wanted to touch quickly just to introduce the idea that's the final piece

65
00:04:01.820 --> 00:04:04.260
of where data can come from.

66
00:04:04.260 --> 00:04:08.820
The Cloud Connect product represents a little bit of a different method for getting data

67
00:04:08.820 --> 00:04:10.580
into Linux.

68
00:04:10.580 --> 00:04:15.060
All of the other methods that we talked about, the JavaScript tag collection, the APIs, all

69
00:04:15.060 --> 00:04:19.340
of our background jobs, all of that kind of stuff uses our streaming pipeline.

70
00:04:19.340 --> 00:04:23.660
So it goes into a stream, a stream maps to a field, fields ultimately show up on the

71
00:04:23.660 --> 00:04:25.220
profile.

72
00:04:25.220 --> 00:04:29.980
Cloud Connect, just to kind of re-cover this part, is quite a bit different, actually,

73
00:04:29.980 --> 00:04:34.220
in that it doesn't use our streaming pipeline to actually get data onto the profile.

74
00:04:34.220 --> 00:04:38.820
It has a whole different mechanism that we can talk at length at.

75
00:04:38.820 --> 00:04:42.700
At some point, Eric can go into details there, but it essentially bypasses that streaming

76
00:04:42.700 --> 00:04:47.980
pipeline and injects the results of that query directly onto the profile.

77
00:04:47.980 --> 00:04:52.420
It's really useful for a few different reasons, but the one context that it comes up often

78
00:04:52.420 --> 00:04:54.620
is around sort of security and control.

79
00:04:54.620 --> 00:05:00.980
So think about like a situation where customer A wants to share a subset of data with one

80
00:05:00.980 --> 00:05:04.740
of their partners, with an agency, with another customer, whatever it may be, but they don't

81
00:05:04.740 --> 00:05:08.060
want to just give access to the raw data, something that they can like copy and own

82
00:05:08.060 --> 00:05:09.060
forever.

83
00:05:09.060 --> 00:05:13.900
So all of the warehouses have a different kind of methodology for how you can share

84
00:05:13.900 --> 00:05:16.060
and unlock that capability.

85
00:05:16.060 --> 00:05:21.100
In BigQuery, you essentially can give access to a specific dataset, and then you don't

86
00:05:21.100 --> 00:05:24.100
have to necessarily expose all of the raw data.

87
00:05:24.100 --> 00:05:28.780
With how Cloud Connect works, where it doesn't stream that data in, it's not creating a sort

88
00:05:28.780 --> 00:05:33.020
of like hard copy that gets written to our system, that gets backed up in our files,

89
00:05:33.020 --> 00:05:38.060
and it's creating kind of a less persistent temporary store for that data.

90
00:05:38.060 --> 00:05:43.740
If customer A wants to unshare essentially that data from their partner, from the other

91
00:05:43.740 --> 00:05:47.820
customer, whatever it may be, and they then lift that access, so they prevent somebody

92
00:05:47.820 --> 00:05:52.060
from actually being able to query that database, the next time that query runs, it'll actually

93
00:05:52.060 --> 00:05:56.140
clean up that system all the way through, so you don't have that kind of like legacy

94
00:05:56.140 --> 00:05:59.340
data that's in that stream system and all that kind of stuff existing.

95
00:05:59.340 --> 00:06:05.740
So it comes up often when security or data control, data access, is a key part of that

96
00:06:05.740 --> 00:06:06.740
conversation.

97
00:06:06.740 --> 00:06:11.020
And it's because, like I said, it doesn't actually stream the data to Lytx in the exact

98
00:06:11.020 --> 00:06:12.020
same way.

99
00:06:12.020 --> 00:06:16.460
So to just quickly kind of recap on Cloud Connect, it's under data pipeline, the same

100
00:06:16.460 --> 00:06:21.740
place that all of our jobs and all the other sort of collection profile sort of building

101
00:06:21.740 --> 00:06:22.740
is.

102
00:06:22.740 --> 00:06:25.220
Within Cloud Connect, you have the idea of connections, which is just that connection

103
00:06:25.220 --> 00:06:26.220
to the database.

104
00:06:26.220 --> 00:06:27.220
We won't rehash there.

105
00:06:27.220 --> 00:06:30.500
You have the data model, which is essentially the query that's going to run.

106
00:06:30.500 --> 00:06:34.980
In our case, in the last conversation, we built this sample query.

107
00:06:34.980 --> 00:06:39.020
It just pulls in a set of sample users, first name, last name, email, customer type, and

108
00:06:39.020 --> 00:06:43.300
then just an example of a score, for instance, that would be maybe in your warehouse.

109
00:06:43.300 --> 00:06:47.820
The thing that we didn't totally cover was how you then get the Cloud Connect data, this

110
00:06:47.820 --> 00:06:53.500
warehouse query into Lytx to store it on a profile so that it functions essentially fundamentally

111
00:06:53.500 --> 00:06:56.460
like one of our normal attributes.

112
00:06:56.460 --> 00:06:58.140
There's a publishing process in this.

113
00:06:58.140 --> 00:07:01.420
So when I hit next, but I think we might have briefly touched on this, if I recall, but

114
00:07:01.420 --> 00:07:02.420
we didn't actually complete it.

115
00:07:02.420 --> 00:07:04.740
And then we definitely didn't show it on the profile.

116
00:07:04.740 --> 00:07:08.860
So with Cloud Connect, you don't have to have everything mapped.

117
00:07:08.860 --> 00:07:12.440
You don't have to have all of the attributes configured in the same way that the streaming

118
00:07:12.440 --> 00:07:13.780
pipeline is.

119
00:07:13.780 --> 00:07:19.380
The only question that you have to answer is essentially how to map this data to a single

120
00:07:19.380 --> 00:07:20.380
profile.

121
00:07:20.380 --> 00:07:25.220
So you have to essentially pick the key from your data that you're querying from, say BigQuery,

122
00:07:25.220 --> 00:07:28.580
and what key you want to be able to write it to, so which identifier inside of Lytx

123
00:07:28.580 --> 00:07:30.940
you want to associate that data with.

124
00:07:30.940 --> 00:07:34.540
So in this case, in this one that I've already pre-configured, we're basically just saying

125
00:07:34.540 --> 00:07:37.180
in this query, there's a bunch of stuff, but all we want to do is we want to find anybody

126
00:07:37.180 --> 00:07:38.940
that matches on email.

127
00:07:38.940 --> 00:07:43.420
And if they match on email, we're going to append this information to that profile.

128
00:07:43.420 --> 00:07:49.140
The thing that is unique about Cloud Connect, if I go to a profile, for instance, that I

129
00:07:49.140 --> 00:07:52.100
think I had pulled up, yeah.

130
00:07:52.100 --> 00:07:55.660
So this one is one of the records that's in that sample data set.

131
00:07:55.660 --> 00:08:00.820
They have a profile, just like any other user, regardless of where that was generated from.

132
00:08:00.820 --> 00:08:05.180
But if you scroll down and see the data that came in for that particular data model on

133
00:08:05.180 --> 00:08:07.940
their profile, you're going to see a few different things.

134
00:08:07.940 --> 00:08:11.580
One, you'll see the raw attribute that we pulled in, first name, last name, but it's

135
00:08:11.580 --> 00:08:16.740
going to be independent of the other first name, last name fields that are already in

136
00:08:16.740 --> 00:08:18.080
the schema.

137
00:08:18.080 --> 00:08:21.420
And then you see this unique, and this is actually the more useful part of this particular

138
00:08:21.420 --> 00:08:25.580
thing, is there's this unique membership attribute that now gets added.

139
00:08:25.580 --> 00:08:30.940
So back to that example of like, I'm Nike.com and I'm sharing data with a partner and I

140
00:08:30.940 --> 00:08:35.300
want to give them access to everybody that has a high propensity to buy women's running

141
00:08:35.300 --> 00:08:39.260
shoes or whatever it may be, but I don't want to give them all of the data that I needed

142
00:08:39.260 --> 00:08:41.220
to use to pull this score to build that list.

143
00:08:41.220 --> 00:08:44.300
I just want to give them sort of that Boolean yes, no.

144
00:08:44.300 --> 00:08:47.540
That's where this membership flag can come in and that you don't have to have access

145
00:08:47.540 --> 00:08:50.540
to all of the information in order to make the calculation.

146
00:08:50.660 --> 00:08:55.140
You're just essentially pulling this information in temporarily as long as you have access.

147
00:08:55.140 --> 00:08:59.540
And then when you go to build a segment, it functions full scale, just like all of the

148
00:08:59.540 --> 00:09:00.940
other attributes in the system.

149
00:09:00.940 --> 00:09:02.660
So you can mix and match.

150
00:09:02.660 --> 00:09:06.580
There's no sort of limitations there, but like the one thing to just kind of be aware

151
00:09:06.580 --> 00:09:11.060
of and know is the use case of where this particular method for pulling data in from

152
00:09:11.060 --> 00:09:15.980
a warehouse is super useful, is around that sort of access control.

153
00:09:15.980 --> 00:09:17.420
I want to be able to pull things away.

154
00:09:17.420 --> 00:09:18.420
I don't want that data to persist.

155
00:09:18.620 --> 00:09:20.340
I don't want to store it somewhere.

156
00:09:20.340 --> 00:09:23.780
Whereas the streaming method, which also has a warehouse connection, or if I just want

157
00:09:23.780 --> 00:09:28.380
to pull everything from a particular table or whatever it may be, you can use our kind

158
00:09:28.380 --> 00:09:30.460
of back end integration.

```

```transcript
<!-- PLACEHOLDER: replace with real transcript before publish if cues were auto-derived from WebVTT -->
[00:00] So, we talked about integrations that pull from marketing tools.
[00:19] We talked about all their APIs, our JavaScript tag, which uses our APIs.
[00:23] The thing that we haven't touched on that I think is pretty easy to demo, it certainly
[00:26] can be a bigger conversation to go into the weeds.
[00:28] But the other source of data that is super common to pull in is from your warehouse.
[00:32] So we have integrations in the data pipeline, if you just want to stream the entire table
[00:37] in and not have any filtering from your warehouse, you can do that through the connection just
[00:41] like we did MailChimp.
[00:42] But we have a special tool called Cloud Connect that allows you to actually connect to your
[00:47] warehouse.
[00:48] We support all of the major warehouses, BigQuery, Snowflake, Redshift, et cetera.
[00:52] What it allows you to do is create a connection, which is very similar to what we did with
[00:56] MailChimp, so I won't walk through that.
[00:57] Like it's a BigQuery, we just use a JWT to get authorization into BigQuery.
[01:03] But what it allows you to then do is build data models, which is essentially a SQL query
[01:08] against that particular table or set of tables inside of your warehouse to pull that data
[01:14] in uniquely to the profile.
[01:16] It goes through, and this is why we'll have a bigger conversation on what it actually
[01:19] means to the profile and how it works, but it maps things in a pretty unique way in that
[01:24] it doesn't go through a data stream.
[01:26] It doesn't necessarily have to adhere to the mapping and the sort of those rules.
[01:31] It creates its new own set of fields that also go away if you lose access to the data.
[01:37] So this comes up a lot when customers want to add scores or information to a profile,
[01:42] but they don't want to create a duplicate and copy and stream it in and go through all
[01:45] of the inherent kind of risks there.
[01:48] They just want to plop something on a profile and kind of like override some of those settings.
[01:51] It allows you to go in, in this sample BigQuery instance, it'll actually pull up a SQL editor.
[01:57] I have a very simple query that I wrote.
[02:01] You can just paste it and say, I want to select everybody from the sample data set with email
[02:04] first name, last name, and an average annual revenue from sample customers.
[02:08] You can test the query.
[02:10] It'll actually query that, in this case, BigQuery in real time.
[02:14] And then as you connect it, you can actually then describe how you want to map that to
[02:18] a profile.
[02:20] So we'll just say BQ test.
[02:23] The only thing really you have to choose is the primary key.
[02:27] So from my data set, I want to map the email that I just selected.
[02:30] Again, that's the only kind of context.
[02:32] You still have to tell it how you're going to map this Cloud Connect data, this warehouse
[02:36] data to a profile.
[02:38] So I want to merge it based on the email address.
[02:42] I want to merge that with the email field.
[02:46] And then you can choose optionally if you want to pull in additional information.
[02:49] So I want to add first name, last name, and average annual revenue.
[02:53] With Cloud Connect, because it's less of a real-time thing, more of a query-based thing,
[02:58] you can then choose the cadence of how often you want that to run.
[03:02] For tests and demo, I always do an hour.
[03:04] In reality, you probably don't want to just spam your warehouse instance with these big
[03:09] expensive queries every single hour.
[03:11] So most customers are going to do 24 hours or 48 hours or whatever it may be.
[03:15] But you have that.
[03:16] You can flag if you want to create net new profiles from that data.
[03:20] If you don't check this box, it's only going to map to the profiles that exist and never
[03:24] create net new ones.
[03:26] Because our sample database isn't built of Game of Thrones characters, I'll create new
[03:30] so that it creates those profiles.
[03:32] Then ultimately, you create this data model, and it's going to go through, query that BigQuery
[03:35] database on that cadence that we described.
[03:39] And then ultimately, those profiles will come in to the UI.
[03:43] I don't know how long that will take.
```

#### Key takeaways

- Connect **Working w/ Warehouse Data** back to your stack configuration before moving to the next module.
- Capture one concrete artifact (screenshot, Postman call, or code snippet) that proves the step works in your environment.
- Re-read the delivery versus management boundary for anything you changed in the entry model.

## Supplement for indexing

### Content summary

Working w/ Warehouse Data. Working w/ Warehouse Data in Data Ingestion & Profile Construction (data-insights-data-ingestion-profile-construction).

### Retrieval tags

- Working
- Warehouse
- Data
- data-insights-data-ingestion-profile-construction
- lesson 10
- Working w/ Warehouse Data
- data-insights-data-ingestion-profile-construction lesson

### Indexing notes

Index this lesson as a primary chunk tagged with lesson_id "10" and topics: [Working, Warehouse, Data].
Parent course slug: data-insights-data-ingestion-profile-construction. Use asset_references URLs as thumbnail hints in search results when present.
Never surface LMS quiz content or assessment answers from this file.

### Asset references

| Label | URL |
| --- | --- |
| Video thumbnail: Working w/ Warehouse Data | `https://cdn.jwplayer.com/v2/media/u1mD3rGg/poster.jpg?width=720` |

### External links

| Label | URL |
| --- | --- |
| Contentstack Academy home | `https://www.contentstack.com/academy/` |
| Training instance setup | `https://www.contentstack.com/academy/training-instance` |
| Academy playground (GitHub) | `https://github.com/contentstack/contentstack-academy-playground` |
| Contentstack documentation | `https://www.contentstack.com/docs/` |
