Blog

Cleaning up email addresses at scale using GPT in Clay

Written by Harris Kenny | August 9, 2023

This week, we developed a new way to build cold outbound campaigns. This includes web scraping, email validation, and ChatGPT. All in Clay!

First, the challenge is to find unique lead lists that may not exist in the databases we're all familiar with. So we started with Google Maps. And then, from the associated websites, we gather email addresses.

Often, the email extraction also retrieves extraneous information, making the email undeliverable. Our solution? ChatGPT.

Next, we use ZeroBounce to verify the emails and then we take a second pass with ChatGPT on a specially trained model. Instead of handling each email individually, we applied GPT programmatically across our entire dataset, which is over 3000 rows. 

See how we did it, along with pros/cons of taking this approach. 

 

Transcript

[00:00:00] Introduction

[00:00:00] How's it going? I want to share something that we did this past week that I think is pretty cool and pretty useful.

[00:00:06] I'm going to zoom in here in a second and show you, but first let me just give you an overview.

[00:00:10] So what we're doing is cold outreach. We're starting with a list built using Google Maps. And then we're going to the websites that are listed in that Google Maps listing, extracting the email addresses from the site, and when we do that, we're getting a bunch of extraneous information. Extra things in the site source code. That make the email address itself, not deliverable.

[00:00:32] And the way we're going to fix that is using ChatGPT.

[00:00:36] So I want to show you how we trained a model to take these inputs from websites, email addresses from websites, and then extract basically what the email address should be or could be using GPT.

[00:00:47] Starting with Google Maps and extracted addresses

[00:00:47] I got a whole table here with a lot of different things. So I'm not going to show everything, but like I said, we start with Google Maps and then as we proceed down, we got a bunch of information from the Google Maps listings. And in this case, we've got landscape design companies.

[00:01:01] And so what are we gonna do from there?

[00:01:04] Initial email validation with ZeroBounce

[00:01:04] We ran them through ZeroBounce first, which is a bounce checker or an email validation tool that you may be familiar with. Now you may have heard of email validation tools. It's a generally a very highly, strongly recommended practice to validate emails before you send them. This makes sure that they're deliverable.

[00:01:19] And, you know, if you're using a tool like HubSpot and you're doing email marketing, that's one thing. But in this case, we're doing cold outreach.

[00:01:25] And in either case we recommend validation, but in cold outreach, it's especially important because bounce rates and other issues that can come from cold outreach can cause a lot more problems for your inbox, basically.

[00:01:38] We ran them through ZeroBounce. Now there were certain ones that passed the ZeroBounce check. So in that case, we're going to consider those good. And we're going to move on.

[00:01:45] Handling undeliverable email addresses

[00:01:45] But there's a lot that didn't.

[00:01:46] I'm going to show you how we handled those. Now here, you've got a column of website emails that came through and then we didn't get a clear code from a ZeroBounce on the status of the inbox.

[00:01:55] This one is the one I'm going to focus on in particular. It's a really good example for the issue we had in this dataset. You see items dot 3 0 3 dot 7 5 0 dot 7 8 6 8 info.

[00:02:08] That is the problem. That's the thing that we're trying to solve.

[00:02:10] And you can't —a validator—an email tool is not gonna be able to do that. What's happening here is it's got a phone number that it's concatenating with an email address and no email validation tool is going to solve that.

[00:02:21] This is how we're going to use GPT.

[00:02:23] Developing the initial prompt for the model

[00:02:23] And you may notice that on the right, the way we're doing this, as we're doing this programmatically, or we're doing this at scale across an entire sheet, rather than how you may be used to using GPT through the mobile app or through the desktop app where you're doing single prompts.

[00:02:36] I want to talk through the way that we built this prompt here and on the right, starting out, we assign the role to the AI of data processor. And then I want to explain the way we structured this prompt.

[00:02:47] I'm not going to read you the whole thing. You can pause it. But I'm just giving the context that we've got email addresses that may have extra information at the beginning of the end.

[00:02:56] This has been provided from websites and the most likely error is that there's going to be extra data at the beginning of the end of the string.

[00:03:02] So look out for characters that are obviously incorrect, like letters, on the end of a common domain, like a .com.

[00:03:08] If you suspect you extracted an unknown email address, that's not deliverable, then I got a typo here, oops, return the answer UNKNOWN. If an input is not provided return the answer UNKNOWN.

[00:03:18] And so the reason we're doing that is that we want to be able to filter out, um, Results that are not email addresses very quickly and very easily before we export this list for a cold outreach. And then we drop it in the input.

[00:03:31] Training the model with examples

[00:03:31] And I gave it a few examples.

[00:03:33] So this is an example where it's a word on the end.

[00:03:35] This is an example where there's a phone number on the beginning.

[00:03:37] This is an example of just a string where the, from the website extraction, it's just pulling the at symbol and that's not even an email at all. And then finally a blank input

[00:03:45] Here you can see this number through GPT. We ended up with a deliverable, proper email address.

[00:03:50] Caveat emptor

[00:03:50] Now this is a catch all. And so running these types of campaigns, they have more risks associated with them.

[00:03:54] this is definitely a more advanced thing. And I would say, would use caution when doing outreach like this. because these addresses are going to have different deliverability. They're going to have different responses, different user reported, spam, different, lots of different issues. This data is really fresh. We just got it. And, and we're going to treat it accordingly.

[00:04:11] That's subject for a whole other video, but I wanted to share this idea how we did it.

[00:04:14] And if you find this useful, if you do something similar, let me know what you think. I would love, love to hear ideas.

[00:04:20] Programmatically using GPT at scale with Clay

[00:04:20] This type of application for GPT is really exciting to me and, you know, I'll just look at the number of rows here on this spreadsheet.

[00:04:26] We have over 3000 rows here in this spreadsheet and clay. And so you can see that we were really able to apply this across a large dataset and significantly improve the yield that we got from this list. Let me know what you try and, uh, thanks for watching.