Connecting AWS SageMaker to SnowFlake

Towards ingesting data for Machine Learning

Suman Gautam
5 min readAug 15, 2022
Image: Aaron Burden from unsplash

One of the important task in a Machine Learning model deployment is to have a seamless flow of data from one platform to another. We may use AWS SageMaker to preprocess data, train model and make inferences. Often times, our dataset might be hosted in a different platform. In this tutorial, I would like to show you a step-by-step method on how to connect AWS SageMaker with the Snowflake environment. By doing so, we would be able to ingest data that is stored in Snowflake environment prior to any Machine Learning tasks.

In this tutorial, I would also like to provide a stepwise tutorial on how we can store username and passwords within the AWS parameter stores which allows a more secure way to making a connection to Snowflake.

Step 1: Create SageMaker Notebook Instance

  • From SageMaker Console, select Notebook instances -> Create notebook instance
  • Attach appropriate IAM Roles: e.g SageMakerExecutionRole (later on, we will also need to attach some additional policies to this role)

When creating notebook instance, the Network settings need to be set as well (Usually for SageMaker specific notebook, we tend to ignore this option). Also, if you have an existing notebook, it may not be possible to update the Network settings and you will have start from a new one.

Please contact your Admin to fill the following fields:

VPC: <*********>Subnet: <******>Security Group: <*******>Direct Internet Access: Disable

Step 2: Store Snowflake Service Account User and password into AWS Parameter Store

You can always hard-code the User/password in the SageMaker Notebook when creating a Snowflake connector. But it’s good practice to have a service account that we can store in AWS either in secret key or parameter store. In this example, I will show you how to store and use the User/password in the AWS parameter store and avoid using the hard-code mode.

Goto AWS Systems Manager -> Goto Parameter Store -> Create parameter

1. Give a meaningful name for the user — e.g ‘****-user’2. Choose 'SecureString' Type3. Enter the User Name in the ‘Value’ section. (The User is the Snowflake Service Account Name)4. Choose Create Parameter5. Repeat step 1 to 4 for storing password also. (In the step 3, now enter the password in the ‘Value’ section)

Then we will have to provide the names of the parameters for user and password in a policy we are going to create.

Step 3: Create a Credential Policy and attach to the existing or new roles.

In our case, we will create and attach a credential policy to existing SageMaker Execution Role

Go to the Policies -> Create Policies -> Select JSON -> copy paste the following code

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ssm:PutParameter",
"ssm:DeleteParameter",
"kms:Decrypt",
"ssm:GetParameterHistory",
"ssm:GetParametersByPath",
"ssm:GetParameters",
"ssm:GetParameter",
"ssm:DeleteParameters"
],
"Resource": [
"arn:aws:ssm:<region>:<accountid>:parameter/<user>",
"arn:aws:ssm:<region>:<accountid>:parameter/<pw>"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"ssm:DescribeParameters",
"kms:ListAliases"
],
"Resource": "*"
}
]
}

Goto the AWS Systems Manager ->Parameter Store and find the names of user and password parameter (not the actual user/password).

  • Replace the <user> with the user name — e.g SF_SERVICE_USER
  • Replace the <pw> with the password — e.g SF_USER_PSS
  • Give the policy a meaningful name: e.g “ ****-parameter-store”

Attach the policy to the SageMaker Execution Role associated with the Notebook instance.

Goto the IAM -> select Roles -> select AmazonSageMaker-ExecutionPolicy ( or whatever policy you have)

Add permissions -> Attach Policy-> type the name of the above created policy in the search bar and attach the policy

Step 4: Connect the SageMaker Notebook to the Snowflake

Open the notebook you created in the Step 1. In your notebook, copy following codes, edit the region and execute.

!pip install snowflake-connector-pythonimport sys
import boto3
import snowflake.connector
region = 'us-west-2'
# you need to know which region your Snowflake account is created from your admin

We are going to use Python boto3 library to extract the user/password for the Snowflake connection. Below is a simple function to extract user and password from the AWS Parameter Store.

# Create a list of user/password from the parameter store, remember this is not the actual user/passwordparams = [‘****-user’, ‘****-pw’] def get_credentials(params):
ssm = boto3.client('ssm',region)
response = ssm.get_parameters(
Names=params,
WithDecryption=True
)
#Build dict of credentials
param_values={k['Name']:k['Value'] for k in response['Parameters']}
return param_values
param_values=get_credentials(params)

Explore the param_values dictionary which is to be used in the Snowflake connector module.

con = snowflake.connector.connect(
user = param_values['****-user'],
account="********",
password=param_values['****-pw'],
database = "<database>",
schema = "<schema>"
)
! Note: the 'account' is different from the Service Account you used earlier to store the user/pw

Next step is to check whether the connection works. Run the code snippet below:

# Check the connection works
cur = con.cursor()
print(sys.executable)
print(sys.version)
print(sys.version_info)
try:
cur.execute("select current_date")
one_row=cur.fetchone()
print("Current_Date:",one_row[0])
cur.execute("SELECT current_version()")
one_row = cur.fetchone()
print("Snowflake_Version:",one_row[0])
finally:
cur.close()
cur.close()

Upon the successful connection, the output of the above code should look like:

/home/ec2-user/anaconda3/envs/python3/bin/python
3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)
[GCC 9.4.0]
sys.version_info(major=3, minor=8, micro=12, releaselevel='final', serial=0)
Current_Date: 2022-08-08
Snowflake_Version: 6.26.1
False

Step 5: Perform Query and save into Amazon S3 Bucket

cur = con.cursor()
cur.execute("SELECT TOP 10 * FROM <some_db>")
result = cur.fetchall()
# Save the results into a dataframe
result_df = pd.DataFrame(result)

Once you are happy with the Query, you can write our Dataframe to S3 bucket for further preprocessing, model training and inferences. An example snippet to write file into S3 bucket is below:

import sagemaker 
sess = sagemaker.Session()
result_df.to_csv('result.csv')
data = sess.upload_data('result.csv',
bucket='<your s3 bucket>',
key_prefix='<folder inside your bucket>')
Note: Make sure you have a policy attached to the SageMaker Execution Role to access the S3 bucket.

Tada! Congratulations on connecting SageMaker to Snowflake. If you have any suggestions/comments, feel free to write.

BECOME a WRITER at MLearning.ai

--

--