Llama Inference using TGI #3168

Akhil-ender · 2025-04-13T03:46:46Z

System Info

I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:

Error:
#015Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:04, 1.29it/s]#015Loading checkpoint shards: 29%|██▊ | 2/7 [00:01<00:03, 1.31it/s]#015Loading checkpoint shards: 43%|████▎ | 3/7 [00:02<00:03, 1.30it/s]#015Loading checkpoint shards: 57%|█████▋ | 4/7 [00:03<00:02, 1.29it/s]#015Loading checkpoint shards: 71%|███████▏ | 5/7 [00:03<00:01, 1.30it/s]#015Loading checkpoint shards: 86%|████████▌ | 6/7 [00:04<00:00, 1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.32it/s]

Error: DownloadError

Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)

After this log, I get an error that the endpoint did not pass health checks.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json

sess = sagemaker.Session()

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)

sagemaker config

instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1800)

Expected behavior

The model should be deployed to the endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama Inference using TGI #3168

Llama Inference using TGI #3168

Akhil-ender commented Apr 13, 2025

Llama Inference using TGI #3168

Llama Inference using TGI #3168

Comments

Akhil-ender commented Apr 13, 2025

System Info

Information

Tasks

Reproduction

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

retrieve the llm image uri

sagemaker config

Define Model and Endpoint configuration parameter

create HuggingFaceModel with the image uri

Expected behavior