Skip to content

Llama Inference using TGI #3168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
Akhil-ender opened this issue Apr 13, 2025 · 0 comments
Open
4 tasks

Llama Inference using TGI #3168

Akhil-ender opened this issue Apr 13, 2025 · 0 comments

Comments

@Akhil-ender
Copy link

System Info

I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:

Error:
#015Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:04, 1.29it/s]#015Loading checkpoint shards: 29%|██▊ | 2/7 [00:01<00:03, 1.31it/s]#015Loading checkpoint shards: 43%|████▎ | 3/7 [00:02<00:03, 1.30it/s]#015Loading checkpoint shards: 57%|█████▋ | 4/7 [00:03<00:02, 1.29it/s]#015Loading checkpoint shards: 71%|███████▏ | 5/7 [00:03<00:01, 1.30it/s]#015Loading checkpoint shards: 86%|████████▌ | 6/7 [00:04<00:00, 1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.32it/s]

Error: DownloadError

Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)

After this log, I get an error that the endpoint did not pass health checks.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json

sess = sagemaker.Session()

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)

sagemaker config

instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1800)

Expected behavior

The model should be deployed to the endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant