You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)
After this log, I get an error that the endpoint did not pass health checks.
sagemaker session bucket -> used for uploading data, models and logs
sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}
System Info
I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:
Error:
#015Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:04, 1.29it/s]#015Loading checkpoint shards: 29%|██▊ | 2/7 [00:01<00:03, 1.31it/s]#015Loading checkpoint shards: 43%|████▎ | 3/7 [00:02<00:03, 1.30it/s]#015Loading checkpoint shards: 57%|█████▋ | 4/7 [00:03<00:02, 1.29it/s]#015Loading checkpoint shards: 71%|███████▏ | 5/7 [00:03<00:01, 1.30it/s]#015Loading checkpoint shards: 86%|████████▌ | 6/7 [00:04<00:00, 1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.32it/s]
Error: DownloadError
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)
After this log, I get an error that the endpoint did not pass health checks.
Information
Tasks
Reproduction
import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json
sess = sagemaker.Session()
sagemaker session bucket -> used for uploading data, models and logs
sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)
sagemaker config
instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1800)
Expected behavior
The model should be deployed to the endpoint.
The text was updated successfully, but these errors were encountered: