Skip to content

Add matplotlib arxiv workflow #724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 28, 2023
Merged

Add matplotlib arxiv workflow #724

merged 11 commits into from
Mar 28, 2023

Conversation

jrbourbeau
Copy link
Member

This PR adds a new representative workflow for an embarrassing parallel computation. Based on https://www.coiled.io/blog/how-popular-is-matplotlib.

# For tests/workflows/test_embarrassingly_parallel.py
embarrassingly_parallel_cluster:
n_workers: 100
backend_options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are not specifying the VM types here? Using the default t3 will cause a lot of volatility in the times and trigger regressions. I'd recommend using m6i.xlarge and that will take ~4 min total

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want this to be representative of what a naive user might do. My current thinking is that they'll just use the default worker specification. Though maybe you or someone else have a different perspective. Definitely open to thoughts from others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, although this will likely cause regression detections. I was able to see a difference of double the time between a run that Mat did and one I did with the defaults.
Maybe @ntabris has some opinions on this?

@@ -530,6 +530,7 @@ def s3():
return s3fs.S3FileSystem(
key=os.environ.get("AWS_ACCESS_KEY_ID"),
secret=os.environ.get("AWS_SECRET_ACCESS_KEY"),
requester_pays=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this affect the tests that do not require this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is we'll pay for more transfer fees (https://s3fs.readthedocs.io/en/latest/#requester-pays-buckets) -- though this is probably already happening for datasets in the Coiled AWS account. Some public datasets require us to specify requester_pays=True. We didn't have a mechanism for passing s3fs options through, so I just hard coded it. cc @ntabris for thoughts on if this is problematic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's problematic. this feels like something that would make sense to have configurable per-test though, right? (that feels to me like the easier way of making sure it isn't more problematic, but of course, it would be someone else doing that work so of course it feels easier to me.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Nat, the only test that needs requester_pays is the one you are adding here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new s3_factory fixture for creating non-default S3FileSystem instances. Note it's function scoped instead of session scoped like the existing s3 fixture (changing s3 to be function scoped breaks other fixtures where s3 is currently used)

@jrbourbeau
Copy link
Member Author

jrbourbeau commented Mar 21, 2023

Hmm we're getting PermissionError: Access Denied errors when running test_embarassingly_parallel here (e.g. this CI build). I can run this example successfully on my laptop when using the oss Coiled AWS account. @ncclementi any ideas on what I might be doing wrong here?

@ncclementi
Copy link
Contributor

Hmm we're getting PermissionError: Access Denied errors when running test_embarassingly_parallel here. I can run this example successfully on my laptop when using the oss Coiled AWS account. @ncclementi any ideas on what I might be doing wrong here?

Can you point me to the error? I can't find it in the previous CI runs. The only thing that comes to my head is how the request_pays=True is passed through to the cluster, maybe something wrong with that?

The other thing you can try locally is using the credentials of the bot and see if you can replicate it lcoally, they are on onepassword in the dask-eng vault.

@jrbourbeau
Copy link
Member Author

https://github.com/coiled/coiled-runtime/actions/runs/4475231686/jobs/7864609830 has the permissions error

he only thing that comes to my head is how the request_pays=True is passed through to the cluster, maybe something wrong with that?

I agree that seems like the place where things should be going wrong, but given I can run this test successfully locally, it makes me think it's more a credentials issue and request_pays= is being passed off appropriately

The other thing you can try locally is using the credentials of the bot and see if you can replicate it lcoally, they are on onepassword in the dask-eng vault.

Good idea. I need to do a few things to access onepassword on my machine. You mind trying out using the bot credentials? We could screenshare for 5 minutes to see if we can reproduce

secret=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)
def s3(s3_storage_options):
return s3fs.S3FileSystem(**s3_storage_options)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this change is just cosmetic. We define key / secret above in s3_storage_options and then do the same thing here again. This change just reuses s3_storage_options instead of grabbing the environment variables again. Should be okay as both fixtures are session scoped

@jrbourbeau
Copy link
Member Author

jrbourbeau commented Mar 21, 2023

Alright, it looks like @ntabris has fixed the permissions issue -- restarting those failing CI builds...

@jrbourbeau
Copy link
Member Author

cc @douglasdavis since we were talking about this example offline

This was referenced Mar 21, 2023
@@ -31,6 +31,12 @@ parquet_cluster:
n_workers: 15
worker_vm_types: [m5.xlarge] # 4 CPU, 16 GiB

# For tests/workflows/test_embarrassingly_parallel.py
embarrassingly_parallel_cluster:
n_workers: 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly concerned about cost and CI runtime.
Looking at the cluster, this uses up about 50-100 credits and I believe this one job runs for 10-20min.
I don't think this should run on every commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably separate the real-world benchmarks from the artificial ones, and then only run the real-world ones (probably they're larger in general) on-demand ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that's easy though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be easy to do, this is all pytest stuff so we can use markers, paths, etc.

I'm fine with running on-demand and maybe once per week or release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the cost is about $1-3 bucks per run when using the t3 instances the times are in the order of 5-10 min. I suggested using m6i instances which get a more consistent time usage ~4min every time, but James mentioned he wanted to use the default t3s. For some info on cost and times, you can see this comment from Sarah who run this workflow multiple times for the arm vs intel comparison.
https://github.com/coiled/platform/issues/645#issuecomment-1459019060

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated things so that workflows (i.e. all tests in the tests/workflows/ directory) are only run when requested and on the nightly cron job. Locally you can request workflows be run by adding the --run-workflows flag to your pytest ... command, and on PRs you can request workflows be run by adding the workflows label

def test_embarassingly_parallel(embarrassingly_parallel_client, s3_factory):
# How popular is matplotlib?
s3 = s3_factory(requester_pays=True)
directories = s3.ls("s3://arxiv/pdf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we'll just let this run on everything that's there. I don't think this is necessary and considering that every month a new directory pops up this would also increase our runtime every month which is bad for a benchmark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make a copy of that into our own bucket, like a snapshot in time, and that way we also avoid the whole request_payer situation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to only use data for a specific range of years (1991-2022). That should give us a fixed set of data files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau by looking at the plot in the readme of the arxiv example https://github.com/mrocklin/arxiv-matplotlib#results it doesn't seem very reasonable to start grabbing at 1991. If we are already going to select a range I'd suggest (by looking at the plot) doing 2004/2005 to 2022.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1991 corresponds to the beginning of the dataset. I think we want to analyze the full dataset (up to some time cutoff so we have a consistent data volume). The 2004/2005 lower bound you're referencing is just for visualization purposes. If you look at the full notebook https://github.com/mrocklin/arxiv-matplotlib/blob/main/arxiv-aws.ipynb you'll see an earlier plot that does back to 1991

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant to say is that the curve is pretty flat at the beginning and probably including less data for those years would save us some money, if we are running this once a day that is a cost of ~$2 * 365 = ~$730 a year for this workflow only.

Copy link
Member Author

@jrbourbeau jrbourbeau Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. In general, I agree with this sentiment. Though in this particular case, there's not much data before 2004. Comparing back to back runs on 1991-2022 data and 2004-2022 data, I saw a small rough price difference of ~$1.66 and ~$1.74, respectively.

Thinking about it more, extending down into the 90s actually has some value as it let's us confirm that the filename_to_date utility is working as expected. I've added a few light validation asserts to the end of this test.

We can always revisit the subset of data we use here and frequency we run the workflows in the future if we want to do some price optimization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking in @ncclementi does ^ seem reasonable to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, thanks @jrbourbeau
That was my last comment, I think this is good to go!

@jrbourbeau jrbourbeau added workflows Related to representative Dask user workflows and removed workflows Related to representative Dask user workflows labels Mar 23, 2023
@jrbourbeau jrbourbeau added the workflows Related to representative Dask user workflows label Mar 27, 2023
Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @jrbourbeau

@ncclementi ncclementi merged commit decbba5 into main Mar 28, 2023
@jrbourbeau jrbourbeau deleted the embarrassing-parallel branch March 28, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflows Related to representative Dask user workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants