-
Notifications
You must be signed in to change notification settings - Fork 19
Add matplotlib arxiv workflow #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# For tests/workflows/test_embarrassingly_parallel.py | ||
embarrassingly_parallel_cluster: | ||
n_workers: 100 | ||
backend_options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we are not specifying the VM types here? Using the default t3 will cause a lot of volatility in the times and trigger regressions. I'd recommend using m6i.xlarge
and that will take ~4 min total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want this to be representative of what a naive user might do. My current thinking is that they'll just use the default worker specification. Though maybe you or someone else have a different perspective. Definitely open to thoughts from others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, although this will likely cause regression detections. I was able to see a difference of double the time between a run that Mat did and one I did with the defaults.
Maybe @ntabris has some opinions on this?
tests/conftest.py
Outdated
@@ -530,6 +530,7 @@ def s3(): | |||
return s3fs.S3FileSystem( | |||
key=os.environ.get("AWS_ACCESS_KEY_ID"), | |||
secret=os.environ.get("AWS_SECRET_ACCESS_KEY"), | |||
requester_pays=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this affect the tests that do not require this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is we'll pay for more transfer fees (https://s3fs.readthedocs.io/en/latest/#requester-pays-buckets) -- though this is probably already happening for datasets in the Coiled AWS account. Some public datasets require us to specify requester_pays=True
. We didn't have a mechanism for passing s3fs
options through, so I just hard coded it. cc @ntabris for thoughts on if this is problematic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's problematic. this feels like something that would make sense to have configurable per-test though, right? (that feels to me like the easier way of making sure it isn't more problematic, but of course, it would be someone else doing that work so of course it feels easier to me.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Nat, the only test that needs requester_pays
is the one you are adding here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new s3_factory
fixture for creating non-default S3FileSystem
instances. Note it's function scoped instead of session scoped like the existing s3
fixture (changing s3
to be function scoped breaks other fixtures where s3
is currently used)
Hmm we're getting |
Can you point me to the error? I can't find it in the previous CI runs. The only thing that comes to my head is how the The other thing you can try locally is using the credentials of the bot and see if you can replicate it lcoally, they are on onepassword in the dask-eng vault. |
https://github.com/coiled/coiled-runtime/actions/runs/4475231686/jobs/7864609830 has the permissions error
I agree that seems like the place where things should be going wrong, but given I can run this test successfully locally, it makes me think it's more a credentials issue and
Good idea. I need to do a few things to access onepassword on my machine. You mind trying out using the bot credentials? We could screenshare for 5 minutes to see if we can reproduce |
secret=os.environ.get("AWS_SECRET_ACCESS_KEY"), | ||
) | ||
def s3(s3_storage_options): | ||
return s3fs.S3FileSystem(**s3_storage_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this change is just cosmetic. We define key
/ secret
above in s3_storage_options
and then do the same thing here again. This change just reuses s3_storage_options
instead of grabbing the environment variables again. Should be okay as both fixtures are session scoped
Alright, it looks like @ntabris has fixed the permissions issue -- restarting those failing CI builds... |
cc @douglasdavis since we were talking about this example offline |
@@ -31,6 +31,12 @@ parquet_cluster: | |||
n_workers: 15 | |||
worker_vm_types: [m5.xlarge] # 4 CPU, 16 GiB | |||
|
|||
# For tests/workflows/test_embarrassingly_parallel.py | |||
embarrassingly_parallel_cluster: | |||
n_workers: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly concerned about cost and CI runtime.
Looking at the cluster, this uses up about 50-100 credits and I believe this one job runs for 10-20min.
I don't think this should run on every commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably separate the real-world benchmarks from the artificial ones, and then only run the real-world ones (probably they're larger in general) on-demand ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that's easy though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be easy to do, this is all pytest stuff so we can use markers, paths, etc.
I'm fine with running on-demand and maybe once per week or release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, the cost is about $1-3 bucks per run when using the t3 instances the times are in the order of 5-10 min. I suggested using m6i instances which get a more consistent time usage ~4min every time, but James mentioned he wanted to use the default t3s. For some info on cost and times, you can see this comment from Sarah who run this workflow multiple times for the arm vs intel comparison.
https://github.com/coiled/platform/issues/645#issuecomment-1459019060
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated things so that workflows (i.e. all tests in the tests/workflows/
directory) are only run when requested and on the nightly cron job. Locally you can request workflows be run by adding the --run-workflows
flag to your pytest ...
command, and on PRs you can request workflows be run by adding the workflows
label
def test_embarassingly_parallel(embarrassingly_parallel_client, s3_factory): | ||
# How popular is matplotlib? | ||
s3 = s3_factory(requester_pays=True) | ||
directories = s3.ls("s3://arxiv/pdf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we'll just let this run on everything that's there. I don't think this is necessary and considering that every month a new directory pops up this would also increase our runtime every month which is bad for a benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we make a copy of that into our own bucket, like a snapshot in time, and that way we also avoid the whole request_payer
situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to only use data for a specific range of years (1991-2022). That should give us a fixed set of data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau by looking at the plot in the readme of the arxiv example https://github.com/mrocklin/arxiv-matplotlib#results it doesn't seem very reasonable to start grabbing at 1991. If we are already going to select a range I'd suggest (by looking at the plot) doing 2004/2005 to 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1991 corresponds to the beginning of the dataset. I think we want to analyze the full dataset (up to some time cutoff so we have a consistent data volume). The 2004/2005 lower bound you're referencing is just for visualization purposes. If you look at the full notebook https://github.com/mrocklin/arxiv-matplotlib/blob/main/arxiv-aws.ipynb you'll see an earlier plot that does back to 1991
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant to say is that the curve is pretty flat at the beginning and probably including less data for those years would save us some money, if we are running this once a day that is a cost of ~$2 * 365 = ~$730 a year for this workflow only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean. In general, I agree with this sentiment. Though in this particular case, there's not much data before 2004. Comparing back to back runs on 1991-2022 data and 2004-2022 data, I saw a small rough price difference of ~$1.66 and ~$1.74, respectively.
Thinking about it more, extending down into the 90s actually has some value as it let's us confirm that the filename_to_date
utility is working as expected. I've added a few light validation assert
s to the end of this test.
We can always revisit the subset of data we use here and frequency we run the workflows in the future if we want to do some price optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking in @ncclementi does ^ seem reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, thanks @jrbourbeau
That was my last comment, I think this is good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @jrbourbeau
This PR adds a new representative workflow for an embarrassing parallel computation. Based on https://www.coiled.io/blog/how-popular-is-matplotlib.