Add matplotlib arxiv workflow #724

jrbourbeau · 2023-03-21T02:32:32Z

This PR adds a new representative workflow for an embarrassing parallel computation. Based on https://www.coiled.io/blog/how-popular-is-matplotlib.

…embarrassing-parallel

ncclementi · 2023-03-21T03:31:05Z

cluster_kwargs.yaml

+# For tests/workflows/test_embarrassingly_parallel.py
+embarrassingly_parallel_cluster:
+  n_workers: 100
+  backend_options:


Is there a reason we are not specifying the VM types here? Using the default t3 will cause a lot of volatility in the times and trigger regressions. I'd recommend using m6i.xlarge and that will take ~4 min total

I want this to be representative of what a naive user might do. My current thinking is that they'll just use the default worker specification. Though maybe you or someone else have a different perspective. Definitely open to thoughts from others

It makes sense, although this will likely cause regression detections. I was able to see a difference of double the time between a run that Mat did and one I did with the defaults.
Maybe @ntabris has some opinions on this?

ncclementi · 2023-03-21T03:32:38Z

tests/conftest.py

@@ -530,6 +530,7 @@ def s3():
    return s3fs.S3FileSystem(
        key=os.environ.get("AWS_ACCESS_KEY_ID"),
        secret=os.environ.get("AWS_SECRET_ACCESS_KEY"),
+        requester_pays=True,


How does this affect the tests that do not require this?

My guess is we'll pay for more transfer fees (https://s3fs.readthedocs.io/en/latest/#requester-pays-buckets) -- though this is probably already happening for datasets in the Coiled AWS account. Some public datasets require us to specify requester_pays=True. We didn't have a mechanism for passing s3fs options through, so I just hard coded it. cc @ntabris for thoughts on if this is problematic

not sure if it's problematic. this feels like something that would make sense to have configurable per-test though, right? (that feels to me like the easier way of making sure it isn't more problematic, but of course, it would be someone else doing that work so of course it feels easier to me.)

I agree with Nat, the only test that needs requester_pays is the one you are adding here.

I've added a new s3_factory fixture for creating non-default S3FileSystem instances. Note it's function scoped instead of session scoped like the existing s3 fixture (changing s3 to be function scoped breaks other fixtures where s3 is currently used)

jrbourbeau · 2023-03-21T15:10:59Z

Hmm we're getting PermissionError: Access Denied errors when running test_embarassingly_parallel here (e.g. this CI build). I can run this example successfully on my laptop when using the oss Coiled AWS account. @ncclementi any ideas on what I might be doing wrong here?

ncclementi · 2023-03-21T17:24:30Z

Hmm we're getting PermissionError: Access Denied errors when running test_embarassingly_parallel here. I can run this example successfully on my laptop when using the oss Coiled AWS account. @ncclementi any ideas on what I might be doing wrong here?

Can you point me to the error? I can't find it in the previous CI runs. The only thing that comes to my head is how the request_pays=True is passed through to the cluster, maybe something wrong with that?

The other thing you can try locally is using the credentials of the bot and see if you can replicate it lcoally, they are on onepassword in the dask-eng vault.

jrbourbeau · 2023-03-21T17:28:50Z

https://github.com/coiled/coiled-runtime/actions/runs/4475231686/jobs/7864609830 has the permissions error

he only thing that comes to my head is how the request_pays=True is passed through to the cluster, maybe something wrong with that?

I agree that seems like the place where things should be going wrong, but given I can run this test successfully locally, it makes me think it's more a credentials issue and request_pays= is being passed off appropriately

The other thing you can try locally is using the credentials of the bot and see if you can replicate it lcoally, they are on onepassword in the dask-eng vault.

Good idea. I need to do a few things to access onepassword on my machine. You mind trying out using the bot credentials? We could screenshare for 5 minutes to see if we can reproduce

jrbourbeau · 2023-03-21T17:47:28Z

tests/conftest.py

-        secret=os.environ.get("AWS_SECRET_ACCESS_KEY"),
-    )
+def s3(s3_storage_options):
+    return s3fs.S3FileSystem(**s3_storage_options)


Note this change is just cosmetic. We define key / secret above in s3_storage_options and then do the same thing here again. This change just reuses s3_storage_options instead of grabbing the environment variables again. Should be okay as both fixtures are session scoped

jrbourbeau · 2023-03-21T19:14:55Z

Alright, it looks like @ntabris has fixed the permissions issue -- restarting those failing CI builds...

jrbourbeau · 2023-03-21T20:24:07Z

cc @douglasdavis since we were talking about this example offline

fjetter · 2023-03-23T12:04:56Z

cluster_kwargs.yaml

@@ -31,6 +31,12 @@ parquet_cluster:
  n_workers: 15
  worker_vm_types: [m5.xlarge]  # 4 CPU, 16 GiB

+# For tests/workflows/test_embarrassingly_parallel.py
+embarrassingly_parallel_cluster:
+  n_workers: 100


I am slightly concerned about cost and CI runtime.
Looking at the cluster, this uses up about 50-100 credits and I believe this one job runs for 10-20min.
I don't think this should run on every commit.

We should probably separate the real-world benchmarks from the artificial ones, and then only run the real-world ones (probably they're larger in general) on-demand ?

I don't know if that's easy though

it should be easy to do, this is all pytest stuff so we can use markers, paths, etc.

I'm fine with running on-demand and maybe once per week or release

For the record, the cost is about $1-3 bucks per run when using the t3 instances the times are in the order of 5-10 min. I suggested using m6i instances which get a more consistent time usage ~4min every time, but James mentioned he wanted to use the default t3s. For some info on cost and times, you can see this comment from Sarah who run this workflow multiple times for the arm vs intel comparison.
https://github.com/coiled/platform/issues/645#issuecomment-1459019060

I've updated things so that workflows (i.e. all tests in the tests/workflows/ directory) are only run when requested and on the nightly cron job. Locally you can request workflows be run by adding the --run-workflows flag to your pytest ... command, and on PRs you can request workflows be run by adding the workflows label

fjetter · 2023-03-23T12:06:16Z

tests/workflows/test_embarrassingly_parallel.py

+def test_embarassingly_parallel(embarrassingly_parallel_client, s3_factory):
+    # How popular is matplotlib?
+    s3 = s3_factory(requester_pays=True)
+    directories = s3.ls("s3://arxiv/pdf")


IIUC we'll just let this run on everything that's there. I don't think this is necessary and considering that every month a new directory pops up this would also increase our runtime every month which is bad for a benchmark

What if we make a copy of that into our own bucket, like a snapshot in time, and that way we also avoid the whole request_payer situation?

I've updated this to only use data for a specific range of years (1991-2022). That should give us a fixed set of data files.

@jrbourbeau by looking at the plot in the readme of the arxiv example https://github.com/mrocklin/arxiv-matplotlib#results it doesn't seem very reasonable to start grabbing at 1991. If we are already going to select a range I'd suggest (by looking at the plot) doing 2004/2005 to 2022.

1991 corresponds to the beginning of the dataset. I think we want to analyze the full dataset (up to some time cutoff so we have a consistent data volume). The 2004/2005 lower bound you're referencing is just for visualization purposes. If you look at the full notebook https://github.com/mrocklin/arxiv-matplotlib/blob/main/arxiv-aws.ipynb you'll see an earlier plot that does back to 1991

What I meant to say is that the curve is pretty flat at the beginning and probably including less data for those years would save us some money, if we are running this once a day that is a cost of ~$2 * 365 = ~$730 a year for this workflow only.

Ah, I see what you mean. In general, I agree with this sentiment. Though in this particular case, there's not much data before 2004. Comparing back to back runs on 1991-2022 data and 2004-2022 data, I saw a small rough price difference of ~$1.66 and ~$1.74, respectively.

Thinking about it more, extending down into the 90s actually has some value as it let's us confirm that the filename_to_date utility is working as expected. I've added a few light validation asserts to the end of this test.

We can always revisit the subset of data we use here and frequency we run the workflows in the future if we want to do some price optimization

Just checking in @ncclementi does ^ seem reasonable to you?

This makes sense, thanks @jrbourbeau
That was my last comment, I think this is good to go!

…embarrassing-parallel

ncclementi

LGTM! Thank you @jrbourbeau

jrbourbeau added 4 commits March 20, 2023 21:28

Add matplotlib arxiv workflow

3a76080

Remove stray print

5aedaca

Merge branch 'main' of https://github.com/coiled/coiled-runtime into …

e185e5d

…embarrassing-parallel

Update fixture name

95585d4

ncclementi reviewed Mar 21, 2023

View reviewed changes

Only use requester_pays for test_embarassingly_parallel

10a529b

jrbourbeau commented Mar 21, 2023

View reviewed changes

This was referenced Mar 21, 2023

Workflow ideas #725

Open

Data loading and cleaning #726

Open

Rerun CI

4504020

fjetter reviewed Mar 23, 2023

View reviewed changes

Update instance type

96e66f5

jrbourbeau added workflows Related to representative Dask user workflows and removed workflows Related to representative Dask user workflows labels Mar 23, 2023

jrbourbeau added 3 commits March 23, 2023 14:25

Run workflows on demand and during nightly cron job

0779b48

Use specific range of years

7fb6792

Merge branch 'main' of https://github.com/coiled/coiled-runtime into …

437eb0b

…embarrassing-parallel

jrbourbeau added the workflows Related to representative Dask user workflows label Mar 27, 2023

Light asserts

e4851df

This was referenced Mar 28, 2023

Add workflow reading csv, cleaning, saving to parquet #737

Closed

Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

Draft

ncclementi approved these changes Mar 28, 2023

View reviewed changes

ncclementi merged commit decbba5 into main Mar 28, 2023

jrbourbeau deleted the embarrassing-parallel branch March 28, 2023 16:39

Add matplotlib arxiv workflow #724

Add matplotlib arxiv workflow #724

Uh oh!

Conversation

jrbourbeau commented Mar 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ncclementi commented Mar 21, 2023

Uh oh!

jrbourbeau commented Mar 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau commented Mar 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrbourbeau commented Mar 21, 2023 •

edited

Loading

jrbourbeau commented Mar 21, 2023 •

edited

Loading

jrbourbeau Mar 27, 2023 •

edited

Loading