Skip to content

bpo-45466: Add download feature to urllib.request module #29217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

bpo-45466: Add download feature to urllib.request module #29217

wants to merge 7 commits into from

Conversation

pohlt
Copy link

@pohlt pohlt commented Oct 25, 2021

Similar to http.server, the urllib.request could offer a download functionality:

python -m urllib.request https://python.org/ --output file.html

To keep the code lean, output is the only optional parameter.
A typical use case could be downloading some installation scripts or other data from within a container where curl/wget is not available.

https://bugs.python.org/issue45466

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

Recognized GitHub username

We couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames:

@pohlt

This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

Similar to http.server, the urllib.request offers a download
functionality:
python -m urllib.request https://python.org/ --output file.html
@ericvsmith
Copy link
Member

Tests are needed.

@pohlt
Copy link
Author

pohlt commented Oct 26, 2021

Yep, tests are next.
I'm planning to use subprocess.run(sys.executable, ...). Any recommendations which server to test against? Set up my own HTTP server, use python.org (pipeline would fail if down for maintenance), ... ?

I just found test_urllib2_localnet.py which offers exactly what is needed for the tests.

out = stdout.buffer if args.output is None else open(args.output, "wb")

with urlopen(args.URL) as response:
while data := response.read(1024 * 1024):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 MB a reasonable choice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 MB is a bit large. curl uses a buffer size of 32768.

The code will also do lots of allocation and deallocations. It's possible to avoid allocations with memoryview(bytearray(32768)) and readinto().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

And some housekeeping.
@pohlt
Copy link
Author

pohlt commented Oct 27, 2021

I looked into the doc building issue, but couldn't figure out what went wrong. Could someone please give me a hint?

@pohlt
Copy link
Author

pohlt commented Oct 27, 2021

Fixed some uncritical typos and reformatted the news blurb to remove newlines from within code parts. Maybe this fixes the docs issue.

@ericvsmith
Copy link
Member

I'm sure the docs are an unrelated problem. I think there's an open issue with the Sphinx version being used, but now I can't find the mention of that problem.

@pohlt
Copy link
Author

pohlt commented Nov 1, 2021

I guess it is unlikely for a core developer to look at the PR as long as there are open basic issues like breaking the docs. Is there anything I can do to re-run the pipeline to see if the docs issue has been fixed?

@ericvsmith ericvsmith closed this Nov 1, 2021
@ericvsmith
Copy link
Member

Closing and re-opening to trigger the doc build step.

@ericvsmith ericvsmith reopened this Nov 1, 2021
Copy link
Member

@ericvsmith ericvsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we use f-strings for new code.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@tiran tiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think it is a good idea to add a download feature at all.

While the PR implements enough functionality to address your use case, other users will open issues to request more features. I expect that users will ask for "take filename from remote", "custom HTTP headers" and "POST requests" next. curl has over 200 command line options for a reason. Soon we'll end up with a poor clone of curl and a new source of security bugs. :)

out = stdout.buffer if args.output is None else open(args.output, "wb")

with urlopen(args.URL) as response:
while data := response.read(1024 * 1024):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 MB is a bit large. curl uses a buffer size of 32768.

The code will also do lots of allocation and deallocations. It's possible to avoid allocations with memoryview(bytearray(32768)) and readinto().

args = parser.parse_args()
out = stdout.buffer if args.output is None else open(args.output, "wb")

with urlopen(args.URL) as response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will print an exception. You should add error checking and a nice error output in case connection or download fails.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm catching URLError now.

parser.add_argument(
"-o",
"--output",
type=str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argparse has builtin file handling:

Suggested change
type=str,
type=argparse.FileType('wb'), default=sys.stdout.buffer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks.

Comment on lines 2786 to 2787
if __name__ == "__main__":
from argparse import ArgumentParser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please turn this into a helper method, e.g. def main().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common practice (or impolite) to resolve a fixed issue?

pohlt added 2 commits November 1, 2021 20:25
* use buffer to avoid buffer reallocations
* catch URLError and print error output
* use argparse file handling
@pohlt
Copy link
Author

pohlt commented Nov 1, 2021

Sorry if I screw up the review process. I've never done this in Github...

@pohlt
Copy link
Author

pohlt commented Nov 1, 2021

While the PR implements enough functionality to address your use case, other users will open issues to request more features. [...]

The default answer for those requests could be: If you need more functionality than that, install curl/wget and use it instead.

Anyway, thanks for the reviews.

@ericvsmith
Copy link
Member

After discussing this among the core devs, we've decided not to accept this patch. Sorry, @pohlt. I hope you at least gained some experience in working with the code and our processes. I'll comment on the issue about why we're not accepting it.

@ericvsmith ericvsmith closed this Nov 1, 2021
@pohlt
Copy link
Author

pohlt commented Nov 2, 2021

Thanks, @ericvsmith, for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants