Skip to content

robots.txt should steer search engines away from old docs #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smontanaro opened this issue Nov 3, 2016 · 16 comments · Fixed by #99
Closed

robots.txt should steer search engines away from old docs #94

smontanaro opened this issue Nov 3, 2016 · 16 comments · Fixed by #99
Assignees

Comments

@smontanaro
Copy link

See https://mail.python.org/pipermail/pydotorg-www/2016-November/003921.html for original discussion.

When searching Google for "Python timeit" recently, the first hit was for

https://doc.python.org/2/library/timeit.html

The second hit, unfortunately, was for

https://doc.python.org/3.0/library/timeit.html

The first page of results didn't mention

https://doc.python.org/3/library/timeit.html

at all. It seems that the robots.txt file should be tweaked to strongly discourage search engine crawlers from traversing outdated documentation, at least < 3.2 or < 2.6. It's been a long while since I messed with a robots.txt file (so I won't pretend I could submit a proper PR), but something like

User-agent: *
disallow: /3.0/
disallow: /3.1/
disallow: /2.5/
disallow: /2.4/
disallow: /2.3/
disallow: /2.2/
disallow: /2.1/
disallow: /2.0/

should steer well-behaved crawlers away from obsolete documentation.

@berkerpeksag
Copy link
Member

I couldn't find robots.txt in the psf-salt repo. @benjaminp do you know where it is?

@MarkMangoba
Copy link

@berkerpeksag
Copy link
Member

@MarkMangoba good point! Could you please check whether https://docs.python.org/robots.txt is created on Fastly? I don't have a Fastly account so I can't check it myself.

Also, I think we can safely add 3.2 and 3.3 to the list Skip shared in https://github.com/python/pythondotorg/issues/1030#issue-187084143.

@brainwane brainwane assigned ewdurbin and unassigned MarkMangoba Aug 11, 2019
@brainwane
Copy link

@ewdurbin Is this something we could talk about this week, as we work to close out Python 2 sunsetting communications tasks?

@brainwane
Copy link

Hi @JulienPalard -- could you confirm that this is ok to do?

@JulienPalard
Copy link
Member

A disallow looks violent, (and dangerous, a typo and boom, I fear robots.txt probably more than I should).

We already have proper canonical links in some builds, like:

$ curl https://docs.python.org/3.6/tutorial/index.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/tutorial/index.html" />

they are placed here by Doc/tools/templates/layout.html.

this should be enough to have a /3/ instead of a /3.6/ in search results. But we don't have it for really old docs like 3.4:

$ curl https://docs.python.org/3.4/tutorial/index.html | grep canonical

See also #51.

Using a canonical relation to point from 2.7 to 3 is probably a bad idea: it looks like a lie, and some page have moved (following modules that were renamed or moved), but it could be discussed.

Using a disallow on 2.7 is not a good idea, everyone still has the right to explicitly search for Python 2.7 things in search engines, like python urlparse should give them https://docs.python.org/2/library/urlparse.html (this module does no longer exists in Python 3, it's now https://docs.python.org/3/library/urllib.parse.html).

@enda
Copy link

enda commented Jun 4, 2020

I agree with @JulienPalard, a disallow looks violent and I will always want to find the version 2 in some year if I need it ...

We should let search engines do their job, but maybe we can help them a little:

  • By editing the link to v3 "Read the Python documentation for the current stable release." and adding the top keywords in your element:
    For exemple, at https://docs.python.org/2.7/library/functions.html, the link should be something like
    You should upgrade and read the <a href="https://docs.python.org/3/library/functions.html">Built-ins Functions - Python 3 documentation</a>.
    instead of
    You should upgrade and read the <a href="https://docs.python.org/3/library/functions.html"> Python documentation for the current stable release</a>.

  • By trying to add a SearchAction markup, pointing only the Python 3 documentation.

  • Creating a sitemap only for /3/, adding it in robots.txt and Google SearchConsole Build a sitemap #56

  • Pushing from /2/ to /3/ but not from /3/ to /2/

  • Found external /2/ links and emails them to change it to /3/ ...

@ewdurbin
Copy link
Member

ewdurbin commented Jun 4, 2020

I defer to the Docs team to make a decision on how to move forward here.

Personally I don't see an immediate action that should be taken. Search is hard and I think there are enough little traps described above that this should be approached gradually and with a plan.

@brainwane
Copy link

I agree @ewdurbin - should we move this issue perhaps to the CPython repo, or to https://github.com/python/docsbuild-scripts ?

@hugovk
Copy link
Member

hugovk commented Jun 4, 2020

  • Found external /2/ links and emails them to change it to /3/ ...

Here's 95 links on English Wikipedia that begin with https://docs.python.org/2 and and 15 with http://docs.python.org/2.

Not all need to be or should be changed, but I'll do a few now. (Edit: now 80 and 15.)

@JulienPalard
Copy link
Member

I'd gladly take the issue on docsbuild-script, but in any cases I won't do disallows in robots.txt.

Thanks a lot @hugovk to try and fix external links, it's a good was as it also helps users, not only search engines 👍

Also please note that the situation already enhanced since 2016, I'm having /fr/3/library/timeit.html first, /3/library/timeit.html 2nd in Google, and I'm having /3/library/timeit.html first on duckduckgo. Did not experimented more. This issue can be closed, reopen as needed on docsbuild scripts if you see SERP clearly lagging behind.

@berkerpeksag
Copy link
Member

We have another report showing links to Python 2 docs in python/pythondotorg#1619. I was going to transfer this issue to docsbuild-scripts, but I couldn't find it in the repository list.

@JulienPalard
Copy link
Member

@berkerpeksag
Copy link
Member

I'm well aware of the location of docsbuild-scripts repository. I was talking about the "transfer issue" feature of GitHub:

image

@ewdurbin ewdurbin transferred this issue from python/pythondotorg Jul 27, 2020
@ewdurbin
Copy link
Member

Not sure why you were unable to transfer @berkerpeksag, but it has been completed.

@berkerpeksag
Copy link
Member

@ewdurbin thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants