Skip to content

Sorting crates by top recently downloaded #857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

natboehm
Copy link
Contributor

@natboehm natboehm commented Jul 7, 2017

I'm currently working on issue #702, sort crates by number of downloads in the past 90 days. In the function index of the file src/krate.rs, I'm having trouble converting the SQL query to Diesel and incorporating it into the already formed query variable statement.

Even when the inner_join is placed above the into_boxed function, as @sgrif recommended in my previous pull request #817, a different error persists in which the types are not matching when executing the query. My latest attempt places the inner_join after into_boxed, as I could not figure out how to keep the logic of the query while calling inner_join before into_boxed.

The query written in src/recent_downloads_query.sql currently works when inserted in straight SQL with Diesel. This is done in lines 912 - 916 of src/krate.rs. Currently when served locally, this branch is able to serve only a list of crates sorted by number of downloads in the past 90 days. All other functionality within fn index is commented out. I did this to show that the query does successfully get the top crates downloaded in the past 90 days. I can't seem be able to translate correctly to Diesel, or to be able to incorporate it to work correctly with the initialization of query and the other ways in which query is used.

In terms of functionality, when on the search page we want to be able to both show the list of crates sorted by top recently downloaded and list their respective recent download count instead of the all-time download count. We want this in addition to the current state of sorting crates by all-time downloads.

I'd appreciate some help translating my query into the Diesel code, as well as some serious guidance in incorporating it into the variable query in the function index in src/krate.rs.

My latest attempt at the Diesel query currently resides in krate.rs, lines 820 - 822. The working SQL query is in src/recent_downloads_query.sql.

@sgrif @carols10cents

@sgrif
Copy link
Contributor

sgrif commented Jul 7, 2017

I'm confused as to why this needs to join at all? Could this do what you need?

query = query.order(crate_downloads.select(sum(downloads))
    .filter(date.gt(now - 90.days()))
    .filter(crate_id.eq(sql::<Integer>("crates.id"))))

@natboehm
Copy link
Contributor Author

natboehm commented Jul 7, 2017

I thought I needed the join to associate crates.id, and associated metadata, with crate_downloads.crate_id. Without the join, I didn't think the function would be able to return the crate title, description, etc listed when searching.

@shepmaster
Copy link
Member

shepmaster commented Jul 7, 2017

why this needs to join at all?

I think it also needs to join because we want to show the recent download count in the UI. Is it possible to get the crate's ID, crate's name (and other metadata) as well as the # of downloads in the last 90 days, sorted by the number of downloads, without a join?

@shepmaster
Copy link
Member

shepmaster commented Jul 7, 2017

One potential idea I tried in parallel was to introduce a view which can then be joined to by Diesel. I have an example repo available. This might work, but is unfortunate because I couldn't figure out how to get Diesel to infer the schema for the view, which means that the Rust representation could go stale.

It doesn't matter in this case, but a view would also be impossible to "parameterize", which is something I've seen in other cases of joining to a subselect.

@sgrif
Copy link
Contributor

sgrif commented Jul 7, 2017

You don't need to join in order to do that. You can just order by a subselect like I wrote above. The query plan generated is nearly identical

@sgrif
Copy link
Contributor

sgrif commented Jul 7, 2017

Ah I see where I misunderstood now

@sgrif
Copy link
Contributor

sgrif commented Jul 7, 2017

Still, it shouldn't need to join onto a subselect -- I'd think this would work fine:

let recent_downloads = sql::<BigInt>("SUM(crate_downloads.downloads)");
crates.inner_join(crate_downloads::table)
    .group_by(crates::id)
    .filter(crate_downloads::date.gt(now - 90.days()))
    .select((crates::all_columns, recent_downloads))
    .order(recent_downloads);

That still doesn't solve the issue mentioned here though. It sounds like you just need to completely branch this case off, which you'd have to do regardless of Diesel. The type that you're returning from the query would be different, and the response you return would therefore be different as well. It'd have to be something like:

if sort_by_recent_downloads {
    // perform the full query, including execution, returning the result
} else {
    // what the method does now
}

@natboehm
Copy link
Contributor Author

Okay, thank you. I'm confused about your statement of the type that we're returning from the query being different, could you explain that a bit more? The function currently works returning only the top recently downloaded results, it's only when I try to incorporate it into the main query where I start getting the errors. Do you mean that the query we are trying to construct is a different type from the queries constructed in the other if/else blocks and thus is not able to type check correctly, but that the overall return type of the function is still consistent?

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

I was referring to this comment above:

I think it also needs to join because we want to show the recent download count in the UI.

If we are to do that, we'd be selecting additional data in this case that isn't present in any of the other branches, which means the query is now different from the rest of the function.

@carols10cents
Copy link
Member

@sgrif the data types don't have to be different though-- we could either:

  • replace the total download count with the recent download count when sorting by recent downloads
  • always return both total and recent download counts no matter what the sort is

sooo are there remaining problems?

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

always return both total and recent download counts no matter what the sort is

Would likely be the simplest solution in the short term. Anything else will either require some new features in Diesel, or a pretty deep refactoring of this function.

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

I'm going to open a PR to at least add the join methods onto boxed queries. It won't fully allow this, since the type would still change from BoxedQuery<_, crates, _> to BoxedQuery<_, Join<crates, crates_downloads, _>. However it would at least allow us to do this:

let data = if sort == "recent_downloads" {
    query.inner_join(crates_downloads::table)
        .select(all_columns_but_with_sum_where_total_downloads_is)
        .group_by(crates::id)
        .filter(crate_downloads::date.gt(now - 90.days()))
        .paginate(limit, offset)
        .load(&*conn)
} else {
    query.paginate(limit, offset).load(&*conn)
};

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

Ref diesel-rs/diesel#1015

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

diesel-rs/diesel#1016 would be needed as well.

@natboehm
Copy link
Contributor Author

Thanks, that would likely be helpful. I tried the query you wrote in the previous comment:

let recent_downloads = sql::<BigInt>("SUM(crate_downloads.downloads)");
crates.inner_join(crate_downloads::table)
    .group_by(crates::id)
    .filter(crate_downloads::date.gt(now - 90.days()))
    .select((crates::all_columns, recent_downloads))
    .order(recent_downloads);

translated in our case to:

let recent_downloads = sql::<BigInt>("SUM(crate_downloads.downloads)");
let mut query = crates::table
    .inner_join(crate_downloads::table)
    .group_by(crates::id)
    .filter(crate_downloads::date.gt(date(now - 90.days())))
    .select((ALL_COLUMNS, AsExpression::<Bool>::as_expression(false), recent_downloads.clone()))
    .into_boxed();

and received the following error:

ERROR:conduit_log_requests: 127.0.0.1:59577 [2017-07-10T13:14:20-04:00]
Get /api/v1/crates - 3ms 500: column "crates.id" must appear in the
GROUP BY clause or be used in an aggregate function

I recall getting this error before, and think that this is why we decided that a subquery was necessary. Do you have any insight into if this could be caused by a different problem?

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

into_boxed is dropping the group by clause right now, that's what diesel-rs/diesel#1016 addresses (Diesel doesn't technically support group_by yet, that method is just there so there's some way to work around it in the mean time without having to write the whole query with raw SQL)

Never mind that PR has a bug

@carols10cents
Copy link
Member

ohhhhhh i thought it was the join that doesnt' work with into_boxed

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

It's both. Well, you can join before boxing, but group_by just doesn't do anything to boxed queries right now

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

@natboehm Can you give this a try using #852 as a base branch and then adding this to Cargo.toml:

[replace]
"diesel:0.14.0" = { git = "https://github.com/diesel-rs/diesel.git", branch = "sg-crates-io-brainstorming" }

To see if the query I mentioned above works for this use case with those two changes?

@natboehm
Copy link
Contributor Author

Sure, when trying to run the backend server I received this error message:

ERROR:conduit_log_requests: 127.0.0.1:60274 [2017-07-10T14:43:32-04:00] 
Get /api/v1/crates - 3ms 500: syntax error at or near "GROUP"

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

/headdesk

Fixed... (We have little test coverage for the feature since it's not "officially" supported)

@natboehm
Copy link
Contributor Author

Awesome, it seems to be working, the group_by errors seem to have been resolved and the top recently downloaded crates display instead of all-time downloads on my local version. I'm going to keep working off of this branch in anticipation of these changes being merged at some point. Thanks!

@sgrif
Copy link
Contributor

sgrif commented Jul 10, 2017

Yeah, I'll merge them once CI is green. @carols10cents What do we want to do for the crates.io dependency? I'm not planning a release for a few months. Are we comfortable pointing at git?

@natboehm
Copy link
Contributor Author

From running the tests we've found that this query doesn't work for crates with no downloads. In order to support crates with no downloads, we need to change the inner_join to a left_join and the filter on the date needs to be a condition on the join. Is this possible to write in Diesel currently?

SELECT 
    crates.*
FROM (
    crates 
    LEFT OUTER JOIN
        crate_downloads 
    ON 
        crates.id = crate_downloads.crate_id 
    AND crate_downloads.date > date(CURRENT_DATE - INTERVAL '90 days')
)
GROUP BY crates.id 
ORDER BY crates.name ASC 
LIMIT 10 
OFFSET 0;

Essentially this is what we need to generate.

@sgrif
Copy link
Contributor

sgrif commented Jul 13, 2017

You'd have to reach into our internals in order to do that currently.

use diesel::query_source::joins::LeftOuter;

query.join(crate_downloads::table, LeftOuter, crates::id.eq(crate_downloads::crate_id).and(crate_downloads::date.gt(now - 90.days()))

@sgrif
Copy link
Contributor

sgrif commented Jul 13, 2017

I'll have an actual API for this some time this weekend. The way to do this will be:

query.left_join(crate_downloads::table.on(crates::id.eq(crate_downloads::crate_id).and(crate_downloads::date.gt(now - 90.days())))

@natboehm
Copy link
Contributor Author

Sounds good, thanks! The query works with Diesel 0.14.1, some of the rows with zero downloads end up null and it looks like you added an update to deal with that issue.

@sgrif
Copy link
Contributor

sgrif commented Jul 14, 2017

API added in diesel-rs/diesel#1026

@natboehm natboehm closed this Jul 18, 2017
@natboehm natboehm deleted the sort-crate-recent-downloads branch September 15, 2017 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants