Skip to content

generate historycache for directories #1704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
totembe opened this issue Jul 31, 2017 · 17 comments
Open

generate historycache for directories #1704

totembe opened this issue Jul 31, 2017 · 17 comments

Comments

@totembe
Copy link

totembe commented Jul 31, 2017

I have one particular very active project that has 9265 commits which spans to 371 pages. It takes about 15 seconds to load each page. Project git repository size ise 890MB.

I have another project with total of 2717 commits which takes to load 1-2 seconds to load which is far better and acceptable duration. This project is sized about 130MB.

I have another project with total of 6955 commits. Sized 93MB and takes to load 1-2 seconds for each page.

It seems when size of git repository increases duration of load time increases. Is it possible to optimize this?

Running Virtualbox on SSD disk with Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU with 64gb ram. Allocated 4 cores with 100% threshold and 16gb of ram.

@totembe totembe changed the title With long commit history, paginated listing takes a lot of time With large git repository size, paginated history listing takes a lot of time Jul 31, 2017
@vladak
Copy link
Member

vladak commented Jul 31, 2017

Do you use history cache ? If you run the git log commands by hand, do the times correspond to the times observed in OpenGrok ?

@totembe
Copy link
Author

totembe commented Jul 31, 2017

git log returns instant.

I don't know any history cache setting. I am using OpenGrok with stock settings. There is historycache directory under /var/opengrok/data folder.

1.7G ./index
788M ./historycache
1.6G ./xref
4.0G .

@vladak
Copy link
Member

vladak commented Aug 1, 2017

That's strange. Could you inspect/instrument what the systems (both client and server) are doing during those 15 seconds ? e.g. is it the client that is CPU loaded or the server is performing some heavy I/O ?

Also, what OpenGrok version are you running ? Since #1049 even if the history for given directory/file is very long, the output will be paginated so should not take too long to display/render.

@totembe
Copy link
Author

totembe commented Aug 1, 2017

Both requesting client and opengrok server are located on my personal office machine. Machine ise mostly idle. OpenGrok is served to a small team of 10 person.

image

I marked my request start with S and page render end as E.

I am using OpenGrok-1.1-rc8.

@totembe
Copy link
Author

totembe commented Aug 1, 2017

I found the culprit. As you pointed out git is source of delay. When I dumped git log to txt file it took 9 seconds. I failed to spot this at first because git dumps log to less directly and displays instantly. I thought process ends and displays afterwards.

time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict > /home/ethem/test.txt

real 0m8.994s
user 0m7.372s
sys 0m1.588s

Instead of full dump, Log dump can be optimized by using --skip option and cutting pipe when required data is acquired.

I.e:
time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict --skip=10 | head -n 100 > /home/ethem/test.txt

real 0m0.014s
user 0m0.008s
sys 0m0.004s

@totembe
Copy link
Author

totembe commented Aug 1, 2017

As I am new to git, I understand now that this is git problem.

With repacking repository with git repack -a option, duration decreased from 8 seconds to 2.5 seconds. 2.5 seconds is not responsive enough to navigate between history log pages. Log file itself is a few megabytes file (In my instance it is 3.2megs), git log output can be cached and OpenGrok can use this file..

@vladak
Copy link
Member

vladak commented Aug 1, 2017

Well, if history cache is used, running git log when displaying history view page is avoided because history is stored in a compressed XML file (that basically represents a set of HistoryEntry objects) under the historycache directory.

Do you run the indexer with the -H option ?

@totembe
Copy link
Author

totembe commented Aug 2, 2017

I tried with

sudo OPENGROK_GENERATE_HISTORY=on /opt/opengrok-1.1-rc8/bin/OpenGrok index /home/ethem/og/src

I couldnt find any compressed xml files under /var/opengrok/data/historycache

Am I missing something?

@vladak
Copy link
Member

vladak commented Aug 2, 2017

So what are these 788M in the historycache directory ?

If you have a project called foo, that has a file bar.txt located directly in it (i.e. /var/opengrok/src/foo/bar.txt exists and /var/opengrok/src/foo is a Git/Mercurial/... repository), its historycache entry will be located in file /var/opengrok/data/historycache/foo/bar.txt.gz. If the file is not there, history view in the webapp will have to resort to running git log directly, hence the delay you're seeing.

It seems that history cache generation failed for some reason. Do indexer logs contain anything of interest ?

@vladak vladak added the question label Aug 2, 2017
@vladak vladak changed the title With large git repository size, paginated history listing takes a lot of time git historycache generation failed. Why ? Aug 2, 2017
@totembe
Copy link
Author

totembe commented Aug 2, 2017

There are gz files for each code file in our projects. While browsing code file history, i hadn't any issue, I think those files are caches for single file.

I had problems of browsing history of entire repository.

URL (takes time):
http://opengrokhost/source/history/foo

URL(no problems at the moment):
http://opengrokhost/source/history/foo/bar.txt

@vladak
Copy link
Member

vladak commented Aug 2, 2017

Aha ! :-) The per-directory history is not cached, so git log is run every time. The reason is the history cache creation. It uses a trick to convert per-repository history into per file history by mapping changesets to files changed in them and then inverting this map. I am not sure this trick can be used for creating per-directory history cache. If yes, it will certainly demand more space for storing the historycache and it will make reindex longer too.

The other option is to create the history cache for directories on demand. Thus only the first display will take long time and subsequent displays (leveraging the incremental history generation using the OpenGroklatestRev file) will be fast (as long as not too many history entries are added).

@vladak vladak changed the title git historycache generation failed. Why ? generate historycache for directories Aug 2, 2017
@vladak
Copy link
Member

vladak commented Aug 2, 2017

Another idea would be to store historycache at least for top-level directory of given repository since it is available anyway, i.e. change FileHistoryCache.java#store() to store the history parameter in a file. Filed #1716 to track this.

@vladak
Copy link
Member

vladak commented Aug 3, 2017

The reason for why history is not cached for directories is given in FileHistoryCache.java#get():

            // Don't cache history-information for directories, since the
            // history information on the directory may change if a file in
            // a sub-directory change. This will cause us to present a stale
            // history log until a the current directory is updated and
            // invalidates the cache entry.

So if directory cache is implemented, that would mean traversing the directory hierarchy all the way up from the changed file and invalidating all directory cache entries. Or devising better solution.

@totembe
Copy link
Author

totembe commented Aug 3, 2017

latest revision hash can be parsed by git log. after parsing first record, close the command output pipe because we don't need rest of records which gives performance boost. after getting latest revision hash, it can be compared revision hash associated with history cache, and if it is not equal then cache can be invalidated and new cache can be generated. I tried for subdirectories and git log works.

@vladak
Copy link
Member

vladak commented Aug 3, 2017

The latest changeset is easy to acquire via git log -n1 (plus some templating), no need to close the pipe.

Anyhow, there are (at least) 2 different ways how to approach this:

  1. do not cache anything and just cut the number of log entries using the -noption of git log. If the first page is displayed, cfg.getSearchMaxItems() changesets will have to fetched, second page double that, etc. This would work only if there is a cheap way how to retrieve number of changesets for given directory since it is needed to construct the slider in history.jsp:
        // We have a lots of results to show: create a slider for them
        request.setAttribute("history.jsp-slider", Util.createSlider(start, max, totalHits, request));
  1. get the full history and cache it on the first request. Invalidate+refetch the history using above described approach.

The first option has the advantage that it might be fast for first couple of history pages but it will get progressively worse (assuming the history is not cached for the session). Also for each page, git log will have to be called.

The advantage of the second option is that once cached and valid, the history fetch will be quick. However, the first request will be always slow. Also, if the repository changes often and reindex is done often too, the cache will be mostly invalid, saving no time.

@totembe
Copy link
Author

totembe commented Aug 3, 2017

  1. method seems better

getSearchMaxItems can be acquired with git rev-list --count --all $subdir

Edit: For git root directory, omitting $subdir is better in performance terms.

$ time git rev-list --count --all
9304

real 0m0.041s
user 0m0.036s
sys 0m0.004s
$ time git rev-list --count --all .
9304

real 0m0.686s
user 0m0.612s
sys 0m0.068s

performance wont degrade with skip option

git log -n $history_per_page --skip ($page-1) * $history_per_page $subdir (I tried "0" for first page and it works)

@vladak
Copy link
Member

vladak commented Aug 3, 2017

Well, it should work not only for git; ideally for other SCMs that support per-directory history retrieval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants