generate historycache for directories #1704

totembe · 2017-07-31T07:37:07Z

I have one particular very active project that has 9265 commits which spans to 371 pages. It takes about 15 seconds to load each page. Project git repository size ise 890MB.

I have another project with total of 2717 commits which takes to load 1-2 seconds to load which is far better and acceptable duration. This project is sized about 130MB.

I have another project with total of 6955 commits. Sized 93MB and takes to load 1-2 seconds for each page.

It seems when size of git repository increases duration of load time increases. Is it possible to optimize this?

Running Virtualbox on SSD disk with Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU with 64gb ram. Allocated 4 cores with 100% threshold and 16gb of ram.

vladak · 2017-07-31T09:24:48Z

Do you use history cache ? If you run the git log commands by hand, do the times correspond to the times observed in OpenGrok ?

totembe · 2017-07-31T12:23:27Z

git log returns instant.

I don't know any history cache setting. I am using OpenGrok with stock settings. There is historycache directory under /var/opengrok/data folder.

1.7G ./index
788M ./historycache
1.6G ./xref
4.0G .

vladak · 2017-08-01T09:02:04Z

That's strange. Could you inspect/instrument what the systems (both client and server) are doing during those 15 seconds ? e.g. is it the client that is CPU loaded or the server is performing some heavy I/O ?

Also, what OpenGrok version are you running ? Since #1049 even if the history for given directory/file is very long, the output will be paginated so should not take too long to display/render.

totembe · 2017-08-01T13:42:32Z

Both requesting client and opengrok server are located on my personal office machine. Machine ise mostly idle. OpenGrok is served to a small team of 10 person.

I marked my request start with S and page render end as E.

I am using OpenGrok-1.1-rc8.

totembe · 2017-08-01T14:03:35Z

I found the culprit. As you pointed out git is source of delay. When I dumped git log to txt file it took 9 seconds. I failed to spot this at first because git dumps log to less directly and displays instantly. I thought process ends and displays afterwards.

time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict > /home/ethem/test.txt

real 0m8.994s
user 0m7.372s
sys 0m1.588s

Instead of full dump, Log dump can be optimized by using --skip option and cutting pipe when required data is acquired.

I.e:
time /opt/git/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict --skip=10 | head -n 100 > /home/ethem/test.txt

real 0m0.014s
user 0m0.008s
sys 0m0.004s

totembe · 2017-08-01T17:39:05Z

As I am new to git, I understand now that this is git problem.

With repacking repository with git repack -a option, duration decreased from 8 seconds to 2.5 seconds. 2.5 seconds is not responsive enough to navigate between history log pages. Log file itself is a few megabytes file (In my instance it is 3.2megs), git log output can be cached and OpenGrok can use this file..

vladak · 2017-08-01T19:43:53Z

Well, if history cache is used, running git log when displaying history view page is avoided because history is stored in a compressed XML file (that basically represents a set of HistoryEntry objects) under the historycache directory.

Do you run the indexer with the -H option ?

totembe · 2017-08-02T06:52:46Z

I tried with

sudo OPENGROK_GENERATE_HISTORY=on /opt/opengrok-1.1-rc8/bin/OpenGrok index /home/ethem/og/src

I couldnt find any compressed xml files under /var/opengrok/data/historycache

Am I missing something?

vladak · 2017-08-02T09:37:13Z

So what are these 788M in the historycache directory ?

If you have a project called foo, that has a file bar.txt located directly in it (i.e. /var/opengrok/src/foo/bar.txt exists and /var/opengrok/src/foo is a Git/Mercurial/... repository), its historycache entry will be located in file /var/opengrok/data/historycache/foo/bar.txt.gz. If the file is not there, history view in the webapp will have to resort to running git log directly, hence the delay you're seeing.

It seems that history cache generation failed for some reason. Do indexer logs contain anything of interest ?

totembe · 2017-08-02T10:34:44Z

There are gz files for each code file in our projects. While browsing code file history, i hadn't any issue, I think those files are caches for single file.

I had problems of browsing history of entire repository.

URL (takes time):
http://opengrokhost/source/history/foo

URL(no problems at the moment):
http://opengrokhost/source/history/foo/bar.txt

vladak · 2017-08-02T12:41:37Z

Aha ! :-) The per-directory history is not cached, so git log is run every time. The reason is the history cache creation. It uses a trick to convert per-repository history into per file history by mapping changesets to files changed in them and then inverting this map. I am not sure this trick can be used for creating per-directory history cache. If yes, it will certainly demand more space for storing the historycache and it will make reindex longer too.

The other option is to create the history cache for directories on demand. Thus only the first display will take long time and subsequent displays (leveraging the incremental history generation using the OpenGroklatestRev file) will be fast (as long as not too many history entries are added).

vladak · 2017-08-02T13:11:44Z

Another idea would be to store historycache at least for top-level directory of given repository since it is available anyway, i.e. change FileHistoryCache.java#store() to store the history parameter in a file. Filed #1716 to track this.

vladak · 2017-08-03T10:47:33Z

The reason for why history is not cached for directories is given in FileHistoryCache.java#get():

            // Don't cache history-information for directories, since the
            // history information on the directory may change if a file in
            // a sub-directory change. This will cause us to present a stale
            // history log until a the current directory is updated and
            // invalidates the cache entry.

So if directory cache is implemented, that would mean traversing the directory hierarchy all the way up from the changed file and invalidating all directory cache entries. Or devising better solution.

totembe · 2017-08-03T11:55:27Z

latest revision hash can be parsed by git log. after parsing first record, close the command output pipe because we don't need rest of records which gives performance boost. after getting latest revision hash, it can be compared revision hash associated with history cache, and if it is not equal then cache can be invalidated and new cache can be generated. I tried for subdirectories and git log works.

vladak · 2017-08-03T20:20:38Z

The latest changeset is easy to acquire via git log -n1 (plus some templating), no need to close the pipe.

Anyhow, there are (at least) 2 different ways how to approach this:

do not cache anything and just cut the number of log entries using the -noption of git log. If the first page is displayed, cfg.getSearchMaxItems() changesets will have to fetched, second page double that, etc. This would work only if there is a cheap way how to retrieve number of changesets for given directory since it is needed to construct the slider in history.jsp:

        // We have a lots of results to show: create a slider for them
        request.setAttribute("history.jsp-slider", Util.createSlider(start, max, totalHits, request));

get the full history and cache it on the first request. Invalidate+refetch the history using above described approach.

The first option has the advantage that it might be fast for first couple of history pages but it will get progressively worse (assuming the history is not cached for the session). Also for each page, git log will have to be called.

The advantage of the second option is that once cached and valid, the history fetch will be quick. However, the first request will be always slow. Also, if the repository changes often and reindex is done often too, the cache will be mostly invalid, saving no time.

totembe · 2017-08-03T20:33:23Z

method seems better

getSearchMaxItems can be acquired with git rev-list --count --all $subdir

Edit: For git root directory, omitting $subdir is better in performance terms.

$ time git rev-list --count --all
9304

real 0m0.041s
user 0m0.036s
sys 0m0.004s
$ time git rev-list --count --all .
9304

real 0m0.686s
user 0m0.612s
sys 0m0.068s

performance wont degrade with skip option

git log -n $history_per_page --skip ($page-1) * $history_per_page $subdir (I tried "0" for first page and it works)

vladak · 2017-08-03T21:09:31Z

Well, it should work not only for git; ideally for other SCMs that support per-directory history retrieval.

totembe changed the title ~~With long commit history, paginated listing takes a lot of time~~ With large git repository size, paginated history listing takes a lot of time Jul 31, 2017

vladak added the question label Aug 2, 2017

vladak changed the title ~~With large git repository size, paginated history listing takes a lot of time~~ git historycache generation failed. Why ? Aug 2, 2017

vladak added enhancement and removed question labels Aug 2, 2017

vladak changed the title ~~git historycache generation failed. Why ?~~ generate historycache for directories Aug 2, 2017

vladak mentioned this issue Aug 3, 2017

cache history of repository #1716

Closed

vladak mentioned this issue Feb 16, 2022

displaying directory history for Mercurial repository takes way too long #3901

Open

vladak mentioned this issue Nov 28, 2022

It takes a long time to open the history page of a folder if the history is very long #4023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate historycache for directories #1704

generate historycache for directories #1704

totembe commented Jul 31, 2017

vladak commented Jul 31, 2017 •

edited

Loading

totembe commented Jul 31, 2017

vladak commented Aug 1, 2017

totembe commented Aug 1, 2017

totembe commented Aug 1, 2017 •

edited

Loading

totembe commented Aug 1, 2017

vladak commented Aug 1, 2017

totembe commented Aug 2, 2017

vladak commented Aug 2, 2017

totembe commented Aug 2, 2017

vladak commented Aug 2, 2017 •

edited

Loading

vladak commented Aug 2, 2017 •

edited

Loading

vladak commented Aug 3, 2017 •

edited

Loading

totembe commented Aug 3, 2017

vladak commented Aug 3, 2017

totembe commented Aug 3, 2017 •

edited

Loading

vladak commented Aug 3, 2017

generate historycache for directories #1704

generate historycache for directories #1704

Comments

totembe commented Jul 31, 2017

vladak commented Jul 31, 2017 • edited Loading

totembe commented Jul 31, 2017

vladak commented Aug 1, 2017

totembe commented Aug 1, 2017

totembe commented Aug 1, 2017 • edited Loading

totembe commented Aug 1, 2017

vladak commented Aug 1, 2017

totembe commented Aug 2, 2017

vladak commented Aug 2, 2017

totembe commented Aug 2, 2017

vladak commented Aug 2, 2017 • edited Loading

vladak commented Aug 2, 2017 • edited Loading

vladak commented Aug 3, 2017 • edited Loading

totembe commented Aug 3, 2017

vladak commented Aug 3, 2017

totembe commented Aug 3, 2017 • edited Loading

vladak commented Aug 3, 2017

vladak commented Jul 31, 2017 •

edited

Loading

totembe commented Aug 1, 2017 •

edited

Loading

vladak commented Aug 2, 2017 •

edited

Loading

vladak commented Aug 2, 2017 •

edited

Loading

vladak commented Aug 3, 2017 •

edited

Loading

totembe commented Aug 3, 2017 •

edited

Loading