Skip to content

Commit 375fd15

Browse files
wolfogreGiteaBot
andauthored
Refactor indexer (#25174)
Refactor `modules/indexer` to make it more maintainable. And it can be easier to support more features. I'm trying to solve some of issue searching, this is a precursor to making functional changes. Current supported engines and the index versions: | engines | issues | code | | - | - | - | | db | Just a wrapper for database queries, doesn't need version | - | | bleve | The version of index is **2** | The version of index is **6** | | elasticsearch | The old index has no version, will be treated as version **0** in this PR | The version of index is **1** | | meilisearch | The old index has no version, will be treated as version **0** in this PR | - | ## Changes ### Split Splited it into mutiple packages ```text indexer ├── internal │   ├── bleve │   ├── db │   ├── elasticsearch │   └── meilisearch ├── code │   ├── bleve │   ├── elasticsearch │   └── internal └── issues ├── bleve ├── db ├── elasticsearch ├── internal └── meilisearch ``` - `indexer/interanal`: Internal shared package for indexer. - `indexer/interanal/[engine]`: Internal shared package for each engine (bleve/db/elasticsearch/meilisearch). - `indexer/code`: Implementations for code indexer. - `indexer/code/internal`: Internal shared package for code indexer. - `indexer/code/[engine]`: Implementation via each engine for code indexer. - `indexer/issues`: Implementations for issues indexer. ### Deduplication - Combine `Init/Ping/Close` for code indexer and issues indexer. - ~Combine `issues.indexerHolder` and `code.wrappedIndexer` to `internal.IndexHolder`.~ Remove it, use dummy indexer instead when the indexer is not ready. - Duplicate two copies of creating ES clients. - Duplicate two copies of `indexerID()`. ### Enhancement - [x] Support index version for elasticsearch issues indexer, the old index without version will be treated as version 0. - [x] Fix spell of `elastic_search/ElasticSearch`, it should be `Elasticsearch`. - [x] Improve versioning of ES index. We don't need `Aliases`: - Gitea does't need aliases for "Zero Downtime" because it never delete old indexes. - The old code of issues indexer uses the orignal name to create issue index, so it's tricky to convert it to an alias. - [x] Support index version for meilisearch issues indexer, the old index without version will be treated as version 0. - [x] Do "ping" only when `Ping` has been called, don't ping periodically and cache the status. - [x] Support the context parameter whenever possible. - [x] Fix outdated example config. - [x] Give up the requeue logic of issues indexer: When indexing fails, call Ping to check if it was caused by the engine being unavailable, and only requeue the task if the engine is unavailable. - It is fragile and tricky, could cause data losing (It did happen when I was doing some tests for this PR). And it works for ES only. - Just always requeue the failed task, if it caused by bad data, it's a bug of Gitea which should be fixed. --------- Co-authored-by: Giteabot <[email protected]>
1 parent b0215c4 commit 375fd15

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1373
-1425
lines changed

custom/conf/app.example.ini

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1334,10 +1334,10 @@ LEVEL = Info
13341334
;; Issue indexer storage path, available when ISSUE_INDEXER_TYPE is bleve
13351335
;ISSUE_INDEXER_PATH = indexers/issues.bleve ; Relative paths will be made absolute against _`AppWorkPath`_.
13361336
;;
1337-
;; Issue indexer connection string, available when ISSUE_INDEXER_TYPE is elasticsearch or meilisearch
1338-
;ISSUE_INDEXER_CONN_STR = http://elastic:changeme@localhost:9200
1337+
;; Issue indexer connection string, available when ISSUE_INDEXER_TYPE is elasticsearch (e.g. http://elastic:password@localhost:9200) or meilisearch (e.g. http://:apikey@localhost:7700)
1338+
;ISSUE_INDEXER_CONN_STR =
13391339
;;
1340-
;; Issue indexer name, available when ISSUE_INDEXER_TYPE is elasticsearch
1340+
;; Issue indexer name, available when ISSUE_INDEXER_TYPE is elasticsearch or meilisearch.
13411341
;ISSUE_INDEXER_NAME = gitea_issues
13421342
;;
13431343
;; Timeout the indexer if it takes longer than this to start.

docs/content/doc/administration/config-cheat-sheet.en-us.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -458,15 +458,15 @@ relation to port exhaustion.
458458
## Indexer (`indexer`)
459459

460460
- `ISSUE_INDEXER_TYPE`: **bleve**: Issue indexer type, currently supported: `bleve`, `db`, `elasticsearch` or `meilisearch`.
461-
- `ISSUE_INDEXER_CONN_STR`: ****: Issue indexer connection string, available when ISSUE_INDEXER_TYPE is elasticsearch, or meilisearch. i.e. http://elastic:changeme@localhost:9200
462-
- `ISSUE_INDEXER_NAME`: **gitea_issues**: Issue indexer name, available when ISSUE_INDEXER_TYPE is elasticsearch
461+
- `ISSUE_INDEXER_CONN_STR`: ****: Issue indexer connection string, available when ISSUE_INDEXER_TYPE is elasticsearch (e.g. http://elastic:password@localhost:9200) or meilisearch (e.g. http://:apikey@localhost:7700)
462+
- `ISSUE_INDEXER_NAME`: **gitea_issues**: Issue indexer name, available when ISSUE_INDEXER_TYPE is elasticsearch or meilisearch.
463463
- `ISSUE_INDEXER_PATH`: **indexers/issues.bleve**: Index file used for issue search; available when ISSUE_INDEXER_TYPE is bleve and elasticsearch. Relative paths will be made absolute against _`AppWorkPath`_.
464464

465465
- `REPO_INDEXER_ENABLED`: **false**: Enables code search (uses a lot of disk space, about 6 times more than the repository size).
466466
- `REPO_INDEXER_REPO_TYPES`: **sources,forks,mirrors,templates**: Repo indexer units. The items to index could be `sources`, `forks`, `mirrors`, `templates` or any combination of them separated by a comma. If empty then it defaults to `sources` only, as if you'd like to disable fully please see `REPO_INDEXER_ENABLED`.
467467
- `REPO_INDEXER_TYPE`: **bleve**: Code search engine type, could be `bleve` or `elasticsearch`.
468468
- `REPO_INDEXER_PATH`: **indexers/repos.bleve**: Index file used for code search.
469-
- `REPO_INDEXER_CONN_STR`: ****: Code indexer connection string, available when `REPO_INDEXER_TYPE` is elasticsearch. i.e. http://elastic:changeme@localhost:9200
469+
- `REPO_INDEXER_CONN_STR`: ****: Code indexer connection string, available when `REPO_INDEXER_TYPE` is elasticsearch. i.e. http://elastic:password@localhost:9200
470470
- `REPO_INDEXER_NAME`: **gitea_codes**: Code indexer name, available when `REPO_INDEXER_TYPE` is elasticsearch
471471

472472
- `REPO_INDEXER_INCLUDE`: **empty**: A comma separated list of glob patterns (see https://github.com/gobwas/glob) to **include** in the index. Use `**.txt` to match any files with .txt extension. An empty list means include all files.

modules/context/repo.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -593,7 +593,7 @@ func RepoAssignment(ctx *Context) (cancel context.CancelFunc) {
593593

594594
ctx.Data["RepoSearchEnabled"] = setting.Indexer.RepoIndexerEnabled
595595
if setting.Indexer.RepoIndexerEnabled {
596-
ctx.Data["CodeIndexerUnavailable"] = !code_indexer.IsAvailable()
596+
ctx.Data["CodeIndexerUnavailable"] = !code_indexer.IsAvailable(ctx)
597597
}
598598

599599
if ctx.IsSigned {

modules/indexer/code/bleve.go renamed to modules/indexer/code/bleve/bleve.go

Lines changed: 36 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
// Copyright 2019 The Gitea Authors. All rights reserved.
22
// SPDX-License-Identifier: MIT
33

4-
package code
4+
package bleve
55

66
import (
77
"bufio"
88
"context"
99
"fmt"
1010
"io"
11-
"os"
1211
"strconv"
1312
"strings"
1413
"time"
@@ -17,12 +16,13 @@ import (
1716
"code.gitea.io/gitea/modules/analyze"
1817
"code.gitea.io/gitea/modules/charset"
1918
"code.gitea.io/gitea/modules/git"
20-
gitea_bleve "code.gitea.io/gitea/modules/indexer/bleve"
19+
"code.gitea.io/gitea/modules/indexer/code/internal"
20+
indexer_internal "code.gitea.io/gitea/modules/indexer/internal"
21+
inner_bleve "code.gitea.io/gitea/modules/indexer/internal/bleve"
2122
"code.gitea.io/gitea/modules/log"
2223
"code.gitea.io/gitea/modules/setting"
2324
"code.gitea.io/gitea/modules/timeutil"
2425
"code.gitea.io/gitea/modules/typesniffer"
25-
"code.gitea.io/gitea/modules/util"
2626

2727
"github.com/blevesearch/bleve/v2"
2828
analyzer_custom "github.com/blevesearch/bleve/v2/analysis/analyzer/custom"
@@ -31,10 +31,8 @@ import (
3131
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
3232
"github.com/blevesearch/bleve/v2/analysis/token/unicodenorm"
3333
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
34-
"github.com/blevesearch/bleve/v2/index/upsidedown"
3534
"github.com/blevesearch/bleve/v2/mapping"
3635
"github.com/blevesearch/bleve/v2/search/query"
37-
"github.com/ethantkoenig/rupture"
3836
"github.com/go-enry/go-enry/v2"
3937
)
4038

@@ -59,38 +57,6 @@ func addUnicodeNormalizeTokenFilter(m *mapping.IndexMappingImpl) error {
5957
})
6058
}
6159

62-
// openBleveIndexer open the index at the specified path, checking for metadata
63-
// updates and bleve version updates. If index needs to be created (or
64-
// re-created), returns (nil, nil)
65-
func openBleveIndexer(path string, latestVersion int) (bleve.Index, error) {
66-
_, err := os.Stat(path)
67-
if err != nil && os.IsNotExist(err) {
68-
return nil, nil
69-
} else if err != nil {
70-
return nil, err
71-
}
72-
73-
metadata, err := rupture.ReadIndexMetadata(path)
74-
if err != nil {
75-
return nil, err
76-
}
77-
if metadata.Version < latestVersion {
78-
// the indexer is using a previous version, so we should delete it and
79-
// re-populate
80-
return nil, util.RemoveAll(path)
81-
}
82-
83-
index, err := bleve.Open(path)
84-
if err != nil && err == upsidedown.IncompatibleVersion {
85-
// the indexer was built with a previous version of bleve, so we should
86-
// delete it and re-populate
87-
return nil, util.RemoveAll(path)
88-
} else if err != nil {
89-
return nil, err
90-
}
91-
return index, nil
92-
}
93-
9460
// RepoIndexerData data stored in the repo indexer
9561
type RepoIndexerData struct {
9662
RepoID int64
@@ -111,8 +77,8 @@ const (
11177
repoIndexerLatestVersion = 6
11278
)
11379

114-
// createBleveIndexer create a bleve repo indexer if one does not already exist
115-
func createBleveIndexer(path string, latestVersion int) (bleve.Index, error) {
80+
// generateBleveIndexMapping generates a bleve index mapping for the repo indexer
81+
func generateBleveIndexMapping() (mapping.IndexMapping, error) {
11682
docMapping := bleve.NewDocumentMapping()
11783
numericFieldMapping := bleve.NewNumericFieldMapping()
11884
numericFieldMapping.IncludeInAll = false
@@ -147,42 +113,28 @@ func createBleveIndexer(path string, latestVersion int) (bleve.Index, error) {
147113
mapping.AddDocumentMapping(repoIndexerDocType, docMapping)
148114
mapping.AddDocumentMapping("_all", bleve.NewDocumentDisabledMapping())
149115

150-
indexer, err := bleve.New(path, mapping)
151-
if err != nil {
152-
return nil, err
153-
}
154-
155-
if err = rupture.WriteIndexMetadata(path, &rupture.IndexMetadata{
156-
Version: latestVersion,
157-
}); err != nil {
158-
return nil, err
159-
}
160-
return indexer, nil
116+
return mapping, nil
161117
}
162118

163-
var _ Indexer = &BleveIndexer{}
119+
var _ internal.Indexer = &Indexer{}
164120

165-
// BleveIndexer represents a bleve indexer implementation
166-
type BleveIndexer struct {
167-
indexDir string
168-
indexer bleve.Index
121+
// Indexer represents a bleve indexer implementation
122+
type Indexer struct {
123+
inner *inner_bleve.Indexer
124+
indexer_internal.Indexer // do not composite inner_bleve.Indexer directly to avoid exposing too much
169125
}
170126

171-
// NewBleveIndexer creates a new bleve local indexer
172-
func NewBleveIndexer(indexDir string) (*BleveIndexer, bool, error) {
173-
indexer := &BleveIndexer{
174-
indexDir: indexDir,
127+
// NewIndexer creates a new bleve local indexer
128+
func NewIndexer(indexDir string) *Indexer {
129+
inner := inner_bleve.NewIndexer(indexDir, repoIndexerLatestVersion, generateBleveIndexMapping)
130+
return &Indexer{
131+
Indexer: inner,
132+
inner: inner,
175133
}
176-
created, err := indexer.init()
177-
if err != nil {
178-
indexer.Close()
179-
return nil, false, err
180-
}
181-
return indexer, created, err
182134
}
183135

184-
func (b *BleveIndexer) addUpdate(ctx context.Context, batchWriter git.WriteCloserError, batchReader *bufio.Reader, commitSha string,
185-
update fileUpdate, repo *repo_model.Repository, batch *gitea_bleve.FlushingBatch,
136+
func (b *Indexer) addUpdate(ctx context.Context, batchWriter git.WriteCloserError, batchReader *bufio.Reader, commitSha string,
137+
update internal.FileUpdate, repo *repo_model.Repository, batch *inner_bleve.FlushingBatch,
186138
) error {
187139
// Ignore vendored files in code search
188140
if setting.Indexer.ExcludeVendored && analyze.IsVendor(update.Filename) {
@@ -227,7 +179,7 @@ func (b *BleveIndexer) addUpdate(ctx context.Context, batchWriter git.WriteClose
227179
if _, err = batchReader.Discard(1); err != nil {
228180
return err
229181
}
230-
id := filenameIndexerID(repo.ID, update.Filename)
182+
id := internal.FilenameIndexerID(repo.ID, update.Filename)
231183
return batch.Index(id, &RepoIndexerData{
232184
RepoID: repo.ID,
233185
CommitID: commitSha,
@@ -237,50 +189,14 @@ func (b *BleveIndexer) addUpdate(ctx context.Context, batchWriter git.WriteClose
237189
})
238190
}
239191

240-
func (b *BleveIndexer) addDelete(filename string, repo *repo_model.Repository, batch *gitea_bleve.FlushingBatch) error {
241-
id := filenameIndexerID(repo.ID, filename)
192+
func (b *Indexer) addDelete(filename string, repo *repo_model.Repository, batch *inner_bleve.FlushingBatch) error {
193+
id := internal.FilenameIndexerID(repo.ID, filename)
242194
return batch.Delete(id)
243195
}
244196

245-
// init init the indexer
246-
func (b *BleveIndexer) init() (bool, error) {
247-
var err error
248-
b.indexer, err = openBleveIndexer(b.indexDir, repoIndexerLatestVersion)
249-
if err != nil {
250-
return false, err
251-
}
252-
if b.indexer != nil {
253-
return false, nil
254-
}
255-
256-
b.indexer, err = createBleveIndexer(b.indexDir, repoIndexerLatestVersion)
257-
if err != nil {
258-
return false, err
259-
}
260-
261-
return true, nil
262-
}
263-
264-
// Close close the indexer
265-
func (b *BleveIndexer) Close() {
266-
log.Debug("Closing repo indexer")
267-
if b.indexer != nil {
268-
err := b.indexer.Close()
269-
if err != nil {
270-
log.Error("Error whilst closing the repository indexer: %v", err)
271-
}
272-
}
273-
log.Info("PID: %d Repository Indexer closed", os.Getpid())
274-
}
275-
276-
// Ping does nothing
277-
func (b *BleveIndexer) Ping() bool {
278-
return true
279-
}
280-
281197
// Index indexes the data
282-
func (b *BleveIndexer) Index(ctx context.Context, repo *repo_model.Repository, sha string, changes *repoChanges) error {
283-
batch := gitea_bleve.NewFlushingBatch(b.indexer, maxBatchSize)
198+
func (b *Indexer) Index(ctx context.Context, repo *repo_model.Repository, sha string, changes *internal.RepoChanges) error {
199+
batch := inner_bleve.NewFlushingBatch(b.inner.Indexer, maxBatchSize)
284200
if len(changes.Updates) > 0 {
285201

286202
// Now because of some insanity with git cat-file not immediately failing if not run in a valid git directory we need to run git rev-parse first!
@@ -308,14 +224,14 @@ func (b *BleveIndexer) Index(ctx context.Context, repo *repo_model.Repository, s
308224
}
309225

310226
// Delete deletes indexes by ids
311-
func (b *BleveIndexer) Delete(repoID int64) error {
227+
func (b *Indexer) Delete(_ context.Context, repoID int64) error {
312228
query := numericEqualityQuery(repoID, "RepoID")
313229
searchRequest := bleve.NewSearchRequestOptions(query, 2147483647, 0, false)
314-
result, err := b.indexer.Search(searchRequest)
230+
result, err := b.inner.Indexer.Search(searchRequest)
315231
if err != nil {
316232
return err
317233
}
318-
batch := gitea_bleve.NewFlushingBatch(b.indexer, maxBatchSize)
234+
batch := inner_bleve.NewFlushingBatch(b.inner.Indexer, maxBatchSize)
319235
for _, hit := range result.Hits {
320236
if err = batch.Delete(hit.ID); err != nil {
321237
return err
@@ -326,7 +242,7 @@ func (b *BleveIndexer) Delete(repoID int64) error {
326242

327243
// Search searches for files in the specified repo.
328244
// Returns the matching file-paths
329-
func (b *BleveIndexer) Search(ctx context.Context, repoIDs []int64, language, keyword string, page, pageSize int, isMatch bool) (int64, []*SearchResult, []*SearchResultLanguages, error) {
245+
func (b *Indexer) Search(ctx context.Context, repoIDs []int64, language, keyword string, page, pageSize int, isMatch bool) (int64, []*internal.SearchResult, []*internal.SearchResultLanguages, error) {
330246
var (
331247
indexerQuery query.Query
332248
keywordQuery query.Query
@@ -379,14 +295,14 @@ func (b *BleveIndexer) Search(ctx context.Context, repoIDs []int64, language, ke
379295
searchRequest.AddFacet("languages", bleve.NewFacetRequest("Language", 10))
380296
}
381297

382-
result, err := b.indexer.SearchInContext(ctx, searchRequest)
298+
result, err := b.inner.Indexer.SearchInContext(ctx, searchRequest)
383299
if err != nil {
384300
return 0, nil, nil, err
385301
}
386302

387303
total := int64(result.Total)
388304

389-
searchResults := make([]*SearchResult, len(result.Hits))
305+
searchResults := make([]*internal.SearchResult, len(result.Hits))
390306
for i, hit := range result.Hits {
391307
startIndex, endIndex := -1, -1
392308
for _, locations := range hit.Locations["Content"] {
@@ -405,11 +321,11 @@ func (b *BleveIndexer) Search(ctx context.Context, repoIDs []int64, language, ke
405321
if t, err := time.Parse(time.RFC3339, hit.Fields["UpdatedAt"].(string)); err == nil {
406322
updatedUnix = timeutil.TimeStamp(t.Unix())
407323
}
408-
searchResults[i] = &SearchResult{
324+
searchResults[i] = &internal.SearchResult{
409325
RepoID: int64(hit.Fields["RepoID"].(float64)),
410326
StartIndex: startIndex,
411327
EndIndex: endIndex,
412-
Filename: filenameOfIndexerID(hit.ID),
328+
Filename: internal.FilenameOfIndexerID(hit.ID),
413329
Content: hit.Fields["Content"].(string),
414330
CommitID: hit.Fields["CommitID"].(string),
415331
UpdatedUnix: updatedUnix,
@@ -418,15 +334,15 @@ func (b *BleveIndexer) Search(ctx context.Context, repoIDs []int64, language, ke
418334
}
419335
}
420336

421-
searchResultLanguages := make([]*SearchResultLanguages, 0, 10)
337+
searchResultLanguages := make([]*internal.SearchResultLanguages, 0, 10)
422338
if len(language) > 0 {
423339
// Use separate query to go get all language counts
424340
facetRequest := bleve.NewSearchRequestOptions(facetQuery, 1, 0, false)
425341
facetRequest.Fields = []string{"Content", "RepoID", "Language", "CommitID", "UpdatedAt"}
426342
facetRequest.IncludeLocations = true
427343
facetRequest.AddFacet("languages", bleve.NewFacetRequest("Language", 10))
428344

429-
if result, err = b.indexer.Search(facetRequest); err != nil {
345+
if result, err = b.inner.Indexer.Search(facetRequest); err != nil {
430346
return 0, nil, nil, err
431347
}
432348

@@ -436,7 +352,7 @@ func (b *BleveIndexer) Search(ctx context.Context, repoIDs []int64, language, ke
436352
if len(term.Term) == 0 {
437353
continue
438354
}
439-
searchResultLanguages = append(searchResultLanguages, &SearchResultLanguages{
355+
searchResultLanguages = append(searchResultLanguages, &internal.SearchResultLanguages{
440356
Language: term.Term,
441357
Color: enry.GetColor(term.Term),
442358
Count: term.Count,

modules/indexer/code/bleve_test.go

Lines changed: 0 additions & 30 deletions
This file was deleted.

0 commit comments

Comments
 (0)