kallithea Changeset - caef0be39948

Changeset - caef0be39948

Parent rev.

Child rev.

[Not reviewed]

default

0 2 0

FUJIWARA Katsunori - 9 years ago 2017-01-22 18:17:38
foozy@lares.dti.ne.jp

search: make "repository:" condition work as expected

Before this revision, "repository:foo" condition at searching for
"File contents" or "File names" shows files in repositories below.

- foo
- foo/bar
- foo-bar
- and so on ...

Whoosh library, which is used to parse text for indexing and seaching,
does:

- treat almost all non-alphanumeric characters as delimiter both at
indexing search items and at parsing search condition
- make each fields for a search item be indexed by multiple values

For example, files in "foo/bar" repository are indexed by "foo" and
"bar" in "repository" field. This tokenization make "repository:foo"
search condition match against files in "foo/bar" repository, too.

In addition to it, using plain TEXT also causes unintentional
ignorance of "stop words" in search conditions. For example, "this",
"a", "you", and so on are ignored at indexing and parsing, because
these are too generic words (from point of view of generic "text
search").

This issue can't be resolved by using ID instead of TEXT for
"repository" of SCHEMA, like as previous revisions for JOURNAL_SCHEMA,
because:

- highlight-ing file content requires SCHEMA to support "positions"
feature, but using ID instead of TEXT disables it
- using ID violates current case-insensitive search policy, because
it preserves case of text

To make "repository:" condition work as expected, this revision
explicitly specifies "analyzer", which does:

- avoid tokenization
- match case-insensitively
- avoid removing "stop words" from text

This revision requires full re-building index tables, because indexing
schema is changed.

BTW, "repository:" condition at searching for "Commit messages" uses
CHGSETS_SCHEMA instead of SCHEMA. The former uses ID for "repository",
and it does:

- avoid issues by tokenization and removing "stop words"

- disable "positions" feature of CHGSETS_SCHEMA

But highlight-ing file content isn't needed at searching for
"Commit messages". Therefore, this can be ignored.

- preserve case of text

This violates current case-insensitive search policy, This issue
will be fixed by subsequent revision, because fixing it isn't so
simple.

2 files changed with 13 insertions and 4 deletions:

kallithea/lib/indexers/__init__.py

kallithea/tests/functional/test_search_indexing.py

0 comments (0 inline, 0 general)

kallithea/lib/indexers/__init__.py

➞

Show inline comments

 # -*- coding: utf-8 -*-
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation, either version 3 of the License, or
 # (at your option) any later version.
+#
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
+#
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 """
 kallithea.lib.indexers
 ~~~~~~~~~~~~~~~~~~~~~~
 Whoosh indexing module for Kallithea
 This file was forked by the Kallithea project in July 2014.
 Original author and date, and relevant copyright and licensing information is below:
 :created_on: Aug 17, 2010
 :author: marcink
 :copyright: (c) 2013 RhodeCode GmbH, and others.
 :license: GPLv3, see LICENSE.md for more details.
 """
 import os
 import sys
 import logging
 from os.path import dirname
 # Add location of top level folder to sys.path
 sys.path.append(dirname(dirname(dirname(os.path.realpath(__file__)))))
 from whoosh.analysis import RegexTokenizer, LowercaseFilter
+from whoosh.analysis import RegexTokenizer, LowercaseFilter, IDTokenizer
 from whoosh.fields import TEXT, ID, STORED, NUMERIC, BOOLEAN, Schema, FieldType, DATETIME
 from whoosh.formats import Characters
 from whoosh.highlight import highlight as whoosh_highlight, HtmlFormatter, ContextFragmenter
 from kallithea.lib.utils2 import LazyProperty
 log = logging.getLogger(__name__)
 # CUSTOM ANALYZER wordsplit + lowercase filter
 ANALYZER = RegexTokenizer(expression=r"\w+") | LowercaseFilter()
 # CUSTOM ANALYZER raw-string + lowercase filter
+#
 # This is useful to:
 # - avoid tokenization
 # - avoid removing "stop words" from text
 # - search case-insensitively
+#
 ICASEIDANALYZER = IDTokenizer() | LowercaseFilter()
 #INDEX SCHEMA DEFINITION
 SCHEMA = Schema(
     fileid=ID(unique=True),
     owner=TEXT(),
     repository=TEXT(stored=True),
+    repository=TEXT(stored=True, analyzer=ICASEIDANALYZER),
     path=TEXT(stored=True),
     content=FieldType(format=Characters(), analyzer=ANALYZER,
                       scorable=True, stored=True),
     modtime=STORED(),
     extension=TEXT(stored=True)
+)
 IDX_NAME = 'HG_INDEX'
 FORMATTER = HtmlFormatter('span', between='\n<span class="break">...</span>\n')
 FRAGMENTER = ContextFragmenter(200)
 CHGSETS_SCHEMA = Schema(
     raw_id=ID(unique=True, stored=True),
     date=NUMERIC(stored=True),
     last=BOOLEAN(),
     owner=TEXT(),
     repository=ID(unique=True, stored=True),
     author=TEXT(stored=True),
     message=FieldType(format=Characters(), analyzer=ANALYZER,
                       scorable=True, stored=True),
     parents=TEXT(),
     added=TEXT(),
     removed=TEXT(),
     changed=TEXT(),
+)
 CHGSET_IDX_NAME = 'CHGSET_INDEX'
 # used only to generate queries in journal
 JOURNAL_SCHEMA = Schema(
     username=ID(),
     date=DATETIME(),
     action=TEXT(),
     repository=ID(),
     ip=TEXT(),
+)
 class WhooshResultWrapper(object):
     def __init__(self, search_type, searcher, matcher, highlight_items,
                  repo_location):
         self.search_type = search_type
         self.searcher = searcher
         self.matcher = matcher
         self.highlight_items = highlight_items
         self.fragment_size = 200
         self.repo_location = repo_location

kallithea/tests/functional/test_search_indexing.py

➞

Show inline comments

@@ @@ -68,99 +68,99 @@ def rebuild_index(full_index): @@
 class TestSearchControllerIndexing(TestController):
     @classmethod
     def setup_class(cls):
         for reponame, init_or_fork, groupname in repos:
             if groupname and groupname not in groupids:
                 group = fixture.create_repo_group(groupname)
                 groupids[groupname] = group.group_id
             if callable(init_or_fork):
                 repo = fixture.create_repo(reponame,
                                            repo_group=groupname)
                 init_or_fork(repo)
             else:
                 repo = fixture.create_fork(init_or_fork, reponame,
                                            repo_group=groupname)
             repoids[reponame] = repo.repo_id
         # treat "it" as indexable filename
         filenames_mock = list(INDEX_FILENAMES)
         filenames_mock.append('it')
         with mock.patch('kallithea.lib.indexers.daemon.INDEX_FILENAMES',
                         filenames_mock):
             rebuild_index(full_index=False) # only for newly added repos
     @classmethod
     def teardown_class(cls):
         # delete in reversed order, to delete fork destination at first
         for reponame, init_or_fork, groupname in reversed(repos):
             RepoModel().delete(repoids[reponame])
         for reponame, init_or_fork, groupname in reversed(repos):
             if groupname in groupids:
                 RepoGroupModel().delete(groupids.pop(groupname),
                                         force_delete=True)
         Session().commit()
         Session.remove()
         rebuild_index(full_index=True) # rebuild fully for subsequent tests
     @parametrize('reponame', [
         (u'indexing_test'),
         (u'indexing_test-fork'),
         (u'group/indexing_test'),
         (u'this-is-it'),
         (u'*-fork'),
         (u'group/*'),
     ])
     @parametrize('searchtype,query,hit', [
-        #('content', 'this_should_be_unique_content', 1),
         ('content', 'this_should_be_unique_content', 1),
         ('commit', 'this_should_be_unique_commit_log', 1),
-        #('path', 'this_should_be_unique_filename.txt', 1),
         ('path', 'this_should_be_unique_filename.txt', 1),
     ])
     def test_repository_tokenization(self, reponame, searchtype, query, hit):
         self.log_user()
         q = 'repository:%s %s' % (reponame, query)
         response = self.app.get(url(controller='search', action='index'),
                                 {'q': q, 'type': searchtype})
         response.mustcontain('>%d results' % hit)
     @parametrize('searchtype,query,hit', [
         ('content', 'this_should_be_unique_content', 2),
         ('commit', 'this_should_be_unique_commit_log', 1),
         ('path', 'this_should_be_unique_filename.txt', 2),
     ])
     def test_repository_case_sensitivity(self, searchtype, query, hit):
         self.log_user()
         lname = u'indexing_test-foo'
         uname = u'indexing_test-FOO'
         # (1) "repository:REPONAME" condition should match against
         # repositories case-insensitively
         q = 'repository:%s %s' % (lname, query)
         response = self.app.get(url(controller='search', action='index'),
                                 {'q': q, 'type': searchtype})
         response.mustcontain('>%d results' % hit)
         # (2) on the other hand, searching under the specific
         # repository should return results only for that repository,
         # even if specified name matches against another repository
         # case-insensitively.
         response = self.app.get(url(controller='search', action='index',
                                     repo_name=uname),
                                 {'q': query, 'type': searchtype})
         response.mustcontain('>%d results' % hit)
         # confirm that there is no matching against lower name repository
         assert uname in response
         #assert lname not in response
     @parametrize('searchtype,query,hit', [
         ('content', 'path:this/is/it def test', 37),
         ('commit', 'added:this/is/it bother to ask where', 4),
         # this condition matches against files below, because
         # "path:" condition is also applied on "repository path".
         # - "this/is/it" in "stopword_test" repo

0 comments (0 inline, 0 general)