Jump to content

User:Claus chr/Searching

From KDE Wiki Sandbox

Searching

The problems

There are (at least) three distinct problems with searching UserBase; they are (in order of increasing obscurity to regular readers and utility to writers and administrators):

  • Using the search box gives unhelpful and possibly incomplete results
  • Using What links here doesn't find Special:myLanguage links
  • Using DPL yields incomplete results

The search box

The problem: The query searches both page names and page content (by default only the main namespace). This means you will often get hits for pages in many different languages, and the search results are presented in no particular order, so frequently the results are if not useless at least unreasonably difficult to use.

I'm not even sure we find all relevant hits. Last time I did a search for "Amarok/Man" I got two pages worth of results (about 40 pages in all) most of them translated pages. The search ought to have found all the Amarok Manual pages (and all their translations). In fact, only two English manual pages were found: Amarok/Manual/Various/FAQ and Amarok/Manual/Various/FAQ/en! (As an interesting side note, the search box query did find some pages containing a Special:myLanguage link!) (See following Note box.)

Note

This needs to be rechecked. My feeling is, that we usually get far more hits, so maybe this is a transient phenomenon. It's worrying none the less.

Update: The problem described above was caused by my searching on an incomplete page name. It seems, that is this case only the page contents is searched. Searching Amarok/Manual/ gives a lot of hits in many languages - including English!


Going to the advanced search page is no help at all; that just gives us the option to cast an even wider net, searching more namespaces. (Btw. translated pages all live in the main namespace. The namespace Translation holds individual translated units.)

What we should wish for in a general search:

  • Only results in the readers language by default
  • Some sort of Google-like prioritizing of results (?)
  • The option to specify desired language(s) as well as namespaces on the advanced page.
  • Search results should be comprehensive (fx, all Amarok manual pages should be found)

What Links Here

In its current form What Links Here is simply useless - it doesn't know about Translate links (Special:myLanguage), so it doesn't pick up most pages. Only lingering old style links are found (mostly on translated pages). We need a Translate-aware What Links Here replacement.

DPL

In theory DPL should be able to overcome the limitations of both of the other two options. Sadly that doesn't hold in the real world. For some queries it doesn't find all matching pages. Experiments shown on User:Claus chr/DPL/Test seems to indicate that some fixed capacity is exceeded. Sometimes performing a broad search on a pattern gives fewer results than searching for the same pattern in a narrower range of pages (fx searching only User namespace as opposed to searching both User and main namespace in one query). The results are very reproducible; always the same hits (and misses).

Note

The text on User:Claus chr/DPL/Test reflects my thoughts on the problem as they evolved in response to the experiments, so they should be most accurate (and confusing) towards the end of the page. Also the description does not match the results currently displayed; this is just because more pages have been added to UserBase since the test page was written. (Yes, more pages can lead to fewer hits - see above)


Update

We have an answer from the DPL forum: There is a limit to the number of pages considered when searching with 'includematch' (as we must). By default that limit is 500 pages. We can limit the number of pages considered by fx specifying a namespace, which is why we sometimes get more results from a narrower search. However, Main namespace alone contains 5.5k pages (and growing! - about 1000 translatable pages + translations + category pages, etc.). If we try to use 'titlematch' and 'notitlematch' to narrow the search we only make things worse, because in that case namespace filtering is turned off, which means that we get all the units in the translate system included (each unit is a separate page in Translation namespace); that would account for about 150k pages by my best guesstimate. Filtering those out would be possible if messy, and we would almost certainly filter out something that really should be included.

Our options seems be these:

  1. Increase the limit to how many pages are searched. There is a config setting for that.
    • Can the server handle that? We would need to raise the number from 500 to at least 6000, or better to 10000 - we need to guard against banging our heads against the limit again; no warning is given when that happens, and the page count increases daily thanks to many new translators.
    • What will that mean for search times? Increased search times may be acceptable for administrative work, but probably not for user searches.
  2. Try to split each search into many partial searches with clever use of 'titlematch' and 'notitlematch' to filter the number of pages considered.
    • It is not clear what criteria based on simple wild card matching we can use for this.
    • We'd need to keep an eye on the number of pages in each group. Not something I'm looking forward to maintain
  3. Find a different extension
    • I have already searched wikimedia extensions many times without coming up with anything that looks remotely promising.
    • Otoh, we might be able to make it ourselves. It seems, that fundamentally what we need is a fairly simple query on the underlying database. I think that in a way the main problem with DPL is that it is much more powerful that what we need. A simpler tool might fill our needs without putting as much strain on the system. The only problem is, that this simpler tool doesn't seem to exist. - yet

In my opinion the first option is by far the best (unless it brings our server to a crawl). It would be very simple to implement and require very little effort. If we can't use that I think we should seriously consider the third option. I don't feel, that the second option is worth considering as anything but a temporary stop-gap. Added: Even though the original problem with titlematch was all in my head the second option is still very unatractive. We'd have to split our pages (including the translated ones) into 15+ groups of ca. 400 pages each (to allow for addition of new pages) based on simple wildcard matching on page titles. And we'd have to monitor the actual size of these groups regularly, since we get no warning when there are too many pages in a group - the extra pages are silently ignored!

Update 2

It turns out that the second option above may not be as problemtic as I first thought. If we assume that all searches can be limited to one language, then it would be enough to split searches in three: Pages with titles beginning with A-J, with K, and with L-Z. That might be managable. This is possible because we already seem to have a greater limit to searches than the default 500. Raising the limit would still be the simplest solution, but if that puts too much strain on the servers we can manage with option 2.

Update 3

My mistake. I was too optimistic, and not carefully enough going through the pages found under the letter K. DPL reports 755 pages in namespace Main with names beginning with 'K' – however only the first 500 of these are actually considered by DPL. Confounding the problem, TechBase seem to have even more pages in Main and differently distributed; the letter C alone acounts for 503 pages. Using option two, we would need different solutions for each wiki. And even worse: searching for wiki markup can't be accomplished in a template, so we need to copy-paste and modify the entire DPL code for each such search, which is awkward and error prone. I really, really think option one is the best way forward.