Wikimedia Developer Support

Elastica error blocking PDF indexing

Search is working, but I’m troubleshooting a problem. I wanted to ensure that PDF document indexing is working. However, checking this file as an example, you should be able to search the wiki for Jeffrey Berman. The Cirrus index doesn’t contain the full text of the document.

The original indexing process seems to have completed fine (all indexes are green).

curl localhost:9200/_cat/indices
green open mediawiki_namic_archive_first  fQCZpSBRRCG3YctuburKTw 4 0     0    0     1kb     1kb
green open mediawiki_igtpg_archive_first  4T-xH9-MS1iy0HAkN4Lv5w 4 0     0    0     1kb     1kb
green open mediawiki_igtpg_content_first  BYlEmVRbQmup3m_ZLDwZng 4 0     1    0  16.4kb  16.4kb
green open mw_cirrus_metastore_first      _E7izfXIQfavxxn2pT5Wsg 1 0   153   30  33.4kb  33.4kb
green open mediawiki_labs_archive_first   ZqP5fBpuQv21XWGGlDGDNw 4 0     0    0     1kb     1kb
green open mediawiki_slicer_general_first fuCZnC4SQBybd6SB0iVofg 4 0  6824 1309  30.5mb  30.5mb
green open mediawiki_namic_general_first  AP1cVZViTAuTit-_UQ_skQ 4 0 10802 1548  27.6mb  27.6mb
green open mediawiki_www_general_first    buOeI5qSQ4yEjYBT27gQKQ 4 0  3096   58  10.9mb  10.9mb
green open mediawiki_www_content_first    5WxBex3XSLKLsBIWe5PE0Q 4 0   991  197  28.4mb  28.4mb
green open mediawiki_slicer_archive_first LJltLq5XSEaUxCI7VnXmAQ 4 0     0    0     1kb     1kb
green open mediawiki_labs_general_first   LOJZ9YiXRNy5DjDwO57ZOQ 4 0     6    0 160.7kb 160.7kb
green open mediawiki_ncigt_general_first  PGDnvsckQK-F_ll08IcQog 4 0 12660 2525  39.6mb  39.6mb
green open mediawiki_slicer_content_first anp0BAVuRtywoVrcBbSzbA 4 0 11884 2995 236.2mb 236.2mb
green open mediawiki_namic_content_first  PcCSBPbOS6a7HhCuUMmhXw 4 0  4997  990 111.2mb 111.2mb
green open mediawiki_igtpg_general_first  kezT1avOQU6FZXVO0XykcA 4 0    18    0 183.7kb 183.7kb
green open mediawiki_labs_content_first   70wGSEXNRrCmTqGnK8otwg 4 0     1    0  11.8kb  11.8kb
green open mediawiki_ncigt_content_first  m8oQ4nhkRSujrJ9nAXkgEA 4 0  2298  397 126.3mb 126.3mb
green open mediawiki_www_archive_first    vT8kFUUDS6eJNQLoXuCFDQ 4 0     0    0     1kb     1kb
green open mediawiki_ncigt_archive_first  lnnw9XmwSJeyfxz7Q0laSg 4 0     0    0     1kb     1kb

If I try the regular process to force a complete re -index, I’m getting a strange error that seems to indicate the indexes are read-only (or blocked somehow).

php /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=slicer
php /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip --wiki=slicer
php /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse --wiki=slicer
php /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=slicer
indexing namespaces...
        Indexing namespaces...[ee3fbf29c1e1dad97a3a2c5a] [no req]   Elastica\Exception\Bulk\ResponseException from line 410 of /opt/mediawiki/1.33.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Bulk.php: Error in one or more bulk request actions:

index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer--2 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer--1 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-1 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-2 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-3 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-4 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-5 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-6 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-7 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-8 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-9 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-10 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-11 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-12 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-13 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-14 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-15 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-100 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-101 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-274 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-275 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-2300 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-2301 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-2302 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
index: /mw_cirrus_metastore_first/mw_cirrus_metastore/namespace-mediawiki_slicer-2303 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

Backtrace:
#0 /opt/mediawiki/1.33.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Bulk.php(359): Elastica\Bulk->_processResponse(Elastica\Response)
#1 /opt/mediawiki/1.33.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(361): Elastica\Bulk->send()
#2 /opt/mediawiki/1.33.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php(182): Elastica\Client->addDocuments(array, array)
#3 /opt/mediawiki/1.33.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Type.php(202): Elastica\Index->addDocuments(array, array)
#4 /opt/mediawiki/1.33.0/extensions/CirrusSearch/includes/MetaStore/MetaNamespaceStore.php(66): Elastica\Type->addDocuments(array)
#5 /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/indexNamespaces.php(37): CirrusSearch\MetaStore\MetaNamespaceStore->reindex(LanguageEn)
#6 /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(53): CirrusSearch\Maintenance\IndexNamespaces->execute()
#7 /opt/mediawiki/1.33.0/maintenance/doMaintenance.php(96): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute()
#8 /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(69): require_once(string)
#9 {main}

I tried forcing an index of just that document, which ‘worked’ (or failed silently) but there is no difference in the cirrusdump, or search results!!??

php /opt/mediawiki/1.33.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php --fromId 6349 --toId 6350 --wiki=slicer
[    mediawiki_slicer] Indexed 2 pages ending at 6350 at 0.1/second
Indexed a total of 2 pages at 0.1/second

So I ran the saneitizer to try to identify and fix inconsistencies in the index.
Fixed 5855 page(s) (19223 checked)
Still no joy.

Note: I believe this is an unrelated problem: PdfHandler is installed but the previews aren’t working and dimensions are 0x0 on the File: pages. I did run maintenance/rebuildImages.php also but there was no change in the search results or cirrusdump.

{
  "name" : "labs",
  "cluster_name" : "QualityBox Search",
  "cluster_uuid" : "tMm0lB2ESIivoHvGlMaIpQ",
  "version" : {
    "number" : "6.8.3",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "0c48c0e",
    "build_date" : "2019-08-29T19:05:24.312154Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Hi,

CirrusSearch relies on PdfHandler to index the text content of PDF files so I’d suggest to look into this direction and fix PdfHandler. According to the documentation you must install some packages to make it work properly (e.g. pdftotext must be present on your system).

You can use the ?action=cirrusDump param to look at the data currently indexed inside elasticsearch on any page URL of your wiki.

Hope it helps,

David.

Despite the fact the the requisite software was installed, at the default paths, I needed to explicitly specify the following (on Debian 10) in LocalSettings.php:

$wgUseImageMagick = true;
$wgImageMagickConvertCommand = "/usr/bin/convert";
// Maximum amount of virtual memory available to shell processes under Linux, in KiB
$wgMaxShellMemory = 0; // unlimited; default is 300*1024
$wgMaxShellFileSize = 512000; // default is 102400
$wgPdfProcessor = '/usr/bin/gs'; 
$wgPdfPostProcessor = $wgImageMagickConvertCommand; // if defined via ImageMagick
// $wgPdfPostProcessor = '/usr/bin/convert';  // if not defined via ImageMagick
$wgPdfInfo = '/usr/bin/pdfinfo'; 
$wgPdftoText = '/usr/bin/pdftotext';

Because I have a cron job* that exercises the job queue, this made image previews work for PDF files. However, there is still a problem with indexing / searching the PDF text (as seen with action=cirrusdump)

  • e.g. cron job to keep MediaWiki’s job queue happy
0 0 * * * root /usr/bin/php /opt/mediawiki/1.33.0/maintenance/runJobs.php --wiki=slicer 2> /var/log/runJobs.log

After providing the absolute path for the pdf tools, did you re-run all maintenance scripts (maintenance/rebuildImages.php and forceSearchIndex.php)?
Would it be possible that you upload a new PDF file to test, there might be issues that cause existing pages not to be refreshed.

I focused on the error message blocked by: [FORBIDDEN/12/index read-only / allow delete (api)] and found several references in ElasticSearch that this means your disk is running low on space so the indexes are automatically switched to read-only.

Sure enough, my main disk partition was 97% full. After cleaning house so that I have plenty of space available, now I’m trying to unlock those indexes. According to https://discuss.elastic.co/t/forbidden-12-index-read-only-allow-delete-api/126067/12 – once you reach the “flood stage” of 95% full, you need to manually unlock each index with

PUT /your-index/_settings
{
  "index.blocks.read_only_allow_delete": null
}

but I’m not sure of the exact curl command. The following errors:

curl -XPUT localhost:9200/mediawiki_slicer_content_first/_settings {"index.blocks.read_only_allow_delete": null }

Sorry I completely overlooked this error message in your first message.
The following command:

curl -XPUT localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

should fix all your indices.

and then run forceSearchIndex.php.

Thanks @DCausse_WMF I had to specify an additional header, so my final fix was:
curl -X PUT localhost:9200/_all/_settings -H 'Content-Type: application/json' -d '{ "index.blocks.read_only_allow_delete": null }'

Now I’m able to rebuild my indexes and everything should be working as expected.