-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I have setup the opensearch container with the ingest plugin. i have setup collective.elastic.ingest in a python venv locally.
I have setup a simple Plone 6.1 site, no multilingual, language german, no content, collective.elastic.plone is installed. the communication between the instances works.
I use the mappings.json file from the example docker-os directory in this package.
I add a Page with:
- title: Himbeere
- description: Birne
- richtext: Apfel
I add a PDF File with:
- title: Rot
- description: Gelb
- the PDF contains only one word: Grün
Now i use a Rest Client for better debugging and send a request to http://localhost:9200/plone/_search
{
"_source": true,
"query": {
"multi_match": {
"query": "the word i search",
"fields": [
"title*^1.9",
"description*^1.5",
"file__extracted.content",
"text__extracted.content"
],
"analyzer": "german",
"operator": "or",
"fuzziness": "AUTO",
"prefix_length": 2,
"type": "most_fields",
"minimum_should_match": "80%"
}
}
}
My search tests:
- Rot
- Gelb
- Grün
- Himbeere
- Birne
- Apfel -> no hits
I investigate the query with term "Himbeere" (that is the plone page) i see the term "Apfel", but not as plain text, the HTML is inside the field text__extracted.content
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 2.491389,
"hits": [
{
"_source": {
"text__extracted": {
"content_type": "text/plain; charset=ISO-8859-1",
"language": "mt",
"content": "<p>Apfel</p>",
"content_length": 13
},
"text": {
"data": "<p>Apfel</p>",
"content-type": "text/html",
"encoding": "utf-8"
},
}
}
]
}
}
I investigate the query with term "grün" (that is the pdf file in my plone site) i see the term "grün" in the field file__extracted.content
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6019437,
"hits": [
{
"_source": {
"file__extracted": {
"content_type": "text/plain; charset=UTF-8",
"language": "de",
"content": "Grün",
"content_length": 6
},
"file": {
"download": "http://carusnet.local/farben.pdf/@@download/file",
"filename": "farben.pdf",
"size": 6,
"content-type": "application/pdf"
},
}
}
]
}
}
Two Problems:
- the term in the richtext field is not found
- shouldn't the HTML code strip in the ‘text__extracted.content’ field be removed? Perhaps this solve the first problem?