Can anyone explain the mapping of plone.richtext behavior?

I have setup the opensearch container with the ingest plugin. i have setup `collective.elastic.ingest` in a python venv locally.
I have setup a simple Plone 6.1 site, no multilingual, language german, no content, `collective.elastic.plone` is installed. the communication between the instances works.

I use the mappings.json file from the example docker-os directory in this package.

I add a Page with:
 - title: Himbeere
 - description: Birne
 - richtext: Apfel

I add a PDF File with:
 - title: Rot
 - description: Gelb
 - the PDF contains only one word: Grün

Now i use a Rest Client for better debugging and send a request to http://localhost:9200/plone/_search

```
{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "the word i search",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "file__extracted.content",
        "text__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "80%"
    }
  }
}
```

My search tests:

- [x] Rot
- [x] Gelb
- [x] Grün
- [x] Himbeere
- [x] Birne
- [ ] Apfel -> no hits

I investigate the query with term "Himbeere" (that is the plone page) i see the term "Apfel", but not as plain text, the HTML is inside the field `text__extracted.content`

```
{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.491389,
    "hits": [
      {
        "_source": {          
          "text__extracted": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "mt",
            "content": "<p>Apfel</p>",
            "content_length": 13
          },
          "text": {
            "data": "<p>Apfel</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },          
        }
      }
    ]
  }
}
```

I investigate the query with term "grün" (that is the pdf file in my plone site) i see the term "grün" in the field `file__extracted.content`

```
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6019437,
    "hits": [
      {
        "_source": {          
          "file__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "Grün",
            "content_length": 6
          },
          "file": {
            "download": "http://carusnet.local/farben.pdf/@@download/file",
            "filename": "farben.pdf",
            "size": 6,
            "content-type": "application/pdf"
          },
        }
      }
    ]
  }
}
```

Two Problems:

- the term in the richtext field is not found
- shouldn't the HTML code strip in the ‘text__extracted.content’ field be removed? Perhaps this solve the first problem?

Any hints @jensens or @ksuess ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can anyone explain the mapping of plone.richtext behavior? #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can anyone explain the mapping of plone.richtext behavior? #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions