WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Can anyone explain the mapping of plone.richtext behavior? #35

@1letter

Description

@1letter

I have setup the opensearch container with the ingest plugin. i have setup collective.elastic.ingest in a python venv locally.
I have setup a simple Plone 6.1 site, no multilingual, language german, no content, collective.elastic.plone is installed. the communication between the instances works.

I use the mappings.json file from the example docker-os directory in this package.

I add a Page with:

  • title: Himbeere
  • description: Birne
  • richtext: Apfel

I add a PDF File with:

  • title: Rot
  • description: Gelb
  • the PDF contains only one word: Grün

Now i use a Rest Client for better debugging and send a request to http://localhost:9200/plone/_search

{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "the word i search",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "file__extracted.content",
        "text__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "80%"
    }
  }
}

My search tests:

  • Rot
  • Gelb
  • Grün
  • Himbeere
  • Birne
  • Apfel -> no hits

I investigate the query with term "Himbeere" (that is the plone page) i see the term "Apfel", but not as plain text, the HTML is inside the field text__extracted.content

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.491389,
    "hits": [
      {
        "_source": {          
          "text__extracted": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "mt",
            "content": "<p>Apfel</p>",
            "content_length": 13
          },
          "text": {
            "data": "<p>Apfel</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },          
        }
      }
    ]
  }
}

I investigate the query with term "grün" (that is the pdf file in my plone site) i see the term "grün" in the field file__extracted.content

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6019437,
    "hits": [
      {
        "_source": {          
          "file__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "Grün",
            "content_length": 6
          },
          "file": {
            "download": "http://carusnet.local/farben.pdf/@@download/file",
            "filename": "farben.pdf",
            "size": 6,
            "content-type": "application/pdf"
          },
        }
      }
    ]
  }
}

Two Problems:

  • the term in the richtext field is not found
  • shouldn't the HTML code strip in the ‘text__extracted.content’ field be removed? Perhaps this solve the first problem?

Any hints @jensens or @ksuess ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions