03 Dec 14:22

MaartenGr

75f2910

v0.17.4 Latest

Latest

Highlights:

Add .delete_topics by @shuanglovesdata in #2322
Allow execution without plotly by @luismavs in #2401
Add tqdm to _litellm.py @NFrnk in #2408
Drop support for python 3.9 by @afuetterer in #2419
Make UMAP's init default to random on visualize_topics for reproducible visualization by @makramab in #2412

cuML:

Preparing for MEGA!-scale BERTopic with Multi-GPU UMAP and the following necessary updates:

Update installation instructions for cuML with BERTopic by @csadorf in #2446
Speed up ._create_topic_vectors by replacing DataFrame .loc with NumPy masking @jinsolp in #2406
Modify _reduce_dimensionality to use fit_transform by @betatim in #2416

Fixes:

Fix incorrect label in zero-shot svg in documentation by @huisman in #2448
Enable ruff rule RUF by @afuetterer in #2457
CI: bump github actions versions by @afuetterer in #2427
CI: prefer action-pre-commit-uv for lint job by @afuetterer in #2434
CI: switch to uv based project installation by @afuetterer in #2445
Chore: update pre-commit hooks by @afuetterer in #2414 and #2443
Chore: remove obsolete version_info check by @afuetterer in #2444

Assets 2

08 Jul 10:56

MaartenGr

v0.17.3

2de0674

v0.17.3

BERTopic now fully supports uv! You can install BERTopic with uv as follows:

uv add bertopic

Assets 2

08 Jul 09:04

MaartenGr

v0.17.1

3e40e04

v0.17.1

Highlights:

Added FastEmbed backend by @nickprock in #2213
Added LangChain backend by @regaltsui in #2303
Pass precomputed embeddings to KeyBERTInspired.extract_topics @saikumaru in #2368

Fixes:

Merge models without pytorch (using safetensors) by @MaartenGr in #2329
Fix installation issue with uv by @MaartenGr in #2328 * Fix incorrect comparison in update_topics by @uply23333 in #2336
Add missing comma under Exploration subsection by @angelonazzaro in #2374
Fix typo in Lightweight installation under tips_and_tricks by @angelonazzaro in #2375
Fix IndexError in zero-shot topic modeling by @MaartenGr in #2267

Assets 2

19 Mar 17:02

MaartenGr

v0.17.0

0f6f924

v0.17.0

Highlights:

Light-weight installation without UMAP and HDBSCAN by @MaartenGr in #2289
Add Model2Vec as an embedding backend by @MaartenGr in #2245
Add LiteLLM as a representation model by @MaartenGr in #2213
Interactive DataMapPlot by @MaartenGr in #2287

Fixes:

Lightweight installation: use safetensors without torch by @hedgeho in #2306
Fix missing links by @MaartenGr in #2305
Set up pre-commit hooks by @afuetterer in #2283
Fix handling OpenAI returning None objects by @jeaninejuliettes in #2280
Add support for python 3.13 by @afuetterer in #2173
Added system prompts by @Leo-LiHao in #2145
More documentation for topic reduction by @MaartenGr in #2260
Drop support for python 3.8 by @afuetterer in #2243
Fixed online topic modeling on GPU by @SSivakumar12 in #2181
Fixed hierarchical cluster visualization by @PipaFlores in #2191
Remove duplicated phrase by @AndreaFrancis in #2197

Model2Vec

With Model2Vec, we now have a very interesting pipeline for light-weight embeddings. Combined with the light-weight installation, you can now run BERTopic without using pytorch!

Installation is straightforward:

pip install --no-deps bertopic
pip install --upgrade numpy pandas scikit-learn tqdm plotly pyyaml

This will install BERTopic even without UMAP or HDBSCAN, so you can use other techniques instead. If these are not installed, then it uses PCA with scikit-learn's HDBSCAN instead. You can install them, together with Model2Vec:

pip install model2vec umap-learn hdbscan

Then, creating a BERTopic model is as straightforward as you are used to:

from bertopic import BERTopic
from model2vec import StaticModel

# Model2Vec
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# BERTopic
topic_model = BERTopic(embedding_model=embedding_model)

DataMapPlot

To use the interactive version of DataMapPlot, you only need to run the following:

from umap import UMAP

# Reduce your embeddings to 2-dimensions
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Create an interactive DataMapPlot figure
topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings, interactive=True

Assets 2

09 Oct 10:57

MaartenGr

v0.16.4

9518035

v0.16.4

Fixes

Fix ValueError in Guided Topic Modeling by @RTChou in #2115
Fix saving BERTopic when c-TF-IDF is None by @sete39 in #2112
Fix KeyError: 'topics_from' in #2101
Fix issues related Zero-shot Topic Modeling by @ianrandman in #2105
Fix regex matching being used in PartOfSpeech representation model by @woranov in #2138
Update typo by @saikumaru in #2162

Assets 2

22 Jul 08:25

MaartenGr

v0.16.3

2353f4c

v0.16.3

Highlights

Simplify zero-shot topic modeling by @ianrandman in #2060
Option to choose between c-TF-IDF and Topic Embeddings in many functions by @azikoss in #1894
- Use the use_ctfidf parameter in the following function to choose between c-TF-IDF and topic embeddings:
  - hierarchical_topics, reduce_topics, visualize_hierarchy, visualize_heatmap, visualize_topics
Linting with Ruff by @afuetterer in #2033
Switch from setup.py to pyproject.toml by @afuetterer in #1978
In multi-aspect context, allow Main model to be chained by @ddicato in #2002

Fixes

Added templates for issues and pull requests
Update River documentation example by @Proteusiq in #2004
Fix PartOfSpeech reproducibility by @Greenpp in #1996
Fix PartOfSpeech ignoring first word by @Greenpp in #2024
Make sklearn embedding backend auto-select more cautious by @freddyheppell in #1984
Fix typos by @afuetterer in #1974
Fix hierarchical_topics(...) when the distances between three clusters are the same by @azikoss in #1929
Fixes to chain strategy example in outlier_reduction.md by @reuning in #2065
Remove obsolete flake8 config and update line length by @afuetterer in #22066

Assets 2

12 May 09:32

MaartenGr

v0.16.2

ccc9ebd

v0.16.2

Fixes:

Fix issue with zeroshot topic modeling missing outlier #1957
Bump github actions versions by @afuetterer in #1941
Drop support for python 3.7 by @afuetterer in #1949
Add testing python 3.10+ in Github actions by @afuetterer in #1968
Speed up fitting CountVectorizer by @dannywhuang in #1938
Fix transform when using cuML HDBSCAN by @beckernick in #1960
Fix wrong link in algorithm documentation by @naeyn in #1970

Assets 2

21 Apr 14:42

MaartenGr

v0.16.1

e7369d0

v0.16.1

Highlights:

Add Quantized LLM Tutorial
Add optional datamapplot visualization using topic_model.visualize_document_datamap by @lmcinnes in #1750
Migrated OpenAIBackend to openai>=1 by @peguerosdc in #1724
Add automatic height scaling and font resize by @ir2718 in #1863
Use [KEYWORDS] tags with the LangChain representation model by @mcantimmy in #1871

Fixes:

Fixed issue with .merge_models seemingly skipping topic #1898
Fixed Cohere client.embed TypeError #1904
Fixed AttributeError: 'TextGeneration' object has no attribute 'random_state' #1870
Fixed topic embeddings not properly updated if all outliers were removed #1838
Fixed issue with representation models not properly merging #1762
Fixed Embeddings not ordered correctly when using .merge_models #1804
Fixed Outlier topic not in the 0th position when using zero-shot topic modeling causing prediction issues (amongst others) #1804
Fixed Incorrect label in ZeroShot doc SVG #1732
Fixed MultiModalBackend throws error with clip-ViT-B-32-multilingual-v1 #1670
Fixed AuthenticationError while using OpenAI() #1678
Update FAQ on Apple Silicon by @benz0li in #1901
Add documentation DataMapPlot + FAQ for running on Apple Silicon by @dkapitan in #1854
Remove commas from pip install reference in readme by @luisoala in #1850
Spelling corrections by @joouha in #1801
Replacing the deprecated text-ada-001 model with the latest text-embedding-3-small from OpenAI by @atmb4u in #1800
Prevent invalid empty input error when retrieving embeddings with openai backend by @liaoelton in #1827
Remove spurious warning about missing embedding model by @sliedes in #1774
Fix type hint in ClassTfidfTransformer constructor @snape in #1803
Fix typo and simplify wording in OnlineCountVectorizer docstring by @chrisji in #1802
Fixed warning when saving a topic model without an embedding model by @zilch42 in #1740
Fix bug in TextGeneration by @manveersadhal in #1726
Fix an incorrect link to usecases.md by @nicholsonjf in #1731
Prevent model argument being passed twice when using generator_kwargs in OpenAI by @ninavandiermen in #1733
Several fixes to the docstrings by @arpadikuma in #1719
Remove unused cluster_df variable in hierarchical_topics by @shadiakiki1986 in #1701
Removed redundant quotation mark by @LawrenceFulton in #1695
Fix typo in merge models docs by @zilch42 in #1660

Assets 2

27 Nov 08:06

MaartenGr

v0.16.0

61a2cd2

v0.16

Highlights:

Merge pre-trained BERTopic models with .merge_models
- Combine models with different representations together!
- Use this for incremental/online topic modeling to detect new incoming topics
- First step towards federated learning with BERTopic
Zero-shot Topic Modeling
- Use a predefined list of topics to assign documents
- If needed, allows for further exploration of undefined topics
Seed (domain-specific) words with ClassTfidfTransformer
- Make sure selected words are more likely to end up in the representation without influencing the clustering process
Added params to truncate documents to length when using LLMs
Added LlamaCPP as a representation model
LangChain: Support for LCEL Runnables by @joshuasundance-swca in #1586
Added topics parameter to .topics_over_time to select a subset of documents and topics
Documentation:
- Best practices Guide
- Llama 2 Tutorial
- Zephyr Tutorial
- Improved embeddings guidance (MTEB)
- Improved logging throughout the package
Added support for Cohere's Embed v3:

cohere_model = CohereBackend(
    client,
    embedding_model="embed-english-v3.0",
    embed_kwargs={"input_type": "clustering"}
)

Fixes:

Fixed n-gram Keywords need delimiting in OpenAI() #1546
Fixed OpenAI v1.0 issues #1629
Improved documentation/logging to adress #1589, #1591
Fixed engine support for Azure OpenAI embeddings #1577
Fixed OpenAI Representation: KeyError: 'content' #1570
Fixed Loading topic model with multiple topic aspects changes their format #1487
Fix expired link in algorithm.md by @burugaria7 in #1396
Fix guided topic modeling in cuML's UMAP by @stevetracvc in #1326
OpenAI: Allow retrying on Service Unavailable errors by @agamble in #1407
Fixed parameter naming for HDBSCAN in best practices by @rnckp in #1408
Fixed typo in tips_and_tricks.md by @aronnoordhoek in #1446
Fix typos in documentation by @bobchien in #1481
Fix IndexError when all outliers are removed by reduce_outliers by @Aratako in #1466
Fix TypeError on reduce_outliers "probabilities" by @ananaphasia in #1501
Add new line to fix markdown bullet point formatting by @saeedesmaili in #1519
Update typo in topicrepresentation.md by @oliviercaron in #1537
Fix typo in FAQ by @sandijou in #1542
Fixed typos in best practices documentation by @poomkusa in #1557
Correct TopicMapper doc example by @chrisji in #1637
Fix typing in hierarchical_topics by @dschwalm in #1364
Fixed typing issue with treshold parameter in reduce_outliers by @dschwalm in #1380
Fix several typos by @mertyyanik in #1307
(#1307)
Fix inconsistent naming by @rolanderdei in #1073

Merge Pre-trained BERTopic Models

The new .merge_models feature allows for any number of fitted BERTopic models to be merged. Doing so allows for a number of use cases:

Incremental topic modeling -- Continuously merge models together to detect whether new topics have appeared
Federated Learning - Train BERTopic models on different clients and combine them on a central server
Minimal compute - We can essentially batch the training process into multiple instances to reduce compute
Different datasets - When you have different datasets that you want to train seperately on, for example with different languages, you can train each model separately and join them after training

To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.

First, we train three separate models on different parts of the data:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

Then, we can combine all three models into one with .merge_models:

# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

Zero-shot Topic Modeling

Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics. This allows for extensive flexibility as there are three scenario's to explore.

No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers.
We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.
However, we are not sure whether other topics exist and want to explore those.

Using this feature is straightforward:

from datasets import load_dataset

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, _ = topic_m...

Assets 2

30 May 16:49

MaartenGr

v0.15.0

609d49c

v0.15

Highlights:

Multimodal Topic Modeling
- Train your topic modeling on text, images, or images and text!
- Use the bertopic.backend.MultiModalBackend to embed images, text, both or even caption images!
Multi-Aspect Topic Modeling
- Create multiple topic representations simultaneously
Improved Serialization options
- Push your model to the HuggingFace Hub with .push_to_hf_hub
- Safer, smaller and more flexible serialization options with safetensors
- Thanks to a great collaboration with HuggingFace and the authors of BERTransfer!
Added new embedding models
- OpenAI: bertopic.backend.OpenAIBackend
- Cohere: bertopic.backend.CohereBackend
Added example of summarizing topics with OpenAI's GPT-models
Added nr_docs and diversity parameters to OpenAI and Cohere representation models
Use custom_labels="Aspect1" to use the aspect labels for visualizations instead
Added cuML support for probability calculation in .transform
Updated topic embeddings
- Centroids by default and c-TF-IDF weighted embeddings for partial_fit and .update_topics
Added exponential_backoff parameter to OpenAI model

Fixes:

Fixed custom prompt not working in TextGeneration
Fixed #1142
Add additional logic to handle cupy arrays by @metasyn in #1179
Fix hierarchy viz and handle any form of distance matrix by @elashrry in #1173
Updated languages list by @sam9111 in #1099
Added level_scale argument to visualize_hierarchical_documents by @zilch42 in #1106
Fix inconsistent naming by @rolanderdei in #1073

Multimodal Topic Modeling

With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.

In this example, we are going to use images from flickr that each have a caption accociated to it:

# NOTE: This requires the `datasets` package which you can 
# install with `pip install datasets`
from datasets import load_dataset

ds = load_dataset("maderix/flickr_bw_rgb")
images = ds["train"]["image"]
docs = ds["train"]["caption"]

The docs variable contains the captions for each image in images. We can now use these variables to run our multimodal example:

from bertopic import BERTopic
from bertopic.representation import VisualRepresentation

# Additional ways of representing a topic
visual_model = VisualRepresentation()

# Make sure to add the `visual_model` to a dictionary
representation_model = {
   "Visual_Aspect":  visual_model,
}
topic_model = BERTopic(representation_model=representation_model, verbose=True)

We can now access our image representations for each topic with topic_model.topic_aspects_["Visual_Aspect"].
If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:

import base64
from io import BytesIO
from IPython.display import HTML

def image_base64(im):
    if isinstance(im, str):
        im = get_thumbnail(im)
    with BytesIO() as buffer:
        im.save(buffer, 'jpeg')
        return base64.b64encode(buffer.getvalue()).decode()


def image_formatter(im):
    return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'

# Extract dataframe
df = topic_model.get_topic_info().drop("Representative_Docs", 1).drop("Name", 1)

# Visualize the images
HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))

Multi-aspect Topic Modeling

In this new release, we introduce multi-aspect topic modeling! During the .fit or .fit_transform stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).

![Image title](getting_started/multiaspect/multiaspect.svg)

The approach is rather straightforward. We might want to represent our topics using a PartOfSpeech representation model but we might also want to try out KeyBERTInspired and compare those representation models. We can do this as follows:

from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.representation import MaximalMarginalRelevance
from sklearn.datasets import fetch_20newsgroups

# Documents to train on
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# The main representation of a topic
main_representation = KeyBERTInspired()

# Additional ways of representing a topic
aspect_model1 = PartOfSpeech("en_core_web_sm")
aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

# Add all models together to be run in a single `fit`
representation_model = {
   "Main": main_representation,
   "Aspect1":  aspect_model1,
   "Aspect2":  aspect_model2 
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)

As show above, to perform multi-aspect topic modeling, we make sure that representation_model is a dictionary where each representation model pipeline is defined.
The main pipeline, that is used in most visualization options, is defined with the "Main" key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as "Aspect1" and "Aspect2".

After we have fitted our model, we can access all representations with topic_model.get_topic_info():

As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in topic_model.topic_aspects_.

Serialization

Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now advised to go with .safetensors as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as .pickle and pytorch .bin are also possible.

The methods are used as follows:

topic_model = BERTopic().fit(my_docs)

# Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Method 2 - pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)

# Method 3 - pickle
topic_model.save("my_model", serialization="pickle")

Saving the topic modeling with .safetensors or pytorch has a number of advantages:

.safetensors is a relatively safe format
The resulting model can be very small (often < 20MB>) since no sub-models need to be saved
Although version control is important, there is a bit more flexibility with respect to specific versions of packages
More easily used in production
Share models with the HuggingFace Hub

The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing safetensors, pytorch, and pickle. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.

HuggingFace Hub

When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:

from huggingface_hub import login
login()

When you have logged in to your HuggingFace account, you can save and upload the model as follows:

from bertopic import BERTopic

# Train model
topic_model = BERTopic().fit(my_docs)

# Push to HuggingFace Hub
topic_model.push_to_hf_hub(
    re...

Assets 2

Releases: MaartenGr/BERTopic

v0.17.4

Highlights:

cuML:

Fixes:

Uh oh!

v0.17.3

Uh oh!

v0.17.1

Highlights:

Fixes:

Uh oh!

v0.17.0

Highlights:

Fixes:

Model2Vec

DataMapPlot

Uh oh!

v0.16.4

Fixes

Uh oh!

v0.16.3

Highlights

Fixes

Uh oh!

v0.16.2

Fixes:

Uh oh!

v0.16.1

Highlights:

Fixes:

Uh oh!

v0.16

Highlights:

Fixes:

Merge Pre-trained BERTopic Models

Zero-shot Topic Modeling

Uh oh!

v0.15

Highlights:

Fixes:

Multimodal Topic Modeling

Multi-aspect Topic Modeling

Serialization

HuggingFace Hub

Uh oh!