Releases: MaartenGr/BERTopic
v0.17.4
Highlights:
- Add
.delete_topicsby @shuanglovesdata in #2322 - Allow execution without plotly by @luismavs in #2401
- Add tqdm to _litellm.py @NFrnk in #2408
- Drop support for python 3.9 by @afuetterer in #2419
- Make UMAP's init default to random on visualize_topics for reproducible visualization by @makramab in #2412
cuML:
Preparing for MEGA!-scale BERTopic with Multi-GPU UMAP and the following necessary updates:
- Update installation instructions for cuML with BERTopic by @csadorf in #2446
- Speed up
._create_topic_vectorsby replacing DataFrame .loc with NumPy masking @jinsolp in #2406 - Modify _reduce_dimensionality to use fit_transform by @betatim in #2416
Fixes:
- Fix incorrect label in zero-shot svg in documentation by @huisman in #2448
- Enable ruff rule RUF by @afuetterer in #2457
- CI: bump github actions versions by @afuetterer in #2427
- CI: prefer action-pre-commit-uv for lint job by @afuetterer in #2434
- CI: switch to uv based project installation by @afuetterer in #2445
- Chore: update pre-commit hooks by @afuetterer in #2414 and #2443
- Chore: remove obsolete
version_infocheck by @afuetterer in #2444
v0.17.3
v0.17.1
Highlights:
- Added FastEmbed backend by @nickprock in #2213
- Added LangChain backend by @regaltsui in #2303
- Pass precomputed embeddings to KeyBERTInspired.extract_topics @saikumaru in #2368
Fixes:
- Merge models without pytorch (using safetensors) by @MaartenGr in #2329
- Fix installation issue with uv by @MaartenGr in #2328 * Fix incorrect comparison in update_topics by @uply23333 in #2336
- Add missing comma under Exploration subsection by @angelonazzaro in #2374
- Fix typo in Lightweight installation under tips_and_tricks by @angelonazzaro in #2375
- Fix IndexError in zero-shot topic modeling by @MaartenGr in #2267
v0.17.0
Highlights:
- Light-weight installation without UMAP and HDBSCAN by @MaartenGr in #2289
- Add Model2Vec as an embedding backend by @MaartenGr in #2245
- Add LiteLLM as a representation model by @MaartenGr in #2213
- Interactive DataMapPlot by @MaartenGr in #2287
Fixes:
- Lightweight installation: use safetensors without torch by @hedgeho in #2306
- Fix missing links by @MaartenGr in #2305
- Set up pre-commit hooks by @afuetterer in #2283
- Fix handling OpenAI returning None objects by @jeaninejuliettes in #2280
- Add support for python 3.13 by @afuetterer in #2173
- Added system prompts by @Leo-LiHao in #2145
- More documentation for topic reduction by @MaartenGr in #2260
- Drop support for python 3.8 by @afuetterer in #2243
- Fixed online topic modeling on GPU by @SSivakumar12 in #2181
- Fixed hierarchical cluster visualization by @PipaFlores in #2191
- Remove duplicated phrase by @AndreaFrancis in #2197
Model2Vec
With Model2Vec, we now have a very interesting pipeline for light-weight embeddings. Combined with the light-weight installation, you can now run BERTopic without using pytorch!
Installation is straightforward:
pip install --no-deps bertopic
pip install --upgrade numpy pandas scikit-learn tqdm plotly pyyaml
This will install BERTopic even without UMAP or HDBSCAN, so you can use other techniques instead. If these are not installed, then it uses PCA with scikit-learn's HDBSCAN instead. You can install them, together with Model2Vec:
pip install model2vec umap-learn hdbscan
Then, creating a BERTopic model is as straightforward as you are used to:
from bertopic import BERTopic
from model2vec import StaticModel
# Model2Vec
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")
# BERTopic
topic_model = BERTopic(embedding_model=embedding_model)DataMapPlot
To use the interactive version of DataMapPlot, you only need to run the following:
from umap import UMAP
# Reduce your embeddings to 2-dimensions
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# Create an interactive DataMapPlot figure
topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings, interactive=Truev0.16.4
Fixes
- Fix ValueError in Guided Topic Modeling by @RTChou in #2115
- Fix saving BERTopic when c-TF-IDF is None by @sete39 in #2112
- Fix
KeyError: 'topics_from'in #2101 - Fix issues related Zero-shot Topic Modeling by @ianrandman in #2105
- Fix regex matching being used in PartOfSpeech representation model by @woranov in #2138
- Update typo by @saikumaru in #2162
v0.16.3
Highlights
- Simplify zero-shot topic modeling by @ianrandman in #2060
- Option to choose between c-TF-IDF and Topic Embeddings in many functions by @azikoss in #1894
- Use the
use_ctfidfparameter in the following function to choose between c-TF-IDF and topic embeddings:hierarchical_topics,reduce_topics,visualize_hierarchy,visualize_heatmap,visualize_topics
- Use the
- Linting with Ruff by @afuetterer in #2033
- Switch from setup.py to pyproject.toml by @afuetterer in #1978
- In multi-aspect context, allow Main model to be chained by @ddicato in #2002
Fixes
- Added templates for issues and pull requests
- Update River documentation example by @Proteusiq in #2004
- Fix PartOfSpeech reproducibility by @Greenpp in #1996
- Fix PartOfSpeech ignoring first word by @Greenpp in #2024
- Make sklearn embedding backend auto-select more cautious by @freddyheppell in #1984
- Fix typos by @afuetterer in #1974
- Fix hierarchical_topics(...) when the distances between three clusters are the same by @azikoss in #1929
- Fixes to chain strategy example in outlier_reduction.md by @reuning in #2065
- Remove obsolete flake8 config and update line length by @afuetterer in #22066
v0.16.2
Fixes:
- Fix issue with zeroshot topic modeling missing outlier #1957
- Bump github actions versions by @afuetterer in #1941
- Drop support for python 3.7 by @afuetterer in #1949
- Add testing python 3.10+ in Github actions by @afuetterer in #1968
- Speed up fitting CountVectorizer by @dannywhuang in #1938
- Fix
transformwhen using cuML HDBSCAN by @beckernick in #1960 - Fix wrong link in algorithm documentation by @naeyn in #1970
v0.16.1
Highlights:
- Add Quantized LLM Tutorial
- Add optional datamapplot visualization using
topic_model.visualize_document_datamapby @lmcinnes in #1750 - Migrated OpenAIBackend to openai>=1 by @peguerosdc in #1724
- Add automatic height scaling and font resize by @ir2718 in #1863
- Use
[KEYWORDS]tags with the LangChain representation model by @mcantimmy in #1871
Fixes:
- Fixed issue with
.merge_modelsseemingly skipping topic #1898 - Fixed Cohere client.embed TypeError #1904
- Fixed
AttributeError: 'TextGeneration' object has no attribute 'random_state'#1870 - Fixed topic embeddings not properly updated if all outliers were removed #1838
- Fixed issue with representation models not properly merging #1762
- Fixed Embeddings not ordered correctly when using
.merge_models#1804 - Fixed Outlier topic not in the 0th position when using zero-shot topic modeling causing prediction issues (amongst others) #1804
- Fixed Incorrect label in ZeroShot doc SVG #1732
- Fixed MultiModalBackend throws error with clip-ViT-B-32-multilingual-v1 #1670
- Fixed AuthenticationError while using OpenAI() #1678
- Update FAQ on Apple Silicon by @benz0li in #1901
- Add documentation DataMapPlot + FAQ for running on Apple Silicon by @dkapitan in #1854
- Remove commas from pip install reference in readme by @luisoala in #1850
- Spelling corrections by @joouha in #1801
- Replacing the deprecated
text-ada-001model with the latesttext-embedding-3-smallfrom OpenAI by @atmb4u in #1800 - Prevent invalid empty input error when retrieving embeddings with openai backend by @liaoelton in #1827
- Remove spurious warning about missing embedding model by @sliedes in #1774
- Fix type hint in ClassTfidfTransformer constructor @snape in #1803
- Fix typo and simplify wording in OnlineCountVectorizer docstring by @chrisji in #1802
- Fixed warning when saving a topic model without an embedding model by @zilch42 in #1740
- Fix bug in
TextGenerationby @manveersadhal in #1726 - Fix an incorrect link to usecases.md by @nicholsonjf in #1731
- Prevent
modelargument being passed twice when usinggenerator_kwargsin OpenAI by @ninavandiermen in #1733 - Several fixes to the docstrings by @arpadikuma in #1719
- Remove unused
cluster_dfvariable inhierarchical_topicsby @shadiakiki1986 in #1701 - Removed redundant quotation mark by @LawrenceFulton in #1695
- Fix typo in merge models docs by @zilch42 in #1660
v0.16
Highlights:
- Merge pre-trained BERTopic models with
.merge_models- Combine models with different representations together!
- Use this for incremental/online topic modeling to detect new incoming topics
- First step towards federated learning with BERTopic
- Zero-shot Topic Modeling
- Use a predefined list of topics to assign documents
- If needed, allows for further exploration of undefined topics
- Seed (domain-specific) words with
ClassTfidfTransformer- Make sure selected words are more likely to end up in the representation without influencing the clustering process
- Added params to truncate documents to length when using LLMs
- Added LlamaCPP as a representation model
- LangChain: Support for LCEL Runnables by @joshuasundance-swca in #1586
- Added
topicsparameter to.topics_over_timeto select a subset of documents and topics - Documentation:
- Best practices Guide
- Llama 2 Tutorial
- Zephyr Tutorial
- Improved embeddings guidance (MTEB)
- Improved logging throughout the package
- Added support for Cohere's Embed v3:
cohere_model = CohereBackend(
client,
embedding_model="embed-english-v3.0",
embed_kwargs={"input_type": "clustering"}
)Fixes:
- Fixed n-gram Keywords need delimiting in OpenAI() #1546
- Fixed OpenAI v1.0 issues #1629
- Improved documentation/logging to adress #1589, #1591
- Fixed engine support for Azure OpenAI embeddings #1577
- Fixed OpenAI Representation: KeyError: 'content' #1570
- Fixed Loading topic model with multiple topic aspects changes their format #1487
- Fix expired link in algorithm.md by @burugaria7 in #1396
- Fix guided topic modeling in cuML's UMAP by @stevetracvc in #1326
- OpenAI: Allow retrying on Service Unavailable errors by @agamble in #1407
- Fixed parameter naming for HDBSCAN in best practices by @rnckp in #1408
- Fixed typo in tips_and_tricks.md by @aronnoordhoek in #1446
- Fix typos in documentation by @bobchien in #1481
- Fix IndexError when all outliers are removed by reduce_outliers by @Aratako in #1466
- Fix TypeError on reduce_outliers "probabilities" by @ananaphasia in #1501
- Add new line to fix markdown bullet point formatting by @saeedesmaili in #1519
- Update typo in topicrepresentation.md by @oliviercaron in #1537
- Fix typo in FAQ by @sandijou in #1542
- Fixed typos in best practices documentation by @poomkusa in #1557
- Correct TopicMapper doc example by @chrisji in #1637
- Fix typing in hierarchical_topics by @dschwalm in #1364
- Fixed typing issue with treshold parameter in reduce_outliers by @dschwalm in #1380
- Fix several typos by @mertyyanik in #1307
(#1307) - Fix inconsistent naming by @rolanderdei in #1073
Merge Pre-trained BERTopic Models
The new .merge_models feature allows for any number of fitted BERTopic models to be merged. Doing so allows for a number of use cases:
- Incremental topic modeling -- Continuously merge models together to detect whether new topics have appeared
- Federated Learning - Train BERTopic models on different clients and combine them on a central server
- Minimal compute - We can essentially batch the training process into multiple instances to reduce compute
- Different datasets - When you have different datasets that you want to train seperately on, for example with different languages, you can train each model separately and join them after training
To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.
First, we train three separate models on different parts of the data:
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]
# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)Then, we can combine all three models into one with .merge_models:
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])Zero-shot Topic Modeling
Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics. This allows for extensive flexibility as there are three scenario's to explore.- No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
- Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
- Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.
In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers.
We know the data and believe there to be at least the following topics: clustering, topic modeling, and large language models.
However, we are not sure whether other topics exist and want to explore those.
Using this feature is straightforward:
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]
# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_m...v0.15
Highlights:
- Multimodal Topic Modeling
- Train your topic modeling on text, images, or images and text!
- Use the
bertopic.backend.MultiModalBackendto embed images, text, both or even caption images!
- Multi-Aspect Topic Modeling
- Create multiple topic representations simultaneously
- Improved Serialization options
- Push your model to the HuggingFace Hub with
.push_to_hf_hub - Safer, smaller and more flexible serialization options with
safetensors - Thanks to a great collaboration with HuggingFace and the authors of BERTransfer!
- Push your model to the HuggingFace Hub with
- Added new embedding models
- OpenAI:
bertopic.backend.OpenAIBackend - Cohere:
bertopic.backend.CohereBackend
- OpenAI:
- Added example of summarizing topics with OpenAI's GPT-models
- Added
nr_docsanddiversityparameters to OpenAI and Cohere representation models - Use
custom_labels="Aspect1"to use the aspect labels for visualizations instead - Added cuML support for probability calculation in
.transform - Updated topic embeddings
- Centroids by default and c-TF-IDF weighted embeddings for
partial_fitand.update_topics
- Centroids by default and c-TF-IDF weighted embeddings for
- Added
exponential_backoffparameter toOpenAImodel
Fixes:
- Fixed custom prompt not working in
TextGeneration - Fixed #1142
- Add additional logic to handle cupy arrays by @metasyn in #1179
- Fix hierarchy viz and handle any form of distance matrix by @elashrry in #1173
- Updated languages list by @sam9111 in #1099
- Added level_scale argument to visualize_hierarchical_documents by @zilch42 in #1106
- Fix inconsistent naming by @rolanderdei in #1073
Multimodal Topic Modeling
With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.
In this example, we are going to use images from flickr that each have a caption accociated to it:
# NOTE: This requires the `datasets` package which you can
# install with `pip install datasets`
from datasets import load_dataset
ds = load_dataset("maderix/flickr_bw_rgb")
images = ds["train"]["image"]
docs = ds["train"]["caption"]The docs variable contains the captions for each image in images. We can now use these variables to run our multimodal example:
from bertopic import BERTopic
from bertopic.representation import VisualRepresentation
# Additional ways of representing a topic
visual_model = VisualRepresentation()
# Make sure to add the `visual_model` to a dictionary
representation_model = {
"Visual_Aspect": visual_model,
}
topic_model = BERTopic(representation_model=representation_model, verbose=True)We can now access our image representations for each topic with topic_model.topic_aspects_["Visual_Aspect"].
If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:
import base64
from io import BytesIO
from IPython.display import HTML
def image_base64(im):
if isinstance(im, str):
im = get_thumbnail(im)
with BytesIO() as buffer:
im.save(buffer, 'jpeg')
return base64.b64encode(buffer.getvalue()).decode()
def image_formatter(im):
return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'
# Extract dataframe
df = topic_model.get_topic_info().drop("Representative_Docs", 1).drop("Name", 1)
# Visualize the images
HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))Multi-aspect Topic Modeling
In this new release, we introduce multi-aspect topic modeling! During the .fit or .fit_transform stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).
The approach is rather straightforward. We might want to represent our topics using a PartOfSpeech representation model but we might also want to try out KeyBERTInspired and compare those representation models. We can do this as follows:
from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.representation import MaximalMarginalRelevance
from sklearn.datasets import fetch_20newsgroups
# Documents to train on
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# The main representation of a topic
main_representation = KeyBERTInspired()
# Additional ways of representing a topic
aspect_model1 = PartOfSpeech("en_core_web_sm")
aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]
# Add all models together to be run in a single `fit`
representation_model = {
"Main": main_representation,
"Aspect1": aspect_model1,
"Aspect2": aspect_model2
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)As show above, to perform multi-aspect topic modeling, we make sure that representation_model is a dictionary where each representation model pipeline is defined.
The main pipeline, that is used in most visualization options, is defined with the "Main" key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as "Aspect1" and "Aspect2".
After we have fitted our model, we can access all representations with topic_model.get_topic_info():
As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in topic_model.topic_aspects_.
Serialization
Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now advised to go with .safetensors as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as .pickle and pytorch .bin are also possible.
The methods are used as follows:
topic_model = BERTopic().fit(my_docs)
# Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)
# Method 2 - pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)
# Method 3 - pickle
topic_model.save("my_model", serialization="pickle")Saving the topic modeling with .safetensors or pytorch has a number of advantages:
.safetensorsis a relatively safe format- The resulting model can be very small (often < 20MB>) since no sub-models need to be saved
- Although version control is important, there is a bit more flexibility with respect to specific versions of packages
- More easily used in production
- Share models with the HuggingFace Hub
The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing safetensors, pytorch, and pickle. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
HuggingFace Hub
When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:
from huggingface_hub import login
login()When you have logged in to your HuggingFace account, you can save and upload the model as follows:
from bertopic import BERTopic
# Train model
topic_model = BERTopic().fit(my_docs)
# Push to HuggingFace Hub
topic_model.push_to_hf_hub(
re...

