SiSimba Text Assistant

Our tools assist you in understanding German texts by summarising them in simplified language. They are designed to enhance your reading experience and support you learning the German language.

Collaborate with us

Our goal is to make online texts and information accessible to as many diverse individuals as possible. These target groups — such as non-native speakers or adults with disabilities — are highly varied. We believe that collaboration is key to enhancing our offerings. Therefore, we invite researchers, professionals, and engaged users to work with us. Their expertise will help us refine our base model and create simplifications suitable for different groups. Our AI model and code are open source.

Write us an email

What is Simba?

We have developed two AI-powered tools designed to help people understand German online texts. The first is a web app that allows you to simplify your own texts. The second is a browser extension that automatically summarises texts on websites for you. Both use an AI-based language model to simplify German-language texts automatically. Simplification involves reducing complexity while retaining the core message. This means replacing longer words with shorter synonyms, shortening sentences, or adding extra information to clarify and explain context. The model that enables these simplifications has been trained and evaluated on news articles, making our tools particularly suited for this type of web content. Our browser extension scans the first 2700 words on a webpage to provide a simplified summary. If you need to simplify and summarise a longer German text, you can use our web app (the text simplifier), which has no word limit for text analysis. Please note that we cannot guarantee the model will always provide accurate information. Simba is based on a text generation model, which, like other generative models, can occasionally produce "hallucinations." Please compare the output with the input text to verify its accuracy. You can also provide feedback on how you find the summary created by our browser extension, helping us improve the AI model.

What is the goal of Simba?

Our AI-assisted text simplification tools were developed by members of the “Public Interest AI” research group at the Alexander von Humboldt Institute for Internet and Society. The overarching aim of the research group is to determine what characteristics AI in the public interest should have (you can read more about our thoughts at publicinterest.ai). We aim to implement these characteristics in practical prototypes. Simba is one such prototype. Specifically, this means that the code and models behind Simba are open source. This not only facilitates collaboration with others but also provides meaningful transparency about the system. Simba's functionality is also a step towards a larger goal that we see as serving the public interest: making online texts (and thus the internet) more accessible to everyone.

How does a summarisation model work in general?

There are various methods to automatically create a summary. Simba is based on a “text generation model,” also known as Large Language Models or Foundation Models: ChatGPT and Llama are examples. These are very large neural networks trained on vast amounts of text data. They are trained to predict the next word in a sequence based on the likelihood.

Which data did we use?

We used German-language newspaper articles that were simplified to fine-tune the Llama-3-8B-Instruct foundation model. We used articles from the Austria Presse Agentur, which were simplified by professional translators to levels B1 and A2 of the Common European Framework of Reference for Languages (CEFR). You can find a sample of the dataset here.

What do we know about Simba's limitations?

Like all text generation models, and as shown in the sample texts, automatically generated summaries and simplifications may contain information that is not accurate. These are known as "hallucinations." We recommend comparing the input and output texts to ensure factual accuracy. The output may also contain repeated information. Our model has been fine-tuned on Austrian newspaper articles, meaning it performs best with this type of text and the outputs may contain linguistic features unique to Austrian German.

I have a specific question on the code, model or data…

You can find our code repository here and a sample of the dataset here. If your question is not answered, feel free to file an issue on our code repository.