SiSimba text assistant

The Simba Text Assistant is designed to improve your online reading experience and support you on your language learning journey. It is a plug-in that runs in your browser and provides summaries and simplifications of the text found on web pages. Our models and code are open source.

Collaborate with us

The goal of Simba is to reduce complexity of online texts, therefore making them more accessible to a wide range of target groups. These target groups – non-native language users, adults with disabilities, for example – reflect a very heterogeneous group of people and we believe that it is only through collaboration that the tool can truly create simplifications that work for these different groups of people. Our aim is to grow our community by inviting researchers and professionals from the simplification world as well as dedicated users to collaborate and build upon our base model, and to bring in their expertise.

Write us an email

What is the Simba Text Assistant?

The Simba Text Assistant is a browser plug-in that produces summaries of German-language text on web pages. It is designed to additionally simplify the summaries, by shortening the sentences and providing explanations for words. We have also integrated the Hurraki dictionary, and you can choose to highlight words found in the dictionary and in the online text, and be shown their definition in Easy Language. We trained and evaluated the model that provides these simplifications with news articles; that is why it works better for these types of web content. The plug-in also offers the opportunity to submit your feedback on the summary that Simba produces.

Please note that we cannot guarantee that the model will always produce factual information. Simba is based on a text generation model, and as other generation models, it can in some cases ‘hallucinate’. Please compare the output to the input and also use the integrated Hurraki dictionary to verify any definitions.

What's the aim?

Simba has been created by members of the Public Interest AI research group at the HIIG. The overarching goal of the research group is to carve out what characteristics “public interest AI” should have (our thoughts on this can be found on publicinterest.ai). We also aim to implement these characteristics in practical prototypes; Simba is one of these. Concretely, this means that the code and models behind Simba are open source, which not only allows for collaboration but also provides meaningful transparency about the system. The functionality of Simba is also one step in the direction of a larger goal that we believe to be in the public interest: making online text (and by proxy, the internet) more accessible.

FAQ

How does a summarisation model work in general?

There are different ways of automatically creating a summarisation, Simba is based on a so-called “text generation” model. These text generation models are also referred to as Large Language Models or foundation models: ChatGPT and Llama are examples of these. They are very large neural networks that are fed with a large amount of text data. These networks are trained to calculate what word is most likely to come next in a sequence.

What data did we use?

We used German-language newspaper articles that have been simplified to fine-tune the foundation model Mistral-7B-v0.1. We use articles from the Austrian Press Agency, which have been simplified by professional translators. They are simplified to the levels B1 and A2 on the Common European Framework of Reference for Languages (CEFR). A sample of the dataset can be found here.

What do we know about the limitations of Simba?

As with all text generation models, and as may be seen in the example texts, the automatically generated summaries and simplifications may contain information that is not true, so-called “hallucinations”. We recommend comparing the input and output text to ensure the output is factual. The output may also contain repeated information. We fine-tuned our model on newspaper articles from Austria which means that our model works best with this text type and the outputs may contain linguistic characteristics unique to Austrian German.

I have a specific question on the code, model or data…

Our code repository can be found here and a sample of the dataset can be found here. If your question is not answered, feel free to file an issue on our Simba repository.