Login / Signup
Taja Kuzman
Publication Activity (10 Years)
Years Active: 2017-2024
Publications (10 Years): 14
Top Topics
Language Model
Top Venues
CoRR
LREC/COLING
VarDial@EACL
EAMT
</>
Publications
</>
Nikola Ljubesic
,
Vít Suchomel
,
Peter Rupnik
,
Taja Kuzman
,
Rik van Noord
Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining.
CoRR
(2024)
Rik van Noord
,
Taja Kuzman
,
Peter Rupnik
,
Nikola Ljubesic
,
Miquel Esplà-Gomis
,
Gema Ramírez-Sánchez
,
Antonio Toral
Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages.
LREC/COLING
(2024)
Rik van Noord
,
Taja Kuzman
,
Peter Rupnik
,
Nikola Ljubesic
,
Miquel Esplà-Gomis
,
Gema Ramírez-Sánchez
,
Antonio Toral
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages.
CoRR
(2024)
Nikola Ljubesic
,
Taja Kuzman
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation.
CoRR
(2024)
Nikola Ljubesic
,
Taja Kuzman
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation.
LREC/COLING
(2024)
Peter Rupnik
,
Taja Kuzman
,
Nikola Ljubesic
BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian.
VarDial@EACL
(2023)
Taja Kuzman
,
Igor Mozetic
,
Nikola Ljubesic
ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification.
CoRR
(2023)
Taja Kuzman
,
Peter Rupnik
,
Nikola Ljubesic
Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora.
VarDial@EACL
(2023)
Taja Kuzman
,
Igor Mozetic
,
Nikola Ljubesic
Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models.
Mach. Learn. Knowl. Extr.
5 (3) (2023)
Marta Bañón
,
Malina Chichirau
,
Miquel Esplà-Gomis
,
Mikel L. Forcada
,
Aarón Galiano Jiménez
,
Taja Kuzman
,
Nikola Ljubesic
,
Rik van Noord
,
Leopoldo Pla Sempere
,
Gema Ramírez-Sánchez
,
Peter Rupnik
,
Vit Suchomel
,
Antonio Toral
,
Jaume Zaragoza-Bernabeu
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.
EAMT
(2023)
Taja Kuzman
,
Peter Rupnik
,
Nikola Ljubesic
The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.
CoRR
(2022)
Marta Bañón
,
Miquel Esplà-Gomis
,
Mikel L. Forcada
,
Cristian García-Romero
,
Taja Kuzman
,
Nikola Ljubesic
,
Rik van Noord
,
Leopoldo Pla Sempere
,
Gema Ramírez-Sánchez
,
Peter Rupnik
,
Vít Suchomel
,
Antonio Toral
,
Tobias van der Werff
,
Jaume Zaragoza
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.
EAMT
(2022)
Taja Kuzman
,
Peter Rupnik
,
Nikola Ljubesic
The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.
LREC
(2022)
Polona Gantar
,
Simon Krek
,
Taja Kuzman
Verbal Multiword Expressions in Slovene.
Europhras
(2017)