$textit{lucie}$: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2410.09119v1 Announce Type: new
Abstract: The University of California–Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the $textit{ucimlrepo}$ package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present $textit{lucie}$ — $underline{l}oad$ $underline{U}niversity$ $underline{C}alifornia$ $underline{I}rvine$ $underline{e}xamples$ — a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. $textit{lucie}$ was designed using the top 100 most popular datasets and benchmarked on the next 130, where it resulted in a success rate of 95.4% vs. 73.1% for $textit{ucimlrepo}$. $textit{lucie}$ is available as a Python package on PyPI with 98% code coverage.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.