about

search the arXiv uses OpenAI's text-embedding-ada-002 model to embed the abstract of over 500,000 machine learning papers from arXiv. The papers are regularly sourced from the arXiv metadataset over on Kaggle and filtered to include only those belonging to

cs.CV, cs.LG, cs.CL, cs.AI, cs.NE, cs.RO

When you perform a search, your query is embedded using the same model after which the 100 papers with the highest cosine similarity to the query embedding are returned. The papers can then be sorted by similarity, citation count, or date.

Citation counts are sourced from the Semantic Scholar API.

This is a personal project meant as an experiment more than anything else. You can find the source code on GitHub, the embeddings on Kaggle, and me on Twitter 🤗