about

search the arXiv uses OpenAI's text-embedding-ada-002 model to embed the abstract of over 250,000 machine learning papers from arXiv. The papers are regularly sourced from the arXiv metadataset over on Kaggle and filtered to include only those belonging to

cs.CV, cs.LG, cs.CL, cs.AI, cs.NE, cs.RO

When you perform a search, your query is embedded using the same model after which the 10 papers with the highest cosine similarity to the query embedding are returned.

This is a personal project meant as an experiment more than anything else. You can find the source code on GitHub, the embeddings on Kaggle, and me on Twitter 🤗