Show HN: API to turn entire websites into Markdown While building mendable - we found that feeding LLMs well-structured markdown improved accuracy. We also found it surprisingly hard. We found some great tools online, but none reliably handled the entire process. We wanted an API that took a URL, crawled the pages in the URL, and gave us an easy-to-use, up-to-date markdown we could feed into our index. So, we released an open-source repo and an API that crawls and turns entire websites into a markdown with just a few lines of code The API handles: - Crawling without consistent sitemaps - Infra to handle running many crawling jobs - Proxying, hosting headless browsers at scale - Conversion to clean markdown - Caching - Handling images, videos (soon), and tables(soon) - LLM extraction (soon) It is open source, and we also offer an easy-to-use API that starts free. It has built-in loaders for both @llama_index and @langchain. Excited to see people try it https://ift.tt/n9BqJ75 April 16, 2024 at 11:26PM
Tags:
Hacker News