My own Search Engine..?
I have been thinking about what I want a search engine to do. It should do basically this:
Search Wikipedia. Apparently you can download the entirety of (English) Wikipedia in just 19GB (compressed). This is excluding any page history/talk pages, and the dump doesn’t contain media. But still, 19GB is not that much. It expands to around 90GB, but I think that’s manageable if I run it on a local server.
Crawl and index Stack Exchange sites. Think Stack Overflow, Server Fault, Super User, Unix & Linux, Mathematics, and Ask Ubuntu.
Optionally record my browsing history via a browser extension. This should be optional and contain a blacklist containing domains that shouldn’t be recorded — like search pages. Results from my own browser history should be less prominent.
Crawl websites on the small web. If you’re wondering what the heck I’m talking about, this explanation from Marginalia Search describes it pretty well:
Remember when you used to explore the Internet, when you used to discover cool little websites made by people and it wasn’t just a bunch of low effort content mill listicles and blog spam?
In recent years, something has been simmering: Some call it the “Small Internet”. I hesitate to call it a movement, that would imply a level of organization and intent that it does not possess. It’s a disjointed group of like-minded people that recognize that the Internet has lost a certain je ne sais quoi, it has turned from a wild and creative space, into more of shopping mall. Where ever you go, you’re prodded to subscribe to newsletters, to like and comment, to buy stuff.
There’s a bunch of small indexes and search engines like this. I was planning to curate my own index, based on existing indexes. There should also be an option for people to submit sites.
Index some prominent Linux-oriented wikis, like the Arch Wiki and Gentoo Wiki. They seem to run the same wiki software as Wikipedia, so maybe they have a download option as well?
Searching man pages and RFCs seems cool?
Support for bangs, like DuckDuckGo. I’d want
!gfor Whoogle results and
This hypothetical search engine would be mainly for personal use, but I wouldn’t mind putting it on a public domain if I were to build it.