|

NudeKnow Now Crawls 6 Domains. Read more about Sisly and how it works.

Introduction

NudeKnow now successfully crawls 6 domains, up from two. We haven’t unleashed it on the internet on its own just yet, but eventually it will crawl the top 100 sites in SERPs for various search terms.

If you’re interested in how it works, keep reading.

The crawler behind NudeKnow: Sisly

NudeKnow uses a search engine called Sisly, which privides a command line web crawler that targets specific domains. It allows you to include and exclude certain URL patterns, which will eventually be generated automatically via a pre-crawl. It handles lazy loading images and tries to obtain thumbnails and full size photos from the site that pertain to user uploaded content.

Our Upcoming Sisly Management Tool: DARK

DARK, which stands for:

Data extraction, Asset retrieval, Reverse image search, and Copyright knowledge

DARK is our search management engine that pre-crawls websites to generate the definition/script to index the site in full. It looks for patterns and quirks in a website and determines how that website needs to be crawled. For example if a CDN on a different domain is used, that domain is included in the definition for Sisly so that it’s indexed. Medium sized thumbnails are also preferred, so the path to full sized images would be excluded.

It also provides the necessary tools to start and stop a scan of a website, based on Docker, as well as monitor the logs, bandwidth, compute, and storage usage for each site. As we begin a transition into indexing video in the future, keeping an optimized pipeline is essential. It also provides a way to view the data being indexed to identify if it’s appropriate or if the Sisly script/definitions need to be adjusted.

Downloading the Media

Once the image has been identified it is then queued into the Aria2 downloader which downloads up to 50 files at a time and can backlog a queue if needed. Average download speed per website is about 100KB/s or less, but we try to stay below 1MB/s, or a measly 8 Mbps, to reduce burden on a website.

Operational Costs For Site Owners

To understand how much bandwidth that maxes out to be per month, we need to convert seconds into 31 days. There are about 2,678,400 seconds in a month, and that equates to about 2.554 TB a month of bandwidth. At $0.01 / GiB, it amounts to about a $26 increase for the first month in bandwidth for the site owner. Typically they will have dedicated pipes and not pay by the bandwidth, so the costs won’t be seen. Websites are also not often recrawled once the content is obtained, as only new content is needed, so it will depend on how much content is uploaded per month and whether Sisly finds it.

When Sites Become Stale and Our On-Premise Options

Pages go stale after 1 month passes between crawls. NudeKnow is dedicated to providing an excellent user experience to both our customers and to the sites we crawl. We offer our indexing services on-premise for $2500 for every 80MB/s of data uploaded in the form of DMCA APIs. If you would like to provide APIs for your website, join our initiative to do so by getting in touch. These servers process data at up to 80MB/s nearly saturating a 1Gbps line while performing perceptual hashing, regular hashing, and face “hashing” on the data. 10Gbps capable of 800MB/s and higher servers are available. These servers come with GPUs to perform the face recognition functionality, and can be used for age detection as well to automatically flag images of minors for review. Face Search is not included in that price, only indexing.

Typically you are limited by the read speed of the drives on your CDN. Redundancy and distributed file storage is required to achieve maximum throughput. We can help you scale, just contact us at Skitzen Technology Services [skitzen.com]

Our DMCA API, AI, and Robots.txt

Our initiative to provide image search capabilities for DMCA requests is part of a broader plan to expand robots.txt to include links to API definitions as well as our DMCA APIs. We would also like to incorporate Crypto transactions that are computed based on the content consumed and the billing set forth by the content owner so that we can send you payments for the bandwidth that we consume. We believe that as AI expands, it’s crawling of the internet and the ethics behind what it can use need to be determined by the website owner. There is no way to turn it off, but you could get paid for the content that is incorporated into the AI model, or in our case collected to be searched.

What is Robots.txt and how would it be used?

robots.txt was originally intended to be for search engines to know what to crawl, but when the purpose is for AI, additional content gets crawled as well, such as images or even video. We believe it’s the perfect way to modernize website crawling for existing and future AI systems as we progress into artificial general intelligence, or AGI, which will need realtime data to be the “smartest.”

Conclusion

We hope to provide ethical and efficient means of providing the services that NudeKnow offers to our users. While we store copies for indexing, we do not keep long term copies of data and our copies are not made public to end users. We hope that you have enjoyed a look inside how NudeKnow works. We hope to open source Sisly again soon to provide complete transparency into our web crawlers and to help you build your own, so stay tuned. This does not include our pre-crawler that generates definitions for Sisly scans and manages ongoing scans. Only the command line version will be open sourced allowing for single sites to be crawled at a time.

Similar Posts

Leave a Reply