Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The Internet is dying. Everything shifts from standards toward managed corporate walled gardens. There is no place for RSS in the future. This is my personal opinion.

I use only RSS. That is how I obtain new information. I do not know personally anyone else doing that. I ping sources every hour, but I ping at least 400 sources.

From my sources none provided last-modified in headers: reddit, youtube, personal sites (In ff f12, network). My site supports it. It is nice also that you support it, but I doubt there is any impact of that in real world. Most of the attention goes through tiktok,youtube videos, through chrome browser. Either it is supported now, or it doesn't really matter if 40 dudes makes request every minute or so.

We should also provide clean title, description, in open graph protocol meta data, and yet not everybody does that.

We should also return correct HTTP status codes, and yet not everybody does that.

I am disenchanted with current state of the Internet, or maybe it was always a little bit pile of various things/garbage.



+1 for shared sentiments. Do you have a blog post(s) that explains your setup with the processes you’ve described?


I do not have any blog entry worth sharing. I am running an "ethic web scraper". I think I cannot speak about processes. I just may be lacking knowledge. I think it is more about "experience" rather than process.

Web crawling core is in file: https://github.com/rumca-js/Django-link-archive/blob/main/rs...

Some things more project specific are in https://github.com/rumca-js/Django-link-archive/blob/main/rs...

I know that there already are spiders, metadata processing packages for python, but I like having control over the process.

Old man yelling at the cloud. I hate also:

- blocking me with 403 because my user agent is not "mainstream". Why do I have to use chrome undetected to read some RSS feeds? Why can't I use third party clients? Contents can have adverts. I just want my own layout, buttons

- RSS feeds protected with cloudflare, so tools cannot read feeds easily

- not using, or outright blocking RSS functionality in wordpress. Some sites could be more open that way, but no. RSS feeds are closed/removed

- some sites have "/blog" location, but the main domain is empty, or nearly empty, or returns 404. Can I trust such location?

- when HTML meta data are not available. I like YouTube. It allows me to scrape metadata, but it protects video contents, and that is good

- weird redirects. Domain does not have any contents. Does not describe what it is. It just have javascript redirects. From main domain to some weird locations within the domain

- url shorteners, vanity links. You do not know where you will be transported. I understand they are counting sheep, but they sacrifice my security

- google returning links with syntax "https://www.google.com/url", not directly. Youtube does the same with syntax "https://www.youtube.com/redirect". For me again this is vulnerability

My ethic web scraper results are placed in: https://github.com/rumca-js/Internet-Places-Database.


Thanks for the info.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: