Refine by Language

Refine by Category

Web Content Extracting Projects


rg3 / youtube-dl

Command-line program to download videos from YouTube.com and other video sites

Python     25515   today


soimort / you-get

⏬ Dumb downloader that scrapes the web

Python     12504   today


codelucas / newspaper

News, full-text, and article metadata extraction in Python 3

Python     4533   21 days ago


grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python

Python     2446   1 months ago


scrapy / scrapely

A pure-python HTML screen-scraping library

Python     1217   29 days ago


buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Python     1167   3 months ago


miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Python     917   15 days ago


cantino / ruby-readability

Port of arc90's readability project to Ruby

Ruby     810   %d years ago


documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs

Ruby     732   2 months ago


jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

Ruby     719   2 months ago


bndr / node-read

Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.

JavaScript     583   6 days ago


essence / essence

A library for extracting web media.

PHP     582   7 months ago


jeckman / youtube-downloader

PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers

PHP     436   3 days ago


datalib / libextract

Extract data from websites using basic statistical magic

Python     416   %d years ago


fent / node-ytdl-core

Youtube downloader in javascript.

JavaScript     378   today


gottfrois / link_thumbnailer

Ruby gem that generates thumbnail images from a given URL. Much like popular social website with link preview.

Ruby     358   1 months ago


alir3z4 / html2text

Convert HTML to Markdown-formatted text.

Python     358   2 days ago


michaelhelmick / lassie

Web Content Retrieval for Humans™

Python     345   3 months ago


coleifer / micawber

a small library for extracting rich content from urls

Python     336   29 days ago


wikiteam / wikiteam

Tools for downloading and preserving wikis

Python     146   24 days ago


mpratt / embera

A Oembed consumer library, that gives you information about urls. It helps you replace urls to youtube or vimeo for example, with their html embed code.

PHP     142   24 days ago


mauricesvay / imageresolver

ImageResolver.js does its best to determine the main image on a URL without loading all images.

JavaScript     113   2 months ago


vinta / haul

An Extensible Image Crawler

Python     85   4 months ago