Refine by Language

Refine by Category

Web Content Extracting Projects


rg3 / youtube-dl

Command-line program to download videos from YouTube.com and other video sites

Python     28372   today


soimort / you-get

⏬ Dumb downloader that scrapes the web

Python     14238   3 days ago


codelucas / newspaper

News, full-text, and article metadata extraction in Python 3

Python     5003   4 days ago


grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python

Python     2653   3 months ago


scrapy / scrapely

A pure-python HTML screen-scraping library

Python     1274   3 months ago


buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Python     1242   6 months ago


miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Python     1059   29 days ago


cantino / ruby-readability

Port of arc90's readability project to Ruby

Ruby     810   %d years ago


jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

Ruby     744   3 months ago


documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs

Ruby     740   5 months ago


essence / essence

A library for extracting web media.

PHP     599   3 months ago


bndr / node-read

Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.

JavaScript     589   4 months ago


jeckman / youtube-downloader

PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers

PHP     481   23 days ago


fent / node-ytdl-core

Youtube downloader in javascript.

JavaScript     460   today


datalib / libextract

Extract data from websites using basic statistical magic

Python     435   %d years ago


alir3z4 / html2text

Convert HTML to Markdown-formatted text.

Python     393   3 days ago


gottfrois / link_thumbnailer

Ruby gem that generates thumbnail images from a given URL. Much like popular social website with link preview.

Ruby     368   1 months ago


coleifer / micawber

a small library for extracting rich content from urls

Python     359   27 days ago


michaelhelmick / lassie

Web Content Retrieval for Humans™

Python     356   9 days ago


wikiteam / wikiteam

Tools for downloading and preserving wikis

Python     176   2 months ago


mpratt / embera

A Oembed consumer library, that gives you information about urls. It helps you replace urls to youtube or vimeo for example, with their html embed code.

PHP     148   5 months ago


mauricesvay / imageresolver

ImageResolver.js does its best to determine the main image on a URL without loading all images.

JavaScript     122   5 months ago


vinta / haul

An Extensible Image Crawler

Python     101   7 months ago