Refine by Language

Refine by Category

Web Content Extracting Projects

rg3 / youtube-dl

Command-line program to download videos from and other video sites

Python     25515   today

soimort / you-get

⏬ Dumb downloader that scrapes the web

Python     12504   today

codelucas / newspaper

News, full-text, and article metadata extraction in Python 3

Python     4533   21 days ago

grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python

Python     2446   1 months ago

scrapy / scrapely

A pure-python HTML screen-scraping library

Python     1217   29 days ago

buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Python     1167   3 months ago

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Python     917   15 days ago

cantino / ruby-readability

Port of arc90's readability project to Ruby

Ruby     810   %d years ago

documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs

Ruby     732   2 months ago

jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

Ruby     719   2 months ago

bndr / node-read

Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.

JavaScript     583   6 days ago

essence / essence

A library for extracting web media.

PHP     582   7 months ago

jeckman / youtube-downloader

PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers

PHP     436   3 days ago

datalib / libextract

Extract data from websites using basic statistical magic

Python     416   %d years ago

fent / node-ytdl-core

Youtube downloader in javascript.

JavaScript     378   today

gottfrois / link_thumbnailer

Ruby gem that generates thumbnail images from a given URL. Much like popular social website with link preview.

Ruby     358   1 months ago

alir3z4 / html2text

Convert HTML to Markdown-formatted text.

Python     358   2 days ago

michaelhelmick / lassie

Web Content Retrieval for Humans™

Python     345   3 months ago

coleifer / micawber

a small library for extracting rich content from urls

Python     336   29 days ago

wikiteam / wikiteam

Tools for downloading and preserving wikis

Python     146   24 days ago

mpratt / embera

A Oembed consumer library, that gives you information about urls. It helps you replace urls to youtube or vimeo for example, with their html embed code.

PHP     142   24 days ago

mauricesvay / imageresolver

ImageResolver.js does its best to determine the main image on a URL without loading all images.

JavaScript     113   2 months ago

vinta / haul

An Extensible Image Crawler

Python     85   4 months ago