ashvardanian / StringZilla Star 1.8k Code Issues Pull requests Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc ?? html parser json information-retrieval csv string simd dataset string-manipulation sorting-algorithms beautifulsoup pattern-recognition ndjson substring string-matching string-search string-parsing common-crawl laion Updated May 18, 2024 C++
commoncrawl / cc-pyspark Star 390 Code Issues Pull requests Process Common Crawl data with Python and Spark spark pyspark sparksql wet commoncrawl common-crawl warc-files wat-files Updated Apr 8, 2024 Python
commoncrawl / news-crawl Star 302 Code Issues Pull requests Discussions News crawling with StormCrawler - stores content as WARC crawler news web-crawler apache-storm warc commoncrawl common-crawl storm-crawler Updated Dec 13, 2023 Java
michaelharms / comcrawl Star 217 Code Issues Pull requests A python utility for downloading Common Crawl data python data deep-learning scraping commoncrawl common-crawl training-dataset Updated Jun 8, 2023 Python
oscar-project / ungoliant Star 152 Code Issues Pull requests Discussions ??? The pipeline for the OSCAR corpus nlp crawler corpus-linguistics fasttext oscar commoncrawl common-crawl language-classification Updated Dec 18, 2023 Rust
crissyfield / troll-a Star 130 Code Issues Pull requests Drill into WARC web archives security internet-archive command-line-tool warc security-tools common-crawl Updated Jan 4, 2024 Go
commoncrawl / cc-crawl-statistics Star 118 Code Issues Pull requests Statistics of Common Crawl monthly archives mined from URL index files statistics commoncrawl common-crawl Updated Jun 5, 2024 Python
oscar-project / goclassy Star 85 Code Issues Pull requests An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline. nlp corpus-linguistics fasttext common-crawl language-classification Updated Apr 21, 2021 Go
commoncrawl / cc-webgraph Star 69 Code Issues Pull requests Tools to construct and process webgraphs from Common Crawl data pagerank webgraph commoncrawl common-crawl centrality-measures webgraph-framework Updated Jun 3, 2024 Java
commoncrawl / cc-notebooks Star 40 Code Issues Pull requests Various Jupyter notebooks about Common Crawl data jupyter-notebook aws-athena commoncrawl common-crawl webarchiving webgraph-framework Updated Jun 2, 2022 Jupyter Notebook
IBM / cc-dbp Star 28 Code Issues Pull requests A dataset for knowledge base population research using Common Crawl and DBpedia. dbpedia common-crawl ibm-research-ai knowledge-base-population Updated Jan 27, 2022 Java
bminixhofer / gerpt2 Star 18 Code Issues Pull requests German small and large versions of GPT2. nlp machine-learning german language-model common-crawl gpt2 Updated May 11, 2022 Python
oscar-project / oscar-website Star 10 Code Issues Pull requests The website of the Oscar Project nlp website machine-learning hugo language-model common-crawl Updated Nov 9, 2023 TeX
cisnlp / GlotCC Star 9 Code Issues Pull requests GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages crawler multlingual corpus-linguistics glot language-identification commoncrawl common-crawl glotcc multilingual-dataset Updated May 31, 2024
Mgosi / Big-Data-Analysis-using-MapReduce-in-Hadoop Star 8 Code Issues Pull requests We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small d… docker big-data twitter-api hdfs tableau data-processing data-pipeline hadoop-docker common-crawl big-data-analytics tweet-collector Updated Oct 5, 2019 Jupyter Notebook
hrbrmstr / cc Sponsor Star 6 Code Issues Pull requests ?Extract metadata of a specific target based on the results of "commoncrawl.org" r domains urls rstats recon reconnaissance common-crawl r-cyber Updated Aug 31, 2018 R
tokenmill / common-crawl-utils Star 6 Code Issues Pull requests Various Common Crawl utilities in Clojure. clojure clojure-library warc common-crawl cdx-api Updated Dec 5, 2023 Clojure
HRN-Projects / common_crawl_with_scrapy Star 6 Code Issues Pull requests Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy. python data-mining python3 web-scraping scrapy web-crawling webarchive common-crawl common-crawl-with-scrapy parse-common-crawl common-crawl-with-python common-crawl-scrapy common-crawl-python common-crawl-data webarchive-data-scraping Updated Jul 14, 2021 Python
toimik / CommonCrawl Star 6 Code Issues Pull requests Discussions Common Crawl's processing tools warc wat wet commoncrawl common-crawl warc-files wat-files common-crawl-data wet-files Updated May 2, 2024 C#
code402 / warc-benchmark Star 4 Code Issues Pull requests Sample code to grep Common Crawl WARC files in Go, Java, Node and Python. warc commoncrawl common-crawl Updated Apr 30, 2021 Shell