binhnguyennus / awesome-scalability Star 53.5k Code Issues Pull requests The Patterns of Scalable, Reliable, and Performant Large-Scale Systems computer-science lists devops distributed-systems machine-learning awesome web-development programming big-data system backend architecture scalability resources design-patterns interview awesome-list interview-practice interview-questions system-design Updated May 14, 2024
apache / spark Star 38.5k Code Issues Pull requests Apache Spark - A unified analytics engine for large-scale data processing python java r scala sql big-data spark jdbc Updated May 14, 2024 Scala
ClickHouse / ClickHouse Star 34.6k Code Issues Pull requests Discussions ClickHouse® is a free analytics DBMS for big data sql big-data analytics clickhouse dbms olap distributed-database mpp hacktoberfest Updated May 14, 2024 C++
donnemartin / data-science-ipython-notebooks Star 26.5k Code Issues Pull requests Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. python aws data-science machine-learning caffe theano big-data spark deep-learning hadoop tensorflow numpy scikit-learn keras pandas kaggle scipy matplotlib mapreduce Updated Mar 20, 2024 Python
apache / flink Star 23.2k Code Issues Pull requests Apache Flink python java scala sql big-data flink Updated May 14, 2024 Java
amark / gun Sponsor Star 17.8k Code Issues Pull requests An open source cybersecurity protocol for syncing decentralized graph data. machine-learning cryptography crypto encryption database big-data graph offline-first protocol end-to-end dapp decentralized blockchain realtime p2p artificial-intelligence crdt web3 metaverse dweb Updated Apr 15, 2024 JavaScript
prestodb / presto Star 15.6k Code Issues Pull requests The official home of the Presto distributed SQL query engine for big data java data query sql big-data presto hive hadoop lakehouse Updated May 14, 2024 Java
heibaiying / BigData-Notes Star 15.3k Code Issues Pull requests 大?据入?指南 ? phoenix scala kafka big-data spark yarn hive hadoop storm bigdata hbase zookeeper hdfs mapreduce flume azkaban sqoop Updated Jan 5, 2024 Java
questdb / questdb Star 13.5k Code Issues Pull requests Discussions An open source time-series database for fast ingest and SQL queries java iot postgres sql database big-data time-series analytics cpp grafana postgresql simd low-latency financial-analysis tsdb hacktoberfest time-series-database questdb Updated May 14, 2024 Java
andkret / Cookbook Star 13k Code Issues Pull requests The Data Engineering Cookbook big-data best-practices cookbook data-engineering data-engineer Updated Mar 20, 2024
apache / predictionio Star 12.5k Code Issues Pull requests PredictionIO, a machine learning server for developers and ML engineers. scala big-data predictionio Updated Jan 9, 2021 Scala
yahoo / CMAK Star 11.7k Code Issues Pull requests CMAK is a tool for managing Apache Kafka clusters scala kafka big-data cluster-management Updated Aug 2, 2023 Scala
vesoft-inc / nebula Star 10.2k Code Issues Pull requests Discussions A distributed, fast open-source graph database featuring horizontal scalability and high availability distributed-systems database big-data cpp graph raft scalability distributed graph-database graphdb hacktoberfest nebula nebula-graph nebulagraph Updated May 13, 2024 C++
trinodb / trino Star 9.6k Code Issues Pull requests Discussions Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL ( https://trino.io ) java distributed-systems data-science sql database big-data presto hive hadoop analytics jdbc databases distributed-database query-engine iceberg datalake prestodb trino delta-lake Updated May 14, 2024 Java
cython / cython Star 9k Code Issues Pull requests The most widely used Python to C compiler python c performance big-data cpp cython cpython cpython-extensions Updated May 14, 2024 Python
provectus / kafka-ui Star 8.6k Code Issues Pull requests Discussions Open-Source Web UI for Apache Kafka Management opensource kafka big-data web-ui streams kafka-connect apache-kafka kafka-producer kafka-client kafka-streams hacktoberfest streaming-data kafka-manager kafka-cluster event-streaming cluster-management kafka-ui kafka-brokers Updated May 3, 2024 Java
StarRocks / starrocks Star 7.9k Code Issues Pull requests Discussions StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software. sql database big-data analytics olap join distributed-database realtime-database mpp cloudnative iceberg real-time-analytics datalake vectorized real-time-updates star-schema hudi delta-lake lakehouse lakehouse-platform Updated May 14, 2024 Java
catboost / catboost Star 7.8k Code Issues Pull requests Discussions A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU. python data-science machine-learning data-mining tutorial r big-data gpu cuda kaggle gbdt gbm gpu-computing decision-trees gradient-boosting coreml catboost categorical-features Updated May 14, 2024 Python
apache / beam Star 7.6k Code Issues Pull requests Apache Beam is a unified programming model for Batch and Streaming data processing. python java golang streaming sql big-data beam batch Updated May 14, 2024 Java
delta-io / delta Star 6.9k Code Issues Pull requests An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs big-data spark analytics acid delta-lake Updated May 14, 2024 Scala