data-lake
About 2 min
mindmap
what's a data lake
- A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
- It's a concept, similar to cloud computing, but not a specifc technology
- It's an architectural approach that allows enterprises consolidate large heterogeneous data assets at scale and uncover actionable insights from the consolidated data through various types of analytics
concepts
- storage
- storage is a fundamental component of any data lake
- data lakes should be able to store structured, semi-structured, and unstructured data
- in usual, data lakes are built on scalable and elastic storage
- it's very important to integrate with other storage systems for a data lake to accessing and analyzing data from different sources
- processing
- data lakes should be able to process data at big scale
- data lakes may support both batch processing and stream processing
- data lakes should support different processing methods, such as SQL, piece of code, etc.
- metadata
- it's a key fundamental component of a data lake which keeps track of all the data assets
- metadata is used to discover, understand, and govern data
- features for metadata management including
- auto-discovery
- auto-classification
- auto-tagging
- data lineage
- user customization
- management
- data governance
- data lifecycle management
- data security
- data privacy
- beyond technology
implementation pieces
storage layer
- features
- store binary files with s3-compatible storage
- store tables with s3-compatible storage or databases
- use
s3-compatible storage
, such as minio, to store files - use
clickhouse
to store tables- docker | k8s
- why clickhouse?
- it's a column-oriented database management system
- merge tree engine is powerful to store and query time series data
- support a lot of different engines to connect to other storage systems, such as s3, kafka, postgresql, etc.
- support multiple interfaces to pretend it as another (common) database, such as PostgreSQL interface and MySQL interface
- easy to integrate with flink cdc to capture changes from other databases/systems through jdbc
- features
processing layer
- features
- views through other databases or systems
- logical views
- timely updated physical views
- real-time physical views
- easy to analyze data between multiple data(across files and tables)
- easy to integrate with algorithms
- with remote calls: grpc or rest api
- with local invokes: apache arrow
- views through other databases or systems
- use
flink
as the processing engine- basic tutorials
- flink on k8s
- basic connectors
- s3 filesystem connector(source/sink) with parquet format
- jdbc connector(source/sink)
- NOTE: source is implemented by
InputFormat
- NOTE: source is implemented by
- cdc connectors
- features
metadata layer
- features
- auto-discovery
- auto-tagging(including classification)
- data-lineage
- user-customization
- easy to search data
- easy to fetch data
- easy to analyze data
- use
datahub
as meta data backend
- features
workflow and scheduling
- argo workflow
- easy to construct a pipeline through yaml, flink jobs and container images
api and sdk
- self developed
machine learning and ai