data-lake

ben.wangzAbout 2 min

mindmap

what's a data lake

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
It's a concept, similar to cloud computing, but not a specifc technology
It's an architectural approach that allows enterprises consolidate large heterogeneous data assets at scale and uncover actionable insights from the consolidated data through various types of analytics

concepts

storage
- storage is a fundamental component of any data lake
- data lakes should be able to store structured, semi-structured, and unstructured data
- in usual, data lakes are built on scalable and elastic storage
- it's very important to integrate with other storage systems for a data lake to accessing and analyzing data from different sources
processing
- data lakes should be able to process data at big scale
- data lakes may support both batch processing and stream processing
- data lakes should support different processing methods, such as SQL, piece of code, etc.
metadata
- it's a key fundamental component of a data lake which keeps track of all the data assets
- metadata is used to discover, understand, and govern data
- features for metadata management including
  - auto-discovery
  - auto-classification
  - auto-tagging
  - data lineage
  - user customization
management
- data governance
- data lifecycle management
- data security
- data privacy
beyond technology

implementation pieces

storage layer
- features
  - store binary files with s3-compatible storage
  - store tables with s3-compatible storage or databases
- use s3-compatible storage, such as minio, to store files
  - minio: docker | k8s
- use clickhouse to store tables
  - docker | k8s
  - why clickhouse?
    1. it's a column-oriented database management system
    2. merge tree engine is powerful to store and query time series data
    3. support a lot of different engines to connect to other storage systems, such as s3, kafka, postgresql, etc.
    4. support multiple interfaces to pretend it as another (common) database, such as PostgreSQL interface and MySQL interface
    5. easy to integrate with flink cdc to capture changes from other databases/systems through jdbc
processing layer
- features
  - views through other databases or systems
    - logical views
    - timely updated physical views
    - real-time physical views
  - easy to analyze data between multiple data(across files and tables)
  - easy to integrate with algorithms
    - with remote calls: grpc or rest api
    - with local invokes: apache arrow
- use flink as the processing engine
  - basic tutorials
  - flink on k8s
  - basic connectors
    1. s3 filesystem connector(source/sink) with parquet formatopen in new window
    2. jdbc connector(source/sink)open in new window
      - NOTE: source is implemented by InputFormat
  - cdc connectors
    1. postgresql connector(source/sink)
    2. kafka connector(source/sink)
metadata layer
- features
  - auto-discovery
  - auto-tagging(including classification)
  - data-lineage
  - user-customization
  - easy to search data
  - easy to fetch data
  - easy to analyze data
- use datahub as meta data backend
workflow and scheduling
- argo workflow
- easy to construct a pipeline through yaml, flink jobs and container images
api and sdk
- self developed
machine learning and ai
other examples