• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

Data Engineering Concepts: A project based introduction

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155
I recently finished the

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

by

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

. For my certification, I was required to undertake a capstone project that would culminate in a dashboard showing insights from the data I had processed in my pipeline.

Instead of a step-by-step guide (which can be easily found in the project’s

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

), this article explores data engineering concepts from a high-level view, explaining the decisions I made and the trade-offs I considered.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Table of Contents

  • Data Sourcing and Problem Definition
  • Containerization
  • Infrastructure-as-Code (IaC)
  • Orchestration vs Automation
  • Data Lake and Data Warehouse
  • Analytics Engineering and Data Modeling
  • Batch vs Streaming
  • Exposure: Visualization and Predictions
  • Conclusion
1. Data Sourcing and Problem Definition


Before coding, I sourced the data and defined the problem. I chose the LinkedIn Job Postings dataset from Kaggle due to its richness and descriptive documentation.

Problem Statement:


How can data from LinkedIn job posts (2023–2024) help us make informed decisions on a career path?

I then went ahead to break it down into the following issues, each of which would be addressed by a chart in my dashboard:

  • Which job titles offer the highest salaries?
  • Which companies, industries, and skills are the most lucrative?
  • What percentage of companies offer remote work?
  • What are the highest salaries and average experience levels?
  • Which countries have the most job postings?

These were, of course, not MECE-compliant. (MECE - mutually exclusive and collectively exhaustive)

2. Containerization


Before I could think of extracting my data, I took a high level and long term view of my project and considered aspects such as collaboration and reproducibility. Whereas it is true that I could easily just create the pipelines and files needed on my local machine with no packaging whatsoever, this would pose a challenge to anyone who would be looking to evaluate or reproduce my project. I, hence, decided to use docker containers to package my project for replication either locally or even in the cloud.

Docker containers also have other advantages such as being lightweight, easily replicable thus allowing scalability via horizontal scaling and load balancing, increases project maintainability since Dockerfiles simplify environment management, isolation between the containers prevents dependency conflicts and it supports version control via versioned images.

3. Infrastructure as Code (IaC)


I would be using GCP (Google’s cloud wing) for my data lake, data warehouse and dashboard hosting and so I needed a reliable way to interact with the cloud. Infrastructure-as-Code (IaC) is the practice of managing and provisioning computing infrastructure—such as servers, networks, databases, and other resources—through machine-readable configuration files rather than through manual processes.

The use of IaC tools simplifies the process of cloud infrastructure management and allows for scalability, version control, testability and automation. Apart from provisioning infrastructure, IaC tools can be used for other management activities such as enabling APIs in GCP and many more. It also allows reusability of resources since it avoids creating new resources if the defined ones already exist.

Terraform is the IaC tool that I used due to how simple it is. I made it modular and included the use of variables and outputs to integrate terraform into my project’s workflow. An alternative to terraform is AWS Cloudformation which is used in AWS setups.

4. Orchestration vs Automation


A data workflow is a sequence of automated data processing steps that specifies what are the steps, inputs, outputs, and dependencies in a data processing pipeline. Data workflows are also called DAGs (Directed Acyclic Graphs). Directed means they have direction, Acyclic means there are no cycles. There may be loops but no cycles are allowed. The difference between a loop and a cycle in this case is that in a loop, we know the starting and ending point. The loop ends based on whether a certain condition is met but a cycle has none. DAGs are run using tools / engines for orchestration like Apache Airflow, Luigi, Prefect, Dagster, Kestra, etc. Smaller workflows can be run using make and or cron jobs but this is usually done locally.

In software engineering and data management, an orchestrator is a tool that automates, manages, and coordinates various workflows and tasks across different services, systems, or applications. Because an orchestrator allows everything to run smoothly without the need for manual intervention, it is easy to confuse orchestration with automation.

Whereas automation refers to the execution of individual tasks or actions without manual intervention, orchestration goes beyond automation by managing the flow of multiple interconnected tasks or processes. Orchestration defines not only what happens but also when and how things happen, ensuring that all tasks (whether automated or not) are executed in the correct order, with the right dependencies and error handling in place. While automation focuses on individual tasks, orchestration ensures all those tasks are arranged and managed within a broader, cohesive system. This matters if you need to reliably handle complex processes with many interdependent steps.

Use cases for automation include automated testing after code commits, automated backups and automated email notifications. Use cases for orchestration include data pipeline orchestration, CI/CD pipeline orchestration and cloud infrastructure orchestration.

Advantages of workflow orchestration include:

  • Scalability
  • Error handling and resilience
  • Improved monitoring and control
  • Process standardization
  • Faster time to value since no need to reinvent the wheel
What’s the Difference?

CategoryAutomationOrchestration
ScopeSingle task executionCoordination of multiple tasks
FocusEfficiency of individual actionsDependency management, error handling
ExampleAutomated backupsCI/CD pipelines, data pipeline scheduling

For my workflow orchestration tool, I chose Apache Airflow because it is the most common in the industry. I built Airflow using a docker-compose.yml file and Dockerfile which installs google sdk (a way to interact with GCP). I then created a dag that had multiple steps which include downloading, unzipping and uploading the data to the created gcs bucket.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Why I Used Apache Airflow

  • Industry standard for DAG orchestration
  • Allows complex workflows
  • Supports retries, alerts, and dependency management
  • Easily containerized using docker-compose.yml and custom Dockerfile
5. Data Lake vs Data Warehouse

Data Lake


A data lake is a centralized repository that allows you to store structured, semi-structured, and unstructured data at any scale, in its raw, native format until it's needed for analysis.

Features of a data lake:

  • Allows ingestion of structured and unstructured data
  • Catalogs and indexes data for analysis without data movement
  • Stores, secures and protects data at an unlimited scale
  • Connects data with analytics and ML tools
Why do we need a data lake:

  • Companies realized the value of data
  • Allows for quick storage and access of data (contingent on tier if S3)
  • It is hard to always be able to define the structure of data at the onset
  • Data usefulness is sometimes realized later in the project lifecycle
  • R&D on data products requires huge amounts of data
  • The need for cheap storage of Big Data

Cloud providers of data lakes include Google Cloud Storage by GCP, S3 by AWS and Azure Blob by Azure.

Dangers in a data lake:

  • Conversion into a data swamp (disorganized, inaccessible, and untrustworthy data lake)
  • No versioning
  • Incompatible schemas for same data without versioning
  • No metadata associated
  • Joins not possible
Data Warehouse


A data warehouse is a centralized, structured repository designed to store, manage, and analyze large volumes of cleaned and organized data from multiple sources to support business intelligence (BI), reporting, and decision-making. This is where we have the partitioning and clustering capabilities.

FeatureData LakeData Warehouse
Data TypeRaw (structured + unstructured)Refined (structured only)
PurposeStorage for future useFast analytics & reporting
DesignSchema-on-readSchema-on-write
Example ToolGoogle Cloud StorageBigQuery, Redshift, Snowflake

Data warehouse cloud providers include GCP BigQuery, Amazon Redshift and Snowflake by Azure.

Additional Topics about Data Warehousing

Unbundling


Data warehouse unbundling is the process of breaking apart a traditional, monolithic data warehouse into distinct, independently scalable components. In practice, this involves decoupling ingestion, storage, processing and compute allowing the following:

  1. Scale the the two independently
  2. Adopt best-of-breed tools because of modularity leading to better performance, agility and innovation.
  3. Improve agility and maintenance

Not all data warehouses are unbundled so be sure to check out if the one you want to use is.

OLAP vs OLTP:


Online analytical processing (OLAP) and online transaction processing (OLTP) are data processing systems that help you store and analyze business data. You can collect and store data from multiple sources—such as websites, applications, smart meters, and internal systems. OLAP combines and groups the data so you can analyze it from different points of view. Conversely, OLTP stores and updates transactional data reliably and efficiently in high volumes. OLTP databases can be one among several data sources for an OLAP system. Both online analytical processing (OLAP) and online transaction processing (OLTP) are database management systems for storing and processing data in large volumes.

The primary purpose of online analytical processing (OLAP) is to analyze aggregated data, while the primary purpose of online transaction processing (OLTP) is to process database transactions. You use OLAP systems to generate reports, perform complex data analysis, and identify trends. In contrast, you use OLTP systems to process orders, update inventory, and manage customer accounts. A data warehouse is an OLAP solution.

FeatureOLTPOLAP
PurposeManage and process real-time transactions/business operationsAnalyze large volumes of data to support decision-making
Data UpdatesShort, fast updates initiated by usersData periodically refreshed with scheduled, long-running batch jobs
Database DesignNormalized databases for efficiency and consistencyDenormalized databases using star/snowflake schemas for analytical queries
Space RequirementsGenerally small (if historical data is archived)Generally large due to aggregating large datasets
Response TimeMilliseconds – optimized for speedSeconds or minutes – optimized for complex queries
Backup and RecoveryFrequent backups required for business continuityData can be reloaded from OLTP systems in lieu of regular backups
ProductivityIncreases productivity of end-users and transaction handlersIncreases productivity of analysts, executives, and decision-makers
Data ViewDetailed, day-to-day business transactionsAggregated, multi-dimensional view of enterprise data
Example ApplicationsOrder processing, payments, inventory updatesTrend analysis, forecasting, executive dashboards
User ExamplesCustomer-facing staff, clerks, online shoppersBusiness analysts, data scientists, senior management
Data StructureRow-based storageColumnar storage (in most modern OLAP systems like BigQuery, Redshift, etc.)
Examples by ProviderGoogle Cloud SQL, Amazon Aurora, Cloud SpannerBigQuery, Amazon Redshift, Snowflake
6. Analytics Engineering and Data Modeling

What Is Analytics Engineering?


Analytics engineering is a field that seeks to bridge the gap between data engineering and data analysis. It introduces good software engineering practices (such as modularity, version control, testing, documentation and DRY) to the efforts of data analysts and data scientists. This is done by the use of tools such as dbt, dataform, aws glue and sqlmesh.

Data Modeling:


Data modelling is the process of defining and organizing the structure of data within a system or database to ensure consistency, clarity, and usability. It involves creating abstract representations (models) of how data is stored, connected, and processed. There are three levels of data models:

  • Conceptual - High-level view of business entities and relationships (dimensions tables)
  • Logical - Defines the structure and attributes of data without database-specific constraints. Measurements, metrics or facts (Facts tables)
  • Physical - Maps the logical model to actual database schemas, tables, indexes, and storage.
ETL vs ELT


We can either use ELT or ETL when transforming data. The letters represent the same words (Extract, Load, Transform) but the order matters.

FeatureETLELT
Data VolumeSmallLarge
Transformation TimeBefore loadingAfter loading
FlexibilityLowerHigher
CostHigher computeLower overall

In terms of data modelling, I used cloud dbt because it is easy to use and allows for integration with Github. It does have a limit of one dbt project for the non-premium account so I did have to delete a previous project.

Instead of having data duplication by uploading all my data from the GCS bucket into BigQuery, I used dbt to create external tables which reference data without having to load it into BigQuery’s native storage. The trade off here is that performance in areas such as access and querying may be a little slow due to on-the-fly reading.

I also used dbt to partition and cluster my tables in bigQuery. Partitioning splits a table into segments (partitions) based on the values of a specific column (usually date/time or integer range). Each partition stores a subset of the table’s data, and queries can skip entire partitions that aren’t relevant. The same query processes less data for a partitioned table than for a non-partitioned table thus saving both time and money for queries that are frequently run. You can have a maximum of 4000 partitions in a table.

Clustering, on the other hand, organizes rows within a partition (or unpartitioned table) based on the values in one or more columns. It enables fine-grained pruning of data during query execution and optimizes filter operations, joins and aggregations by organizing data on disk. You can specify up to 4 columns to cluster by; They must be top level and non-repeated fields. Big query performs automatic reclustering for newly added data at no cost.

FeaturePartitioningClustering
GranularityCoarse (splits table into partitions)Fine (organizes rows within partitions/tables)
BasisOne column (DATE/TIMESTAMP/INTEGER)Up to 4 columns (any type)
PerformanceSkips entire partitionsSkips blocks of rows within a table
Cost EfficiencyReduces scan by entire partitions. Cost is predictableReduces scan via pruning but cost benefit varies
Storage LayoutLogical partitioning (physically separated partitions)Physical sorting within storage blocks
Best Used ForTime-series or log dataFrequently filtered or grouped columns with repetition
LimitationsMax 4000 partitions per tableMax 4 clustering columns; no nested/repeated fields
7. Batch vs Streaming


In data engineering and big data, there is usually large amounts of data being generated all the time. A data engineer will thus need to decide whether to process the data as it comes (streaming) or batch it up and process it as intervals.

How to decide between streaming and batch


A good heuristic to follow is to only use streaming when there is an automated response to the data at the end of the pipeline instead of just a human analyst looking at the data (that'd be over engineering). As such, use cases for streaming include fraud detection, hacked account detection and surge pricing (uber).

Batch is best for use cases where data is generated in large volumes but not continuously and can be processed in intervals. You can even use micro batching (15 and 60 minute batches) in case you have a lot of data but not enough to justify streaming.

Streaming uses a pub-sub model where publishers publish data and subscribers read and process this data. Data is transmitted in packets known as topics and each topic stores it’s own timestamp.

The use cases for streaming in analytical data is low (which is the main data that data engineers mostly use). Streaming is more like owning a server, website or rest API rather than a batch pipeline or offline process. It is much more complex and some organizations even have different names for batch and streaming data engineers. (e.g., at Netflix, Data engineers handle batch processing whereas SWE, data handle stream data processing).

In terms of technology, Apache spark is used for batch processing, Apache Kafka is used for streaming and Apache Flink supports both batch and stream processing but it was built for stream processing.

8. Exposure: Visualization and Predictions


In data engineering - especially in the context of modern data tooling like dbt - an exposure refers to the end-use or downstream dependency of data models that shows where and how the data is being used outside of the transformation layer. Example exposures can be assets such as dashboards, machine learning models, external reports and APIs.

My exposure was, of course, the Looker Studio dashboard (shown below).


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



I only used my final fact table in the dashboard as I had condensed all my previous staging and dimension tables into it using dbt.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Some of the insights I got were that:

  • Software development is the highest paying industry.
  • Top-paying skills: Sales, IT, Management, Manufacturing
  • Average required experience: 6 years
Conclusion


I learned a lot during the Zoomcamp and capstone project. The hands-on nature, real-world tooling, and community support made the journey insightful and practical.

If you're interested in data engineering, I highly recommend joining a live cohort of the DataTalks Club Zoomcamp to get the full experience and earn certification.

This article was just a high-level tour of the data engineering landscape—feel free to dig deeper into any concept that intrigued you.

Bon voyage!


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу