용어
- Data: Information that is collected from various sources, like databases, spreadsheets, emails, photos, videos, and social media.
- Data Repositories: Places where data is stored, like databases, data warehouses, and data lakes.
- Data Integration: Combining data from different sources into a single view so that it can be easily accessed and used.
- Data Pipelines: Tools and processes that help move data from its source to its destination, while also transforming and cleaning it along the way.
- Query Languages: Languages used to ask questions and retrieve specific information from the data, like SQL.
- Programming Languages: Languages used to develop applications and perform more complex data tasks, like Python.
- BI and Reporting Tools: Tools used to collect and present data in a visual format, like interactive dashboards.
- Automation: Using tools and frameworks to automate different stages of the data processing and analysis process.
Summary
Data engineering involves managing and processing data effectively. It includes tasks like integrating data from different sources, creating data pipelines, storing data in repositories, and automating workflows. Data can be structured, semi-structured, or unstructured, and it comes in various file formats from different sources. Data repositories can be transactional or analytical, depending on the type of data and its usage. Data integration tools combine data from different sources, and data pipelines help process and transform the data. Query languages, programming languages, and scripting languages are used for querying, manipulating, and developing applications with data. BI and reporting tools help visualize data, and automation tools optimize the data analytics process. Overall, data engineering is a diverse and challenging field with a wide range of tools and processes.
Data can be categorized based on its structure into three main types:
- Structured Data: This type of data follows a well-defined format and can be organized neatly into rows and columns. It is typically found in databases and spreadsheets. Examples include customer information, sales data, and financial records.
- Semi-Structured Data: Semi-structured data is a mix of structured and unstructured data. It has some consistent characteristics but does not conform to a rigid structure. An example is an email, which has structured data like sender and recipient information, but also unstructured data in the form of the email content.
- Unstructured Data: Unstructured data is complex and does not fit into a traditional row and column structure. It includes qualitative information that cannot be easily organized. Examples of unstructured data include photos, videos, text files, PDFs, and social media content.
By categorizing data based on its structure, data engineers can determine the appropriate tools and techniques to process, store, and analyze the data effectively.
Here are some real-life examples of how data can be categorized based on its structure:
- Structured Data:
- Customer information stored in a relational database, including names, addresses, and contact details.
- Sales data organized in a spreadsheet, with columns for product names, quantities, and prices.
- Financial records stored in a structured format, such as balance sheets and income statements.
- Semi-Structured Data:
- Emails that contain structured data like sender and recipient information, but also unstructured data in the form of the email content.
- JSON or XML files that have a defined structure but allow flexibility in the data elements they contain.
- Web logs that capture structured data like timestamps and IP addresses, along with unstructured data like user comments.
- Unstructured Data:
- Photos and videos that do not have a predefined structure but contain valuable visual information.
- Text files, such as documents or articles, that contain paragraphs, sentences, and words without a specific structure.
- Social media content, including tweets, posts, and comments, which often contain unstructured text and multimedia elements.
These examples demonstrate how data can vary in structure, and understanding this categorization helps data engineers determine the appropriate methods and tools to process and analyze the data effectively.
'프로그래밍 > 데이터엔지니어링' 카테고리의 다른 글
아마존 sp api 보고서 유형 (0) | 2024.01.26 |
---|---|
Mongo DB에서 find와 find one의 차이 (0) | 2024.01.26 |
Mongo DB에서 find와 find one의 차이 (0) | 2024.01.24 |
아마존 SP-API (0) | 2024.01.19 |
📖데이터 품질의 비밀 Ch1. 지금, 데이터 품질에 주목해야 하는 이유 (0) | 2023.05.29 |