PCDE Course Overview
DELETEME Test diff for deploy
Introduction
This is a course where MIT's xPro program will prepare you for a certification in the professional skills required in Data Engineering. The course will go through topics in Python programming, the basics of database design.
First Some Lessons About Good Note Taking
Interacting with Lectures
It's important to take notes on the content released each week. Good note taking can even save time on doing activities and assignments, as the information needed for some of them will often be within good notes taken.
Effective learning includes what goes on before lectures, during and after. The University of British Columbia have put together some recommendations on those three phases. Also there's another note on notetaking effectively, including the information presented in University of B.C.'s article.
Chat with your Learning Facilitators
There are times we're simply stumped, regardless of our best efforts. It's best to go through the modules as early as possible, with ample time to reach out to learning facilitators and peers. Attend office hours frequently, when possible, both for information reinforcement and to ask your questions. Submit support tickets to ask those questions and get extra guidance. You'll find instructions on how to submit support tickets in your Orientation Week Module.
Connect with your Fellow Learners
Just because you are viewing course material on your own, doesn't mean that you're the only one pursuing this certificate. Use Slack to connect with others, ask questions, and get some help from your peers. They may have the same difficulties as you or might have some great tips that will make the topics click.
The moral of the story here is that the more you can interact with course content in many ways, through effective note-taking, review and connecting with others, the more you'll be able to get the concepts down and get the most success from the course.
Some Time Management Recommendations
If taking the course in a well structured way, it should be taking about ~15 hours a week. This is a minimal recommendation and you may find yourself spending more than 20 hours per week some weeks. Integrating this time into your schedule will require disciplined time management. Here are some more in depth tips on managing time while in the program. Remember though with previous knowledge from previous cohort that is now deferred, shoot for 10 hours a week. If more time is necessary for each week it's important to seek help earlier.
Course Outline
Here is the outline copy: Here are the due dates for each module outlined
Career Guidance
Preparing a Data Engineering Portfolio
In this section, you will learn how to create a professional portfolio for the data engineering industry. This week, your learning facilitator will host the second in a series of mentorship workshops, designed to introduce you to the broader field of data engineering. The Industry Guidance mentorship workshop will discuss the following topics:
- Developing an industry portfolio and share examples of successful portfolios
- Incorporating your work from this program into your GitHub portfolio to share with prospective employers
- How to approach industry-specific case interviews or technical evaluations
Here are some additional resources to supplement this module’s workshop focus:
- Facebook, Inc. “Build optimized websites quickly, focus on your content.” Docusaurus. 2022. https://docusaurus.io/.
- GitHub. “Data-engineering-pipeline.” GitHub. https://github.com/topics/data-engineering-pipeline.
- Razevedo1994. “Data-engineering.” GitHub. https://github.com/razevedo1994/data-engineering.
- Sspaeti. “Building a Data Engineering Project in 20 Minutes.” Sspaeti. 9 March 2021. https://sspaeti.com/blog/data-engineering-project-in-twenty-minutes/.
- Vanhack Admin. “How to Build a GitHub Portfolio & Get Noticed by Recruiters.” Vanhack. 14 Dec. 2021.
After you have read through some of these resources, consider the following in your discussion post:
- How can you develop a strong industry portfolio?
- What content you want to include in that portfolio?
- How can you incorporate your work from this program into your GitHub portfolio to share with prospective employers?
- How can you use your portfolio to assist you with industry-specific case interviews or technical evaluations?
- Finally, find some examples online of strong data engineering portfolios and share them with your peers.
Module 0: Course Orientation
Notes Links
- Note-taking Strategies
- Notetaking Strategies for Lectures
- Time Management Strategies Notes
- Time Blocking Strategies Notes
Key Activities
- Course Introduction
- Learning Platform Overview
- Introduce Yourself
- Course Agreement
- Install Tools Needed for Modules 1-3
Module 1: Introduction to Python
Notes Links
Learning Outcomes
- Starts: 2022-12-07
- Due: 2022-12-14
- Compare Python basic data types and operators.
- Create basic Python data types in a coding environment.
- Identify lists, tuples, sets, and dictionaries in Python.
- Create Python lists, tuples, sets, and dictionaries in a coding environment.
- Use indexing and slicing in Python.
- Interpret memory allocation for Python objects.
- Define loops and conditionals in a Python coding environment.
- Integrate loops and conditionals in a Python coding environment.
- Define Python functions and variable scope.
- Use Python functions in a coding environment.
- Interpret Python classes.
- Read and write files in Python.
Key Activities
- Discussions
- Activities
- Knowledge Checks
- Coding Assignment
Module 2: Introduction to NumPy
Notes on Topic
Learning Outcomes
- Create NumPy arrays, functions, and multidimensional arrays.
- Define NumPy arrays, functions, and multidimensional arrays.
- Interpret NumPy memory allocation.
- Describe basic probability concepts.
- Explain the connection between histograms and probability densities.
- Differentiate between discrete and continuous distributions.
- Define probability density functions and probability distribution functions.
- Create discrete and continuous distributions.
- Define Matplotlib graphs.
- Visualize data using Matplotlib graphs.
- Interpret data using Matplotlib graphs.
Module 3: Introduction to Pandas
Learning Outcomes
- Define pandas series and dataframes
- Implement pandas series and dataframes
- Perform data cleaning in pandas
- Prepare data using one-hot encoding in pandas
- Explain time and data functionality in pandas
- Analyze data in pandas
- Design dataframes in pandas
Note Links
Module 4: Databases & Intro to SQL
- Notes on topic
Module 5: Databases with SQL Statements
Notes on Topic
Key Activities
- Discussions
- Activities
- Knowledge Checks
- Coding Assignment
Outcomes
- Outline big data and database systems.
- Design databases conceptually and formally.
- Interpret database components.
- Correlate databases.
- Interpret cardinality and normalization of tables.
- Design physical components of databases.
- Define a database in a coding environment.
- Manipulate a database in a coding environment.
- Explain database data types and indexing.
- SQL Tutorial - Full Database Course for Beginners (from FreeCodeCamp)
- Workbench Files (from mysql.com)
- MySQL Workbench Tutorial (on Youtube)
- MySQL Workbench Video Walkthrough (by Telusko on Youtube)
Module 6: Databases Analysis and the Client Server Interface
Notes on Topic
- Course materials
- SQL Notes
- Exploratory Data Analysis (EDA) in SQL
- Visualizing Data in SQL
- Cleaning Data in SQL
- Dates & Time in SQL
- Client Server Architecture Overview
Key Activities
- Discussions: 2
- Activities: 5
- Self Study Drag & Drop: 2
- Knowledge Checks: 7
- Coding Assignment: 1
- Video Lectures: 25
- Mini Lessons: 5
- Estimated 17.5hrs to complete
Time Log
- 23-01-26: 4.5hrs
Outcomes
- Write functional queries to explore a database.
- Analyze the structure of a database.
- Create visualizations of data using histograms in SQL.
- Clean a dataset in SQL.
- Handle date and time in SQL.
- Define the client-server interface.
- Read and write tables using a driver.
- Discriminate between RDBMS and in-memory databases.
Module 7: A Model to Predict Housing Prices
Due Date: 1629 UTC February 8, 2023 Available for late submission till: February 22, 2023
Notes on Topic
Key Activities
- Discussions: 4
- Activities: 0
- Self Study Drag & Drop: 0
- Knowledge Checks: 3
- Coding Assignment: 1 (PROJECT)
- Video Lectures: 6 LONG LECTURES
- Mini Lessons: 0
- Estimated 18hrs to complete
- Divided by 7 days & 40% overshoot = 4hrs/day
Outcomes
- Describe how descriptive statistics are used in Python.
- Explain central limit theorem and correlation.
- Describe how to calculate a linear regression.
- Write Markdown syntax.
- Build a prediction model using linear regression.
Module 8: ETL, Analysis, Visualization
Due Date: 4:29 PM UTC February 15, 2023 Available for late submission till: February 22, 2023
Notes on Topic
Key Activities
- Discussions: 4
- Activities: 0
- Self Study Drag & Drop: 0
- Knowledge Checks: 3
- Coding Assignment: 1 (PROJECT)
- Video Lectures: 6 LONG LECTURES
- Mini Lessons: 0
- Estimated 18hrs to complete
- Divided by 7 days & 40% overshoot = 4hrs/day
Outcomes
- Describe how descriptive statistics are used in Python.
- Explain central limit theorem and correlation.
- Describe how to calculate a linear regression.
- Write Markdown syntax.
- Build a prediction model using linear regression.
Module 9: GitHub & Advanced Python
Notes on Topic
- Module 9 Materials
- VS Code
- Git
- GitHub
- Python: Classes
- Python: Advanced Functions
- Python: Decorators
- Python: Wrappers
Key Activities
- Discussions: 1
- Activities: 6
- Self Study: 2
- Knowledge Checks: 4
- Coding Assignment: 1
- Video Lectures: 90 minutes
- Mini Lessons: 0
Outcomes
- Debug Python code.
- Use GitHub for version control.
- Create a portfolio using GitHub Pages.
- Implement Python classes.
- Write code using advanced Python functions.
- Utilize Python decorators and wrappers.
Module 10: Networks
Outcomes
- Learn about how computer networks work
- HTTP
- Postman
- Strapi
- API
Notes on Topic
- CLI: Command Line Interface
- GNU CoreUtils
- Computer Networks
- HTTP: Hypertext Transport Protocol
- HTTP Headers
- Software Containers
- Docker
- VS Code
- Postman
- Swagger
Module 11: Client Server Architecture
Note Links
Outcomes
In this module these topics will be covered:
- Cookies & session cookies
- How session cookies protect API (application programming interface) routes
- How swagger can be used to detail an API
- Developing a Swagger interface
- Writing a flask Server
- Handling security tokens
- Kerberos to understand the need of security tokens
- PKI (public key infrastructure)
- Signing documents using private keys
- Passing public key into Github
The most difficult part of this section is correctly generating secure tokens for authentication, getting it wrong can mean loss of access to data or worse leaking data by an attacker.
Module 12: Types of Databases & Database Containerization
Due Data
- Due Wednesday, March 22, 2023 at 4:29 PM UTC
Note Links
- Python
- Types of Databases
- Relational Databases
- Document Databases
- MongoDB Using Python
- Key-Value Databases
- Distributed Databases
- Cassandra (Distributed Database)
Outcomes
- Describe applications of various types of databases.
- Identify key concepts related to database containerization.
- Update and delete data in different types of containerized databases.
- Identify key concepts related to different types of databases.
References
- MIT xPRO Emeritus Certification Programs Homepage
- University of British Columbia: How to take rock-solid notes for online lectures
- MIT PCDE Pro Slack Channel
Web References
- Facebook, Inc. “Build optimized websites quickly, focus on your content.” Docusaurus. 2022. https://docusaurus.io/.
- GitHub. “Data-engineering-pipeline.” GitHub. https://github.com/topics/data-engineering-pipeline.
- Razevedo1994. “Data-engineering.” GitHub. https://github.com/razevedo1994/data-engineering.
- Sspaeti. “Building a Data Engineering Project in 20 Minutes.” Sspaeti. 9 March 2021. https://sspaeti.com/blog/data-engineering-project-in-twenty-minutes/.
- Vanhack Admin. “How to Build a GitHub Portfolio & Get Noticed by Recruiters.” Vanhack. 14 Dec. 2021.
Notes Links
- Note-taking Strategies
- Notetaking Strategies for Lectures
- Time Management Strategies Notes
- Time Blocking Strategies Notes
- Introduction to Python Notes
- Introduction to Python
- PCDE Course: Module 2 Content
- Mathematical Probability Overview
- NumPy: Numerical Python Library
- Matplotlib: Python Plotting Library
- Normal Distribution
- PCDE COurse: Module 3 Content
- Pandas: Python Dataframes & Data Manipulation
- PCDE Course Materials (Module 5)
- SQL Overview
- Logical Operators in SQL
- Regular Expressions (RegEx)
- PCDE Course Module 6: Database Analysis & the Client Server Interface
- Exploratory Data Analysis in SQL
- Visualizing Data in SQL
- Cleaning Data in SQL
- Dates & Time in SQL
- Client Server Architecture Overview
- PCDE Course Module 7 Content: Model to Predict Housing Prices
- Statistics Using Python
- Markdown
- Predicting Housing Prices through Linear Regression & Python
- PCDE Course Module 8 Content: ETL, Analysis and Visualization
- PCDE Course: Module 9 Content
- VS Code
- Git
- GitHub
- Python: Classes
- Python: Advanced Functions
- Python: Decorators
- Python: Wrappers
- PCDE Course: Module 10 Content
- CLI: Command Line Interface
- GNU CoreUtils
- Computer Networks
- HTTP: Hypertext Transport Protocol
- HTTP Headers
- Software Containers
- Docker
- VS Code
- Postman
- Swagger
- PCDE Course: Module 11 Content
- Flask
- Cookies
- OAuth2
- PCDE Course: Module 12 Content
- Python
- Types of Databases
- Relational Databases
- Document Databases
- MongoDB Using Python
- Key-Value Databases
- Distributed Databases
- Cassandra (Distributed Database)