Self Challenge on Data Science for 52 weeks
Hi,
This is a self note tracker for the Learning Content for acquiring skills for Data Scientist…
This content will be updated upon the finding some useful resources in mean while journey.
This content for Course Structure is taken from 52 weeks Curriculum to become Data Scientist in 2021
Course Structure
- Statistics and Probability (Week 1 to Week 6)
- Mathematics (Week 7 to 12)
- SQL (Week 13 to Week 21)
- Python and Programming (Week 22 to Week 28)
- Pandas (Week 29 to Week 33)
- Visualizing Data (Week 34 to Week 35)
- Data Exploration and Preparation (Week 36 to Week 39)
- Machine Learning (Week 40 to Week 51)
- Data Science Project (Week 52)
Statistics & Probability
Why Statistics and Probability?
Data science and machine learning are essentially a modern version of statistics. By learning statistics first, you’ll have a much easier time when it comes to learning machine learning concepts and algorithms! Even though it may seem like you’re not getting anything tangible out of the first few weeks, it will be worth it in the later weeks.
Week 1: Descriptive Statistics
Week 2: Probability
- Theoretical probability
- Sample spaces
- Set operations
- Addition rule
- Multiplication rule for independent events
- Multiplication rule for dependent events
- Conditional probability and independence
Week 3: Combinations and Permutations
Week 4: Normal Distribution and Sampling Distributions
- Normal distribution and the Empirical rule
- Introduction to Sampling Distributions
- Sampling distribution of a sample proportion
- Sampling distribution of a sample mean
Week 5: Confidence Intervals
Week 6: Hypothesis Testing
- Introduction to Hypothesis Testing
- Error probabilities and power
- Tests about a population proportion
- Tests about a population mean
- More videos
Mathematics
Why Mathematics?
Like statistics, many data science concepts build on fundamental mathematical concepts.
In order to understand cost functions, you need to know differential calculus. In order to understand hypothesis testing, you need to understand integration. And to give more one more example, linear algebra is essential to learning deep learning concepts, recommendation systems, and principal component analysis.
Week 7: Vectors and Spaces
- Vectors
- Linear Combinations and Spans
- Linear Dependence and Independence
- Subspaces and the basis for a subspace
Week 8: Dot Product and Matrix Transformations pt. 1
- Vector dot and cross products
- Functions and Linear Transformations
- Transformations and Matrix Multiplications
Week 9: Matrix Transformations pt. 2
Week 10: Eigenvalues and Eigenvectors
- Eigenvalues and Eigenvectors
- Anything that you couldn’t finish in the past few weeks!_
Week 11: Integrals
- Approximation with Riemann Sums
- Definite Integrals with Riemann Sums
- The Fundamental Theorem of Calculus and Accumulation Functions
- Properties of Definite Integrals
Week 12: Integrals Part 2!
- The Fundamental Theorem of Calculus and Definite Integrals
- Reverse Power Rule
- Indefinite Integrals of Common Functions
- Definite Integrals of Common Functions
SQL
Why SQL?
SQL is arguably the most important skill to learn across any type of data-related profession, whether you’re a data scientist, data engineer, data analyst, business analyst, the list goes on.
At its core, SQL is used to extract (or query) specific data from a database, so that you can do things like analyze the data, visualize the data, model the data, etc. Therefore, developing strong SQL skills will allow you to take your analyses, visualizations, and modeling to the next level because you will be able to extract and manipulate the data in advanced ways.
I came across Mode’s curriculum a while back and it is fantastic! So I would first get familiar with using SQL in Mode and then you’ll be able to go through the topics below!
Week 13: Basic SQL
Week 14: LOGICAL and COMPARISON Operators
Week 15: AGGREGATES
- Aggregate Functions (COUNT, SUM, MIN/MAX, AVG)
- GROUP BY clause
- HAVING clause
Week 16: DISTINCT, CASE WHEN
Week 17: JOINS and UNIONS
Week 18: Subqueries and Common Table Expressions
Week 19: String Manipulations
- String Functions in SQL (LEFT/RIGHT, TRIM, STRPOS, SUBSTR, CONCAT, UPPER/LOWER, etc…)
Week 20: Date-time manipulation
- EXTRACT
- DATE_ADD()
- DATE_SUB()
- DATE_DIFF()
- See here for more functions (on the left of the webpage)
Week 21: Windows Functions
- Windows Functions (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG, LEAD, SUM, COUNT, AVG)
- See here for advanced window functions.
Python and Programming
Why Python?
I started with Python, and I’ll probably stick with Python for the rest of my life. It’s so far ahead in terms of open source contributions, and it’s straightforward to learn. Feel free to go with R if you want, but I have no opinions or advice to provide regarding R.
Week 22: Introduction to Python
Week 23: List, Tuples, Functions, Conditional Statements, Comparisons
Week 24: Dictionaries, Loops, Comments
Week 25: Try/Except, Reading & Writing files, Classes and Objects
Week 26: Recursion
Week 27: Binary Trees
Week 28: APIs and Anaconda
Pandas
Why Pandas?
Arguably the most important library to know in Python is Pandas, which is specifically meant for data manipulation and analysis.
Week 29: Getting and Knowing your data
Week 30: Filtering and Sorting
Week 31: Grouping
Week 32: Apply
Week 33: Merge
Visualizing Data
Why Data Visualizations?
The ability to visualize data and insights is so important because it’s the easiest way to communicate intricate information and a lot of information at once. As a data scientist, you’re always selling yourself and your ideas, whether your pitching a new project or convincing others why your model should be productionalized — data visualizations are a great tool to help you with that.
There are dozens of data visualization libraries out there, but I’m going to focus on two: Matplotlib and Plotly.
Week 34: Data Visualizations with Matplotlib
- Introduction to Matplotlib
- 3-D Visualizations in Matplotlib
- Types of Data Visualizations in Matplotlib
- Cheatsheet
Week 35: Data Visualizations with Plotly
- Types of Visualizations in Plotly (beginner)
- Types of Visualizations in Plotly (beginner and advanced)
Data Exploration and Preparation
Why Data Exploration and Preparation?
“Garbage in, garbage out”
The models that you create can only be as good as the data that you feed into it. To understand what the state of your data is in, i.e. whether it’s “good” or not, you have to explore the data and prepare the data. Therefore, for the next four weeks, I’m going to provide several amazing resources for you to go through and get a better understanding of what data exploration and preparation entails.
Week 36: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) can be difficult because there’s no one set way of doing it — but that’s also what keeps it exciting. Generally, you want to…
- Derive descriptive statistics (eg. central tendency)
- Perform uni-variable analysis (distributions and spread)
- Perform multi-variable analysis (scatterplots, correlation matrix, predictive power score, etc…)
- Check for missing data and check for outliers
Check out a beginner’s guide to EDA here.
Week 37: Data Preparation: Feature Imputation and Normalization
- What is Feature Imputation?
- 6 ways to impute missing data
- Normalization vs Standardization
- Example of implementing normalization vs standardization
Week 38: Feature Engineering and Feature Selection
Week 39: Imbalanced Datasets
- An Introduction to Imbalanced Classification Problems
- The Right Way to Oversample in Predictive Modeling
Machine Learning
Why Machine Learning?
Everything that you’ve learned has led up to this point! Not only is machine learning interesting and exciting, but it is also a skill that all data scientists have. It’s true that modeling makes up a small portion of a data scientist’s time, but it doesn’t take away from its importance.
Later in your career, you might notice that I left out some machine learning algorithms, like K Nearest-Neighbors, Gradient Boost, and CatBoost. This is completely intentional — if you can understand the following machine learning concepts, you’ll have the skills to learn any other machine learning algorithms in the future.
Week 40: Introduction to Machine Learning
Week 41: Linear Regression
- Linear Models: Linear Regression
- Linear Models: Multiple Regression
- Mathematics behind linear regression
Week 42: Logistic Regression
- Introduction to Logistic Regression
- Part 1: Coefficients
- Part 2: Maximum likelihood
- Part 3: R-squared and P-value
Week 43: Regularization
Week 44: Decision Trees
- Decision Trees Introduction
- Feature Selection and Missing Date
- Implementing a Decision Tree in Python
Week 45: Naïve Bayes
Week 46: Support Vector Machines
- Intuition of Support Vector Machines
- Support Vector Machines in Python
- A mathematical explanation of Support Vector Machines
Week 47: Clustering
Week 48: Principal Component Analysis
- Principal Component Analysis (PCA) step-by-step
- Another detailed explanation by Luis Serrano (I highly suggest you watch both)
- Mathematical explanation of PCA
Week 49: Bootstrap Sampling, Bagging, and Boosting
Week 50: Random Forests and Other Boosted Trees
- Random Forests pt.1
- Random Forests pt.2
- XGBoost — Regression
- XGBoost — Classification
- XGBoost — Mathematical Details
- XGBoost in Python
Week 51: Model Evaluation Metrics
- Evaluation Metrics with Python Code
- Understanding the confusion matrix and how to implement it in Python
Week 52: Data Science Project
If you feel comfortable with the materials above, you’re definitely ready to start your own data science project! Just in case, I’ve provided three ideas that you can use as inspiration to get started, but feel free to do whatever you like.
Idea 1: SQL Case Study
The objective of this case is to determine the cause for a drop in user engagement for a social network called Yammer. Before diving into the data, you should read the overview of what Yammer does here. There are 4 tables that you should work with.
The link to the case above will provide you with much more detail pertaining to the problem, the data, and the questions that should be answered.
Check out how I approached this case study here if you’d like guidance.
Skills You’ll Develop
- SQL
- Data Analysis
- Data Visualization if you choose to visualize your insights.
Idea 2: Trustpilot Webscraper
Learning how to webscrape data is simple to learn and extremely useful, especially when it comes to collecting data for personal projects. Scraping a customer review website, like Trustpilot, is valuable for a company as it allows them to understand review trends (getting better or worse) and see what customers are saying via NLP.
First I would get familiar with how Trustpilot is organized, and decide upon which kinds of businesses to analyze. Then I would take a look at this walkthrough of how to scrape Trustpilot reviews.
Skills You’ll Develop
- Writing Python Scripts
- Data Wrangling
- BeautifulSoup/Selenium (webscraping libraries)
- Data Analysis
- Take it further and apply NLP to extract insights from reviews.
Idea 3: Titanic Machine Learning Competition
In my opinion, there’s no better way of showing that you’re ready for a data science job than to showcase your code through competitions. Kaggle hosts a variety of competitions that involves building a model to optimize a certain metric, one of them being the Titanic Machine Learning Competition.
If you want to get some inspiration and guidance, check out this step-by-step walkthrough of one of the solutions.
Skills You’ll Develop
- Data Exploration and Cleaning with Pandas
- Feature Engineering
- Machine Learning Modelling
With Regards Subham (@webobite)