Introduction

Introduction

This Jupyter Book is all about using Python to obtain data for use in your data science or machine learning projects. This was especially made for those starting out as Data Engineers or ML Engineers who want to learn how to generate test data, and where to find data, extract, and prepare them for analysis and modeling.

The book starts out with the simplest way to obtain data - by generating them using some random function. Most of the time, we would have some idea about what to do with the data we’ll have, but we just don’t have the data yet. Sometimes reading real-life data is too time-consuming if we just want to test out a theory or visualise a pattern. Learning how to generate data using only a few lines of code can be a major time-saving skill and is one that I would recommend a Data Engineer to have in their toolbox.

Next, we’ll go through some common techniques for reading from standard data files like CSVs and JSON, as well as databases like PostgreSQL and MongoDB. There’s a little bit of web scraping involved to teach you how to retrieve data from websites and save that data for later use in your project.

The latter half of this Jupyter book focuses on public datasets and common dataset APIs that are dedicated to machine learning, like Scikit-learn and Tensorflow. These APIs typically include some handy functions for filtering and even partitioning data for training and testing.

My hope for this Jupyter Book is for readers to realise the importance of strong foundational skills for data generation, extraction, and preparation in their journey to creating data pipelines or machine learning models.

I’ve learned a lot myself while writing this Jupyter book! I hope that you do too.

Cheers,

Shiela


Topics in this Book

  • Generating data

    • Random

    • Numpy

    • Scikit-learn

  • Local data files

    • CSV files

    • Excel files

  • Databases

    • SQL databases

    • Redis

    • MongoDB

    • BigQuery

  • Web and APIs

    • HTTP requests

    • Scikit-learn

    • Seaborn

    • Tensorflow

Reading Files