Apache Spark for Data Science Cookbook

Over insightful 90 recipes to get lightning-fast analytics with Apache Spark
Preview in Mapt

Apache Spark for Data Science Cookbook

Padma Priya Chitturi

Over insightful 90 recipes to get lightning-fast analytics with Apache Spark
Mapt Subscription
FREE
$29.99/m after trial
eBook
$25.20
RRP $35.99
Save 29%
Print + eBook
$44.99
RRP $44.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$25.20
$44.99
$29.99 p/m after trial
RRP $35.99
RRP $44.99
Subscription
eBook
Print + eBook
Start 14 Day Trial

Frequently bought together


Apache Spark for Data Science Cookbook Book Cover
Apache Spark for Data Science Cookbook
$ 35.99
$ 25.20
Python Machine Learning Book Cover
Python Machine Learning
$ 35.99
$ 25.20
Buy 2 for $35.00
Save $36.98
Add to Cart

Book Details

ISBN 139781785880100
Paperback392 pages

Book Description

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark’s selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.

This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark’s data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.

Table of Contents

Chapter 1: Big Data Analytics with Spark
Introduction
Initializing SparkContext
Working with Spark's Python and Scala shells
Building standalone applications
Working with the Spark programming model
Working with pair RDDs
Persisting RDDs
Loading and saving data
Creating broadcast variables and accumulators
Submitting applications to a cluster
Working with DataFrames
Working with Spark Streaming
Chapter 2: Tricky Statistics with Spark
Introduction
Variable identification
Sampling data
Summary and descriptive statistics
Generating frequency tables
Installing Pandas on Linux
Installing Pandas from source
Using IPython with PySpark
Creating Pandas DataFrames over Spark
Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
Implementing co-variance and correlation using Pandas
Concatenating and merging operations over DataFrames
Complex operations over DataFrames
Sparkling Pandas
Chapter 3: Data Analysis with Spark
Introduction
Univariate analysis
Bivariate analysis
Missing value treatment
Outlier detection
Use case - analyzing the MovieLens dataset
Use case - analyzing the Uber dataset
Chapter 4: Clustering, Classification, and Regression
Introduction
Supervised learning
Unsupervised learning
Applying regression analysis for sales data
Variable identification
Data exploration
Feature engineering
Applying linear regression
Applying logistic regression on bank marketing data
Variable identification
Data exploration
Feature engineering
Applying logistic regression
Real-time intrusion detection using streaming k-means
Variable identification
Simulating real-time data
Applying streaming k-means
Chapter 5: Working with Spark MLlib
Introduction
Working with Spark ML pipelines
Implementing Naive Bayes' classification
Implementing decision trees
Building a recommendation system
Implementing logistic regression using Spark ML pipelines
Chapter 6: NLP with Spark
Introduction
Installing NLTK on Linux
Installing Anaconda on Linux
Anaconda for cluster management
POS tagging with PySpark on an Anaconda cluster
NER with IPython over Spark
Implementing openNLP - chunker over Spark
Implementing openNLP - sentence detector over Spark
Implementing stanford NLP - lemmatization over Spark
Implementing sentiment analysis using stanford NLP over Spark
Chapter 7: Working with Sparkling Water - H2O
Introduction
Features
Working with H2O on Spark
Implementing k-means using H2O over Spark
Implementing spam detection with Sparkling Water
Deep learning with airlines and weather data
Implementing a crime detection application
Running SVM with H2O over Spark
Chapter 8: Data Visualization with Spark
Introduction
Visualization using Zeppelin
Installing Zeppelin
Customizing Zeppelin's server and websocket port
Visualizing data on HDFS - parameterizing inputs
Running custom functions
Adding external dependencies to Zeppelin
Pointing to an external Spark Cluster
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala
Creating plots with the lightning visualization server
Visualize machine learning models with Databricks notebook
Chapter 9: Deep Learning on Spark
Introduction
Installing CaffeOnSpark
Working with CaffeOnSpark
Running a feed-forward neural network with DeepLearning 4j over Spark
Running an RBM with DeepLearning4j over Spark
Running a CNN for learning MNIST with DeepLearning4j over Spark
Installing TensorFlow
Working with Spark TensorFlow
Chapter 10: Working with SparkR
Introduction
Installing R
Interactive analysis with the SparkR shell
Creating a SparkR standalone application from RStudio
Creating SparkR DataFrames
SparkR DataFrame operations
Applying user-defined functions in SparkR
Running SQL queries from SparkR and caching DataFrames
Machine learning with SparkR

What You Will Learn

  • Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
  • Solve real-world analytical problems with large data sets.
  • Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.
  • Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
  • Learn about numerical and scientific computing using NumPy and SciPy on Spark.
  • Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.

Authors

Table of Contents

Chapter 1: Big Data Analytics with Spark
Introduction
Initializing SparkContext
Working with Spark's Python and Scala shells
Building standalone applications
Working with the Spark programming model
Working with pair RDDs
Persisting RDDs
Loading and saving data
Creating broadcast variables and accumulators
Submitting applications to a cluster
Working with DataFrames
Working with Spark Streaming
Chapter 2: Tricky Statistics with Spark
Introduction
Variable identification
Sampling data
Summary and descriptive statistics
Generating frequency tables
Installing Pandas on Linux
Installing Pandas from source
Using IPython with PySpark
Creating Pandas DataFrames over Spark
Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
Implementing co-variance and correlation using Pandas
Concatenating and merging operations over DataFrames
Complex operations over DataFrames
Sparkling Pandas
Chapter 3: Data Analysis with Spark
Introduction
Univariate analysis
Bivariate analysis
Missing value treatment
Outlier detection
Use case - analyzing the MovieLens dataset
Use case - analyzing the Uber dataset
Chapter 4: Clustering, Classification, and Regression
Introduction
Supervised learning
Unsupervised learning
Applying regression analysis for sales data
Variable identification
Data exploration
Feature engineering
Applying linear regression
Applying logistic regression on bank marketing data
Variable identification
Data exploration
Feature engineering
Applying logistic regression
Real-time intrusion detection using streaming k-means
Variable identification
Simulating real-time data
Applying streaming k-means
Chapter 5: Working with Spark MLlib
Introduction
Working with Spark ML pipelines
Implementing Naive Bayes' classification
Implementing decision trees
Building a recommendation system
Implementing logistic regression using Spark ML pipelines
Chapter 6: NLP with Spark
Introduction
Installing NLTK on Linux
Installing Anaconda on Linux
Anaconda for cluster management
POS tagging with PySpark on an Anaconda cluster
NER with IPython over Spark
Implementing openNLP - chunker over Spark
Implementing openNLP - sentence detector over Spark
Implementing stanford NLP - lemmatization over Spark
Implementing sentiment analysis using stanford NLP over Spark
Chapter 7: Working with Sparkling Water - H2O
Introduction
Features
Working with H2O on Spark
Implementing k-means using H2O over Spark
Implementing spam detection with Sparkling Water
Deep learning with airlines and weather data
Implementing a crime detection application
Running SVM with H2O over Spark
Chapter 8: Data Visualization with Spark
Introduction
Visualization using Zeppelin
Installing Zeppelin
Customizing Zeppelin's server and websocket port
Visualizing data on HDFS - parameterizing inputs
Running custom functions
Adding external dependencies to Zeppelin
Pointing to an external Spark Cluster
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala
Creating plots with the lightning visualization server
Visualize machine learning models with Databricks notebook
Chapter 9: Deep Learning on Spark
Introduction
Installing CaffeOnSpark
Working with CaffeOnSpark
Running a feed-forward neural network with DeepLearning 4j over Spark
Running an RBM with DeepLearning4j over Spark
Running a CNN for learning MNIST with DeepLearning4j over Spark
Installing TensorFlow
Working with Spark TensorFlow
Chapter 10: Working with SparkR
Introduction
Installing R
Interactive analysis with the SparkR shell
Creating a SparkR standalone application from RStudio
Creating SparkR DataFrames
SparkR DataFrame operations
Applying user-defined functions in SparkR
Running SQL queries from SparkR and caching DataFrames
Machine learning with SparkR

Book Details

ISBN 139781785880100
Paperback392 pages
Read More

Read More Reviews

Recommended for You

Python Machine Learning Book Cover
Python Machine Learning
$ 35.99
$ 25.20
Fast Data Processing with Spark 2 - Third Edition Book Cover
Fast Data Processing with Spark 2 - Third Edition
$ 31.99
$ 22.40
Apache Spark 2 for Beginners Book Cover
Apache Spark 2 for Beginners
$ 31.99
$ 22.40
Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 25.20
Machine Learning with Spark - Second Edition Book Cover
Machine Learning with Spark - Second Edition
$ 39.99
$ 28.00
Spark for Data Science Book Cover
Spark for Data Science
$ 39.99
$ 28.00