Mastering Spark for Data Science

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products
Preview in Mapt

Mastering Spark for Data Science

Andrew Morgan et al.

2 customer reviews
Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products
Mapt Subscription
FREE
$29.99/m after trial
eBook
$30.80
RRP $43.99
Save 29%
Print + eBook
$54.99
RRP $54.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$30.80
$54.99
$29.99 p/m after trial
RRP $43.99
RRP $54.99
Subscription
eBook
Print + eBook
Start 14 Day Trial

Frequently bought together


Mastering Spark for Data Science Book Cover
Mastering Spark for Data Science
$ 43.99
$ 30.80
Learning Apache Spark 2 Book Cover
Learning Apache Spark 2
$ 35.99
$ 25.20
Buy 2 for $35.00
Save $44.98
Add to Cart

Book Details

ISBN 139781785882142
Paperback560 pages

Book Description

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.

This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.

You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Table of Contents

Chapter 1: The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Overall architecture
Data technologies
Companion tools
Summary
Chapter 2: Data Acquisition
Data pipelines
Content registry
Quality assurance
Summary
Chapter 3: Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
Loading your data
Avro
Parquet
Summary
Chapter 4: Exploratory Data Analysis
The problem, principles and planning
Preparation
Exploring GDELT
Summary
Chapter 5: Spark for Geographic Analysis
GDELT and oil
Formulating a plan of action
GeoMesa
Gauging oil prices
Summary
Chapter 6: Scraping Link-Based External Data
Building a web scale news scanner
Named entity recognition
GIS lookup
Names de-duplication
News index dashboard
Summary
Chapter 7: Building Communities
Building a graph of persons
Using the Accumulo database
Community detection algorithm
GDELT dataset
Summary
Chapter 8: Building a Recommendation System
Different approaches
Uninformed data
Building a song analyzer
Building a recommender
Summary
Chapter 9: News Dictionary and Real-Time Tagging System
The mechanical Turk
Designing a Spark Streaming application
Consuming data streams
Processing Twitter data
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Our Twitter mechanical Turk
Summary
Chapter 10: Story De-duplication and Mutation
Detecting near duplicates
Building stories
Story mutation
Summary
Chapter 11: Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Analysing sentiment
Using Timely as a time series database
Twitter and the Godwin point
A Small Step into sarcasm detection
Summary
Chapter 12: TrendCalculus
Studying trends
The TrendCalculus algorithm
Practical applications
Summary
Chapter 13: Secure Data
Data security
Authentication and authorization
Access
Encryption
Data disposal
Kerberos authentication
Security ecosystem
Your Secure Responsibility
Summary
Chapter 14: Scalable Algorithms
General principles
Spark architecture
Challenges
Plotting your course
Design patterns and techniques
Summary

What You Will Learn

  • Learn the design patterns that integrate Spark into industrialized data science pipelines
  • See how commercial data scientists design scalable code and reusable code for data science services
  • Explore cutting edge data science methods so that you can study trends and causality
  • Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs
  • Find out how Spark can be used as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Study advanced Spark concepts, solution design patterns, and integration architectures
  • Demonstrate powerful data science pipelines

Authors

Table of Contents

Chapter 1: The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Overall architecture
Data technologies
Companion tools
Summary
Chapter 2: Data Acquisition
Data pipelines
Content registry
Quality assurance
Summary
Chapter 3: Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
Loading your data
Avro
Parquet
Summary
Chapter 4: Exploratory Data Analysis
The problem, principles and planning
Preparation
Exploring GDELT
Summary
Chapter 5: Spark for Geographic Analysis
GDELT and oil
Formulating a plan of action
GeoMesa
Gauging oil prices
Summary
Chapter 6: Scraping Link-Based External Data
Building a web scale news scanner
Named entity recognition
GIS lookup
Names de-duplication
News index dashboard
Summary
Chapter 7: Building Communities
Building a graph of persons
Using the Accumulo database
Community detection algorithm
GDELT dataset
Summary
Chapter 8: Building a Recommendation System
Different approaches
Uninformed data
Building a song analyzer
Building a recommender
Summary
Chapter 9: News Dictionary and Real-Time Tagging System
The mechanical Turk
Designing a Spark Streaming application
Consuming data streams
Processing Twitter data
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Our Twitter mechanical Turk
Summary
Chapter 10: Story De-duplication and Mutation
Detecting near duplicates
Building stories
Story mutation
Summary
Chapter 11: Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Analysing sentiment
Using Timely as a time series database
Twitter and the Godwin point
A Small Step into sarcasm detection
Summary
Chapter 12: TrendCalculus
Studying trends
The TrendCalculus algorithm
Practical applications
Summary
Chapter 13: Secure Data
Data security
Authentication and authorization
Access
Encryption
Data disposal
Kerberos authentication
Security ecosystem
Your Secure Responsibility
Summary
Chapter 14: Scalable Algorithms
General principles
Spark architecture
Challenges
Plotting your course
Design patterns and techniques
Summary

Book Details

ISBN 139781785882142
Paperback560 pages
Read More
From 2 reviews

Read More Reviews

Recommended for You

Learning Apache Spark 2 Book Cover
Learning Apache Spark 2
$ 35.99
$ 25.20
Machine Learning with Spark - Second Edition Book Cover
Machine Learning with Spark - Second Edition
$ 39.99
$ 28.00
Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 25.20
Spark for Data Science Book Cover
Spark for Data Science
$ 39.99
$ 28.00
Deep Learning with TensorFlow Book Cover
Deep Learning with TensorFlow
$ 39.99
$ 28.00
Artificial Intelligence with Python Book Cover
Artificial Intelligence with Python
$ 39.99
$ 28.00