Documents

71 views

MySQL and Hadoop

MySQL and Hadoop. MySQL SF Meetup 2012 Chris Schneider. About Me . Chris Schneider, Data Architect @ Ning.com (a Glam Media Company) Spent the last ~2 years working with Hadoop (CDH) Spent the last 10 years building MySQL architecture for multiple companies chriss@glam.com.
of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
MySQL and HadoopMySQL SF Meetup2012Chris SchneiderAbout Me Chris Schneider, Data Architect @ Ning.com (a Glam Media Company)Spent the last ~2 years working with Hadoop (CDH)Spent the last 10 years building MySQL architecture for multiple companieschriss@glam.comWhat we’ll coverHadoopCDHUse cases for HadoopMap ReduceScoopHiveImpalaWhat is Hadoop?An open-source framework for storing and processing data on a cluster of serversBased on Google’s whitepapers of the Google File System (GFS) and MapReduceScales linearly Designed for batch processingOptimized for streaming readsThe Hadoop Distribution
  • Cloudera
  • The only distribution for Apache Hadoop
  • What Cloudera Does
  • Cloudera Manager
  • Enterprise Training
  • HadoopAdmin
  • Hadoop Development
  • Hbase
  • Hive and Pig
  • Enterprise Support
  • Why Hadoop
  • Volume
  • Use Hadoop when you cannot or should not use traditional RDBMS
  • Velocity
  • Can ingest terabytes of data per day
  • Variety
  • You can have structured or unstructured data
  • Use cases for Hadoop
  • Recommendation engine
  • Netflix recommends movies
  • Ad targeting, log processing, search optimization
  • eBay, Orbitz
  • Machine learning and classification
  • Yahoo Mail’s spam detection
  • Financial: Identity theft and credit risk
  • Social Graph
  • Facebook, Linkedin and eHarmony connections
  • Predicting the outcome of an election before the election, 50 out of 50 correct thanks to Nate Silver!
  • Some Details about Hadoop
  • Two Main Pieces of Hadoop
  • Hadoop Distributed File System (HDFS)
  • Distributed and redundant data storage using many nodes
  • Hardware will inevitably fail
  • Read and process data with MapReduce
  • Processing is sent to the data
  • Many “map” tasks each work on a slice of the data
  • Failed tasks are automatically restarted on another node or replica
  • MapReduce Word CountThe key and value together represent a row of data where the key is the byte offset and the value is the linemap (key,value) foreach (word in value) output (word,1)Map is used for SearchingForeach word64, big data is totally cool and big…Intermediate Output (on local disk):big, 1data, 1is, 1totally, 1cool, 1and, 1big, 1 MAPReduce is used to aggregatebig, (1,1)data, (1)is, (1)totally, (1)cool, (1)and, (1)Reducebig, 2data, 1is, 1totally, 1cool, 1and, 1Hadoop aggregates the keys and calls a reduce for each unique key… e.g. GROUP BY, ORDER BYreduce (key, list) sum the list output (key, sum)Where does Hadoop fit in?Think of Hadoop as an augmentation of your traditional RDBMS systemYou want to store years of data You need to aggregate all of the data over many years timeYou want/need ALL your datastored and accessible not forgotten or deleted You need this to be free software running on commodity hardwareWhere does Hadoop fit in?Tableau: Business Analytics HivePighttphttphttpFlumeMySQLMySQLMySQLMySQLMySQLMySQLHadoop (CDH4)NameNodeJobTrackerSecondaryNameNodeNameNode2DataNodeDataNodeSqoop or ETLDataNodeDataNodeDataNodeDataNodeDataNodeDataNodeSqoopData Flow
  • MySQL is used for OLTP data processing
  • ETL process moves data from MySQL to Hadoop
  • Cron job – Sqoop
  • OR
  • Cron job – Custom ETL
  • Use MapReduce to transform data, run batch analysis, join data, etc…
  • Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report
  • About SqoopOpen Source and stands for SQL-to-HadoopParallel import and export between Hadoop and various RDBMS Default implementation is JDBCOptimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza, Teradata (Not Open Source)Sqoop Data Into Hadoop$ sqoop import --connect jdbc:mysql://example.com/world \--tables City \--fields-terminated-by ‘\t’ \--lines-terminated-by ‘\n’This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.CityThe resulting TSV file(s) will be stored in HDFSSqoop Features
  • You can choose specific tables or columns to import with the --where flag
  • Controlled parallelism
  • Parallel mappers/connections (--num-mappers)
  • Specify the column to split on (--split-by)
  • Incremental loads
  • Integration with Hive and Hbase
  • Sqoop Export$ sqoopexport --connect jdbc:mysql://example.com/world \--tables City \--export-dir /hdfs_path/City_dataThe City table needs to exist Default CSV formatted Can use staging table (--staging-table)About Hive
  • Offers a way around the complexities of MapReduce/JAVA
  • Hive is an open-source project managed by the Apache Software Foundation
  • Facebook uses Hadoop and wanted non-JAVA employees to be able to access data
  • Language based on SQL
  • Easy to lean and use
  • Data is available to many more people
  • Hive is a SQL SELECT statement to MapReduce translator
  • More About Hive
  • Hive is NOT a replacement for RDBMS
  • Not all SQL works
  • Hive is only an interpreter that converts HiveQL to MapReduce
  • HiveQL queries can take many seconds or minutes to produce a result set
  • RDBMS vs HiveSqoop and Hive$ sqoopimport --connect jdbc:mysql://example.com/world \--tables City \--hive-importAlternatively, you can create table(s) within the Hive CLI and run an “fs -put” with an exported CSV file on the local file systemImpala
  • It’s new, it’s fast
  • Allows real time analytics on very large data sets
  • Runs on top of HIVE
  • Based off of Google’s Dremel
  • http://research.google.com/pubs/pub36632.html
  • Cloudera VM for Impala
  • https://ccp.cloudera.com/display/SUPPORT/Downloads
  • Thanks Everyone
  • Questions?
  • Good References
  • Cloudera.com
  • http://infolab.stanford.edu/~ragho/hive-icde2010.pdf
  • VM downloads
  • https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4
  • We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks