# 从Hive中的users表构升神者造DataFrame
users = sqlContext.table("users")
logs = sqlContext.load("s3n://path/to/data.json", "json")
# 加载HDFS上的Parquet文件
clicks = sqlContext.load("hdfs://path/to/data.parquet", "parquet")
# 通过JDBC访问MySQL
comments = sqlContext.jdbc("jdbc:mysql://localhost/comments", "user")
# 将普通RDD转变为DataFrame
rdd = sparkContext.textFile("article.txt") \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
wordCounts = sqlContext.createDataFrame(rdd, ["word", "count"])
# 将本地数据容器转变为DataFrame
data = [("Alice", 21), ("Bob", 24)]
people = sqlContext.createDataFrame(data, ["name", "age"])
# 将Pandas DataFrame转变为Spark DataFrame(Python API特有功能)
sparkDF = sqlContext.createDataFrame(pandasDF)
Spark on yarn已搭建好,开始使用SparkSql,做如下工作1、将Hive-site.xml copy至$SPARK_HOME/圆型conf目录,注意配置hive.metastore.uris、hive.metastore.client.socket.timeout
2、复制MySQL-connector-Java.jar 到$SPARK_HOME/lib目录
3、配置spark-env.sh
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/spark/lib/mysql-connector-java.jar:/usr/lib/hive/lib/*
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/spark/lib/mysql-connector-java.jar:/橘衡猜usr/lib/hive/拦悄lib/*
4、开始使用
./bin/spark-sql --master yarn --num-executors 30 --executor-cores 4 --executor-memory 8g
./bin/spark-sql --master yarn --num-executors 30 --executor-cores 4 --executor-memory 8g
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)