视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
大数据工程人员知识图谱
2020-11-09 13:17:51 责编:小采
文档


在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学

在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。

因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。

Topic Content Key points Reference
DB/OLTP & DW/OLAP Database/OLTP basic The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementation Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel database Sharding, database proxy
Data warehouse/OLAP Materialized views, ETL, column-oriented storage, reporting, BI tools
Basic programming Programming language Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data.
OS Linux
DB & DW system MySQL/ Hive/Impala
Text format and process JSON/XML, regex
Tool Git/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQL Distributed system principal theory CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)
Distributed storage & computing framework & resource management Hadoop/HDFS/MapReduce/YARN Tom White. Hadoop : The Definitive Guide.

Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on Hadoop Data (log) acquisition/integration/fusion, normalization, feature extraction Sqoop, Flume/Scribe/Chukwa,SerDe Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analytics Hive, Impala, UDF/UDAF
Large scale data mining & machine learning framework Spark/MLbase, MR/Mahout
Streaming process Storm
NoSQL HBase/Cassandra (column oriented database) Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learning DM & ML basic Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging
Statistic Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing
Supervised learning Classifier, boosting, prediction, regression analysis

Han, Jiawei,Micheline Kamber, and Jian Pei.?Data mining: concepts and techniques.

Unsupervised learning Cluster, deep learning
Collaborative filtering

Item based CF, user based CF

Algorithm Classifier Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), na?ve Bayes classifiers, neural networks,
Regression Linear regression, logistic regression, ranking, perception
Cluster Hierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reduction PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text mining & Information retrieval Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.
下载本文
显示全文
专题