注意: 此插件只是在测试环境论证过,如若在生产环境使用,请自行评估风险
参考:
https://yq.aliyun.com/articles/157
https://blog.csdn.net/wzy0623/article/details/78845020

★下载对应的版本的rpm包
http://madlib.apache.org/

★RPM安装

  1. 1.确认madlib是否存在,存在则执行步骤2
  2. ># rpm -qa | grep madlib
  3. 2. 卸载madlib
  4. ># rpm -e madlib-1.14-1.x86_64
  5. 3. 安装madlib
  6. >#rpm -ivh apache-madlib-1.14-bin-Linux.rpm

★确认安装的平台信息

  1. ># su - demo
  2. >$ psql postgres
  3. postgres=# select version();
  4. version
  5. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  6. ---------------------------------
  7. PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build dev) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit compiled on Jul 2 2018
  8. 17:14:09 (with assert checking)
  9. (1 row)
  10. postgres=#

★安装

  1. ># su - demo
  2. >$ /usr/local/madlib/bin/madpack install -s madlib -p greenplum -c demo@127.0.0.1:5432/postgres
  3. madpack.py: INFO : Detected Greenplum DB version 5.0.0.
  4. madpack.py: INFO : *** Installing MADlib ***
  5. madpack.py: INFO : MADlib tools version = 1.14 (/usr/local/madlib/Versions/1.14/bin/../madpack/madpack.py)
  6. madpack.py: INFO : MADlib database version = None (host=127.0.0.1:5432, db=postgres, schema=madlib)
  7. madpack.py: INFO : Testing PL/Python environment...
  8. madpack.py: INFO : > Creating language PL/Python...
  9. madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)
  10. madpack.py: INFO : Installing MADlib into MADLIB schema...
  11. madpack.py: INFO : > Creating MADLIB schema
  12. madpack.py: INFO : > Creating MADLIB.MigrationHistory table
  13. madpack.py: INFO : > Writing version info in MigrationHistory table
  14. madpack.py: INFO : > Creating objects for modules:
  15. madpack.py: INFO : > - array_ops
  16. madpack.py: INFO : > - bayes
  17. madpack.py: INFO : > - crf
  18. madpack.py: INFO : > - elastic_net
  19. madpack.py: INFO : > - linalg
  20. madpack.py: INFO : > - pmml
  21. madpack.py: INFO : > - prob
  22. madpack.py: INFO : > - sketch
  23. madpack.py: INFO : > - svec
  24. madpack.py: INFO : > - svm
  25. madpack.py: INFO : > - tsa
  26. madpack.py: INFO : > - stemmer
  27. madpack.py: INFO : > - conjugate_gradient
  28. madpack.py: INFO : > - knn
  29. madpack.py: INFO : > - lda
  30. madpack.py: INFO : > - stats
  31. madpack.py: INFO : > - svec_util
  32. madpack.py: INFO : > - utilities
  33. madpack.py: INFO : > - assoc_rules
  34. madpack.py: INFO : > - convex
  35. madpack.py: INFO : > - glm
  36. madpack.py: INFO : > - graph
  37. madpack.py: INFO : > - linear_systems
  38. madpack.py: INFO : > - recursive_partitioning
  39. madpack.py: INFO : > - regress
  40. madpack.py: INFO : > - sample
  41. madpack.py: INFO : > - summary
  42. madpack.py: INFO : > - kmeans
  43. madpack.py: INFO : > - pca
  44. madpack.py: INFO : > - validation
  45. madpack.py: INFO : MADlib 1.14 installed successfully in MADLIB schema.

检查安装

  1. >$ /usr/local/madlib/bin/madpack install-check -p greenplum -c demo@127.0.0.1:5432/postgres
  2. madpack.py: INFO : Detected Greenplum DB version 5.0.0.
  3. TEST CASE RESULT|Module: array_ops|array_ops.sql_in|PASS|Time: 3645 milliseconds
  4. TEST CASE RESULT|Module: bayes|gaussian_naive_bayes.sql_in|PASS|Time: 2175 milliseconds
  5. TEST CASE RESULT|Module: bayes|bayes.sql_in|PASS|Time: 6984 milliseconds
  6. TEST CASE RESULT|Module: crf|crf_train_small.sql_in|PASS|Time: 2600 milliseconds
  7. TEST CASE RESULT|Module: crf|crf_train_large.sql_in|PASS|Time: 3662 milliseconds
  8. TEST CASE RESULT|Module: crf|crf_test_small.sql_in|PASS|Time: 3070 milliseconds
  9. TEST CASE RESULT|Module: crf|crf_test_large.sql_in|PASS|Time: 2886 milliseconds
  10. TEST CASE RESULT|Module: elastic_net|elastic_net_install_check.sql_in|PASS|Time: 10679 milliseconds
  11. TEST CASE RESULT|Module: linalg|svd.sql_in|PASS|Time: 15686 milliseconds
  12. TEST CASE RESULT|Module: linalg|matrix_ops.sql_in|PASS|Time: 11672 milliseconds
  13. TEST CASE RESULT|Module: linalg|linalg.sql_in|PASS|Time: 645 milliseconds
  14. TEST CASE RESULT|Module: prob|prob.sql_in|PASS|Time: 2617 milliseconds
  15. TEST CASE RESULT|Module: sketch|support.sql_in|PASS|Time: 140 milliseconds
  16. TEST CASE RESULT|Module: sketch|mfv.sql_in|PASS|Time: 449 milliseconds
  17. TEST CASE RESULT|Module: sketch|fm.sql_in|PASS|Time: 3138 milliseconds
  18. TEST CASE RESULT|Module: sketch|cm.sql_in|PASS|Time: 6968 milliseconds
  19. TEST CASE RESULT|Module: svm|svm.sql_in|PASS|Time: 47427 milliseconds
  20. TEST CASE RESULT|Module: tsa|arima_train.sql_in|PASS|Time: 8105 milliseconds
  21. TEST CASE RESULT|Module: tsa|arima.sql_in|PASS|Time: 8881 milliseconds
  22. TEST CASE RESULT|Module: conjugate_gradient|conj_grad.sql_in|PASS|Time: 1023 milliseconds
  23. TEST CASE RESULT|Module: knn|knn.sql_in|PASS|Time: 4071 milliseconds
  24. TEST CASE RESULT|Module: lda|lda.sql_in|PASS|Time: 13918 milliseconds
  25. TEST CASE RESULT|Module: stats|wsr_test.sql_in|PASS|Time: 483 milliseconds
  26. TEST CASE RESULT|Module: stats|t_test.sql_in|PASS|Time: 535 milliseconds
  27. TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.sql_in|PASS|Time: 3388 milliseconds
  28. TEST CASE RESULT|Module: stats|pred_metrics.sql_in|PASS|Time: 3386 milliseconds
  29. TEST CASE RESULT|Module: stats|mw_test.sql_in|PASS|Time: 313 milliseconds
  30. TEST CASE RESULT|Module: stats|ks_test.sql_in|PASS|Time: 434 milliseconds
  31. TEST CASE RESULT|Module: stats|f_test.sql_in|PASS|Time: 253 milliseconds
  32. TEST CASE RESULT|Module: stats|cox_prop_hazards.sql_in|PASS|Time: 7802 milliseconds
  33. TEST CASE RESULT|Module: stats|correlation.sql_in|PASS|Time: 4900 milliseconds
  34. TEST CASE RESULT|Module: stats|chi2_test.sql_in|PASS|Time: 639 milliseconds
  35. TEST CASE RESULT|Module: stats|anova_test.sql_in|PASS|Time: 372 milliseconds
  36. TEST CASE RESULT|Module: svec_util|svec_test.sql_in|PASS|Time: 2495 milliseconds
  37. TEST CASE RESULT|Module: svec_util|gp_sfv_sort_order.sql_in|PASS|Time: 454 milliseconds
  38. TEST CASE RESULT|Module: utilities|text_utilities.sql_in|PASS|Time: 651 milliseconds
  39. TEST CASE RESULT|Module: utilities|sessionize.sql_in|PASS|Time: 552 milliseconds
  40. TEST CASE RESULT|Module: utilities|pivot.sql_in|PASS|Time: 4107 milliseconds
  41. TEST CASE RESULT|Module: utilities|path.sql_in|PASS|Time: 1948 milliseconds
  42. TEST CASE RESULT|Module: utilities|minibatch_preprocessing.sql_in|PASS|Time: 7324 milliseconds
  43. TEST CASE RESULT|Module: utilities|encode_categorical.sql_in|PASS|Time: 1764 milliseconds
  44. TEST CASE RESULT|Module: utilities|drop_madlib_temp.sql_in|PASS|Time: 390 milliseconds
  45. TEST CASE RESULT|Module: assoc_rules|assoc_rules.sql_in|PASS|Time: 6334 milliseconds
  46. TEST CASE RESULT|Module: convex|mlp.sql_in|PASS|Time: 13578 milliseconds
  47. TEST CASE RESULT|Module: convex|lmf.sql_in|PASS|Time: 5267 milliseconds
  48. TEST CASE RESULT|Module: glm|poisson.sql_in|PASS|Time: 4348 milliseconds
  49. TEST CASE RESULT|Module: glm|ordinal.sql_in|PASS|Time: 2583 milliseconds
  50. TEST CASE RESULT|Module: glm|multinom.sql_in|PASS|Time: 3669 milliseconds
  51. TEST CASE RESULT|Module: glm|inverse_gaussian.sql_in|PASS|Time: 4148 milliseconds
  52. TEST CASE RESULT|Module: glm|gaussian.sql_in|PASS|Time: 4124 milliseconds
  53. TEST CASE RESULT|Module: glm|gamma.sql_in|PASS|Time: 8138 milliseconds
  54. TEST CASE RESULT|Module: glm|binomial.sql_in|PASS|Time: 6418 milliseconds
  55. TEST CASE RESULT|Module: graph|wcc.sql_in|PASS|Time: 4071 milliseconds
  56. TEST CASE RESULT|Module: graph|sssp.sql_in|PASS|Time: 7463 milliseconds
  57. TEST CASE RESULT|Module: graph|pagerank.sql_in|PASS|Time: 31027 milliseconds
  58. TEST CASE RESULT|Module: graph|measures.sql_in|PASS|Time: 3481 milliseconds
  59. TEST CASE RESULT|Module: graph|hits.sql_in|PASS|Time: 3985 milliseconds
  60. TEST CASE RESULT|Module: graph|bfs.sql_in|PASS|Time: 7882 milliseconds
  61. TEST CASE RESULT|Module: graph|apsp.sql_in|PASS|Time: 3741 milliseconds
  62. TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.sql_in|PASS|Time: 985 milliseconds
  63. TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.sql_in|PASS|Time: 955 milliseconds
  64. TEST CASE RESULT|Module: recursive_partitioning|random_forest.sql_in|PASS|Time: 24188 milliseconds
  65. TEST CASE RESULT|Module: recursive_partitioning|decision_tree.sql_in|PASS|Time: 10239 milliseconds
  66. TEST CASE RESULT|Module: regress|robust.sql_in|PASS|Time: 14052 milliseconds
  67. TEST CASE RESULT|Module: regress|multilogistic.sql_in|PASS|Time: 4775 milliseconds
  68. TEST CASE RESULT|Module: regress|marginal.sql_in|PASS|Time: 20274 milliseconds
  69. TEST CASE RESULT|Module: regress|logistic.sql_in|PASS|Time: 13970 milliseconds
  70. TEST CASE RESULT|Module: regress|linear.sql_in|PASS|Time: 1114 milliseconds
  71. TEST CASE RESULT|Module: regress|clustered.sql_in|PASS|Time: 8617 milliseconds
  72. TEST CASE RESULT|Module: sample|train_test_split.sql_in|PASS|Time: 2707 milliseconds
  73. TEST CASE RESULT|Module: sample|stratified_sample.sql_in|PASS|Time: 1659 milliseconds
  74. TEST CASE RESULT|Module: sample|sample.sql_in|PASS|Time: 332 milliseconds
  75. TEST CASE RESULT|Module: sample|balance_sample.sql_in|PASS|Time: 3749 milliseconds
  76. TEST CASE RESULT|Module: summary|summary.sql_in|PASS|Time: 2799 milliseconds
  77. TEST CASE RESULT|Module: kmeans|kmeans.sql_in|PASS|Time: 18560 milliseconds
  78. TEST CASE RESULT|Module: pca|pca_project.sql_in|PASS|Time: 28497 milliseconds
  79. TEST CASE RESULT|Module: pca|pca.sql_in|PASS|Time: 127319 milliseconds
  80. TEST CASE RESULT|Module: validation|cross_validation.sql_in|PASS|Time: 7857 milliseconds

说明: /usr/local/madlib/bin/madpack install -s SCHEMA -p 平台-c 用户@IP:PORT/DATABASE

★SCHEMA验证

  1. postgres=# psql
  2. postgres=# \dn madlib*
  3. List of schemas
  4. Name | Owner
  5. --------------------------------+-------
  6. madlib | demo
  7. madlib_installcheck_tsa | demo
  8. (2 rows)
  9. postgres=#

★设置search_path

  1. postgres=# set search_path="$user",madlib,public;

★版本信息检查

  1. postgres=# \dt
  2. List of relations
  3. Schema | Name | Type | Owner | Storage
  4. -----------+----------------------+--------+---------+---------
  5. madlib | migrationhistory | table | demo | heap
  6. (1 row)
  7. postgres=# select * from migrationhistory;
  8. id | version | applied
  9. -----+---------+----------------------------
  10. 1 | 1.14 | 2018-07-02 18:02:41.384696
  11. (1 row)

★简单使用举例

  1. 参考: https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Users
  2. Sample Problem Using Logistic Regression
  3. **1. The sample data set and an introduction to logistic regression are described here.**
  4. The MADlib function used in this example is described in the MADlib logistic regression documentation.
  5. Suppose that we are working with doctors on a project related to heart failure. The dependent variable in the data set is whether the patient has had a second heart attack within 1 year (yes=1). We have two independent variables: one is whether the patient completed a treatment on anger control (yes=1), and the other is a score on a trait anxiety scale (higher score means more anxious).
  6. The idea is to train a model using labeled data, then use this model to predict second heart attack occurrence for other patients.
  7. **2. To interact with the data using MADlib, use the standard psql terminal provided by the database. You could also use a tool like pgAdmin.**
  8. DROP TABLE IF EXISTS patients, patients_logregr, patients_logregr_summary;
  9. CREATE TABLE patients( id INTEGER NOT NULL,
  10. second_attack INTEGER,
  11. treatment INTEGER,
  12. trait_anxiety INTEGER);
  13. INSERT INTO patients VALUES
  14. (1, 1, 1, 70),
  15. (3, 1, 1, 50),
  16. (5, 1, 0, 40),
  17. (7, 1, 0, 75),
  18. (9, 1, 0, 70),
  19. (11, 0, 1, 65),
  20. (13, 0, 1, 45),
  21. (15, 0, 1, 40),
  22. (17, 0, 0, 55),
  23. (19, 0, 0, 50),
  24. (2, 1, 1, 80),
  25. (4, 1, 0, 60),
  26. (6, 1, 0, 65),
  27. (8, 1, 0, 80),
  28. (10, 1, 0, 60),
  29. (12, 0, 1, 50),
  30. (14, 0, 1, 35),
  31. (16, 0, 1, 50),
  32. (18, 0, 0, 45),
  33. (20, 0, 0, 60);
  34. **3. Call MADlib built-in function to train a classification model using the training data table as input:**
  35. SELECT madlib.logregr_train(
  36. 'patients', -- source table
  37. 'patients_logregr', -- output table
  38. 'second_attack', -- labels
  39. 'ARRAY[1, treatment, trait_anxiety]', -- features
  40. NULL, -- grouping columns
  41. 20, -- max number of iteration
  42. 'irls' -- optimizer
  43. );
  44. **4. View the model that has just been trained:**
  45. -- Set extended display on for easier reading of output (\x is for psql only)
  46. \x on
  47. SELECT * from patients_logregr;
  48. -- ************ --
  49. -- Result --
  50. -- ************ --
  51. coef | [-6.36346994178187, -1.02410605239327, 0.119044916668606]
  52. log_likelihood | -9.41018298389
  53. std_err | [3.21389766375094, 1.17107844860319, 0.0549790458269309]
  54. z_stats | [-1.97998524145759, -0.874498248699549, 2.16527796868918]
  55. p_values | [0.0477051870698128, 0.38184697353045, 0.0303664045046168]
  56. odds_ratios | [0.0017233763092323, 0.359117354054954, 1.12642051220895]
  57. condition_no | 326.081922792
  58. num_rows_processed | 20
  59. num_missing_rows_skipped | 0
  60. num_iterations | 5
  61. variance_covariance | [[10.3291381930637, -0.47430466519573, -0.171995901260052], [-0.47430466519573, 1.37142473278285, -0.00119520703381598], [-0.171995901260052, -0.00119520703381598, 0.00302269548003977]]
  62. -- Alternatively, unnest the arrays in the results for easier reading of output (\x is for psql only)
  63. \x off
  64. SELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attribute,
  65. unnest(coef) as coefficient,
  66. unnest(std_err) as standard_error,
  67. unnest(z_stats) as z_stat,
  68. unnest(p_values) as pvalue,
  69. unnest(odds_ratios) as odds_ratio
  70. FROM patients_logregr;
  71. -- ************ --
  72. -- Result --
  73. -- ************ --
  74. +---------------+---------------+------------------+-----------+-----------+--------------+
  75. | attribute | coefficient | standard_error | z_stat | pvalue | odds_ratio |
  76. |---------------+---------------+------------------+-----------+-----------+--------------|
  77. | intercept | -6.36347 | 3.2139 | -1.97999 | 0.0477052 | 0.00172338 |
  78. | treatment | -1.02411 | 1.17108 | -0.874498 | 0.381847 | 0.359117 |
  79. | trait_anxiety | 0.119045 | 0.054979 | 2.16528 | 0.0303664 | 1.12642 |
  80. +---------------+---------------+------------------+-----------+-----------+--------------+
  81. **5. Now use the model to predict the dependent variable (second heart attack within 1 year) using the logistic regression model. For the purpose of demonstration, we will use the original data table to perform the prediction. Typically a different test dataset with the same features as the original training dataset would be used for prediction.**
  82. -- Display prediction value along with the original value
  83. SELECT p.id, madlib.logregr_predict(coef, ARRAY[1, treatment, trait_anxiety]),
  84. p.second_attack
  85. FROM patients p, patients_logregr m
  86. ORDER BY p.id;
  87. -- ************ --
  88. -- Result --
  89. -- ************ --
  90. +------+-------------------+-----------------+
  91. | id | logregr_predict | second_attack |
  92. |------+-------------------+-----------------|
  93. | 1 | True | 1 |
  94. | 2 | True | 1 |
  95. | 3 | False | 1 |
  96. | 4 | True | 1 |
  97. | 5 | False | 1 |
  98. | 6 | True | 1 |
  99. | 7 | True | 1 |
  100. | 8 | True | 1 |
  101. | 9 | True | 1 |
  102. | 10 | True | 1 |
  103. | 11 | True | 0 |
  104. | 12 | False | 0 |
  105. | 13 | False | 0 |
  106. | 14 | False | 0 |
  107. | 15 | False | 0 |
  108. | 16 | False | 0 |
  109. | 17 | True | 0 |
  110. | 18 | False | 0 |
  111. | 19 | False | 0 |
  112. | 20 | True | 0 |
  113. +------+-------------------+-----------------+
  114. -- Predicting the probability of the dependent variable being TRUE.
  115. -- Display prediction value along with the original value
  116. SELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trait_anxiety])
  117. FROM patients p, patients_logregr m
  118. ORDER BY p.id;
  119. -- ************ --
  120. -- Result --
  121. -- ************ --
  122. +------+------------------------+
  123. | id | logregr_predict_prob |
  124. |------+------------------------|
  125. | 1 | 0.720223 |
  126. | 2 | 0.894355 |
  127. | 3 | 0.19227 |
  128. | 4 | 0.685513 |
  129. | 5 | 0.167748 |
  130. | 6 | 0.798098 |
  131. | 7 | 0.928568 |
  132. | 8 | 0.959306 |
  133. | 9 | 0.877576 |
  134. | 10 | 0.685513 |
  135. | 11 | 0.586701 |
  136. | 12 | 0.19227 |
  137. | 13 | 0.116032 |
  138. | 14 | 0.0383829 |
  139. | 15 | 0.0674976 |
  140. | 16 | 0.19227 |
  141. | 17 | 0.545871 |
  142. | 18 | 0.267675 |
  143. | 19 | 0.398619 |
  144. | 20 | 0.685513 |
  145. +------+------------------------+
  146. The 1 entry in the ARRAY denotes an additional bias term in the model in the standard way, to allow for a non-zero intercept value.
  147. If the probability is greater than 0.5, the prediction is given as True. Otherwise it is given as False.
  148. Next Steps
  149. For details on all of the machine learning functions provided by MADlib, please refer to User Documentation
  150. Try out the available Jupyter notebooks for many MADlib algorithms
  151. To contribute new modules to MADlib, please refer to the Quick Start Guide for Developers
0 评论  
添加一条新评论