NASA hosted and / or maintenance of more than 32,000 data sets ; these datasets covering topics from Earth Sciences to aerospace engineering to management of NASA itself. We can use the metadata of these data sets to understand the links between them.
1 how to organize data NASA
First, let us download the JSON file and view the name of the metadata store.
library(jsonlite)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)
Here we see that we can get information from people released each data set, in order to obtain a license they issued.
It seems to be the title of each dataset, description and keywords to draw a connection between the data sets may be the most effective. We see it.
class(metadata$dataset$title)
1.1 entangled and organize data
Let the title, description, and keywords provided a separate clean data block, to retain the data set for each data set ID, so that we can in a later analysis linking them (if necessary).
1.2 Some simple preliminary exploration
What NASA data set of the most common words are? We can usecount()
dplyr to check this.
nasa_title %>% count(word, sort = TRUE)
What are the most common keywords are?
nasa_keyword %>%
group_by(keyword) %>% count(sort = TRUE)
## # A tibble: 1,774 x 2 ## # Groups: keyword [1,774] ## keyword n ## <chr> <int> ## 1 EARTH SCIENCE 14362 ## 2 Project 7452 ## 3 ATMOSPHERE 7321 ## 4 Ocean Color 7268 ## 5 Ocean Optics 7268 ## 6 Oceans 7268 ## 7 completed 6452
2.1 Description and title of the Net
We can use pairwise_count()
to calculate the number of times each word appears on the title or description field widyr package.
library(widyr)
These are the most frequently occurring in the field descripton words right. "Data" is the description field of very common words; NASA's data set is not the lack of data!
We see in this title word network in some clear clustering; most of NASA datasets title word vocabulary is organized into several series, these words are often together.
Explain how to field the word?
Keywords Network
接下来,让我们建立一个 关键字的网络,以查看哪些关键字通常在同一数据集中一起出现。
keyword_pairs
## # A tibble: 13,390 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 OCEANS OCEAN OPTICS 7324
## 2 EARTH SCIENCE ATMOSPHERE 7318
## 3 OCEANS OCEAN COLOR 7270
## 4 OCEAN OPTICS OCEAN COLOR 7270
请注意,此排序数据帧顶部的这些关键字的相关系数等于1; 他们总是一起出现。这意味着这些是多余的关键字。继续在这些对中使用两个关键字可能没有意义; 相反,只能使用一个关键字。
让我们可视化关键字相关性网络,就像我们为关键字共同出现一样。
3计算描述字段的tf-idf
网络图向我们展示了描述字段由一些常用词来控制,如“数据”,“全局”和“分辨率”; 这将是一个很好的机会,可以使用tf-idf作为统计数据来查找各个描述字段的特征词。 我们可以使用术语频率乘以逆文档频率的tf-idf来识别对文档集合中的文档特别重要的单词。让我们将这种方法应用于这些NASA数据集的描述字段。
我们现在知道描述中的哪些单词具有高tf-idf,并且我们在关键字中也有这些描述的标签。让我们用tf-idf完成关键字数据框和描述字数据框的完全连接,然后找到给定关键字的最高tf-idf字。
4主题建模
使用tf-idf作为统计数据已经让我们深入了解NASA描述字段的内容,但让我们尝试另外一种方法来解决NASA描述字段的内容。
每个主题是关于什么的?让我们来看看每个主题的前10个术语。
## # A tibble: 240 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 data 0.0449
## 2 1 soil 0.0368
## 3 1 moisture 0.0295
## 4 1 amsr 0.0244
## 5 1 sst 0.0168 ## 6 1 validation 0.0132 ## 7 1 temperature 0.0132 ## 8 1 surface 0.0129 ## 9 1 accuracy 0.0123 ## 10 1 set 0.0116
数据框顶部可见的一些概率较低,而某些概率较高。我们的模型已经为每个描述分配了一个概率,这些描述属于我们根据单词集构建的每个主题。概率是如何分配的?
还有问题吗?联系我们!
大数据部落 -中国专业的第三方数据服务提供商,提供定制化的一站式数据挖掘和统计分析咨询服务
统计分析和数据挖掘咨询服务:y0.cn/teradat(咨询服务请联系官网客服)
【服务场景】
科研项目; 公司项目外包;线上线下一对一培训;数据采集;学术研究;报告撰写;市场调查。
【大数据部落】提供定制化的一站式数据挖掘和统计分析咨询服务