c# 绘图数据科学_使用C和C ++进行数据科学

c# 绘图数据科学

尽管Python和R之类的语言在数据科学中越来越受欢迎，但是C和C ++对于高效的数据科学而言是一个不错的选择。在本文中，我们将使用C99和C ++ 11编写一个使用Anscombe的四重奏数据集的程序，接下来我将对其进行解释。

我在一篇涉及Python和GNU Octave的文章中写了我不断学习语言的动机，值得回顾。所有程序都应在命令行上运行，而不是通过图形用户界面（GUI）运行。完整的示例可在polyglot_fit信息库中找到。

编程任务

您将在本系列中编写的程序：

从CSV文件读取数据
用直线内插数据（即f（x）= m⋅x + q ）
将结果绘制到图像文件

这是许多数据科学家遇到的普遍情况。示例数据是Anscombe四重奏的第一组，如下表所示。这是一组人工构造的数据，当拟合直线时可以提供相同的结果，但是它们的曲线非常不同。数据文件是一个文本文件，其中的制表符用作列分隔符，几行作为标题。该任务将仅使用第一组（即前两列）。

安斯科姆四重奏

一世		II		三级		IV
x	ÿ	X	ÿ	X	ÿ	X	ÿ
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

C方式

C是一种通用编程语言，是当今使用最广泛的语言之一（根据TIOBE索引， RedMonk编程语言排名，编程语言索引的流行度和GitHub的八度状态提供的数据）。它是一种非常古老的语言（大约在1973年），并且用它编写了许多成功的程序（例如，Linux内核和Git仅举两个例子）。它也是最接近计算机内部运行的语言之一，因为它直接用于操作内存。它是一种编译语言 ; 因此，源代码必须由编译器翻译为机器代码。它的标准库很小，功能很少，因此开发了其他库来提供缺少的功能。

我最常使用数字处理数字，主要是因为它的性能。我觉得使用起来很麻烦，因为它需要很多样板代码，但是在各种环境中都得到了很好的支持。 C99标准是最新版本，增加了一些漂亮的功能，并且受到编译器的良好支持。

我将一路介绍C和C ++编程的必要背景，以便初学者和高级用户都可以继续学习。

安装

要使用C99进行开发，您需要一个编译器。我通常使用Clang ，但是GCC是另一个有效的开源编译器。对于线性拟合，我选择使用GNU科学库。对于绘图，我找不到任何明智的库，因此该程序依赖于外部程序： Gnuplot 。该示例还使用动态数据结构来存储数据，该结构在Berkeley Software Distribution （BSD）中定义。

在Fedora中安装就像运行一样容易：

 sudo dnf install clang gnuplot gsl gsl-devel

注释码

在C99中，通过将//放在行的开头来格式化注释，其余行将被解释器丢弃。另外， / *和* /之间的任何内容也将被丢弃。

 // This is a comment ignored by the interpreter. 
/* Also this is ignored */

必要的图书馆

库由两部分组成：

头文件，其中包含功能说明
包含函数定义的源文件

头文件包含在源文件中，而库的源文件与可执行文件链接。因此，此示例所需的头文件是：

 // Input/Output utilities 
#include <stdio.h> 
// The standard library 
#include <stdlib.h> 
// String manipulation utilities 
#include <string.h> 
// BSD queue 
#include <sys/queue.h> 
// GSL scientific utilities 
#include <gsl/gsl_fit.h> 
#include <gsl/gsl_statistics_double.h>

主功能

在C语言中，程序必须位于名为main（）的特殊函数内：

 int main ( void ) { 
...
}

这与上一教程中介绍的Python不同，后者将运行在源文件中找到的任何代码。

定义变量

在C语言中，变量必须在使用前声明，并且必须与类型相关联。每当您要使用变量时，都必须决定要在其中存储哪种数据。您也可以指定是否打算将变量用作常量值，这不是必需的，但是编译器可以从此信息中受益。从存储库中的fitting_C99.c程序：

 const char * input_file_name = "anscombe.csv" ; 
const char * delimiter = " \t " ; 
const unsigned int skip_header = 3 ; 
const unsigned int column_x = 0 ; 
const unsigned int column_y = 1 ; 
const char * output_file_name = "fit_C99.csv" ; 
const unsigned int N = 100 ;

C中的数组不是动态的，因为它们的长度必须预先确定（即，在编译之前）：

 int data_array [ 1024 ] ;

由于您通常不知道文件中有多少个数据点，因此请使用单链列表。这是一个动态数据结构，可以无限增长。幸运的是，BSD 提供了链接列表。这是一个示例定义：

 struct data_point { 
double x ; 
double y ; 

SLIST_ENTRY ( data_point ) entries ; 
} ; 


SLIST_HEAD ( data_list , data_point ) head = SLIST_HEAD_INITIALIZER ( head ) ; 

SLIST_INIT ( & head ) ;

本示例定义了一个data_point列表，该列表由同时包含x值和y值的结构化值组成。语法相当复杂，但很直观，详细描述它可能太冗长。

打印输出

要在终端上打印，可以使用printf（）函数，其功能类似于Octave的printf（）函数（在第一篇文章中介绍）：

 printf ( "#### Anscombe's first set with C99 #### \n " ) ;

printf（）函数不会在打印字符串的末尾自动添加换行符，因此您必须添加它。第一个参数是一个字符串，可以包含其他可以传递给函数的参数的格式信息，例如：

 printf ( "Slope: %f \n " , slope ) ;

读取数据

现在来了困难的部分……有一些C语言用于CSV文件解析的库，但是似乎没有一个稳定或流行的库足以存在于Fedora软件包存储库中。我没有为本教程添加依赖项，而是决定自己编写此部分。同样，进入细节太罗word了，所以我只解释总体思路。为了简洁起见，将忽略源代码中的某些行，但是您可以在存储库中找到完整的示例。

首先，打开输入文件：

 FILE * input_file = fopen ( input_file_name , "r" ) ;

然后逐行读取文件，直到出现错误或文件结束：

 while ( ! ferror ( input_file ) && ! feof ( input_file ) ) { 
size_t buffer_size = 0 ; 
char * buffer = NULL ; 

getline ( & buffer , & buffer_size , input_file ) ; 

...
}

getline（）函数是POSIX.1-2008标准的一个很好的新增功能。它可以读取文件中的整行，并负责分配必要的内存。然后使用strtok（）函数将每一行拆分为令牌。遍历令牌，选择所需的列：

 char * token = strtok ( buffer , delimiter ) ; 

while ( token != NULL ) 
{ 
double value ; 
sscanf ( token , "%lf" , & value ) ; 

if ( column == column_x ) { 
x = value ; 
} else if ( column == column_y ) { 
y = value ; 
} 

column += 1 ; 
token = strtok ( NULL , delimiter ) ; 
}

最后，当选择了x和y值时，将新数据点插入到链表中：

 struct data_point * datum = malloc ( sizeof ( struct data_point ) ) ; 

datum -> x = x ; 

datum -> y = y ; 


SLIST_INSERT_HEAD ( & head , datum , entries ) ;

malloc（）函数为新数据点动态分配（保留）一些持久性内存。

拟合数据

GSL线性拟合函数gsl_fit_linear（）需要简单的数组作为其输入。因此，由于您将不知道创建的数组的大小，因此必须手动分配它们的内存：

 const size_t entries_number = row - skip_header - 1 ; 

double * x = malloc ( sizeof ( double ) * entries_number ) ; 
double * y = malloc ( sizeof ( double ) * entries_number ) ;

然后，遍历链接列表以将相关数据保存到数组：

SLIST_FOREACH ( datum , & head , entries ) { 
const double current_x = datum -> x ; 
const double current_y = datum -> y ; 

x [ i ] = current_x ; 
y [ i ] = current_y ; 

i += 1 ; 
}

现在您已经完成了链接列表，请清理它。始终释放已手动分配的内存，以防止内存泄漏。内存泄漏是坏的，坏的，坏的。每次不释放内存时，花园侏儒都会迷失自己的头：

 while ( ! SLIST_EMPTY ( & head ) ) { 
struct data_point * datum = SLIST_FIRST ( & head ) ; 

SLIST_REMOVE_HEAD ( & head , entries ) ; 

free ( datum ) ; 
}

最后，finally（！），您可以适合您的数据：

gsl_fit_linear ( x , 1 , y , 1 , entries_number , 
& intercept , & slope , 
& cov00 , & cov01 , & cov11 , & chi_squared ) ; 
const double r_value = gsl_stats_correlation ( x , 1 , y , 1 , entries_number ) ; 

printf ( "Slope: %f \n " , slope ) ; 
printf ( "Intercept: %f \n " , intercept ) ; 
printf ( "Correlation coefficient: %f \n " , r_value ) ;

绘图

您必须使用外部程序进行绘图。因此，将拟合函数保存到外部文件：

 const double step_x = ( ( max_x + 1 ) - ( min_x - 1 ) ) / N ; 

for ( unsigned int i = 0 ; i < N ; i += 1 ) { 
const double current_x = ( min_x - 1 ) + step_x * i ; 
const double current_y = intercept + slope * current_x ; 

fprintf ( output_file , "%f \t %f \n " , current_x , current_y ) ; 
}

用于绘制两个文件的Gnuplot命令是：

 plot 'fit_C99.csv' using 1 : 2 with lines title 'Fit' , 'anscombe.csv' using 1 : 2 with points pointtype 7 title 'Data'

结果

在运行程序之前，必须对其进行编译：

 clang - std = c99 - I / usr / include / fitting_C99. c - L / usr / lib / - L / usr / lib64 / - lgsl - lgslcblas - o fitting_C99

该命令告诉编译器使用C99标准，读取fitting_C99.c文件，加载库gsl和gslcblas ，然后将结果保存到fit_C99 。命令行上的结果输出为：

 #### Anscombe's first set with C99 #### 

Slope: 0.500091 

Intercept: 3.000091 

Correlation coefficient: 0.816421

这是用Gnuplot生成的结果图像。

Plot and fit of the dataset obtained with C99

C ++ 11方式

C ++是一种通用编程语言，也是当今使用的最受欢迎的语言之一。它是C的继承人（于1983年创建），重点是面向对象的编程（OOP）。 C ++通常被视为C的超集，因此C程序应该能够使用C ++编译器进行编译。这并非完全正确，因为在某些极端情况下它们的行为有所不同。以我的经验，C ++比C需要更少的样板，但是如果要开发对象，语法会更困难。 C ++ 11标准是最新版本，增加了一些漂亮的功能，并且或多或少受编译器支持。

由于C ++在很大程度上与C兼容，因此我只强调两者之间的差异。如果我在本部分中没有涵盖任何部分，则意味着它与C中的相同。

安装

C ++示例的依赖项与C示例相同。在Fedora上，运行：

 sudo dnf install clang gnuplot gsl gsl-devel

必要的图书馆

库的工作方式与C中的相同，但是include指令略有不同：

 #include <cstdlib> 
#include <cstring> 
#include <iostream> 
#include <fstream> 
#include <string> 
#include <vector> 
#include <algorithm> 

extern "C" { 
#include <gsl/gsl_fit.h> 
#include <gsl/gsl_statistics_double.h> 
}

由于GSL库是用C编写的，因此您必须将这种特殊性告知编译器。

定义变量

C ++比C支持更多的数据类型（类），例如比C具有更多功能的字符串类型。相应地更新变量的定义：

 const std :: string input_file_name ( "anscombe.csv" ) ;

对于字符串之类的结构化对象，可以不使用=符号来定义变量。

打印输出

您可以使用printf（）函数，但是cout对象更惯用了。使用运算符<<表示要使用cout打印的字符串（或对象）：

std :: cout << "#### Anscombe's first set with C++11 ####" << std :: endl ; 


...

std :: cout << "Slope: " << slope << std :: endl ; 

std :: cout << "Intercept: " << intercept << std :: endl ; 

std :: cout << "Correlation coefficient: " << r_value << std :: endl ;

读取数据

该方案与以前相同。将打开文件并逐行读取文件，但语法不同：

std :: ifstream input_file ( input_file_name ) ; 

while ( input_file. good ( ) ) { 
std :: string line ; 

getline ( input_file, line ) ; 

...
}

使用与C99示例相同的功能提取行令牌。代替使用标准C数组，请使用两个向量。向量是C ++标准库中C数组的扩展，它允许动态管理内存而无需显式调用malloc（） ：

std :: vector < double > x ; 

std :: vector < double > y ; 

// Adding an element to x and y: 

x. emplace_back ( value ) ; 

y. emplace_back ( value ) ;

拟合数据

为了适合C ++，您不必遍历列表，因为可以保证向量具有连续的内存。您可以将向量缓冲区的指针直接传递给拟合函数：

gsl_fit_linear ( x. data ( ) , 1 , y. data ( ) , 1 , entries_number,
& intercept, & slope,
& cov00, & cov01, & cov11, & chi_squared ) ; 
const double r_value = gsl_stats_correlation ( x. data ( ) , 1 , y. data ( ) , 1 , entries_number ) ; 


std :: cout << "Slope: " << slope << std :: endl ; 

std :: cout << "Intercept: " << intercept << std :: endl ; 

std :: cout << "Correlation coefficient: " << r_value << std :: endl ;

绘图

使用与以前相同的方法进行绘图。写入文件：

 const double step_x = ( ( max_x + 1 ) - ( min_x - 1 ) ) / N ; 

for ( unsigned int i = 0 ; i < N ; i + = 1 ) { 
const double current_x = ( min_x - 1 ) + step_x * i ; 
const double current_y = intercept + slope * current_x ; 

output_file << current_x << " \t " << current_y << std :: endl ; 
} 


output_file. close ( ) ;

然后使用Gnuplot进行绘图。

结果

在运行程序之前，必须使用类似的命令对其进行编译：

 clang ++ - std = c ++ 11 - I / usr / include / fitting_Cpp11. cpp - L / usr / lib / - L / usr / lib64 / - lgsl - lgslcblas - o fitting_Cpp11

命令行上的结果输出为：

 #### Anscombe's first set with C++11 #### 

Slope: 0.500091 

Intercept: 3.00009 

Correlation coefficient: 0.816421

这就是用Gnuplot生成的结果图像。

Plot and fit of the dataset obtained with C++11

结论

GObject和Jansson库。

对于数字运算，我更喜欢在C99中工作，因为它的语法更简单并且得到了广泛的支持。直到最近，C ++ 11还没有得到广泛的支持，我倾向于避免使用先前版本中的粗糙边缘。对于更复杂的软件，C ++可能是一个不错的选择。

您是否还将C或C ++用于数据科学？在评论中分享您的经验。

翻译自: https://opensource.com/article/20/2/c-data-science

c# 绘图数据科学